基于代码序列与图结构的源代码漏洞检测方案

王守梁

基于代码序列与图结构的源代码漏洞检测方案

王守梁

A Source Code Vulnerability Detection Scheme Based on Code Sequence and Graph Structure

WANG Shouliang

摘要

摘要: 针对传统的漏洞检测方案存在检测精度较低的问题，本文提出了一种函数级源代码漏洞检测方案，综合考虑源代码结构图与标记序列两种中间表示形式来实现漏洞检测。首先，抽取扩展的代码属性图(CPG’)进行节点与边的嵌入并应用关系图卷积网络(RGCN)对不同的边进行不同的处理，从而生成图表示。其次，抽取标记序列并应用预训练模型CodeBert生成序列表示。最后，集成二者并应用三层全连接网络以确保漏洞检测性能。本文采用合成与真实软件两种类型的数据集对漏洞检测方案进行了综合评估。实验结果表明，相比现有的基于序列、基于图及基于二者结合的漏洞检测方案，本文给出的方案在准确率与F1值上均有显著提升，最高分别达到98.99%与98.11%。此外，本文通过控制单一变量的对照试验进一步验证了各环节中改进方法的有效性。

Abstract: Aiming at the problem of low detection accuracy in traditional vulnerability detection schemes, this paper proposed a function level source code vulnerability detection scheme that comprehensively considered two intermediate representations of source code structure diagram and token sequence to achieve vulnerability detection. Firstly, the extended code property graph（CPG） for embedding nodes and edges was extracted, and the relational graph convolutional network（RGCN） was applied to perform different processing on different edges, then a graph representation was generated. Secondly, token sequences was extracted, and the pre-trained model CodeBert was applied to generate sequence representations. Finally, the two above step results were integrated, and a three-layer fully connected network was applied to ensure the vulnerability detection performance. This comprehensive evaluation of vulnerability detection schemes was made by using two types of datasets: synthetic and real software. The experimental results show that compared with existing vulnerability detection schemes based on sequences, graphs, and a combination of both, this scheme has a significant improvement in accuracy and F1 value, the highest value of which can respectively reach 98.99% and 98.11%. In addition, the effectiveness of the improved methods in each stage was further verified by the compared experiment of controlling a single variable.

HTML全文

参考文献(52)

施引文献

资源附件(0)