Abstract:
Aiming at the problem of low detection accuracy in traditional vulnerability detection schemes, this paper proposed a function level source code vulnerability detection scheme that comprehensively considered two intermediate representations of source code structure diagram and token sequence to achieve vulnerability detection. Firstly, the extended code property graph(CPG) for embedding nodes and edges was extracted, and the relational graph convolutional network(RGCN) was applied to perform different processing on different edges, then a graph representation was generated. Secondly, token sequences was extracted, and the pre-trained model CodeBert was applied to generate sequence representations. Finally, the two above step results were integrated, and a three-layer fully connected network was applied to ensure the vulnerability detection performance. This comprehensive evaluation of vulnerability detection schemes was made by using two types of datasets: synthetic and real software. The experimental results show that compared with existing vulnerability detection schemes based on sequences, graphs, and a combination of both, this scheme has a significant improvement in accuracy and F1 value, the highest value of which can respectively reach 98.99% and 98.11%. In addition, the effectiveness of the improved methods in each stage was further verified by the compared experiment of controlling a single variable.