Abundant Modalities Offer More Nutrients: Multi-Modal Based Function-level Vulnerability Detection

Software vulnerabilities are the weaknesses in software systems, that leads to serious cybersecurity problems. Recently, many deep learning-based approaches have been proposed to detect vulnerabilities at the function level by using one or a few different modalities (e.g., text representation, graph-based representation) of the function and have achieved promising performance. However, some of these existing studies have not completely leveraged these diverse modalities, particularly the underutilized image modality, and the others using images to represent functions for vulnerability detection have not made adequate use of the significant graph structure underlying the images.

In this paper, we propose MVulD+, a multi-modal-based function-level vulnerability detection approach, which utilizes multi-modal features of the function (i.e., text representation, graph representation, and image representation) to detect vulnerabilities. Specifically, MVulD+ utilizes a pre- trained model (i.e., UniXcoder) to learn the semantic information of the textual source code, employs the graph neural network to distill graph-based representation, and makes use of computer vision techniques to obtain the image representation while retaining the graph structure of the function. To investigate the effectiveness of MVulD+, we conduct a large-scale experiment (25,816 functions) by comparing it with eight state-of-the-art baselines. Experimental results demonstrate that MVulD+ improves the state-of-the-art baselines by 24.3%- 125.7%, 5.2%-31.4%, 40.6%-192.2%, and 22.3%-186.9% in terms of F1-score, Accuracy, Precision, and PR-AUC respectively.

Overview of MVulD+

An overview architecture of MVulD+

MVulD+ consists of four four main phases:

1.Graph Extraction: obtain the structure representation of the function.

2.Image generation: helps to transform the structural graph into graphical representation.

3.Multi-modal feature fusion: builds the relationship of various modalities to obtain enriched code representations.

4.Classification: applied to detect whether a function is vulnerable or not.

Dataset

In our experiment, we choose the Big-Vul dataset provided by Fan, which is one of the largest vulnerability datasets collected from 348 open-source GitHub projects spanning 91 different vulnerability types. And we normalize the source code by performing three filtering steps on the Big-Vul dataset.Our final dataset contains 25,816 functions in total, including 4,069 vulnerable functions and 21,747 non-vulnerable functions. you can download the Big-Vul dataset and process it, or you can also download the preprocessed dataset from HERE and unzip it.

Source Code

Conda environment

Create Conda environment

$ conda env create -f environment.yml

Activate the environment

$ source activate mvuld+

Dataset process

We provide dataset processing scripts (including Graph Extraction and Image generation), please refer to utils/README.md

MVulD+:

To train and test MVulD model, using the following commands.

sh train.sh
sh test.sh

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
utils		utils
MVulD+.png		MVulD+.png
README.md		README.md
config.json		config.json
dataset.py		dataset.py
model.py		model.py
run.py		run.py
test.sh		test.sh
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abundant Modalities Offer More Nutrients: Multi-Modal Based Function-level Vulnerability Detection

Overview of MVulD+

Dataset

Source Code

Conda environment

Dataset process

MVulD+:

About

Releases

Packages

Languages

vinci-grape/MVulD

Folders and files

Latest commit

History

Repository files navigation

Abundant Modalities Offer More Nutrients: Multi-Modal Based Function-level Vulnerability Detection

Overview of MVulD+

Dataset

Source Code

Conda environment

Dataset process

MVulD+:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages