Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays!

Xu Yang, Shaowei Wang, Yi Li, Shaohua Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations

Abstract

Recent progress in Deep Learning (DL) has sparked interest in using DL to detect software vulnerabilities automatically and it has been demonstrated promising results at detecting vulnerabilities. However, one prominent and practical issue for vulnerability detection is data imbalance. Prior study observed that the performance of state-of-the-art (SOTA) DL-based vulnerability detection (DLVD) approaches drops precipitously in real world imbalanced data and a 73% drop of F1-score on average across studied approaches. Such a significant performance drop can disable the practical usage of any DLVD approaches. Data sampling is effective in alleviating data imbalance for machine learning models and has been demonstrated in various software engineering tasks. Therefore, in this study, we conducted a systematical and extensive study to assess the impact of data sampling for data imbalance problem in DLVD from two aspects: i) the effectiveness of DLVD, and ii) the ability of DLVD to reason correctly (making a decision based on real vulnerable statements). We found that in general, oversampling outperforms undersampling, and sampling on raw data outperforms sampling on latent space, typically random oversampling on raw data performs the best among all studied ones (including advanced one SMOTE and OSS). Surprisingly, OSS does not help alleviate the data imbalance issue in DLVD. If the recall is pursued, random undersampling is the best choice. Random oversampling on raw data also improves the ability of DLVD approaches for learning real vulnerable patterns. However, for a significant portion of cases (at least 33% in our datasets), DVLD approach cannot reason their prediction based on real vulnerable statements. We provide actionable suggestions and a roadmap to practitioners and researchers.

Original languageEnglish (US)
Title of host publicationProceedings - 2023 IEEE/ACM 45th International Conference on Software Engineering, ICSE 2023
PublisherIEEE Computer Society
Pages2287-2298
Number of pages12
ISBN (Electronic)9781665457019
DOIs
StatePublished - 2023
Event45th IEEE/ACM International Conference on Software Engineering, ICSE 2023 - Melbourne, Australia
Duration: May 15 2023May 16 2023

Publication series

NameProceedings - International Conference on Software Engineering
ISSN (Print)0270-5257

Conference

Conference45th IEEE/ACM International Conference on Software Engineering, ICSE 2023
Country/TerritoryAustralia
CityMelbourne
Period5/15/235/16/23

All Science Journal Classification (ASJC) codes

  • Software

Keywords

  • Vulnerability detection
  • data sampling
  • deep learning
  • interpretable AI

Fingerprint

Dive into the research topics of 'Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays!'. Together they form a unique fingerprint.

Cite this