TY - GEN
T1 - Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays!
AU - Yang, Xu
AU - Wang, Shaowei
AU - Li, Yi
AU - Wang, Shaohua
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Recent progress in Deep Learning (DL) has sparked interest in using DL to detect software vulnerabilities automatically and it has been demonstrated promising results at detecting vulnerabilities. However, one prominent and practical issue for vulnerability detection is data imbalance. Prior study observed that the performance of state-of-the-art (SOTA) DL-based vulnerability detection (DLVD) approaches drops precipitously in real world imbalanced data and a 73% drop of F1-score on average across studied approaches. Such a significant performance drop can disable the practical usage of any DLVD approaches. Data sampling is effective in alleviating data imbalance for machine learning models and has been demonstrated in various software engineering tasks. Therefore, in this study, we conducted a systematical and extensive study to assess the impact of data sampling for data imbalance problem in DLVD from two aspects: i) the effectiveness of DLVD, and ii) the ability of DLVD to reason correctly (making a decision based on real vulnerable statements). We found that in general, oversampling outperforms undersampling, and sampling on raw data outperforms sampling on latent space, typically random oversampling on raw data performs the best among all studied ones (including advanced one SMOTE and OSS). Surprisingly, OSS does not help alleviate the data imbalance issue in DLVD. If the recall is pursued, random undersampling is the best choice. Random oversampling on raw data also improves the ability of DLVD approaches for learning real vulnerable patterns. However, for a significant portion of cases (at least 33% in our datasets), DVLD approach cannot reason their prediction based on real vulnerable statements. We provide actionable suggestions and a roadmap to practitioners and researchers.
AB - Recent progress in Deep Learning (DL) has sparked interest in using DL to detect software vulnerabilities automatically and it has been demonstrated promising results at detecting vulnerabilities. However, one prominent and practical issue for vulnerability detection is data imbalance. Prior study observed that the performance of state-of-the-art (SOTA) DL-based vulnerability detection (DLVD) approaches drops precipitously in real world imbalanced data and a 73% drop of F1-score on average across studied approaches. Such a significant performance drop can disable the practical usage of any DLVD approaches. Data sampling is effective in alleviating data imbalance for machine learning models and has been demonstrated in various software engineering tasks. Therefore, in this study, we conducted a systematical and extensive study to assess the impact of data sampling for data imbalance problem in DLVD from two aspects: i) the effectiveness of DLVD, and ii) the ability of DLVD to reason correctly (making a decision based on real vulnerable statements). We found that in general, oversampling outperforms undersampling, and sampling on raw data outperforms sampling on latent space, typically random oversampling on raw data performs the best among all studied ones (including advanced one SMOTE and OSS). Surprisingly, OSS does not help alleviate the data imbalance issue in DLVD. If the recall is pursued, random undersampling is the best choice. Random oversampling on raw data also improves the ability of DLVD approaches for learning real vulnerable patterns. However, for a significant portion of cases (at least 33% in our datasets), DVLD approach cannot reason their prediction based on real vulnerable statements. We provide actionable suggestions and a roadmap to practitioners and researchers.
KW - Vulnerability detection
KW - data sampling
KW - deep learning
KW - interpretable AI
UR - http://www.scopus.com/inward/record.url?scp=85171735470&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85171735470&partnerID=8YFLogxK
U2 - 10.1109/ICSE48619.2023.00192
DO - 10.1109/ICSE48619.2023.00192
M3 - Conference contribution
AN - SCOPUS:85171735470
T3 - Proceedings - International Conference on Software Engineering
SP - 2287
EP - 2298
BT - Proceedings - 2023 IEEE/ACM 45th International Conference on Software Engineering, ICSE 2023
PB - IEEE Computer Society
T2 - 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023
Y2 - 15 May 2023 through 16 May 2023
ER -