TY - GEN
T1 - Is neuron coverage a meaningful measure for testing deep neural networks?
AU - Harel-Canada, Fabrice
AU - Wang, Lingxiao
AU - Gulzar, Muhammad Ali
AU - Gu, Quanquan
AU - Kim, Miryung
N1 - Publisher Copyright:
© 2020 Owner/Author.
PY - 2020/11/8
Y1 - 2020/11/8
N2 - Recent effort to test deep learning systems has produced an intuitive and compelling test criterion called neuron coverage (NC), which resembles the notion of traditional code coverage. NC measures the proportion of neurons activated in a neural network and it is implicitly assumed that increasing NC improves the quality of a test suite. In an attempt to automatically generate a test suite that increases NC, we design a novel diversity promoting regularizer that can be plugged into existing adversarial attack algorithms. We then assess whether such attempts to increase NC could generate a test suite that (1) detects adversarial attacks successfully, (2) produces natural inputs, and (3) is unbiased to particular class predictions. Contrary to expectation, our extensive evaluation finds that increasing NC actually makes it harder to generate an effective test suite: higher neuron coverage leads to fewer defects detected, less natural inputs, and more biased prediction preferences. Our results invoke skepticism that increasing neuron coverage may not be a meaningful objective for generating tests for deep neural networks and call for a new test generation technique that considers defect detection, naturalness, and output impartiality in tandem.
AB - Recent effort to test deep learning systems has produced an intuitive and compelling test criterion called neuron coverage (NC), which resembles the notion of traditional code coverage. NC measures the proportion of neurons activated in a neural network and it is implicitly assumed that increasing NC improves the quality of a test suite. In an attempt to automatically generate a test suite that increases NC, we design a novel diversity promoting regularizer that can be plugged into existing adversarial attack algorithms. We then assess whether such attempts to increase NC could generate a test suite that (1) detects adversarial attacks successfully, (2) produces natural inputs, and (3) is unbiased to particular class predictions. Contrary to expectation, our extensive evaluation finds that increasing NC actually makes it harder to generate an effective test suite: higher neuron coverage leads to fewer defects detected, less natural inputs, and more biased prediction preferences. Our results invoke skepticism that increasing neuron coverage may not be a meaningful objective for generating tests for deep neural networks and call for a new test generation technique that considers defect detection, naturalness, and output impartiality in tandem.
KW - Adversarial Attack
KW - Machine Learning
KW - Neuron Coverage
KW - Software Engineering
KW - Testing
UR - https://www.scopus.com/pages/publications/85097195733
UR - https://www.scopus.com/pages/publications/85097195733#tab=citedBy
U2 - 10.1145/3368089.3409754
DO - 10.1145/3368089.3409754
M3 - Conference contribution
AN - SCOPUS:85097195733
T3 - ESEC/FSE 2020 - Proceedings of the 28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
SP - 851
EP - 862
BT - ESEC/FSE 2020 - Proceedings of the 28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
A2 - Devanbu, Prem
A2 - Cohen, Myra
A2 - Zimmermann, Thomas
PB - Association for Computing Machinery, Inc
T2 - 28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020
Y2 - 8 November 2020 through 13 November 2020
ER -