TY - GEN
T1 - SmokeOut
T2 - 12th IEEE International Conference on Software Testing, Verification and Validation, ICST 2019
AU - Musco, Vincenzo
AU - Yin, Xin
AU - Neamtiu, Iulian
PY - 2019/4
Y1 - 2019/4
N2 - Clustering is a key Machine Learning technique, used in many high-stakes domains from medicine to self-driving cars. Many clustering algorithms have been proposed, and these algorithms have been implemented in many toolkits. Clustering users assume that clustering implementations are correct, reliable, and for a given algorithm, interchangeable. We challenge these assumptions. We introduce SmokeOut, an approach and tool that pits clustering implementations against each other (and against themselves) while controlling for algorithm and dataset, to find datasets where clustering outcomes differ when they shouldn't, and measure this difference. We ran SmokeOut on 7 clustering algorithms (3 deterministic and 4 nondeterministic) implemented in 7 widely-used toolkits, and run in a variety of scenarios on the Penn Machine Learning Benchmark (162 datasets). SmokeOut has revealed that clustering implementations are fragile: on a given input dataset and using a given clustering algorithm, clustering outcomes and accuracy vary widely between (1) successive runs of the same toolkit; (2) different input parameters for that tool; (3) different toolkits.
AB - Clustering is a key Machine Learning technique, used in many high-stakes domains from medicine to self-driving cars. Many clustering algorithms have been proposed, and these algorithms have been implemented in many toolkits. Clustering users assume that clustering implementations are correct, reliable, and for a given algorithm, interchangeable. We challenge these assumptions. We introduce SmokeOut, an approach and tool that pits clustering implementations against each other (and against themselves) while controlling for algorithm and dataset, to find datasets where clustering outcomes differ when they shouldn't, and measure this difference. We ran SmokeOut on 7 clustering algorithms (3 deterministic and 4 nondeterministic) implemented in 7 widely-used toolkits, and run in a variety of scenarios on the Penn Machine Learning Benchmark (162 datasets). SmokeOut has revealed that clustering implementations are fragile: on a given input dataset and using a given clustering algorithm, clustering outcomes and accuracy vary widely between (1) successive runs of the same toolkit; (2) different input parameters for that tool; (3) different toolkits.
KW - Clustering
KW - Differential Testing
KW - Machine Learning
KW - Software Reliability
UR - http://www.scopus.com/inward/record.url?scp=85067108534&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85067108534&partnerID=8YFLogxK
U2 - 10.1109/ICST.2019.00057
DO - 10.1109/ICST.2019.00057
M3 - Conference contribution
T3 - Proceedings - 2019 IEEE 12th International Conference on Software Testing, Verification and Validation, ICST 2019
SP - 473
EP - 480
BT - Proceedings - 2019 IEEE 12th International Conference on Software Testing, Verification and Validation, ICST 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 22 April 2019 through 27 April 2019
ER -