TY - GEN
T1 - Implementation-induced Inconsistency and Nondeterminism in Deterministic Clustering Algorithms
AU - Yin, Xin
AU - Neamtiu, Iulian
AU - Patil, Saketan
AU - Andrews, Sean T.
N1 - Funding Information:
ACKNOWLEDGMENTS We thank the anonymous reviewers for their valuable feedback. Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/10
Y1 - 2020/10
N2 - A deterministic clustering algorithm is designed to always produce the same clustering solution on a given input. Therefore, users of clustering implementations (toolkits) naturally assume that implementations of a deterministic clustering algorithm A have deterministic behavior, that is: (1) two different implementations I1 and I2 of A are interchangeable, producing the same clustering on a given input D, and (2) an implementation produces the same clustering solution when run repeatedly on D. We challenge these assumptions. Specifically, we analyze clustering behavior on 528 datasets, three deterministic algorithms (Affinity Propagation, DBSCAN, Hierarchical Agglomerative Clustering) and the deterministic portion of a fourth (K-means), as implemented in various toolkits; in total, we examined 13 algorithm-toolkit combinations. We found that different implementations of deterministic clustering algorithms make different choices, e.g., default parameter settings, noise insertion, input dataset characteristics. As a result, clustering solutions for a fixed algorithm-dataset combination can differ across runs (nondeterminism) and across toolkits (inconsistency). We expose several root causes of such behavior. We show that remedying these root causes improves determinism, increases consistency, and can even improve efficiency. Our approach and findings can benefit developers, testers, and users of clustering algorithms.
AB - A deterministic clustering algorithm is designed to always produce the same clustering solution on a given input. Therefore, users of clustering implementations (toolkits) naturally assume that implementations of a deterministic clustering algorithm A have deterministic behavior, that is: (1) two different implementations I1 and I2 of A are interchangeable, producing the same clustering on a given input D, and (2) an implementation produces the same clustering solution when run repeatedly on D. We challenge these assumptions. Specifically, we analyze clustering behavior on 528 datasets, three deterministic algorithms (Affinity Propagation, DBSCAN, Hierarchical Agglomerative Clustering) and the deterministic portion of a fourth (K-means), as implemented in various toolkits; in total, we examined 13 algorithm-toolkit combinations. We found that different implementations of deterministic clustering algorithms make different choices, e.g., default parameter settings, noise insertion, input dataset characteristics. As a result, clustering solutions for a fixed algorithm-dataset combination can differ across runs (nondeterminism) and across toolkits (inconsistency). We expose several root causes of such behavior. We show that remedying these root causes improves determinism, increases consistency, and can even improve efficiency. Our approach and findings can benefit developers, testers, and users of clustering algorithms.
KW - ML reliability
KW - ML testing
KW - cluster analysis
UR - http://www.scopus.com/inward/record.url?scp=85091581734&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85091581734&partnerID=8YFLogxK
U2 - 10.1109/ICST46399.2020.00032
DO - 10.1109/ICST46399.2020.00032
M3 - Conference contribution
AN - SCOPUS:85091581734
T3 - Proceedings - 2020 IEEE 13th International Conference on Software Testing, Verification and Validation, ICST 2020
SP - 231
EP - 242
BT - Proceedings - 2020 IEEE 13th International Conference on Software Testing, Verification and Validation, ICST 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th IEEE International Conference on Software Testing, Verification and Validation, ICST 2020
Y2 - 23 March 2020 through 27 March 2020
ER -