Clustering is a key Machine Learning technique, used in many high-stakes domains from medicine to self-driving cars. Many clustering algorithms have been proposed, and these algorithms have been implemented in many toolkits. Clustering users assume that clustering implementations are correct, reliable, and for a given algorithm, interchangeable. We challenge these assumptions. We introduce SmokeOut, an approach and tool that pits clustering implementations against each other (and against themselves) while controlling for algorithm and dataset, to find datasets where clustering outcomes differ when they shouldn't, and measure this difference. We ran SmokeOut on 7 clustering algorithms (3 deterministic and 4 nondeterministic) implemented in 7 widely-used toolkits, and run in a variety of scenarios on the Penn Machine Learning Benchmark (162 datasets). SmokeOut has revealed that clustering implementations are fragile: on a given input dataset and using a given clustering algorithm, clustering outcomes and accuracy vary widely between (1) successive runs of the same toolkit; (2) different input parameters for that tool; (3) different toolkits.