TY - GEN
T1 - ACE
T2 - 15th International Conference on Parallel Processing and Applied Mathematics, PPAM 2024
AU - Ahmed, Muyeed
AU - Neamtiu, Iulian
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - Clustering is a key technique in a wide range of data analysis tasks. However, algorithms that ensure stable, deterministic, accurate clustering are computationally expensive, having superlinear complexity for both memory and time. Therefore, even when using HPC hardware, there are hard limits to dataset sizes that can be clustered, as clustering implementations can run out of memory or take unacceptably long. We introduce an approach called ACE that applies algorithm-independent, black-box parallelization to superlinear sequential clustering algorithms, thereby making the clustering of substantial datasets feasible, even on commodity desktop/laptop systems. ACE starts by partitioning data to fit onto a given machine, and via divide-and-conquer, reduce the complexity of clustering steps. Next, ACE uses parallel, automated hyperparameter search to find optimal parameters for the current dataset. Finally, ACE aggregates intermediate results effectively and efficiently so that the final clustering output does not sacrifice clustering quality compared to the original algorithm. An evaluation on four popular clustering algorithms – Affinity Propagation, DBSCAN, Hierarchical Agglomerative Clustering, and Spectral Clustering – shows that ACE substantially reduces memory requirements and achieves linear processing time. ACE was able to process an entire suite of 164 datasets, including substantial datasets with 1.4M points or 1,000 dimensions, whereas the default implementations failed to process between 15 and 149 datasets from the suite. Moreover, for those datasets that could be processed by the default implementations, ACE achieved a 1.13x–102x time reduction.
AB - Clustering is a key technique in a wide range of data analysis tasks. However, algorithms that ensure stable, deterministic, accurate clustering are computationally expensive, having superlinear complexity for both memory and time. Therefore, even when using HPC hardware, there are hard limits to dataset sizes that can be clustered, as clustering implementations can run out of memory or take unacceptably long. We introduce an approach called ACE that applies algorithm-independent, black-box parallelization to superlinear sequential clustering algorithms, thereby making the clustering of substantial datasets feasible, even on commodity desktop/laptop systems. ACE starts by partitioning data to fit onto a given machine, and via divide-and-conquer, reduce the complexity of clustering steps. Next, ACE uses parallel, automated hyperparameter search to find optimal parameters for the current dataset. Finally, ACE aggregates intermediate results effectively and efficiently so that the final clustering output does not sacrifice clustering quality compared to the original algorithm. An evaluation on four popular clustering algorithms – Affinity Propagation, DBSCAN, Hierarchical Agglomerative Clustering, and Spectral Clustering – shows that ACE substantially reduces memory requirements and achieves linear processing time. ACE was able to process an entire suite of 164 datasets, including substantial datasets with 1.4M points or 1,000 dimensions, whereas the default implementations failed to process between 15 and 149 datasets from the suite. Moreover, for those datasets that could be processed by the default implementations, ACE achieved a 1.13x–102x time reduction.
KW - Automatic Parallelization
KW - Clustering
KW - Machine Learning
UR - https://www.scopus.com/pages/publications/105002724820
UR - https://www.scopus.com/pages/publications/105002724820#tab=citedBy
U2 - 10.1007/978-3-031-85697-6_11
DO - 10.1007/978-3-031-85697-6_11
M3 - Conference contribution
AN - SCOPUS:105002724820
SN - 9783031856969
T3 - Lecture Notes in Computer Science
SP - 161
EP - 176
BT - Parallel Processing and Applied Mathematics - 15th International Conference, PPAM 2024, Revised Selected Papers
A2 - Wyrzykowski, Roman
A2 - Dongarra, Jack
A2 - Deelman, Ewa
A2 - Karczewski, Konrad
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 8 September 2024 through 11 September 2024
ER -