TY - GEN
T1 - REMOLD
T2 - 23rd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2017
AU - Liang, Mingfei
AU - Li, Qingyong
AU - Geng, Yangli Ao
AU - Wang, Jianzhu
AU - Wei, Zhi
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/2
Y1 - 2017/7/2
N2 - Density-based clustering algorithms have the distinctive advantage of discovering arbitrarily shaped clusters, but they usually require a procedure to compute the distance between every pair of data points, and this procedure is prohibitive for large datasets since it has quadratic computation complexity. In this paper, we propose a new distributed clustering algorithm, named REstore MOdel with Local Density estimation (REMOLD). Firstly, REMODL applies a balanced partitioning method to evenly divide an large dataset based on Local Sensitive Hashing (LSH). Then, it locally clusters each partition of the dataset, and uses a Gaussian model to represent each local cluster based on the observation that the density distribution of each local cluster shares similar shape with Gaussian distribution. Finally, these models are aggregated on a server where REMOLD restores global clusters based on these local Gaussian models. More specifically, model connection, which measures the density connectivity between two models, are defined to merge local models with an optimized procedure. In this aggregation, REMOLD requires low cost of network transmission for local Gaussian models, since the number of Gaussian models is often less than that of core objects for each partition. We evaluate REMOLD on three synthetic datasets and three real-world datasets on Spark, and the experiment results demonstrate that REMOLD is efficient and effective to find out clusters with complex shapes and it outperforms the established methods.
AB - Density-based clustering algorithms have the distinctive advantage of discovering arbitrarily shaped clusters, but they usually require a procedure to compute the distance between every pair of data points, and this procedure is prohibitive for large datasets since it has quadratic computation complexity. In this paper, we propose a new distributed clustering algorithm, named REstore MOdel with Local Density estimation (REMOLD). Firstly, REMODL applies a balanced partitioning method to evenly divide an large dataset based on Local Sensitive Hashing (LSH). Then, it locally clusters each partition of the dataset, and uses a Gaussian model to represent each local cluster based on the observation that the density distribution of each local cluster shares similar shape with Gaussian distribution. Finally, these models are aggregated on a server where REMOLD restores global clusters based on these local Gaussian models. More specifically, model connection, which measures the density connectivity between two models, are defined to merge local models with an optimized procedure. In this aggregation, REMOLD requires low cost of network transmission for local Gaussian models, since the number of Gaussian models is often less than that of core objects for each partition. We evaluate REMOLD on three synthetic datasets and three real-world datasets on Spark, and the experiment results demonstrate that REMOLD is efficient and effective to find out clusters with complex shapes and it outperforms the established methods.
KW - Density estimation
KW - Density-based clustering
KW - Distributed clustering
KW - Gaussian model
KW - Spark
UR - http://www.scopus.com/inward/record.url?scp=85048373524&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85048373524&partnerID=8YFLogxK
U2 - 10.1109/ICPADS.2017.00057
DO - 10.1109/ICPADS.2017.00057
M3 - Conference contribution
AN - SCOPUS:85048373524
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 376
EP - 383
BT - Proceedings - 2017 IEEE 23rd International Conference on Parallel and Distributed Systems, ICPADS 2017
PB - IEEE Computer Society
Y2 - 15 December 2017 through 17 December 2017
ER -