Regularized k-means clustering of high-dimensional data and its asymptotic consistency

Wei Sun, Junhui Wang, Yixin Fang

Research output: Contribution to journalArticlepeer-review

78 Scopus citations

Abstract

K-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized k-means clus- tering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the k-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stabil- ity is developed. The asymptotic estimation and selection consistency of the regularized k-means clustering with diverging dimension is established. The effectiveness of the regularized k-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering.

Original languageEnglish (US)
Pages (from-to)148-167
Number of pages20
JournalElectronic Journal of Statistics
Volume6
DOIs
StatePublished - 2012
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Keywords

  • Diverging dimension
  • K-means
  • Lasso
  • Selection consistency
  • Stability
  • Variable selection

Fingerprint

Dive into the research topics of 'Regularized k-means clustering of high-dimensional data and its asymptotic consistency'. Together they form a unique fingerprint.

Cite this