ClusterTree: Integration of cluster representation and nearest-neighbor search for large data sets with high dimensions

Dantong Yu, Aidong Zhang

Research output: Contribution to journalArticlepeer-review

34 Scopus citations

Abstract

In this paper, we introduce the ClusterTree, a new indexing approach to representing clusters generated by any existing clustering approach. A cluster is decomposed into several subclusters and represented as the union of the subclusters. The subclusters can be further decomposed, which isolates the most related groups within the clusters. A ClusterTree is a hierarchy of clusters and subclusters which incorporates the cluster representation into the index structure to achieve effective and efficient retrieval. Our cluster representation is highly adaptive to any kind of cluster. It is well accepted that most existing indexing techniques degrade rapidly as the dimensions increase. The ClusterTree provides a practical solution to index clustered data sets and supports the retrieval of the nearest-neighbors effectively without having to linearly scan the high-dimensional data set. We also discuss an approach to dynamically reconstruct the ClusterTree when new data is added. We present the detailed analysis of this approach and justify it extensively with experiments.

Original languageEnglish (US)
Pages (from-to)1316-1337
Number of pages22
JournalIEEE Transactions on Knowledge and Data Engineering
Volume15
Issue number5
DOIs
StatePublished - Sep 1 2003
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Keywords

  • Cluster representation
  • High-dimensional data sets
  • Indexing
  • Nearest-neighbor search

Fingerprint Dive into the research topics of 'ClusterTree: Integration of cluster representation and nearest-neighbor search for large data sets with high dimensions'. Together they form a unique fingerprint.

Cite this