Abstract
Many modern applications and systems represent and exchange data in tree-structured form and process and produce large tree datasets. Discovering informative patterns in large tree datasets is an important research area that has many practical applications. Along the years, research has evolved from mining induced patterns to mining embedded patterns. Embedded patterns allow for discovering useful relationships hidden deeply in the datasets which cannot be captured by induced patterns. Unfortunately, previous embedded tree pattern mining approaches cannot scale satisfactorily when the size of the dataset increases. As a consequence, they focus almost exclusively on mining patterns from a collection of small trees and they are incapable of mining patterns from large data trees. However, given the ubiquitous use of tree data, this pattern mining problem needs efficient solutions. In this paper, we address the problem of mining frequent unordered embedded tree patterns from large data trees. We propose a novel approach that exploits efficient homomorphic pattern matching algorithms to compute pattern support incrementally and avoids the costly enumeration of all pattern matchings required by previous approaches. To reduce space consumption, matching information of already computed patterns is materialized as bitmaps. We further optimize our basic support computation method by designing an algorithm which incrementally generates the bitmaps of the embeddings of a new candidate pattern without first explicitly computing the embeddings of this pattern. Our extensive experimental results on real and synthetic large-tree datasets show that our approach displays orders of magnitude performance improvements over a state-of-the-art tree mining algorithm and a recent graph mining algorithm.
Original language | English (US) |
---|---|
Pages (from-to) | 37-53 |
Number of pages | 17 |
Journal | Big Data Research |
Volume | 14 |
DOIs | |
State | Published - Dec 2018 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Management Information Systems
- Information Systems
- Computer Science Applications
- Information Systems and Management
Keywords
- Bitmap view
- Embedded tree pattern mining
- Holistic twig-join algorithm
- Homomorphism