Abstract
Extracting frequent itemsets from datasets is an important problem in data mining, for which several mining methods including FP-Growth have been proposed. FP-Growth is a classical frequent itemset mining method, which generates pattern databases without candidates. Many improvements have been made in the literature due to the high time complexity and memory usage of FP-Growth. However, most of them still suffer from performance issues on large datasets. In this paper, we design an auxiliary structure, Array Prefix-Tree (AP-Tree), and propose a new algorithm, Array Prefix-Tree Growth (APT-Growth), which is further parallelized as a Spark workflow, referred to as PAPT-Growth. Based on a density threshold, we incorporate an adaptive algorithm selection process into PAPT-Growth to ensure its running time performance. We conduct extensive experiments on different thresholds and multiple datasets, and experimental results show the performance superiority of PAPT-Growth in comparison with several state-of-the-art methods such as PFP, YAFIM, and DFPS. The analysis on density reveals a changing point, which justifies the necessity and validity of adaptive algorithm selection.
Original language | English (US) |
---|---|
Article number | e6313 |
Journal | Concurrency Computation Practice and Experience |
Volume | 34 |
Issue number | 14 |
DOIs | |
State | Published - Jun 25 2022 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Software
- Theoretical Computer Science
- Computer Networks and Communications
- Computer Science Applications
- Computational Theory and Mathematics
Keywords
- Spark workflow
- adaptive algorithm
- array prefix-tree
- frequent itemsets mining
- parallel algorithm