On a parallel spark workflow for frequent itemset mining based on array prefix-tree

Xinzheng Niu, Peng Wu, Chase Q. Wu, Aiqin Hou, Mideng Qian

Research output: Contribution to journalArticlepeer-review

Abstract

Extracting frequent itemsets from datasets is an important problem in data mining, for which several mining methods including FP-Growth have been proposed. FP-Growth is a classical frequent itemset mining method, which generates pattern databases without candidates. Many improvements have been made in the literature due to the high time complexity and memory usage of FP-Growth. However, most of them still suffer from performance issues on large datasets. In this paper, we design an auxiliary structure, Array Prefix-Tree (AP-Tree), and propose a new algorithm, Array Prefix-Tree Growth (APT-Growth), which is further parallelized as a Spark workflow, referred to as PAPT-Growth. Based on a density threshold, we incorporate an adaptive algorithm selection process into PAPT-Growth to ensure its running time performance. We conduct extensive experiments on different thresholds and multiple datasets, and experimental results show the performance superiority of PAPT-Growth in comparison with several state-of-the-art methods such as PFP, YAFIM, and DFPS. The analysis on density reveals a changing point, which justifies the necessity and validity of adaptive algorithm selection.

Original languageEnglish (US)
JournalConcurrency Computation Practice and Experience
DOIs
StateAccepted/In press - 2021

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Computational Theory and Mathematics

Keywords

  • Spark workflow
  • adaptive algorithm
  • array prefix-tree
  • frequent itemsets mining
  • parallel algorithm

Fingerprint

Dive into the research topics of 'On a parallel spark workflow for frequent itemset mining based on array prefix-tree'. Together they form a unique fingerprint.

Cite this