On a parallel spark workflow for frequent itemset mining based on array prefix-tree

Xinzheng Niu, Mideng Qian, Chase Wu, Aiqin Hou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Frequent Itemset Mining (FIM) is a fundamental procedure in various data mining techniques such as association rule mining. Among many existing algorithms, FP-Growth is considered as a milestone achievement that discovers frequenti temsets without generating candidates. However, due to the high complexity of its mining process and the high cost of its memory usage, FP-Growth still suffers from a performance bottleneck when dealing with large datasets. In this paper, we design a new Array Prefix-Tree structure, and based on that, propose an Array Prefix-Tree Growth (APT-Growth) algorithm, which explicitly obviates the need of recursively constructing conditional FP-Tree as required by FP-Growth. To support big data analytics, we further design and implement a parallel version of APTGrowth, referred to as PAPT-Growth, as a Spark workflow. We conduct FIM workflow experiments on both real-life and synthetic datasets for performance evaluation, and extensive results show that PAPT-Growth outperforms other representative parallel FIM algorithms in terms of execution time, which sheds light on its potential applications to big data mining.

Original languageEnglish (US)
Title of host publicationProceedings of WORKS 2019
Subtitle of host publication14th Workshop on Workflows in Support of Large-Scale Science - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages50-59
Number of pages10
ISBN (Electronic)9781728159973
DOIs
StatePublished - Nov 2019
Event14th IEEE/ACM Workshop on Workflows in Support of Large-Scale Science, WORKS 2019 - Denver, United States
Duration: Nov 17 2019 → …

Publication series

NameProceedings of WORKS 2019: 14th Workshop on Workflows in Support of Large-Scale Science - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference14th IEEE/ACM Workshop on Workflows in Support of Large-Scale Science, WORKS 2019
Country/TerritoryUnited States
CityDenver
Period11/17/19 → …

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems
  • Information Systems and Management

Keywords

  • Array prefix-tree
  • Frequent itemsets mining
  • Parallel algorithm
  • Spark workflow

Fingerprint

Dive into the research topics of 'On a parallel spark workflow for frequent itemset mining based on array prefix-tree'. Together they form a unique fingerprint.

Cite this