TY - GEN
T1 - On a parallel spark workflow for frequent itemset mining based on array prefix-tree
AU - Niu, Xinzheng
AU - Qian, Mideng
AU - Wu, Chase
AU - Hou, Aiqin
PY - 2019/11
Y1 - 2019/11
N2 - Frequent Itemset Mining (FIM) is a fundamental procedure in various data mining techniques such as association rule mining. Among many existing algorithms, FP-Growth is considered as a milestone achievement that discovers frequenti temsets without generating candidates. However, due to the high complexity of its mining process and the high cost of its memory usage, FP-Growth still suffers from a performance bottleneck when dealing with large datasets. In this paper, we design a new Array Prefix-Tree structure, and based on that, propose an Array Prefix-Tree Growth (APT-Growth) algorithm, which explicitly obviates the need of recursively constructing conditional FP-Tree as required by FP-Growth. To support big data analytics, we further design and implement a parallel version of APTGrowth, referred to as PAPT-Growth, as a Spark workflow. We conduct FIM workflow experiments on both real-life and synthetic datasets for performance evaluation, and extensive results show that PAPT-Growth outperforms other representative parallel FIM algorithms in terms of execution time, which sheds light on its potential applications to big data mining.
AB - Frequent Itemset Mining (FIM) is a fundamental procedure in various data mining techniques such as association rule mining. Among many existing algorithms, FP-Growth is considered as a milestone achievement that discovers frequenti temsets without generating candidates. However, due to the high complexity of its mining process and the high cost of its memory usage, FP-Growth still suffers from a performance bottleneck when dealing with large datasets. In this paper, we design a new Array Prefix-Tree structure, and based on that, propose an Array Prefix-Tree Growth (APT-Growth) algorithm, which explicitly obviates the need of recursively constructing conditional FP-Tree as required by FP-Growth. To support big data analytics, we further design and implement a parallel version of APTGrowth, referred to as PAPT-Growth, as a Spark workflow. We conduct FIM workflow experiments on both real-life and synthetic datasets for performance evaluation, and extensive results show that PAPT-Growth outperforms other representative parallel FIM algorithms in terms of execution time, which sheds light on its potential applications to big data mining.
KW - Array prefix-tree
KW - Frequent itemsets mining
KW - Parallel algorithm
KW - Spark workflow
UR - http://www.scopus.com/inward/record.url?scp=85078082963&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078082963&partnerID=8YFLogxK
U2 - 10.1109/WORKS49585.2019.00011
DO - 10.1109/WORKS49585.2019.00011
M3 - Conference contribution
T3 - Proceedings of WORKS 2019: 14th Workshop on Workflows in Support of Large-Scale Science - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 50
EP - 59
BT - Proceedings of WORKS 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 14th IEEE/ACM Workshop on Workflows in Support of Large-Scale Science, WORKS 2019
Y2 - 17 November 2019
ER -