TY - GEN
T1 - Large Scale String Analytics in Arkouda
AU - Du, Zhihui
AU - Rodriguez, Oliver Alvarado
AU - Bader, David A.
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Large scale data sets from the web, social networks, and bioinformatics are widely available and can often be rep-resented by strings and suffix arrays are highly efficient data structures enabling string analysis. But, our personal devices and corresponding exploratory data analysis (EDA) tools cannot handle big data sets beyond the local memory. Arkouda is a framework under early development that brings together the productivity of Python at the user side with the high-performance of Chapel at the server-side. In this paper, an efficient suffix array data structure design and integration method are given first. A suffix array algorithm library integration method instead of one single suffix algorithm is presented to enable runtime performance optimization in Arkouda since different suffix array algorithms may have very different practical performances for strings in various applications. A parallel suffix array construction algorithm framework is given to further exploit hierarchical parallelism on multiple locales in Chapel. A corresponding benchmark is developed to evaluate the feasibility of the provided suffix array integration method and measure the end-To-end performance. Experimental results show that the proposed solution can provide data scientists an easy and efficient method to build suffix arrays with high performance in Python. All our codes are open source and available from GitHub (https://github.com/Bader-Research/arkouda/tree/string-suffix-Array-functionality).
AB - Large scale data sets from the web, social networks, and bioinformatics are widely available and can often be rep-resented by strings and suffix arrays are highly efficient data structures enabling string analysis. But, our personal devices and corresponding exploratory data analysis (EDA) tools cannot handle big data sets beyond the local memory. Arkouda is a framework under early development that brings together the productivity of Python at the user side with the high-performance of Chapel at the server-side. In this paper, an efficient suffix array data structure design and integration method are given first. A suffix array algorithm library integration method instead of one single suffix algorithm is presented to enable runtime performance optimization in Arkouda since different suffix array algorithms may have very different practical performances for strings in various applications. A parallel suffix array construction algorithm framework is given to further exploit hierarchical parallelism on multiple locales in Chapel. A corresponding benchmark is developed to evaluate the feasibility of the provided suffix array integration method and measure the end-To-end performance. Experimental results show that the proposed solution can provide data scientists an easy and efficient method to build suffix arrays with high performance in Python. All our codes are open source and available from GitHub (https://github.com/Bader-Research/arkouda/tree/string-suffix-Array-functionality).
KW - Arkouda
KW - exploratory data analysis
KW - large scale string sets
KW - suffix array construction algorithm
UR - http://www.scopus.com/inward/record.url?scp=85123503545&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123503545&partnerID=8YFLogxK
U2 - 10.1109/HPEC49654.2021.9622810
DO - 10.1109/HPEC49654.2021.9622810
M3 - Conference contribution
AN - SCOPUS:85123503545
T3 - 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
BT - 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
Y2 - 20 September 2021 through 24 September 2021
ER -