TY - GEN
T1 - Using RAPIDS AI to Accelerate Graph Data Science Workflows
AU - Hricik, Todd
AU - Bader, David
AU - Green, Oded
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/9/22
Y1 - 2020/9/22
N2 - Scale free networks are abundant in many natural, social, and engineering phenomena for which there exists a substantial corpus of theory able to elucidate many of their underlying properties. In this paper we study the scalability of some widely available Python-based tools for the empirical investigation of scale free network data in a typical early stage analysis pipeline. We demonstrate how porting serial implementations of commonly used pipeline data structures and methods to parallel hardware via the NVIDIA RAPIDS AI API requires minimal rewriting of code. As a utility for each pipeline we recorded the time required to complete the analysis for both the serial and parallelized workflows on a task-wise basis. Furthermore, we review a statistically based methodology for fitting a power-law to empirical data. Maximum likelihood estimations for scale were inferred after using Kolmogorov-Smirnov based methods to determine location estimates. Our serial implementation of a typical early stage network analysis workflow uses a combination of widely used data structures and algorithms provided by the NumPy, Pandas and NetworkX frameworks. We then parallelized our workflow using the APIs provided by NVIDIA's RAPIDS AI open data science libraries and measured the relative time to completion for the tasks of ingesting raw data, creating a graph representation of the data and finally fitting a power-law distribution to the empirical observations. The results of our experiments, run on graphs ranging in size from 1 million to 20 million edges, demonstrate that significantly less time is required to complete the tasks of generating a graph from an edge list, computing the degree of all nodes in the graph and fitting the scale and location parameters to the observed data.
AB - Scale free networks are abundant in many natural, social, and engineering phenomena for which there exists a substantial corpus of theory able to elucidate many of their underlying properties. In this paper we study the scalability of some widely available Python-based tools for the empirical investigation of scale free network data in a typical early stage analysis pipeline. We demonstrate how porting serial implementations of commonly used pipeline data structures and methods to parallel hardware via the NVIDIA RAPIDS AI API requires minimal rewriting of code. As a utility for each pipeline we recorded the time required to complete the analysis for both the serial and parallelized workflows on a task-wise basis. Furthermore, we review a statistically based methodology for fitting a power-law to empirical data. Maximum likelihood estimations for scale were inferred after using Kolmogorov-Smirnov based methods to determine location estimates. Our serial implementation of a typical early stage network analysis workflow uses a combination of widely used data structures and algorithms provided by the NumPy, Pandas and NetworkX frameworks. We then parallelized our workflow using the APIs provided by NVIDIA's RAPIDS AI open data science libraries and measured the relative time to completion for the tasks of ingesting raw data, creating a graph representation of the data and finally fitting a power-law distribution to the empirical observations. The results of our experiments, run on graphs ranging in size from 1 million to 20 million edges, demonstrate that significantly less time is required to complete the tasks of generating a graph from an edge list, computing the degree of all nodes in the graph and fitting the scale and location parameters to the observed data.
KW - GPU computing
KW - data science
KW - graph analytics
UR - http://www.scopus.com/inward/record.url?scp=85099376269&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85099376269&partnerID=8YFLogxK
U2 - 10.1109/HPEC43674.2020.9286224
DO - 10.1109/HPEC43674.2020.9286224
M3 - Conference contribution
AN - SCOPUS:85099376269
T3 - 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
BT - 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020
Y2 - 21 September 2020 through 25 September 2020
ER -