TY - GEN
T1 - Logarithmic Radix Binning and Vectorized Triangle Counting
AU - Green, Oded
AU - Fox, James
AU - Watkins, Alex
AU - Tripathy, Alok
AU - Gabert, Kasimir
AU - Kim, Euna
AU - An, Xiaojing
AU - Aatish, Kumar
AU - Bader, David A.
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/11/26
Y1 - 2018/11/26
N2 - Triangle counting is a building block for numerous graph applications and given the fact that graphs continue to grow in size, its scalability is important. As such, numerous algorithms have been designed for triangle counting - some of which are compute-bound rather than memory bound. Even for compute-bound algorithms, one of the key challenges is the limited control flow available on the processor. This is in-part due to the high dependency between the control flow, input data, and limited utilization of vector instructions. Not surprising, compilers are not always able to detect these data dependencies and vectorize the algorithms. Using the branch-avoiding model we show to remove control flow restrictions by replacing branches with an equivalent set of arithmetic operations. More so, we show how these can be vectorized using Intel's AVX-512 instruction set and that our new vectorized algorithms are 2×-5× faster than scalar counterparts. We also present a new load balancing method, Logarithmic Radix Binning (LRB) that ensures that threads and the vector data lanes execute a near equal amount of work at any given time. Altogether, our algorithm outperforms several 2017 HPEC Graph Challenge Champions such as the KOKKOS framework and a GPU based algorithm by anywhere from 1.5× and up to 14×.
AB - Triangle counting is a building block for numerous graph applications and given the fact that graphs continue to grow in size, its scalability is important. As such, numerous algorithms have been designed for triangle counting - some of which are compute-bound rather than memory bound. Even for compute-bound algorithms, one of the key challenges is the limited control flow available on the processor. This is in-part due to the high dependency between the control flow, input data, and limited utilization of vector instructions. Not surprising, compilers are not always able to detect these data dependencies and vectorize the algorithms. Using the branch-avoiding model we show to remove control flow restrictions by replacing branches with an equivalent set of arithmetic operations. More so, we show how these can be vectorized using Intel's AVX-512 instruction set and that our new vectorized algorithms are 2×-5× faster than scalar counterparts. We also present a new load balancing method, Logarithmic Radix Binning (LRB) that ensures that threads and the vector data lanes execute a near equal amount of work at any given time. Altogether, our algorithm outperforms several 2017 HPEC Graph Challenge Champions such as the KOKKOS framework and a GPU based algorithm by anywhere from 1.5× and up to 14×.
UR - http://www.scopus.com/inward/record.url?scp=85060103013&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85060103013&partnerID=8YFLogxK
U2 - 10.1109/HPEC.2018.8547581
DO - 10.1109/HPEC.2018.8547581
M3 - Conference contribution
AN - SCOPUS:85060103013
T3 - 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018
BT - 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE High Performance Extreme Computing Conference, HPEC 2018
Y2 - 25 September 2018 through 27 September 2018
ER -