TY - GEN
T1 - Performance impact of memory channels on sparse and irregular algorithms
AU - Green, Oded
AU - Fox, James
AU - Young, Jeff
AU - Shirako, Jun
AU - Bader, David
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we show that this is not necessarily the case. We demonstrate that the key factor in the utilization of the memory system for graph algorithms is not the raw bandwidth, or even latency of memory requests, but instead is the number of memory channels available to handle small data transfers with low locality. Using several widely used graph frameworks, including Gunrock (on the GPU) and GAPBS & Ligra (for CPUs), we characterize two very distinct memory hierarchies with respect to key graph analytics kernels. Our results show that the differences in peak bandwidths of several of the latest Pascal-generation GPU memory subsystems aren't reflected in the performance of various analytics. Furthermore, our experiments on CPU and Xeon Phi systems show that the number of memory channels utilized can be a decisive factor in performance across several different applications. For CPU systems with smaller thread counts, the memory channels can be underutilized while systems with high thread counts can oversaturate the memory subsystem, which leads to limited performance. Lastly, we model the performance of including more channels with narrower access widths than those found in existing memory subsystems, and we analyze the trade-offs in terms of the two most prominent types of memory accesses found in graph algorithms, streaming and random accesses.
AB - Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we show that this is not necessarily the case. We demonstrate that the key factor in the utilization of the memory system for graph algorithms is not the raw bandwidth, or even latency of memory requests, but instead is the number of memory channels available to handle small data transfers with low locality. Using several widely used graph frameworks, including Gunrock (on the GPU) and GAPBS & Ligra (for CPUs), we characterize two very distinct memory hierarchies with respect to key graph analytics kernels. Our results show that the differences in peak bandwidths of several of the latest Pascal-generation GPU memory subsystems aren't reflected in the performance of various analytics. Furthermore, our experiments on CPU and Xeon Phi systems show that the number of memory channels utilized can be a decisive factor in performance across several different applications. For CPU systems with smaller thread counts, the memory channels can be underutilized while systems with high thread counts can oversaturate the memory subsystem, which leads to limited performance. Lastly, we model the performance of including more channels with narrower access widths than those found in existing memory subsystems, and we analyze the trade-offs in terms of the two most prominent types of memory accesses found in graph algorithms, streaming and random accesses.
UR - http://www.scopus.com/inward/record.url?scp=85078200509&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078200509&partnerID=8YFLogxK
U2 - 10.1109/IA349570.2019.00016
DO - 10.1109/IA349570.2019.00016
M3 - Conference contribution
AN - SCOPUS:85078200509
T3 - 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms, IA3 2019
SP - 67
EP - 70
BT - 2019 IEEE/ACM 9th Workshop on Irregular Applications
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 9th IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms, IA3 2019
Y2 - 18 November 2019
ER -