TY - GEN
T1 - GPU merge path - A GPU merging algorithm
AU - Green, Oded
AU - McColl, Robert
AU - Bader, David A.
PY - 2012
Y1 - 2012
N2 - Graphics Processing Units (GPUs) have become ideal candidates for the development of fine-grain parallel algorithms as the number of processing elements per GPU increases. In addition to the increase in cores per system, new memory hierarchies and increased bandwidth have been developed that allow for significant performance improvement when computation is performed using certain types of memory access patterns. Merging two sorted arrays is a useful primitive and is a basic building block for numerous applications such as joining database queries, merging adjacency lists in graphs, and set intersection. An efficient parallel merging algorithm partitions the sorted input arrays into sets of non-overlapping sub-arrays that can be independently merged on multiple cores. For optimal performance, the partitioning should be done in parallel and should divide the input arrays such that each core receives an equal size of data to merge. In this paper, we present an algorithm that partitions the workload equally amongst the GPU Streaming Multiprocessors (SM). Following this, we show how each SM performs a parallel merge and how to divide the work so that all the GPU's Streaming Processors (SP) are utilized. All stages in this algorithm are parallel. The new algorithm demonstrates good utilization of the GPU memory hierarchy. This approach demonstrates an average of 20X and 50X speedup over a sequential merge on the x86 platform for integer and floating point, respectively. Our implementation is 10X faster than the fast parallel merge supplied in the CUDA Thrust library.
AB - Graphics Processing Units (GPUs) have become ideal candidates for the development of fine-grain parallel algorithms as the number of processing elements per GPU increases. In addition to the increase in cores per system, new memory hierarchies and increased bandwidth have been developed that allow for significant performance improvement when computation is performed using certain types of memory access patterns. Merging two sorted arrays is a useful primitive and is a basic building block for numerous applications such as joining database queries, merging adjacency lists in graphs, and set intersection. An efficient parallel merging algorithm partitions the sorted input arrays into sets of non-overlapping sub-arrays that can be independently merged on multiple cores. For optimal performance, the partitioning should be done in parallel and should divide the input arrays such that each core receives an equal size of data to merge. In this paper, we present an algorithm that partitions the workload equally amongst the GPU Streaming Multiprocessors (SM). Following this, we show how each SM performs a parallel merge and how to divide the work so that all the GPU's Streaming Processors (SP) are utilized. All stages in this algorithm are parallel. The new algorithm demonstrates good utilization of the GPU memory hierarchy. This approach demonstrates an average of 20X and 50X speedup over a sequential merge on the x86 platform for integer and floating point, respectively. Our implementation is 10X faster than the fast parallel merge supplied in the CUDA Thrust library.
KW - Graphics processors
KW - Measurement of multiple-processor systems
KW - Parallel algorithms
KW - Parallel systems
UR - http://www.scopus.com/inward/record.url?scp=84864053701&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84864053701&partnerID=8YFLogxK
U2 - 10.1145/2304576.2304621
DO - 10.1145/2304576.2304621
M3 - Conference contribution
AN - SCOPUS:84864053701
SN - 9781450313162
T3 - Proceedings of the International Conference on Supercomputing
SP - 331
EP - 340
BT - ICS'12 - Proceedings of the 2012 ACM International Conference on Supercomputing
T2 - 26th ACM International Conference on Supercomputing, ICS'12
Y2 - 25 June 2012 through 29 June 2012
ER -