TY - GEN
T1 - A partition-merge based cache-conscious parallel sorting algorithm for CMP with shared cache
AU - Hao, Song
AU - Du, Zhihui
AU - Bader, David A.
AU - Ye, Yin
PY - 2009
Y1 - 2009
N2 - To explore chip-level parallelism, the PSC (Parallel Shared Cache) model is provided in this paper to describe high performance shared cache of Chip Multi-Processors (CMP). Then for a specific application, parallel sorting, a cache-conscious parallel algorithm, PMCC (Partition-Merge based Cache-Conscious) is designed based on the PSC model. The PMCC algorithm consists of two steps: the partition-based in-cache sorting and merge-based k-way merge sorting. In the first stage, PMCC first divides the input dataset into multiple blocks so that each block can fit into the shared L2 cache, and then employs multiple cores to perform parallel cache sorting to generate sorted blocks. In the second stage, PMCC first selects an optimized parameter k which can not only improve the parallelism but also reduce the cache missing rate, then performs a k-way merge sorting to merge all the sorted blocks. The I/O complexity of the in-cache sorting step and k-way merge step are analyzed in detail. The simulation results show that the PSC based PMCC algorithm can out-performance the latest PEM based cache-conscious algorithm and the scalability of PMCC is also discussed. The low I/O complexity, high parallelism and the high scalability of PMCC can take advantage of CMP to improve its performance significantly and deal with large scale problem efficiently.
AB - To explore chip-level parallelism, the PSC (Parallel Shared Cache) model is provided in this paper to describe high performance shared cache of Chip Multi-Processors (CMP). Then for a specific application, parallel sorting, a cache-conscious parallel algorithm, PMCC (Partition-Merge based Cache-Conscious) is designed based on the PSC model. The PMCC algorithm consists of two steps: the partition-based in-cache sorting and merge-based k-way merge sorting. In the first stage, PMCC first divides the input dataset into multiple blocks so that each block can fit into the shared L2 cache, and then employs multiple cores to perform parallel cache sorting to generate sorted blocks. In the second stage, PMCC first selects an optimized parameter k which can not only improve the parallelism but also reduce the cache missing rate, then performs a k-way merge sorting to merge all the sorted blocks. The I/O complexity of the in-cache sorting step and k-way merge step are analyzed in detail. The simulation results show that the PSC based PMCC algorithm can out-performance the latest PEM based cache-conscious algorithm and the scalability of PMCC is also discussed. The low I/O complexity, high parallelism and the high scalability of PMCC can take advantage of CMP to improve its performance significantly and deal with large scale problem efficiently.
KW - Cache-conscious algorithm
KW - Chip multi-processors (CMP)
KW - Parallel sorting
UR - http://www.scopus.com/inward/record.url?scp=77951478845&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77951478845&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2009.26
DO - 10.1109/ICPP.2009.26
M3 - Conference contribution
AN - SCOPUS:77951478845
SN - 9780769538020
T3 - Proceedings of the International Conference on Parallel Processing
SP - 396
EP - 403
BT - ICPP-2009 - The 38th International Conference on Parallel Processing
T2 - 38th International Conference on Parallel Processing, ICPP-2009
Y2 - 22 September 2009 through 25 September 2009
ER -