TY - GEN
T1 - Sustainable GPU computing at scale
AU - Shi, Justin Y.
AU - Taifi, Moussa
AU - Khreishah, Abdallah
AU - Wu, Jie
PY - 2011
Y1 - 2011
N2 - General purpose GPU (GPGPU) computing has produced the fastest running supercomputers in the world. For continued sustainable progress, GPU computing at scale also need to address two open issues: a) how increase applications mean time between failures (MTBF) as we increase supercomputer's component counts, and b) how to minimize unnecessary energy consumption. Since energy consumption is defined by the number of components used, we consider a sustainable high performance computing (HPC) application can allow better performance and reliability at the same time when adding computing or communication components. This paper reports a two-tier semantic statistical multiplexing framework for sustainable HPC at scale. The idea is to leverage the powers of statistic multiplexing to tame the nagging HPC scalability challenges. We include the theoretical model, sustainability analysis and computational experiments with automatic system level multiple CPU/GPU failure containment. Our results show that assuming three times slowdown of the statistical multiplexing layer, for an application using 1024 processors with 35% checkpoint overhead, the two-tier framework will produce sustained time and energy savings for MTBF less than 6 hours. With 5% checkpoint overhead, 1.5 hour MTBF would be the break even point. These results suggest the practical feasibility for the proposed two-tier framework.
AB - General purpose GPU (GPGPU) computing has produced the fastest running supercomputers in the world. For continued sustainable progress, GPU computing at scale also need to address two open issues: a) how increase applications mean time between failures (MTBF) as we increase supercomputer's component counts, and b) how to minimize unnecessary energy consumption. Since energy consumption is defined by the number of components used, we consider a sustainable high performance computing (HPC) application can allow better performance and reliability at the same time when adding computing or communication components. This paper reports a two-tier semantic statistical multiplexing framework for sustainable HPC at scale. The idea is to leverage the powers of statistic multiplexing to tame the nagging HPC scalability challenges. We include the theoretical model, sustainability analysis and computational experiments with automatic system level multiple CPU/GPU failure containment. Our results show that assuming three times slowdown of the statistical multiplexing layer, for an application using 1024 processors with 35% checkpoint overhead, the two-tier framework will produce sustained time and energy savings for MTBF less than 6 hours. With 5% checkpoint overhead, 1.5 hour MTBF would be the break even point. These results suggest the practical feasibility for the proposed two-tier framework.
KW - Data parallel processing
KW - Fault tolerant GPU computing
KW - Semantic statistical multiplexing
KW - Tuple switching network
UR - http://www.scopus.com/inward/record.url?scp=81455155109&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=81455155109&partnerID=8YFLogxK
U2 - 10.1109/CSE.2011.55
DO - 10.1109/CSE.2011.55
M3 - Conference contribution
AN - SCOPUS:81455155109
SN - 9780769544779
T3 - Proc. - 14th IEEE Int. Conf. on Computational Science and Engineering, CSE 2011 and 11th Int. Symp. on Pervasive Systems, Algorithms, and Networks, I-SPA 2011 and 10th IEEE Int. Conf. on IUCC 2011
SP - 263
EP - 272
BT - Proc. - 14th IEEE Int. Conf. on Computational Science and Engineering, CSE 2011 and 11th Int. Symp.on Pervasive Systems, Algorithms, and Networks, I-SPAN 2011 and 10th IEEE Int. Conf. IUCC 2011
T2 - 14th IEEE Int. Conf. on Computational Science and Engineering, CSE 2011, the 11th International Symposium on Pervasive Systems, Algorithms, and Networks, I-SPAN 2011, and the 10th IEEE Int. Conf. on Ubiquitous Computing and Communications, IUCC 2011
Y2 - 24 August 2011 through 26 August 2011
ER -