TY - GEN
T1 - SpotMPI
T2 - 11th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2011
AU - Taifi, Moussa
AU - Shi, Justin Y.
AU - Khreishah, Abdallah
PY - 2011
Y1 - 2011
N2 - The economy of scale offers cloud computing virtually unlimited cost effective processing potentials. Theoretically, prices under fair market conditions should reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. Resource use efficiency is automatically optimized in the process. While there is no lack of incentives for the cloud provider to offer auction-based computing platform, using these volatile platform for practical computing is a challenge for existing programming paradigms. This paper reports a methodology and a toolkit designed to tame the challenges for MPI applications. Unlike existing MPI fault tolerance tools, we emphasize on dynamically adjusted optimal checkpoint-restart (CPR) intervals. We introduce a formal model, then a HPC application toolkit, named SpotMPI, to facilitate the practical execution of real MPI applications on volatile auction-based cloud platforms. Our models capture the intrinsic dependencies between critical time consuming elements by leveraging instrumented performance parameters and publicly available resource bidding histories. We study algorithms with different computing v.s. communication complexities. Our results show non-trivial insights into the optimal bidding and application scaling strategies.
AB - The economy of scale offers cloud computing virtually unlimited cost effective processing potentials. Theoretically, prices under fair market conditions should reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. Resource use efficiency is automatically optimized in the process. While there is no lack of incentives for the cloud provider to offer auction-based computing platform, using these volatile platform for practical computing is a challenge for existing programming paradigms. This paper reports a methodology and a toolkit designed to tame the challenges for MPI applications. Unlike existing MPI fault tolerance tools, we emphasize on dynamically adjusted optimal checkpoint-restart (CPR) intervals. We introduce a formal model, then a HPC application toolkit, named SpotMPI, to facilitate the practical execution of real MPI applications on volatile auction-based cloud platforms. Our models capture the intrinsic dependencies between critical time consuming elements by leveraging instrumented performance parameters and publicly available resource bidding histories. We study algorithms with different computing v.s. communication complexities. Our results show non-trivial insights into the optimal bidding and application scaling strategies.
UR - http://www.scopus.com/inward/record.url?scp=80455140325&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80455140325&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-24669-2_11
DO - 10.1007/978-3-642-24669-2_11
M3 - Conference contribution
AN - SCOPUS:80455140325
SN - 9783642246685
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 109
EP - 120
BT - Algorithms and Architectures for Parallel Processing - 11th International Conference, ICA3PP 2011, Proceedings
Y2 - 24 October 2011 through 26 October 2011
ER -