A Practical Algorithm Design and Evaluation for Heterogeneous Elastic Computing with Stragglers

Nicholas Woolsey, Jorg Kliewer, Rong Rong Chen, Mingyue Ji

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Our extensive real measurements over Amazon EC2 show that the virtual instances often have different computing speeds even if they share the same configurations. This motivates us to study heterogeneous Coded Storage Elastic Computing (CSEC) systems where machines, with different computing speeds, join and leave the network arbitrarily over different computing steps. In CSEC systems, a Maximum Distance Separable (MDS) code is used for coded storage such that the file placement does not have to be re-defined with each elastic event. Computation assignment algorithms are used to minimize the computation time given computation speeds of different machines. While previous studies of heterogeneous CSEC do not include stragglers - the slow machines during the computation, we develop a new framework in heterogeneous CSEC that introduces straggler tolerance. Based on this framework, we design a novel algorithm using our previously proposed approach for heterogeneous CSEC such that the system can handle any subset of stragglers of a specified size while minimizing the computation time. Furthermore, we establish a trade-off in computation time and straggler tolerance. Another major limitation of existing CSEC designs is the lack of practical evaluations using real applications. In this paper, we evaluate the performance of our designs on Amazon EC2 for applications of the power iteration and linear regression. Evaluation results show that the proposed heterogeneous CSEC algorithms outperform the state-of-the-art designs by more than 30%.

Original languageEnglish (US)
Title of host publication2021 IEEE Global Communications Conference, GLOBECOM 2021 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728181042
DOIs
StatePublished - 2021
Event2021 IEEE Global Communications Conference, GLOBECOM 2021 - Madrid, Spain
Duration: Dec 7 2021Dec 11 2021

Publication series

Name2021 IEEE Global Communications Conference, GLOBECOM 2021 - Proceedings

Conference

Conference2021 IEEE Global Communications Conference, GLOBECOM 2021
Country/TerritorySpain
CityMadrid
Period12/7/2112/11/21

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality
  • Health Informatics

Fingerprint

Dive into the research topics of 'A Practical Algorithm Design and Evaluation for Heterogeneous Elastic Computing with Stragglers'. Together they form a unique fingerprint.

Cite this