On Machine Learning-based Stage-aware Performance Prediction of Spark Applications

Guangjun Ye, Wuji Liu, Chase Q. Wu, Wei Shen, Xukang Lyu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The data volume of large-scale applications in various science, engineering, and business domains has experienced an explosive growth over the past decade, and has gone far beyond the computing capability and storage capacity of any single server. As a viable solution, such data is oftentimes stored in distributed file systems and processed by parallel computing engines, as exemplified by Spark, which has gained increasing popularity over the traditional MapReduce framework due to its fast in-memory processing of streaming data. Spark engines are generally deployed in cloud environments such as Amazon EC2 and Alibaba Cloud. However, storage and computing resources in these cloud environments are typically provisioned on a pay-as-you-go basis and thus an accurate estimate of the execution time of Spark workloads is critical to making full utilization of cloud resources and meeting performance requirements of end users. Our insight is that the execution pattern of many Spark workloads is qualitatively similar, which makes it possible to leverage historical performance data to predict the execution time of a given Spark application. We use the execution information extracted from Spark History Server as training data and develop a stage-aware hierarchical neural network model for performance prediction. Experimental results show that the proposed hierarchical model achieves higher accuracy than a holistic prediction model at the end-to-end level, and also outperforms other existing regression-based prediction methods.

Original languageEnglish (US)
Title of host publication2020 IEEE 39th International Performance Computing and Communications Conference, IPCCC 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728198293
DOIs
StatePublished - Nov 6 2020
Event39th IEEE International Performance Computing and Communications Conference, IPCCC 2020 - Austin, United States
Duration: Nov 6 2020Nov 8 2020

Publication series

Name2020 IEEE 39th International Performance Computing and Communications Conference, IPCCC 2020

Conference

Conference39th IEEE International Performance Computing and Communications Conference, IPCCC 2020
CountryUnited States
CityAustin
Period11/6/2011/8/20

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality

Keywords

  • Big data computing
  • Spark
  • in-memory processing
  • performance modeling

Fingerprint Dive into the research topics of 'On Machine Learning-based Stage-aware Performance Prediction of Spark Applications'. Together they form a unique fingerprint.

Cite this