TY - GEN
T1 - On Machine Learning-based Stage-aware Performance Prediction of Spark Applications
AU - Ye, Guangjun
AU - Liu, Wuji
AU - Wu, Chase Q.
AU - Shen, Wei
AU - Lyu, Xukang
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/11/6
Y1 - 2020/11/6
N2 - The data volume of large-scale applications in various science, engineering, and business domains has experienced an explosive growth over the past decade, and has gone far beyond the computing capability and storage capacity of any single server. As a viable solution, such data is oftentimes stored in distributed file systems and processed by parallel computing engines, as exemplified by Spark, which has gained increasing popularity over the traditional MapReduce framework due to its fast in-memory processing of streaming data. Spark engines are generally deployed in cloud environments such as Amazon EC2 and Alibaba Cloud. However, storage and computing resources in these cloud environments are typically provisioned on a pay-as-you-go basis and thus an accurate estimate of the execution time of Spark workloads is critical to making full utilization of cloud resources and meeting performance requirements of end users. Our insight is that the execution pattern of many Spark workloads is qualitatively similar, which makes it possible to leverage historical performance data to predict the execution time of a given Spark application. We use the execution information extracted from Spark History Server as training data and develop a stage-aware hierarchical neural network model for performance prediction. Experimental results show that the proposed hierarchical model achieves higher accuracy than a holistic prediction model at the end-to-end level, and also outperforms other existing regression-based prediction methods.
AB - The data volume of large-scale applications in various science, engineering, and business domains has experienced an explosive growth over the past decade, and has gone far beyond the computing capability and storage capacity of any single server. As a viable solution, such data is oftentimes stored in distributed file systems and processed by parallel computing engines, as exemplified by Spark, which has gained increasing popularity over the traditional MapReduce framework due to its fast in-memory processing of streaming data. Spark engines are generally deployed in cloud environments such as Amazon EC2 and Alibaba Cloud. However, storage and computing resources in these cloud environments are typically provisioned on a pay-as-you-go basis and thus an accurate estimate of the execution time of Spark workloads is critical to making full utilization of cloud resources and meeting performance requirements of end users. Our insight is that the execution pattern of many Spark workloads is qualitatively similar, which makes it possible to leverage historical performance data to predict the execution time of a given Spark application. We use the execution information extracted from Spark History Server as training data and develop a stage-aware hierarchical neural network model for performance prediction. Experimental results show that the proposed hierarchical model achieves higher accuracy than a holistic prediction model at the end-to-end level, and also outperforms other existing regression-based prediction methods.
KW - Big data computing
KW - Spark
KW - in-memory processing
KW - performance modeling
UR - http://www.scopus.com/inward/record.url?scp=85104397132&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104397132&partnerID=8YFLogxK
U2 - 10.1109/IPCCC50635.2020.9391564
DO - 10.1109/IPCCC50635.2020.9391564
M3 - Conference contribution
AN - SCOPUS:85104397132
T3 - 2020 IEEE 39th International Performance Computing and Communications Conference, IPCCC 2020
BT - 2020 IEEE 39th International Performance Computing and Communications Conference, IPCCC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 39th IEEE International Performance Computing and Communications Conference, IPCCC 2020
Y2 - 6 November 2020 through 8 November 2020
ER -