On Performance Modeling and Prediction for Spark-HBase Applications in Big Data Systems

Haifa Alquwaiee, Chase Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

Many large-scale applications in various business and scientific domains require both parallel computing and distributed data management for big data processing. One typical scenario is the use of the Spark computing engine to process a large amount of data managed by HBase in Hadoop. Such computing workflows provide an opportunity to optimize application performance through strategic resource allocation with suitable parameter settings. As such, it necessitates accurate modeling and prediction of application performance to provide an effective recommendation of optimal system configurations to end users. However, this is a challenging problem for multiple reasons, mainly the large parameter space and the dynamic interactions between different technology layers of big data systems. In this paper, we propose a class of regression-based machine learning models to predict the execution performance of Spark-HBase applications in Hadoop. We first explore and identify an exhaustive set of system parameters across multiple layers including Spark and HBase, and then conduct in-depth exploratory analysis of their effects on the execution time of Spark-HBase applications. Based on these analysis results, we design a performance predictor using regression-based machine learning algorithms. Experimental results show that the resulted predictor achieves high accuracy with different algorithms in comparison. The proposed approach can facilitate automatic system configurations and has potential to be applied to other similar systems for big data processing.

Original languageEnglish (US)
Title of host publicationICC 2022 - IEEE International Conference on Communications
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3685-3690
Number of pages6
ISBN (Electronic)9781538683477
DOIs
StatePublished - 2022
Externally publishedYes
Event2022 IEEE International Conference on Communications, ICC 2022 - Seoul, Korea, Republic of
Duration: May 16 2022May 20 2022

Publication series

NameIEEE International Conference on Communications
Volume2022-May
ISSN (Print)1550-3607

Conference

Conference2022 IEEE International Conference on Communications, ICC 2022
Country/TerritoryKorea, Republic of
CitySeoul
Period5/16/225/20/22

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Electrical and Electronic Engineering

Keywords

  • HBase
  • Spark
  • big data
  • machine learning
  • performance modeling and prediction
  • representation learning

Fingerprint

Dive into the research topics of 'On Performance Modeling and Prediction for Spark-HBase Applications in Big Data Systems'. Together they form a unique fingerprint.

Cite this