NoStop: A Novel Configuration Optimization Scheme for Spark Streaming

Qianwen Ye, Wuji Liu, Chase Q. Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

An increasing number of big data applications in various domains generate datasets continuously, which must be processed for various purposes in a timely manner. As one of the most popular streaming data processing systems, Spark Streaming applies a batch-based mechanism, which receives real-time input data streams and divides the data into multiple batches before passing them to Spark processing engine. As such, inappropriate system configurations including batch interval and executor count may lead to unstable states, hence undermining the capability and efficiency of real-time computing. Hence, determining suitable configurations is crucial to the performance of such systems. Many machine learning- and search-based algorithms have been proposed to provide configuration recommendations for streaming applications where input data streams are fed at a constant speed, which, however, is extremely rare in practice. Most real-life streaming applications process data streams arriving at a time-varying rate and hence require real-time system monitoring and continuous configuration adjustment, which still remains largely unexplored. We propose a novel streaming optimization scheme based on Simultaneous Perturbation Stochastic Approximation (SPSA), referred to as NoStop, which dynamically tunes system configurations to optimize real-time system performance with negligible overhead and proved convergence. The performance superiority of NoStop is illustrated by real-life experiments in comparison with Bayesian Optimization and Spark Back Pressure solutions. Extensive experimental results show that NoStop is able to keep track of the changing pattern of input data in real time and provide optimal configuration settings to achieve the best system performance. This optimization scheme could also be applied to other streaming data processing engines with tunable parameters.

Original languageEnglish (US)
Title of host publication50th International Conference on Parallel Processing, ICPP 2021 - Main Conference Proceedings
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450390682
DOIs
StatePublished - Aug 9 2021
Event50th International Conference on Parallel Processing, ICPP 2021 - Virtual, Online, United States
Duration: Aug 9 2021Aug 12 2021

Publication series

NameACM International Conference Proceeding Series

Conference

Conference50th International Conference on Parallel Processing, ICPP 2021
Country/TerritoryUnited States
CityVirtual, Online
Period8/9/218/12/21

All Science Journal Classification (ASJC) codes

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Keywords

  • Big Data Systems.
  • Performance Optimization
  • Spark Streaming
  • Stochastic Approximation

Fingerprint

Dive into the research topics of 'NoStop: A Novel Configuration Optimization Scheme for Spark Streaming'. Together they form a unique fingerprint.

Cite this