TY - GEN
T1 - NoStop
T2 - 50th International Conference on Parallel Processing, ICPP 2021
AU - Ye, Qianwen
AU - Liu, Wuji
AU - Wu, Chase Q.
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/8/9
Y1 - 2021/8/9
N2 - An increasing number of big data applications in various domains generate datasets continuously, which must be processed for various purposes in a timely manner. As one of the most popular streaming data processing systems, Spark Streaming applies a batch-based mechanism, which receives real-time input data streams and divides the data into multiple batches before passing them to Spark processing engine. As such, inappropriate system configurations including batch interval and executor count may lead to unstable states, hence undermining the capability and efficiency of real-time computing. Hence, determining suitable configurations is crucial to the performance of such systems. Many machine learning- and search-based algorithms have been proposed to provide configuration recommendations for streaming applications where input data streams are fed at a constant speed, which, however, is extremely rare in practice. Most real-life streaming applications process data streams arriving at a time-varying rate and hence require real-time system monitoring and continuous configuration adjustment, which still remains largely unexplored. We propose a novel streaming optimization scheme based on Simultaneous Perturbation Stochastic Approximation (SPSA), referred to as NoStop, which dynamically tunes system configurations to optimize real-time system performance with negligible overhead and proved convergence. The performance superiority of NoStop is illustrated by real-life experiments in comparison with Bayesian Optimization and Spark Back Pressure solutions. Extensive experimental results show that NoStop is able to keep track of the changing pattern of input data in real time and provide optimal configuration settings to achieve the best system performance. This optimization scheme could also be applied to other streaming data processing engines with tunable parameters.
AB - An increasing number of big data applications in various domains generate datasets continuously, which must be processed for various purposes in a timely manner. As one of the most popular streaming data processing systems, Spark Streaming applies a batch-based mechanism, which receives real-time input data streams and divides the data into multiple batches before passing them to Spark processing engine. As such, inappropriate system configurations including batch interval and executor count may lead to unstable states, hence undermining the capability and efficiency of real-time computing. Hence, determining suitable configurations is crucial to the performance of such systems. Many machine learning- and search-based algorithms have been proposed to provide configuration recommendations for streaming applications where input data streams are fed at a constant speed, which, however, is extremely rare in practice. Most real-life streaming applications process data streams arriving at a time-varying rate and hence require real-time system monitoring and continuous configuration adjustment, which still remains largely unexplored. We propose a novel streaming optimization scheme based on Simultaneous Perturbation Stochastic Approximation (SPSA), referred to as NoStop, which dynamically tunes system configurations to optimize real-time system performance with negligible overhead and proved convergence. The performance superiority of NoStop is illustrated by real-life experiments in comparison with Bayesian Optimization and Spark Back Pressure solutions. Extensive experimental results show that NoStop is able to keep track of the changing pattern of input data in real time and provide optimal configuration settings to achieve the best system performance. This optimization scheme could also be applied to other streaming data processing engines with tunable parameters.
KW - Big Data Systems.
KW - Performance Optimization
KW - Spark Streaming
KW - Stochastic Approximation
UR - http://www.scopus.com/inward/record.url?scp=85117161137&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117161137&partnerID=8YFLogxK
U2 - 10.1145/3472456.3472515
DO - 10.1145/3472456.3472515
M3 - Conference contribution
AN - SCOPUS:85117161137
T3 - ACM International Conference Proceeding Series
BT - 50th International Conference on Parallel Processing, ICPP 2021 - Main Conference Proceedings
PB - Association for Computing Machinery
Y2 - 9 August 2021 through 12 August 2021
ER -