TY - GEN
T1 - Profiling-Based Big Data Workflow Optimization in a Cross-layer Coupled Design Framework
AU - Ye, Qianwen
AU - Wu, Chase Q.
AU - Liu, Wuji
AU - Hou, Aiqin
AU - Shen, Wei
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Big data processing and analysis increasingly rely on workflow technologies for knowledge discovery and scientific innovation. The execution of big data workflows is now commonly supported on reliable and scalable data storage and computing platforms such as Hadoop. There are a variety of factors affecting workflow performance across multiple layers of big data systems, including the inherent properties (such as scale and topology) of the workflow, the parallel computing engine it runs on, the resource manager that orchestrates distributed resources, the file system that stores data, as well as the parameter setting of each layer. Optimizing workflow performance is challenging because the compound effects of the aforementioned layers are complex and opaque to end users. Generally, tuning their parameters requires an in-depth understanding of big data systems, and the default settings do not always yield optimal performance. We propose a profiling-based cross-layer coupled design framework to determine the best parameter setting for each layer in the entire technology stack to optimize workflow performance. To tackle the large parameter space, we reduce the number of experiments needed for profiling with two approaches: i) identify a subset of critical parameters with the most significant influence through feature selection; and ii) minimize the search process within the value range of each critical parameter using stochastic approximation. Experimental results show that the proposed optimization framework provides the most suitable parameter settings for a given workflow to achieve the best performance. This profiling-based method could be used by end users and service providers to configure and execute large-scale workflows in complex big data systems.
AB - Big data processing and analysis increasingly rely on workflow technologies for knowledge discovery and scientific innovation. The execution of big data workflows is now commonly supported on reliable and scalable data storage and computing platforms such as Hadoop. There are a variety of factors affecting workflow performance across multiple layers of big data systems, including the inherent properties (such as scale and topology) of the workflow, the parallel computing engine it runs on, the resource manager that orchestrates distributed resources, the file system that stores data, as well as the parameter setting of each layer. Optimizing workflow performance is challenging because the compound effects of the aforementioned layers are complex and opaque to end users. Generally, tuning their parameters requires an in-depth understanding of big data systems, and the default settings do not always yield optimal performance. We propose a profiling-based cross-layer coupled design framework to determine the best parameter setting for each layer in the entire technology stack to optimize workflow performance. To tackle the large parameter space, we reduce the number of experiments needed for profiling with two approaches: i) identify a subset of critical parameters with the most significant influence through feature selection; and ii) minimize the search process within the value range of each critical parameter using stochastic approximation. Experimental results show that the proposed optimization framework provides the most suitable parameter settings for a given workflow to achieve the best performance. This profiling-based method could be used by end users and service providers to configure and execute large-scale workflows in complex big data systems.
KW - Big data workflows
KW - coupled design
KW - performance optimization
KW - stochastic approximation
KW - workflow profiling
UR - http://www.scopus.com/inward/record.url?scp=85092686840&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85092686840&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-60248-2_14
DO - 10.1007/978-3-030-60248-2_14
M3 - Conference contribution
AN - SCOPUS:85092686840
SN - 9783030602475
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 197
EP - 217
BT - Algorithms and Architectures for Parallel Processing - 20th International Conference, ICA3PP 2020, Proceedings
A2 - Qiu, Meikang
PB - Springer Science and Business Media Deutschland GmbH
T2 - 20th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2020
Y2 - 2 October 2020 through 4 October 2020
ER -