Profiling-Based Big Data Workflow Optimization in a Cross-layer Coupled Design Framework

Qianwen Ye, Chase Q. Wu, Wuji Liu, Aiqin Hou, Wei Shen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Big data processing and analysis increasingly rely on workflow technologies for knowledge discovery and scientific innovation. The execution of big data workflows is now commonly supported on reliable and scalable data storage and computing platforms such as Hadoop. There are a variety of factors affecting workflow performance across multiple layers of big data systems, including the inherent properties (such as scale and topology) of the workflow, the parallel computing engine it runs on, the resource manager that orchestrates distributed resources, the file system that stores data, as well as the parameter setting of each layer. Optimizing workflow performance is challenging because the compound effects of the aforementioned layers are complex and opaque to end users. Generally, tuning their parameters requires an in-depth understanding of big data systems, and the default settings do not always yield optimal performance. We propose a profiling-based cross-layer coupled design framework to determine the best parameter setting for each layer in the entire technology stack to optimize workflow performance. To tackle the large parameter space, we reduce the number of experiments needed for profiling with two approaches: i) identify a subset of critical parameters with the most significant influence through feature selection; and ii) minimize the search process within the value range of each critical parameter using stochastic approximation. Experimental results show that the proposed optimization framework provides the most suitable parameter settings for a given workflow to achieve the best performance. This profiling-based method could be used by end users and service providers to configure and execute large-scale workflows in complex big data systems.

Original languageEnglish (US)
Title of host publicationAlgorithms and Architectures for Parallel Processing - 20th International Conference, ICA3PP 2020, Proceedings
EditorsMeikang Qiu
PublisherSpringer Science and Business Media Deutschland GmbH
Pages197-217
Number of pages21
ISBN (Print)9783030602475
DOIs
StatePublished - 2020
Externally publishedYes
Event20th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2020 - New York, United States
Duration: Oct 2 2020Oct 4 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12454 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference20th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2020
Country/TerritoryUnited States
CityNew York
Period10/2/2010/4/20

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Keywords

  • Big data workflows
  • coupled design
  • performance optimization
  • stochastic approximation
  • workflow profiling

Fingerprint

Dive into the research topics of 'Profiling-Based Big Data Workflow Optimization in a Cross-layer Coupled Design Framework'. Together they form a unique fingerprint.

Cite this