TY - GEN
T1 - Performance optimization of Hadoop workflows in public clouds through adaptive task partitioning
AU - Shu, Tong
AU - Wu, Chase Q.
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/10/2
Y1 - 2017/10/2
N2 - Cloud computing provides a cost-effective computing platform for big data workflows where moldable parallel computing models such as MapReduce are widely applied to meet stringent performance requirements. The granularity of task partitioning in each moldable job has a significant impact on workflow completion time and financial cost. We investigate the properties of moldable jobs and design a big-data workflow mapping model, based on which, we formulate a workflow mapping problem to minimize workflow makespan under a budget constraint in public clouds. We show this problem to be strongly NP-complete and design i) a fully polynomial-time approximation scheme (FPTAS) for a special case with a pipeline-structured workflow executed on virtual machines in a single class, and ii) a heuristic for a generalized problem with an arbitrary directed acyclic graph-structured workflow executed on virtual machines in multiple classes. The performance superiority of the proposed solution is illustrated by extensive simulation-based results in Hadoop/YARN in comparison with existing workflow mapping models and algorithms.
AB - Cloud computing provides a cost-effective computing platform for big data workflows where moldable parallel computing models such as MapReduce are widely applied to meet stringent performance requirements. The granularity of task partitioning in each moldable job has a significant impact on workflow completion time and financial cost. We investigate the properties of moldable jobs and design a big-data workflow mapping model, based on which, we formulate a workflow mapping problem to minimize workflow makespan under a budget constraint in public clouds. We show this problem to be strongly NP-complete and design i) a fully polynomial-time approximation scheme (FPTAS) for a special case with a pipeline-structured workflow executed on virtual machines in a single class, and ii) a heuristic for a generalized problem with an arbitrary directed acyclic graph-structured workflow executed on virtual machines in multiple classes. The performance superiority of the proposed solution is illustrated by extensive simulation-based results in Hadoop/YARN in comparison with existing workflow mapping models and algorithms.
UR - http://www.scopus.com/inward/record.url?scp=85034022406&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85034022406&partnerID=8YFLogxK
U2 - 10.1109/INFOCOM.2017.8057204
DO - 10.1109/INFOCOM.2017.8057204
M3 - Conference contribution
AN - SCOPUS:85034022406
T3 - Proceedings - IEEE INFOCOM
BT - INFOCOM 2017 - IEEE Conference on Computer Communications
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2017 IEEE Conference on Computer Communications, INFOCOM 2017
Y2 - 1 May 2017 through 4 May 2017
ER -