TY - GEN
T1 - On MapReduce Scheduling in Hadoop Yarn on Heterogeneous Clusters
AU - Wang, Meng
AU - Wu, Chase Q.
AU - Cao, Huiyan
AU - Liu, Yang
AU - Wang, Yongqiang
AU - Hou, Aiqin
N1 - Funding Information:
This research is sponsored in part by U.S. National Science Foundation under Grant No. CNS-1560698 with New Jersey Institute of Technology, and National Nature Science Foundation of China under Grant No. 61472320 and NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization under Grant No. U1609202 with Northwest University, P.R. China.
PY - 2018/9/5
Y1 - 2018/9/5
N2 - Hadoop is a distributed computing system widely used for big data processing in various domains. As the data volume continues to increase rapidly, Hadoop systems have become a critical contributor to the success of many big data applications. The MapReduce scheduler is a key component that determines the overall performance of a Hadoop cluster. In this paper, we formulate and investigate a task scheduling problem in a heterogeneous Hadoop cluster to minimize the completion time of a batch of MapReduce jobs. We first design a prediction model to predict the end time of a task, which is used for placing the corresponding data block on a node in advance to reduce the data transmission time and the overall job completion time. Based on this prediction model, we propose a task matching-based scheduling algorithm, referred to as TMSA, to schedule the tasks in the task queue in Hadoop, by taking into account the real-time performance of each node in the cluster and the matching degree between nodes and tasks. Experimental results show that the prediction model achieves high accuracy and TMSA significantly reduces the completion time of a batch of MapReduce jobs compared to existing schedulers.
AB - Hadoop is a distributed computing system widely used for big data processing in various domains. As the data volume continues to increase rapidly, Hadoop systems have become a critical contributor to the success of many big data applications. The MapReduce scheduler is a key component that determines the overall performance of a Hadoop cluster. In this paper, we formulate and investigate a task scheduling problem in a heterogeneous Hadoop cluster to minimize the completion time of a batch of MapReduce jobs. We first design a prediction model to predict the end time of a task, which is used for placing the corresponding data block on a node in advance to reduce the data transmission time and the overall job completion time. Based on this prediction model, we propose a task matching-based scheduling algorithm, referred to as TMSA, to schedule the tasks in the task queue in Hadoop, by taking into account the real-time performance of each node in the cluster and the matching degree between nodes and tasks. Experimental results show that the prediction model achieves high accuracy and TMSA significantly reduces the completion time of a batch of MapReduce jobs compared to existing schedulers.
KW - Hadoop
KW - MapReduce
KW - YARN
KW - distributed computing
KW - task scheduler
UR - http://www.scopus.com/inward/record.url?scp=85054085042&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054085042&partnerID=8YFLogxK
U2 - 10.1109/TrustCom/BigDataSE.2018.00264
DO - 10.1109/TrustCom/BigDataSE.2018.00264
M3 - Conference contribution
AN - SCOPUS:85054085042
SN - 9781538643877
T3 - Proceedings - 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications and 12th IEEE International Conference on Big Data Science and Engineering, Trustcom/BigDataSE 2018
SP - 1747
EP - 1754
BT - Proceedings - 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications and 12th IEEE International Conference on Big Data Science and Engineering, Trustcom/BigDataSE 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications and 12th IEEE International Conference on Big Data Science and Engineering, Trustcom/BigDataSE 2018
Y2 - 31 July 2018 through 3 August 2018
ER -