TY - GEN
T1 - Semantics-aware prediction for analytic qeries in MapReduce environment
AU - Yu, Weikuan
AU - Liu, Zhuo
AU - Ding, Xiaoning
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/8/13
Y1 - 2018/8/13
N2 - MapReduce has emerged as a powerful data processing engine that supports large-scale complex analytics applications, most of which are written in declarative query languages such as HiveQL and Pig Latin. Analytic queries are typically compiled into execution plans in the form of directed acyclic graphs (DAGs) of MapReduce jobs. Jobs in the DAGs are dispatched to the MapReduce processing engine as soon as their dependencies are satisfied. MapReduce adopts a job-level scheduling policy to strive for balanced distribution of tasks and effective utilization of resources. However, there is a lack of query-level semantics in the purely task-based scheduling algorithms, resulting in resource thrashing among queries and an overall degradation of performance. Therefore, we introduce a semantic-aware query prediction framework to address these problems systematically. Our framework includes three major techniques: cross-layer semantics percolation, selectivity estimation, and multivariate time prediction for analytic queries. Multivariate query prediction allows us not only to gauge the dynamic size of analytics datasets, but also to accurately predict the resource usage (e.g., numbers of map and reduce tasks) of individual MapReduce jobs and whole queries. In addition, the accurate prediction and queuing of queries can be potentially exploited by Hadoop scheduling for optimizing overall query performance. Based on the query prediction, our case study scheduler demonstrates significant performance improvement compared to traditional Hadoop schedulers.
AB - MapReduce has emerged as a powerful data processing engine that supports large-scale complex analytics applications, most of which are written in declarative query languages such as HiveQL and Pig Latin. Analytic queries are typically compiled into execution plans in the form of directed acyclic graphs (DAGs) of MapReduce jobs. Jobs in the DAGs are dispatched to the MapReduce processing engine as soon as their dependencies are satisfied. MapReduce adopts a job-level scheduling policy to strive for balanced distribution of tasks and effective utilization of resources. However, there is a lack of query-level semantics in the purely task-based scheduling algorithms, resulting in resource thrashing among queries and an overall degradation of performance. Therefore, we introduce a semantic-aware query prediction framework to address these problems systematically. Our framework includes three major techniques: cross-layer semantics percolation, selectivity estimation, and multivariate time prediction for analytic queries. Multivariate query prediction allows us not only to gauge the dynamic size of analytics datasets, but also to accurately predict the resource usage (e.g., numbers of map and reduce tasks) of individual MapReduce jobs and whole queries. In addition, the accurate prediction and queuing of queries can be potentially exploited by Hadoop scheduling for optimizing overall query performance. Based on the query prediction, our case study scheduler demonstrates significant performance improvement compared to traditional Hadoop schedulers.
KW - Analytics query
KW - MapReduce
KW - Scheduling
KW - Semantics-aware
UR - http://www.scopus.com/inward/record.url?scp=85054831007&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054831007&partnerID=8YFLogxK
U2 - 10.1145/3229710.3229713
DO - 10.1145/3229710.3229713
M3 - Conference contribution
AN - SCOPUS:85054831007
SN - 9781450365239
T3 - ACM International Conference Proceeding Series
BT - 47th International Conference on Parallel Processing, ICPP 2018
PB - Association for Computing Machinery
T2 - 47th International Conference on Parallel Processing, ICPP 2018
Y2 - 13 August 2018 through 16 August 2018
ER -