Abstract
Analytic queries are typically compiled into execution plans in the form of directed acyclic graphs (DAGs) of MapReduce jobs. Jobs in the DAGs are dispatched to the MapReduce processing engine as soon as their dependencies are satisfied. MapReduce adopts a job-level scheduling policy to strive for a balanced distribution of tasks and effective utilization of resources. However, such simplistic policy is unable to reconcile the dynamics of different jobs in complex analytic queries, resulting in unfair treatment of different queries, low utilization of system resources, prolonged execution time, and low query throughput. Therefore, we introduce a scheduling framework to address these problems systematically. Our framework includes two techniques: multivariate DAG modeling and two-level query scheduling. Cross-layer semantics percolation allows the flow of query semantics and job dependencies in the DAG to the MapReduce scheduler. With richer semantics information, we build a multivariate model that can accurately predict the execution time of individual MapReduce jobs and gauge the changing size of analytics datasets through selectivity approximation. Furthermore, we introduce two-level query scheduling that can maximize the intra-query job-level concurrency, and at the same time speed up the query-level completion time based on the accurate prediction and queuing of queries. At the job level, we focus on detecting query semantics, predicting the query completion time through an online multivariate linear regression model, thereby increasing job-level parallelism and maximizing data sharing across jobs. At the task level, we focus on balanced data distribution, maximal slot utilization, and optimal data locality of task scheduling. Our experimental results on a set of complex query benchmarks demonstrate that our scheduling framework can significantly improve both fairness and throughput of Hive queries. It can improve query response time by up to 43.9% and 72.8% on average, compared to the Hadoop Fair Scheduling and the Hadoop Capacity Scheduling, respectively. In addition, our two-level scheduler can achieve a query fairness that is, on average, 59.8% better than that of the Hadoop Fair Scheduler.
Original language | English (US) |
---|---|
Pages (from-to) | 66-78 |
Number of pages | 13 |
Journal | Parallel Computing |
Volume | 85 |
DOIs | |
State | Published - Jul 2019 |
All Science Journal Classification (ASJC) codes
- Software
- Theoretical Computer Science
- Hardware and Architecture
- Computer Networks and Communications
- Computer Graphics and Computer-Aided Design
- Artificial Intelligence
Keywords
- MapReduce
- Multivariate modeling
- Query scheduling