Abstract
Many computing workloads in big data and machine learning applications are structured as directed acyclic graphs (DAG) and deployed on PC clusters for parallel execution using multiple physical or virtual machines. The scheduling of such workloads is critical to the application performance such as execution time and a plethora of techniques have been developed, taking into account various aspects such as data locality, network bandwidth, and server capability. We formulate DAG-structured workload scheduling as a nonlinear integer programming (NIP) problem and prove it to be NP-complete. Our empirical study reveals a positive correlation between scheduling plan distance (SPD) and finish time gap (FTG), and based on this finding, we propose a running time gap strategy (RTGS) to tackle this scheduling problem in multiprocessor environments. RTGS follows the main optimization strategy in the family of whale optimization algorithm (WOA). We derive a new function and use a greedy algorithm to generate an effective scheduling plan in RTGS. Extensive experiments with real production traces from Alibaba on simulation environments and realistic Hadoop environments show that our approach significantly improves the stability of WOA when applied to the scheduling problem of DAG-structured workloads, and also reduces the workload completion time by up to 93% in comparison with seven state-of-the-art baseline algorithms.
| Original language | English (US) |
|---|---|
| Article number | 911 |
| Journal | Journal of Supercomputing |
| Volume | 81 |
| Issue number | 8 |
| DOIs | |
| State | Published - Jun 2025 |
| Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Theoretical Computer Science
- Software
- Information Systems
- Hardware and Architecture
Keywords
- Directed acyclic graph
- Parallel computing
- Whale optimization algorithm
- Workload scheduling