On a Meta Learning-Based Scheduler for Deep Learning Clusters

Jin Yang, Liang Bao, Wenjing Liu, Rong Yang, Chase Q. Wu

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Deep learning (DL) has become a dominating type of workloads on AI computing platforms. The performance of such platforms highly depends on how distributed DL jobs are scheduled. Reinforcement learning (RL)-based schedulers have been extensively studied and are capable of modeling interferences between concurrent jobs competing for resources. However, existing RL-based schedulers must learn from large number of samples and adapt to workload changes in real systems, which is a huge cost for production clusters. This paper proposes an intelligent, autonomous scheduler that employs sample-efficient RL for real-world resource scheduling on complex DL clusters. Specifically, we design a closed-loop meta-RL-based worker placement algorithm for DL training jobs. Instead of random exploration, we encourage the scheduler to explore combinatorial subspaces, where the performance model might be inaccurate, to improve the sampling efficiency of the scheduler agent. Extensive experimental results demonstrate that our algorithm outperforms other baselines in terms of average job completion time with 12.29% to 16.24% improvements. Further experiments with workload variations yield 15.76% to 22.13% improvements.

Original languageEnglish (US)
Pages (from-to)3631-3642
Number of pages12
JournalIEEE Transactions on Cloud Computing
Volume11
Issue number4
DOIs
StatePublished - Oct 1 2023

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Hardware and Architecture
  • Computer Networks and Communications
  • Computer Science Applications

Keywords

  • Deep learning cluster
  • meta learning
  • reinforcement learning
  • worker placement

Fingerprint

Dive into the research topics of 'On a Meta Learning-Based Scheduler for Deep Learning Clusters'. Together they form a unique fingerprint.

Cite this