Project Details
Description
In order to keep up with trends in computer architecture and software, high-performance computing (HPC) must adapt to heterogeneity, extreme scale, and dynamism. The complexity of HPC systems is increasing as hardware vendors strive to improve computing capability while managing power consumption. To achieve high performance and throughput, scientific computing applications must be able to fully leverage these large-scale and heterogeneous resources. Meanwhile, scientific simulation and analysis workflows are also becoming more complex and dynamic, incorporating components with varying execution times and resource requirements. Traditional bulk-synchronous execution models are no longer sufficient to capture these patterns. Manual resource allocation decisions require significant time and effort from application developers in order to understand their workload and computing resources. Even with a good understanding, deciding how to allocate resources remains highly challenging. Manual and heuristic-based decisions often result in the under-utilization of resources and inadequate performance. This project aims to develop an intelligent scheduling framework that can automate resource allocation and deliver better performance for adaptive scientific workflows on HPC systems. Specifically, we will design a tree-search-based approach to find high-quality resource allocations for specific workflows and distill the knowledge in the trees into a policy. We will devise deep learning-based surrogate models for fast evaluation of the quality of allocations sampled by the trees, without actually executing the scientific workflows on HPC systems for each and every allocation. To further enhance the accuracy of the surrogate model and the performance of the distilled policy, we will adopt active learning strategies to guide the acquisition and usage of additional real execution data. The developed techniques in this project will have a direct impact on HPC systems, improving resource efficiency, reducing energy consumption, and lowering overall operating costs.
Status | Active |
---|---|
Effective start/end date | 7/1/23 → 6/30/28 |
Funding
- Advanced Scientific Computing Research: $875,000.00