Diagnosing the Interference on CPU-GPU Synchronization Caused by CPU Sharing in Multi-Tenant GPU Clouds

Youssef Elmougy, Weiwei Jia, Xiaoning Ding, Jianchen Shan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The GPU-accelerated cloud, enabled by maturing GPU virtualization techniques, has become the most attractive platform for high-performance computing and machine learning workloads. However, it is notoriously challenging to build the multi-tenant GPU cloud where resources, like CPUs and GPUs, can be shared. One well-known and heavily studied reason is that workloads suffer from poor performance isolation and low GPU utilization when GPUs are shared. But little attention has been paid to another fundamental yet under studied problem: how sharing CPUs among GPU instances could affect the workload performance?Targeting this problem, the paper conducts experiments to measure the performance slowdown and vGPU utilization decrease under interference from CPU sharing. The results show that GPU workloads suffer from poor and unpredictable performance and heavy vGPU under-utilization because of CPU sharing. We find that such interference is the result of the complex interplay between the characteristics of CPU-GPU interactions and the special behavior of shared vCPUs: vCPU discontinuity. To diagnose how vCPU discontinuity causes the interference, the paper leverages NVIDIA Nsight Systems for fine-grained profiling and has the following findings: 1) vCPU discontinuity causes inefficient CPU-GPU synchronizations; 2) vCPU discontinuity delays task offloading to the vGPU; 3) Polling-based CPU-GPU synchronization suffers from interference more than blocking-based CPU-GPU synchronization; 4) GPU workloads with frequent task offloads and synchronizations are more vulnerable. Based on the findings, the paper proposes a novel polling-then-blocking CPU-GPU synchronization primitive. Evaluation shows that it can improve the performance by 4.2x.

Original languageEnglish (US)
Title of host publication2021 IEEE International Performance, Computing, and Communications Conference, IPCCC 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665443319
DOIs
StatePublished - 2021
Event2021 IEEE International Performance, Computing, and Communications Conference, IPCCC 2021 - Austin, United States
Duration: Oct 29 2021Oct 31 2021

Publication series

NameConference Proceedings of the IEEE International Performance, Computing, and Communications Conference
Volume2021-October
ISSN (Print)1097-2641

Conference

Conference2021 IEEE International Performance, Computing, and Communications Conference, IPCCC 2021
Country/TerritoryUnited States
CityAustin
Period10/29/2110/31/21

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Fingerprint

Dive into the research topics of 'Diagnosing the Interference on CPU-GPU Synchronization Caused by CPU Sharing in Multi-Tenant GPU Clouds'. Together they form a unique fingerprint.

Cite this