SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems

Authors: Shaoqi Wang (University of Colorado, Colorado Springs); Oscar J. Gonzalez (Nokia Bell Labs); Xiaobo Zhou (University of Colorado, Colorado Springs); and Thomas Williams, Brian D. Friedman, Martin Havemann, and Thomas Woo (Nokia Bell Labs)

Abstract: We propose an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPU allocation mechanism works in concert with the scheduler. It offers a lightweight and efficient method to reallocate GPUs based on a “SideCar” process that temporarily stops and restarts the job’s DL training processes with a different number of GPUs. We implemented the framework as plugins to Kubernetes and conducted evaluations on a 16-GPU cluster. Results show that our scheduling framework reduces the makespan and average JCT by up to 45% and 63%, respectively, compared to default scheduler.

Back to Technical Papers Archive Listing