An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems
Accelerators, FPGA, and GPUs
Machine Learning, Deep Learning and Artificial Intelligence
Performance/Productivity Measurement and Evaluation
Reliability and Resiliency
TimeThursday, 19 November 20202pm - 2:30pm EDT
DescriptionWe propose an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPU allocation mechanism works in concert with the scheduler. It offers a lightweight and efficient method to reallocate GPUs based on a “SideCar” process that temporarily stops and restarts the job’s DL training processes with a different number of GPUs. We implemented the framework as plugins to Kubernetes and conducted evaluations on a 16-GPU cluster. Results show that our scheduling framework reduces the makespan and average JCT by up to 45% and 63%, respectively, compared to default scheduler.