BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160105Z
LOCATION:Track 3
DTSTART;TZID=America/New_York:20201119T140000
DTEND;TZID=America/New_York:20201119T143000
UID:submissions.supercomputing.org_SC20_sess164_pap159@linklings.com
SUMMARY:An Efficient and Non-Intrusive GPU Scheduling Framework for Deep L
 earning Training Systems
DESCRIPTION:Paper\n\nAn Efficient and Non-Intrusive GPU Scheduling Framewo
 rk for Deep Learning Training Systems\n\nWang, Gonzalez, Zhou, Williams, F
 riedman...\n\nWe propose an efficient, non-intrusive GPU scheduling framew
 ork that employs a combination of an adaptive GPU scheduler and an elastic
  GPU allocation mechanism to reduce the completion time of DL training wor
 kloads and improve resource utilization. Specifically, the adaptive GPU sc
 heduler includes a scheduling algorithm that uses training job progress in
 formation to determine the most efficient allocation and reallocation of G
 PUs for incoming and running jobs at any given time. The elastic GPU alloc
 ation mechanism works in concert with the scheduler. It offers a lightweig
 ht and efficient method to reallocate GPUs based on a “SideCar” process th
 at temporarily stops and restarts the job’s DL training processes with a d
 ifferent number of GPUs. We implemented the framework as plugins to Kubern
 etes and conducted evaluations on a 16-GPU cluster. Results show that our 
 scheduling framework reduces the makespan and average JCT by up to 45% and
  63%, respectively, compared to default scheduler.\n\nTag: Accelerators, F
 PGA, and GPUs, Machine Learning, Deep Learning and Artificial Intelligence
 , Performance/Productivity Measurement and Evaluation, Reliability and Res
 iliency\n\nRegistration Category: Tech Program Reg Pass
END:VEVENT
END:VCALENDAR

