BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160102Z
LOCATION:Track 5
DTSTART;TZID=America/New_York:20201118T153000
DTEND;TZID=America/New_York:20201118T160000
UID:submissions.supercomputing.org_SC20_sess174_pap325@linklings.com
SUMMARY:CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UV
 M
DESCRIPTION:Paper\n\nCRAC: Checkpoint-Restart Architecture for CUDA with S
 treams and UVM\n\nJain, Cooperman\n\nThe share of the top 500 supercompute
 rs with Nvidia GPUs is now over 25% and continues to grow.  While fault to
 lerance is a critical issue for supercomputing, there does not currently e
 xist an efficient, scalable solution for CUDA applications on Nvidia GPUs.
   CRAC is a new checkpoint-restart solution for fault tolerance that suppo
 rts the full range of CUDA applications.  CRAC combines low runtime overhe
 ad (less than 1%); fast checkpoint-restart; support for scalable CUDA stre
 ams (for efficient usage of all of the thousands of GPU cores) and support
  for the full features of Unified Virtual Memory (eliminating the programm
 er's burden of migrating memory between device and host). CRAC achieves it
 s flexible architecture by segregating application code (checkpointed) and
  its external GPU communication via non-reentrant CUDA libraries (not chec
 kpointed) within a single process' memory. This eliminates the high IPC ov
 erhead of earlier approaches.\n\nTag: Accelerators, FPGA, and GPUs, Fault 
 Tolerance, Power, Reliability and Resiliency\n\nRegistration Category: Tec
 h Program Reg Pass
END:VEVENT
END:VCALENDAR

