BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160153Z
LOCATION:Track 6
DTSTART;TZID=America/New_York:20201117T103000
DTEND;TZID=America/New_York:20201117T110000
UID:submissions.supercomputing.org_SC20_sess290_sotp127@linklings.com
SUMMARY:Deploying Checkpoint/Restart for Production Workloads at NERSC
DESCRIPTION:State of the Practice Talk\n\nDeploying Checkpoint/Restart for
  Production Workloads at NERSC\n\nZhao, Hartman-Baker, Cooperman\n\nCheckp
 oint/restart (C/R) is a critical component of fault-tolerant computing, an
 d provides scheduling flexibility for computing centers to support diverse
  workloads with different priorities. Because existing C/R tools are often
  research-oriented, there is a gap to close before they can be used reliab
 ly with production workloads, especially on cutting-edge HPC systems. In t
 his talk, we present our strategy to enable C/R capabilities on NERSC prod
 uction workloads, which are dominated by MPI and hybrid MPI+OpenMP applica
 tions. We share our journey to prepare a production-ready MPI-Agnostic Net
 work-Agnostic (MANA) Distributed Multi-Threaded CheckPointing (DMTCP) tool
  for NERSC. We also present variable-time job scripts to automate preempte
 d job submissions, queue policies and configurations we have adopted to in
 centivize C/R usage, our user training effort to increase NERSC users' upt
 ake of C/R, and our effort to build an active C/R community. Finally, we s
 howcase some applications enabled by C/R.\n\nTag: Best Practices, System M
 anagement\n\nRegistration Category: Tech Program Reg Pass
END:VEVENT
END:VCALENDAR

