BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160553Z
LOCATION:Track 11
DTSTART;TZID=America/New_York:20201111T125500
DTEND;TZID=America/New_York:20201111T132500
UID:submissions.supercomputing.org_SC20_sess204_ws_ftxs109@linklings.com
SUMMARY:Checkpointing OpenSHMEM Programs Using Compiler Analysis
DESCRIPTION:Workshop\n\nCheckpointing OpenSHMEM Programs Using Compiler An
 alysis\n\nShahneous Bari, Basu, Lu, Curtis, Chapman\n\nThe importance of f
 ault-tolerance continues to increase for HPC applications. The continued g
 rowth in size and complexity of HPC systems, and of the applications thems
 elves, is leading to an increased likelihood of failures during execution.
  Most HPC programming models, however, lack a built-in fault-tolerance mec
 hanism. Instead, application developers usually rely on external support s
 uch as application-level checkpoint-restart (C/R) libraries to make their 
 codes fault-tolerant. Tthis increases the application developer's burden, 
 who must use the libraries carefully to ensure correct behavior and minimi
 ze overheads. The C/R routines are used to save the values of all needed p
 rogram variables at the places in the code where they are invoked. It's im
 portant for correctness that the program data is in a consistent state at 
 these places. It is non-trivial to determine such points in OpenSHMEM due 
 to its one-sided communication nature.  The amount of data to be saved and
  the frequency of C/R calls must also be tuned carefully due to the C/R ca
 lls' extremely high overhead.\n\nThere is very little prior work on checkp
 oint-restart support in the context of OpenSHMEM. In this paper, we introd
 uce OpenSHMEM and describe the challenges it poses for checkpointing. We i
 dentify the safest places for inserting C/R calls in an OpenSHMEM program 
 and describe a straightforward approach for identifying the data that need
 s to be checkpointed at these positions. We provide these two functionalit
 ies in a tool that exploits compiler analyses to propose checkpoints, and 
 the data to save at those points, to the application developer.\n\nTag: Ex
 treme Scale Computing, Fault Tolerance, Reliability and Resiliency\n\nRegi
 stration Category: Workshop Reg Pass
END:VEVENT
END:VCALENDAR

