BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160553Z
LOCATION:Track 11
DTSTART;TZID=America/New_York:20201111T100500
DTEND;TZID=America/New_York:20201111T103500
UID:submissions.supercomputing.org_SC20_sess204_ws_ftxs101@linklings.com
SUMMARY:Improving Scalability of Silent-Error Resilience for Message-Passi
 ng Solvers via Local Recovery and Asynchrony
DESCRIPTION:Workshop\n\nImproving Scalability of Silent-Error Resilience f
 or Message-Passing Solvers via Local Recovery and Asynchrony\n\nKolla, May
 o, Teranishi, Armstrong\n\nBenefits of local recovery (restarting only a f
 ailed process or task) have been previously demonstrated in parallel solve
 rs. Local recovery has a reduced impact on application performance due to 
 masking of failure delays (for message-passing codes) or dynamic load bala
 ncing (for asynchronous many-task codes). In this paper, we implement MPI-
 process-local checkpointing and recovery of data (as an extension of the F
 enix library) in combination with an existing method for local detection o
 f silent errors in partial differential equation solvers, to show a path f
 or incorporating lightweight silent-error resilience. In addition, we demo
 nstrate how asynchrony introduced by maximizing computation-communication 
 overlap can halt the propagation of delays. For a prototype stencil solver
  (including an iterative-solver-like variant) with injected memory bit fli
 ps, results show greatly reduced overhead under weak scaling compared to g
 lobal recovery, and high failure-masking efficiency. The approach is expec
 ted to be generalizable to other MPI-based solvers.\n\nTag: Extreme Scale 
 Computing, Fault Tolerance, Reliability and Resiliency\n\nRegistration Cat
 egory: Workshop Reg Pass
END:VEVENT
END:VCALENDAR

