BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160553Z
LOCATION:Track 11
DTSTART;TZID=America/New_York:20201111T103500
DTEND;TZID=America/New_York:20201111T110500
UID:submissions.supercomputing.org_SC20_sess204_ws_ftxs103@linklings.com
SUMMARY:Towards Distributed Software Resilience in Asynchronous Many-Task 
 Programming Models
DESCRIPTION:Workshop\n\nTowards Distributed Software Resilience in Asynchr
 onous Many-Task Programming Models\n\nGupta, Mayo, Lemoine, Kaiser\n\nExce
 ptions and errors occurring within mission critical applications due to ha
 rdware failures have a high cost. With the emerging next generation platfo
 rms (NGPs), the rate of hardware failures will likely increase. Designing 
 our applications to be resilient, therefore, is a critical concern in orde
 r to retain the reliability of results while meeting the constraints on po
 wer budgets. In this paper, we discuss software resilience in AMTs at both
  local and distributed scale. We choose HPX to prototype our resiliency de
 signs. We implement two resiliency APIs that we expose to the application 
 developers, namely task replication and task replay. Task replication repe
 ats a task n times and executes the repeated tasks asynchronously. Task re
 play reschedules a task up to n times until a valid output is returned. Fu
 rthermore, we expose algorithm based fault tolerance (ABFT) using user pro
 vided predicates (e.g., checksums) to validate the returned results. We be
 nchmark the resiliency scheme for both synthetic and real world applicatio
 ns at local and distributed scale and show that most of the added executio
 n time arises from the replay, replication or data movement of the tasks a
 nd not the boilerplate code added to achieve resilience.\n\nTag: Extreme S
 cale Computing, Fault Tolerance, Reliability and Resiliency\n\nRegistratio
 n Category: Workshop Reg Pass
END:VEVENT
END:VCALENDAR

