BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160553Z
LOCATION:Track 11
DTSTART;TZID=America/New_York:20201111T122500
DTEND;TZID=America/New_York:20201111T125500
UID:submissions.supercomputing.org_SC20_sess204_ws_ftxs107@linklings.com
SUMMARY:A Generic Strategy for Node-Failure Resilience for Certain Iterati
ve Linear Algebra Methods
DESCRIPTION:Workshop\n\nA Generic Strategy for Node-Failure Resilience for
Certain Iterative Linear Algebra Methods\n\nPachajoa, Ernstbrunner, Ganst
erer\n\nResilience is an important research topic in HPC. As computer clus
ters go to extreme scales, work in this area is necessary to keep these ma
chines reliable.\n\nIn this work, we introduce a generic method to endow i
terative algorithms in linear algebra based on sparse matrix-vector produc
ts, such as linear system solvers, eigensolvers, with resilience to node f
ailures. This generic method traverses the dependency graph of the variabl
es of the iterative algorithm. If the iterative method exhibits certain pr
operties, it is possible to produce an exact state reconstruction (ESR) al
gorithm, enabling the recovery of the state of the iterative method in the
event of a node failure. This reconstruction is exact, except for small p
erturbations caused by floating point arithmetic. The generic method explo
its redundancy in the matrix-vector product to protect the vector that is
the argument of the product.\n\nWe illustrate the use of this generic appr
oach on three iterative methods: the conjugate gradient method, the BiCGSt
ab method and the Lanczos algorithm. The resulting ESR algorithms enable t
he reconstruction of their state after a node failure from a few redundant
ly stored vectors.\n\nUnlike previous work in preconditioned conjugate gra
dient, this generic method produces ESR algorithms that work with general
matrices. Consequently, we can no longer assume that local diagonal submat
rices used to reconstruct vectors are non-singular. Thus, we also propose
an approach for deriving non-singular local linear systems for the reconst
ruction process with reduced condition numbers, based on a communication-a
voiding rank-revealing QR factorization with column pivoting.\n\nTag: Extr
eme Scale Computing, Fault Tolerance, Reliability and Resiliency\n\nRegist
ration Category: Workshop Reg Pass
END:VEVENT
END:VCALENDAR