SC20 Is Everywhere We Are

SC20 Virtual Platform
A Generic Strategy for Node-Failure Resilience for Certain Iterative Linear Algebra Methods
Event Type
Extreme Scale Computing
Fault Tolerance
Reliability and Resiliency
Registration Categories
TimeWednesday, 11 November 202012:25pm - 12:55pm EST
LocationTrack 11
DescriptionResilience is an important research topic in HPC. As computer clusters go to extreme scales, work in this area is necessary to keep these machines reliable.

In this work, we introduce a generic method to endow iterative algorithms in linear algebra based on sparse matrix-vector products, such as linear system solvers, eigensolvers, with resilience to node failures. This generic method traverses the dependency graph of the variables of the iterative algorithm. If the iterative method exhibits certain properties, it is possible to produce an exact state reconstruction (ESR) algorithm, enabling the recovery of the state of the iterative method in the event of a node failure. This reconstruction is exact, except for small perturbations caused by floating point arithmetic. The generic method exploits redundancy in the matrix-vector product to protect the vector that is the argument of the product.

We illustrate the use of this generic approach on three iterative methods: the conjugate gradient method, the BiCGStab method and the Lanczos algorithm. The resulting ESR algorithms enable the reconstruction of their state after a node failure from a few redundantly stored vectors.

Unlike previous work in preconditioned conjugate gradient, this generic method produces ESR algorithms that work with general matrices. Consequently, we can no longer assume that local diagonal submatrices used to reconstruct vectors are non-singular. Thus, we also propose an approach for deriving non-singular local linear systems for the reconstruction process with reduced condition numbers, based on a communication-avoiding rank-revealing QR factorization with column pivoting.
Back To Top Button