SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony


Workshop:FTXS: Workshop on Fault-Tolerance for HPC at Extreme Scale

Authors: Hemanth Kolla, Jackson R. Mayo, Keita Teranishi, and Robert C. Armstrong (Sandia National Laboratories)


Abstract: Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial differential equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.





Back to FTXS: Workshop on Fault-Tolerance for HPC at Extreme Scale Archive Listing



Back to Full Workshop Archive Listing