Workshop:ExaMPI: Workshop on Exascale MPI
Authors: Derek J. Schafer (University of Tennessee, Chattanooga); Ignacio Laguna (Lawrence Livermore National Laboratory); Anthony Skjellum (University of Tennessee, Chattanooga); Nawrin Sultana (Intel Corporation); and Kathryn Mohror (Lawrence Livermore National Laboratory)
Abstract: Here, we expand upon our effective, checkpoint-based approach of fault tolerance for MPI: "MPI Stages'', an extension of the Reinit model of fault-tolerant MPI introduced by some of us and others, notably without the use of setjmp/longjmp. MPI Stages saves internal MPI state in a separate checkpoint in coordination with application state saved by the application. While it is currently implemented based on the ExaMPI research implementation of MPI (designed to simplify checkpointing of state through a new, OO design), MPI Stages' downward requirement on any MPI implementation primarily inheres in internal-state checkpointability, apart from the new syntax and semantics of the MPI Stages model itself. As of now, MPI Stages supports communicators, groups and limited forms of derived datatypes used with point-to-point and collective communication. We report on success with a substantial MPI application, SW4, which utilizes many of the common subsets of features of many data-parallel MPI applications. We reinforce the model of a pre-main-type resilience interposition model, and introduce MPI opaque object serialization and deserialization. We also introduce MPIX_FT_errno, akin to POSIX errno, new functions to better support use of the Stages model in hierarchical code and legacy software that does not sample MPI error codes faithfully. These MPI Stages concepts appear useful for other fault tolerant MPI models and so are near-term standardization targets. We describe future steps needed to make MPI Stages more production ready, standardizable and integrable with other MPI fault-tolerance models.