SC20 Is Everywhere We Are

SC20 Virtual Platform
Checkpointing OpenSHMEM Programs Using Compiler Analysis
Event Type
Extreme Scale Computing
Fault Tolerance
Reliability and Resiliency
Registration Categories
TimeWednesday, 11 November 202012:55pm - 1:25pm EDT
LocationTrack 11
DescriptionThe importance of fault-tolerance continues to increase for HPC applications. The continued growth in size and complexity of HPC systems, and of the applications themselves, is leading to an increased likelihood of failures during execution. Most HPC programming models, however, lack a built-in fault-tolerance mechanism. Instead, application developers usually rely on external support such as application-level checkpoint-restart (C/R) libraries to make their codes fault-tolerant. Tthis increases the application developer's burden, who must use the libraries carefully to ensure correct behavior and minimize overheads. The C/R routines are used to save the values of all needed program variables at the places in the code where they are invoked. It's important for correctness that the program data is in a consistent state at these places. It is non-trivial to determine such points in OpenSHMEM due to its one-sided communication nature. The amount of data to be saved and the frequency of C/R calls must also be tuned carefully due to the C/R calls' extremely high overhead.

There is very little prior work on checkpoint-restart support in the context of OpenSHMEM. In this paper, we introduce OpenSHMEM and describe the challenges it poses for checkpointing. We identify the safest places for inserting C/R calls in an OpenSHMEM program and describe a straightforward approach for identifying the data that needs to be checkpointed at these positions. We provide these two functionalities in a tool that exploits compiler analyses to propose checkpoints, and the data to save at those points, to the application developer.
Back To Top Button