From Tasks Graphs to Asynchronous Distributed Checkpointing with Local Restart

SC20 Proceedings

From Tasks Graphs to Asynchronous Distributed Checkpointing with Local Restart

Workshop:FTXS: Workshop on Fault-Tolerance for HPC at Extreme Scale

Authors: Romain Lion (French Institute for Research in Computer Science and Automation (INRIA) - Bordeaux, University of Bordeaux) and Samuel Thibault (University of Bordeaux, French Institute for Research in Computer Science and Automation (INRIA) - Bordeaux)

Abstract: The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data processed by today's applications, these strategies, however, suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations.

The current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus allowing for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10% on a dense linear algebra example.

Back to FTXS: Workshop on Fault-Tolerance for HPC at Extreme Scale Archive Listing

Back to Full Workshop Archive Listing