SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

A Workflow Hierarchy-Aware Fault Tolerance System

Authors: Subhendu Behera (North Carolina State University), Dong H. Ahn and Stephen Herbein (Lawrence Livermore National Laboratory), Frank Mueller (North Carolina State University), and Barry L. Rountree (Lawrence Livermore National Laboratory)

Abstract: Complex scientific workflows present unprecedented challenges to fault tolerance support in high-performance computing (HPC). While existing solutions such as checkpoint/restart (C/R) and resource over-provisioning work well at the application level, they do not scale to the demand by complex workflows. As workflows are composed of a large variety of components, they must detect, propagate and recover from a fault in a highly coordinated way, lest handling action itself do more harm than good. We propose Workflow Hierarchy-aware Exception Specification Language (WHESL), a novel solution that allows a modern workflow to specify and handle faults and exceptions among its disparate components in an easy and coordinated fashion. Our preliminary study using our prototype built on top of Flux, a next-generation hierarchical resource and job management system (RJMS), shows that WHESL can significantly extend the traditional HPC fault tolerance support for complex workflows.

Best Poster Finalist (BP): no

Poster: PDF
Poster summary: PDF

Back to Poster Archive Listing