A Workflow Hierarchy-Aware Fault Tolerance System
TimeThursday, 19 November 20208:30am - 5pm EDT
DescriptionComplex scientific workflows present unprecedented challenges to fault tolerance support in high-performance computing (HPC). While existing solutions such as checkpoint/restart (C/R) and resource over-provisioning work well at the application level, they do not scale to the demand by complex workflows. As workflows are composed of a large variety of components, they must detect, propagate and recover from a fault in a highly coordinated way, lest handling action itself do more harm than good. We propose Workflow Hierarchy-aware Exception Specification Language (WHESL), a novel solution that allows a modern workflow to specify and handle faults and exceptions among its disparate components in an easy and coordinated fashion. Our preliminary study using our prototype built on top of Flux, a next-generation hierarchical resource and job management system (RJMS), shows that WHESL can significantly extend the traditional HPC fault tolerance support for complex workflows.