SC20 Is Everywhere We Are

SC20 Virtual Platform
Fingerprinting the Checker Policies of Parallel File Systems
Event Type
Workshop
Tags
Big Data
Data Analytics, Compression, and Management
Data Movement
File Systems and I/O
Storage
Registration Categories
W
TimeThursday, 12 November 20203:23pm - 3:46pm EST
LocationTrack 5
DescriptionParallel file systems (PFSes) play an essential role in high performance computing. To ensure the integrity, many PFSes are designed with a checker component, which serves as the last line of defense to bring a corrupted PFS back to a healthy state. Motivated by real-world incidents of PFS corruptions, we perform a fine-grained study on the capability of PFS checkers in this paper. We apply type-aware fault injection to specific PFS structures, and examine the detection and repair policies of PFS checkers meticulously via a well-defined taxonomy. The study results on two representative PFS checkers show that they are able to handle a wide range of corruptions on important data structures. On the other hand, neither of them is perfect: there are multiple cases where the checkers may behave sub-optimally, leading to kernel panics, wrong repairs, etc. Our work has led to a new patch on Lustre. We hope to develop our methodology into a generic framework for analyzing the checkers of diverse PFSes, and enable more elegant designs of PFS checkers for reliable high-performance computing.
Back To Top Button