Fingerprinting the Checker Policies of Parallel File Systems
Data Analytics, Compression, and Management
TimeThursday, 12 November 20203:23pm - 3:46pm EST
DescriptionParallel file systems (PFSes) play an essential role in high performance computing. To ensure the integrity, many PFSes are designed with a checker component, which serves as the last line of defense to bring a corrupted PFS back to a healthy state. Motivated by real-world incidents of PFS corruptions, we perform a fine-grained study on the capability of PFS checkers in this paper. We apply type-aware fault injection to specific PFS structures, and examine the detection and repair policies of PFS checkers meticulously via a well-defined taxonomy. The study results on two representative PFS checkers show that they are able to handle a wide range of corruptions on important data structures. On the other hand, neither of them is perfect: there are multiple cases where the checkers may behave sub-optimally, leading to kernel panics, wrong repairs, etc. Our work has led to a new patch on Lustre. We hope to develop our methodology into a generic framework for analyzing the checkers of diverse PFSes, and enable more elegant designs of PFS checkers for reliable high-performance computing.