SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Recovering Silent Data Corruption through Spatial Prediction


Student: Sarah A. Placke (Clemson University, Dept of Electrical and Computer Engineering)
Supervisor: Jon C. Calhoun (Clemson University, Dept of Electrical and Computer Engineering)

Abstract: High-performance computing applications are central to advancement in many fields of science and engineering. Central to this advancement is the supposed reliability of the HPC system. However, as system size grows and hardware components are run with near-threshold voltages, transient upset events become more likely. Many works have explored the problem of detection of silent data corruption. Recovery is often left to checkpoint-restart or application-specific techniques. This poster explores the use of spatial similarity to recover from silent data corruption. We explore eight reconstruction methods and find that Linear Regression yields the best results with over 90% of Linear Regression’s corrections having less than 1% relative error.

ACM-SRC Semi-Finalist: yes

Poster: PDF
Poster Summary: PDF


Back to Poster Archive Listing