SC20 Is Everywhere We Are

SC20 Virtual Platform
Cost-Aware Prediction of Uncorrected DRAM Errors in the Field
Event Type
Paper
Tags
Machine Learning, Deep Learning and Artificial Intelligence
Requirements, Performance, and Benchmarks
Reliability and Resiliency
Registration Categories
TP
TimeWednesday, 18 November 20201pm - 1:30pm EST
LocationTrack 5
DescriptionThis paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node hours per year. We release all source code as open source.

We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost–benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost–benefit calculation.
Back To Top Button