SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Runtime-Guided ECC Protection using Online Estimation of Memory Vulnerability

Authors: Luc Jaulmes, Miquel Moretó, and Mateo Valero (Barcelona Supercomputing Center, Polytechnic University of Catalonia); Mattan Erez (University of Texas); and Marc Casas (Barcelona Supercomputing Center, Polytechnic University of Catalonia)

Abstract: Diminishing reliability of semiconductor technologies and decreasing power budgets per component hinder designing next-generation high-performance computing (HPC) systems. Both constraints strongly impact memory subsystems, as DRAM main memory accounts for up to 30 to 50 percent of a node’s overall power consumption, and is the subsystem that is most subject to faults. Improving reliability requires stronger error correcting codes (ECCs), which incur additional power and storage costs. It is critical to develop strategies to uphold memory reliability while minimising these costs, with the goal of improving the power efficiency of computing machines.

We introduce a methodology to dynamically estimate the vulnerability of data, and adjust ECC protection accordingly. Our methodology relies on information readily available to runtime systems in task-based dataflow programming models, and the existing Virtualized Error Correcting Code (VECC) schemes to provide adaptable protection. Guiding VECC using vulnerability estimates is more energy efficient than using stronger uniform ECC.

Back to Technical Papers Archive Listing