HPC I/O Throughput Bottleneck Analysis with Explainable Local Models

SC20 Proceedings

HPC I/O Throughput Bottleneck Analysis with Explainable Local Models

Authors: Mihailo Isakov and Eliakin del Rosario (Texas A&M University); Sandeep Madireddy, Prasanna Balaprakash, Phillip H. Carns, and Robert Ross (Argonne National Laboratory (ANL)); and Michel A. Kinsy (Texas A&M University)

Abstract: With the growing complexity of high-performance computing (HPC) systems, achieving high performance can be difficult because of I/O bottlenecks. We analyze multiple years worth of Darshan logs from the Argonne Leadership Computing Facility's Theta supercomputer in order to understand causes of poor I/O throughput. We present Gauge: a data-driven diagnostic tool for exploring the latent space of supercomputing job features, understanding behaviors of clusters of jobs and interpreting I/O bottlenecks. By finding groups of jobs that at first sight are highly heterogeneous but share certain behaviors, and analyzing these groups instead of individual jobs, we reduce the workload of domain experts and automate I/O performance analysis. We conduct a case study where a system owner using Gauge was able to arrive at several clusters that do not conform to conventional I/O behaviors, as well as find several potential improvements, both on the application level and the system level.

Back to Technical Papers Archive Listing