SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification and Implications

Authors: Tirthak Patel (Northeastern University); Zhengchun Liu, Rajkumar Kettimuthu, Paul Rich, and William Allcock (Argonne National Laboratory (ANL)); and Devesh Tiwari (Northeastern University)

Abstract: HPC workload analysis and resource consumption characteristics are the key to driving better operation practices, system procurement decisions and designing effective resource management techniques. Unfortunately, the HPC community does not have easy accessibility to long-term introspective workload analysis and characterization for production-scale HPC systems. This study bridges this gap by providing detailed long-term quantification, characterization and analysis of job characteristics on two supercomputers; Intrepid and Mira. This study is one of the largest of its kind, covering trends and characteristics for over three billion compute hours, 750 thousand jobs and spanning a decade. We confirm several long-held conventional wisdoms, and identify many previously undiscovered trends and their implications. We also introduce a learning-based technique to predict the resource requirement of future jobs with high accuracy, using features available prior to job submission and without requiring any application-specific tracing or application-intrusive instrumentation.

Back to Technical Papers Archive Listing