Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification and Implications
Data Analytics, Compression, and Management
Performance/Productivity Measurement and Evaluation
Resource Management and Scheduling
TimeThursday, 19 November 202011am - 11:30am EDT
DescriptionHPC workload analysis and resource consumption characteristics are the key to driving better operation practices, system procurement decisions and designing effective resource management techniques. Unfortunately, the HPC community does not have easy accessibility to long-term introspective workload analysis and characterization for production-scale HPC systems. This study bridges this gap by providing detailed long-term quantification, characterization and analysis of job characteristics on two supercomputers; Intrepid and Mira. This study is one of the largest of its kind, covering trends and characteristics for over three billion compute hours, 750 thousand jobs and spanning a decade. We confirm several long-held conventional wisdoms, and identify many previously undiscovered trends and their implications. We also introduce a learning-based technique to predict the resource requirement of future jobs with high accuracy, using features available prior to job submission and without requiring any application-specific tracing or application-intrusive instrumentation.