Log-Based Identification, Classification, and Behavior Prediction of HPC Applications
TimeFriday, 13 November 202012:30pm - 12:50pm EST
DescriptionLeadership supercomputers, such as those operated by the Argonne Leadership Computing Facility (ALCF), provide an important avenue for scientific exploration and discovery, enabling simulation, data analysis and visualization, and artificial intelligence at massive scale. As we move into the exascale supercomputing era in 2021 with the advent of Aurora, Frontier, and other exascale machines, it's important that we are able to understand the interactions between the applications being run, and the hardware they run on, to optimize the use of these expensive and high-demand resources.
In previous work, we analyzed a collection of production machine scheduling and performance logs to better understand application behaviors and characteristics. This work further refines our understanding of how scientific users leverage leadership computing resources; we show that system-level hardware performance counters can work as a lightweight, low-overhead alternative to more performance-intensive benchmarking and logging instrumentation for certain data analysis tasks. We also demonstrate a method for predicting application runtimes on leadership computing resources using data gathered from logging sources at submission.