SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool


Workshop:ISAV 2020: In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization

Authors: Christopher Kelly (Brookhaven National Laboratory); Sungsoo Ha (Amazon Web Services); Kevin Huck (University of Oregon); Hubertus Van Dam and Line Pouchard (Brookhaven National Laboratory); Gyorgy Matyasfalvi (Princeton University); Li Tang (Los Alamos National Laboratory); and Nicholas D’Imperio, Wei Xu, Shinjae Yoo, and Kerstin Van Dam (Brookhaven National Laboratory)


Abstract: Due to the sheer volume of data, it is typically impractical to analyze the detailed performance of an HPC application running at-scale. While conventional small-scale benchmarking and scaling studies are often sufficient for simple applications, many modern workflow-based applications couple multiple elements with competing resource demands and complex inter-communication patterns for which performance cannot easily be studied in isolation and at small scale. This work discusses Chimbuko, a performance analysis framework that provides real-time, in situ anomaly detection. By focusing specifically on performance anomalies and their origin (a.k.a provenance), data volumes are dramatically reduced without losing necessary details. To the best of our knowledge, Chimbuko is the first online, distributed and scalable workflow-level performance trace analysis framework. We demonstrate the tool's usefulness on Oak Ridge National Laboratory's Summit system.





Back to ISAV 2020: In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization Archive Listing



Back to Full Workshop Archive Listing