Explainable Machine Learning Frameworks for Managing HPC Systems

SC20 Proceedings

Explainable Machine Learning Frameworks for Managing HPC Systems

Workshop:2nd Workshop on Machine Learning for Computing Systems

Authors: Burak Aksar (Boston University, Sandia National Laboratories); Emre Ateş (Boston University); Vitus J. Leung (Sandia National Laboratories); and Ayse K. Coskun (Boston University)

Abstract: Recent research on supercomputing proposes a variety of machine learning frameworks that are able to detect performance variations, find optimum application configurations, perform intelligent scheduling or node allocation and improve system security. Although these goals align well with HPC systems' needs, barriers such as the lack of user trust or the difficulty of debugging need to be overcome to enable the widespread adoption of such frameworks in production systems. This paper evaluates a new counterfactual time series explainability method and compares it against state-of-the-art explainability methods for supervised machine learning frameworks that use multivariate HPC system telemetry data. The counterfactual time series explainability method outperforms existing methods in terms of comprehensibility and robustness. We also show how explainability techniques can be used to debug machine learning frameworks and gain a better understanding of HPC system telemetry data.

Back to 2nd Workshop on Machine Learning for Computing Systems Archive Listing

Back to Full Workshop Archive Listing