Explainable Machine Learning Frameworks for Managing HPC Systems
TimeFriday, 13 November 20204:25pm - 4:55pm EST
DescriptionRecent research on supercomputing proposes a variety of machine learning frameworks that are able to detect performance variations, find optimum application configurations, perform intelligent scheduling or node allocation and improve system security. Although these goals align well with HPC systems' needs, barriers such as the lack of user trust or the difficulty of debugging need to be overcome to enable the widespread adoption of such frameworks in production systems. This paper evaluates a new counterfactual time series explainability method and compares it against state-of-the-art explainability methods for supervised machine learning frameworks that use multivariate HPC system telemetry data. The counterfactual time series explainability method outperforms existing methods in terms of comprehensibility and robustness. We also show how explainability techniques can be used to debug machine learning frameworks and gain a better understanding of HPC system telemetry data.