SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Early Prediction of High-Performance Computing Job Outcomes via Modeling System Text Logs

Workshop:2nd Workshop on Machine Learning for Computing Systems

Authors: Alexandra DeLucia (Johns Hopkins University) and Elisabeth (Lissa) Moore (Los Alamos National Laboratory)

Abstract: The massive scale of high-performance computing (HPC) machines necessitates using automatic statistical methods to assist human operators in monitoring day-to-day behavior. We address the problem of identifying problematic compute jobs by modeling system logs, which record all activities on the machine in near-natural language form. We apply techniques from relational learning and human language technology, incorporated with domain knowledge, to extract features from system logs produced by approximately 10,000 HPC jobs. We evaluate the usefulness of these features via a random forest model to predict job outcome state. We compare our models to a baseline which mimics state-of-the-art human operator behavior, and find that the best-performing feature set is one which combines domain knowledge with simple aggregate metrics. Our method predicts job outcomes with an F1 score approaching 0.9 after a job has been running for 30 minutes, giving an average lead time of three hours before failure.

Back to 2nd Workshop on Machine Learning for Computing Systems Archive Listing

Back to Full Workshop Archive Listing