Workshop:2nd Workshop on Machine Learning for Computing Systems
Authors: Alexandra DeLucia (Johns Hopkins University) and Elisabeth (Lissa) Moore (Los Alamos National Laboratory)
Abstract: The massive scale of high-performance computing (HPC) machines necessitates using automatic statistical methods to assist human operators in monitoring day-to-day behavior. We address the problem of identifying problematic compute jobs by modeling system logs, which record all activities on the machine in near-natural language form. We apply techniques from relational learning and human language technology, incorporated with domain knowledge, to extract features from system logs produced by approximately 10,000 HPC jobs. We evaluate the usefulness of these features via a random forest model to predict job outcome state. We compare our models to a baseline which mimics state-of-the-art human operator behavior, and find that the best-performing feature set is one which combines domain knowledge with simple aggregate metrics. Our method predicts job outcomes with an F1 score approaching 0.9 after a job has been running for 30 minutes, giving an average lead time of three hours before failure.