Workshop:2nd Workshop on Machine Learning for Computing Systems
Authors: Virginia K. Felkner (University of Southern California) and Elisabeth (Lissa) Moore (Los Alamos National Laboratory)
Abstract: Each node of a supercomputer produces a detailed log of its operation, called a syslog. It is impossible for system administrators to review all syslog data produced by the thousands of compute nodes associated with a single HPC machine. Analysis of these logs to detect and predict failures, however, is crucial to maintaining the health of supercomputers. The majority of prior work using machine learning to study syslog has relied heavily on the semi-structured nature of system logs. This work investigates syslogs as unstructured, purely textual natural language data. We confirm that treating syslog output as unstructured natural language text does not perform well for node failure prediction, and that researchers must exploit the structure within syslog data to produce more useful results. In order to extract features from syslog text, we employ several popular word embeddings and then cluster both word- and message-level vectors. Finally, we prepare a dataset for supervised learning by aggregating the syslog into 15-minute time windows and extracting the distribution of clusters within that window. Our failure prediction models achieved a relatively low AUC of .59 using a gradient-boosted random forest. This performance barely out-performs random guessing, but does suggest the presence of signal that could be amplified in future work. We conclude that the incorporation of domain knowledge and structural information into predictive models, rather than a unilateral application of natural language processing techniques, is crucial to build deployable tools.