SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

A Year of Automated Anomaly Detection in a Datacenter

Workshop:2nd Workshop on Machine Learning for Computing Systems

Authors: Rufaida Ahmed, Joseph Porter, Abubaker Abdelmutalab, and Robert Ricci (University of Utah)

Abstract: Anomaly detection based on machine learning can be a powerful tool for understanding the behavior of large, complex computer systems in the wild. The set of anomalies seen, however, can change over time: as the system evolves, is put to different uses and encounters different workloads, both its typical behavior and the anomalies that it encounters can change as well. This naturally raises two questions: how effective is automated anomaly detection in this setting, and how much does anomalous behavior change over time?

In this paper, we examine these questions for a dataset taken from a system that manages the lifecycle of servers in datacenters. We look at logs from one year of operation of a datacenter of about 500 servers. Applying state-of-the art techniques for finding anomalous events, we find that there are a core set of anomaly patterns that persist over the entire period studied, but that to track the evolution of the system, we must re-train the detector periodically. Working with the administrators of this system, we find that, despite these changes in patterns, they still contain actionable insights.

Back to 2nd Workshop on Machine Learning for Computing Systems Archive Listing

Back to Full Workshop Archive Listing