SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

SEFEE: Lightweight Storage Error Forecasting in Large-Scale Enterprise Storage Systems


Authors: Amirhessam Yazdi (University of Nevada, Reno); Xing Lin (NetApp Inc); and Lei Yang and Feng Yan (University of Nevada, Reno)

Abstract: With the rapid growth in scale and complexity, today's enterprise storage systems need to deal with significant amounts of errors. Existing proactive methods mainly focus on machine learning techniques trained on SMART measurements. Such methods, however, are usually expensive to use in practice and can only be applied to limited types of errors with a limited scale. We collected more than 23 million storage events from 87 deployed NetApp-ONTAP systems managing 14,371 disks for two years, and propose a lightweight training-free storage error forecasting method; SEFEE. SEFEE employs tensor decomposition to directly analyze storage error-event logs and perform online error prediction for all error types in all storage nodes. SEFEE explores hidden spatiotemporal information that is deeply embedded in the global scale of storage systems to achieve record breaking error forecasting accuracy with minimal prediction overhead.




Back to Technical Papers Archive Listing