BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160059Z
LOCATION:Track 2
DTSTART;TZID=America/New_York:20201118T133000
DTEND;TZID=America/New_York:20201118T140000
UID:submissions.supercomputing.org_SC20_sess166_pap113@linklings.com
SUMMARY:Live Forensics for HPC Systems: A Case Study on Distributed Storag
 e Systems
DESCRIPTION:Paper\n\nLive Forensics for HPC Systems: A Case Study on Distr
 ibuted Storage Systems\n\nJha, Cui, Banerjee, Xu, Enos...\n\nLarge-scale h
 igh-performance computing systems frequently experience a wide range of fa
 ilure modes, such as reliability failures (e.g., hang or crash), and resou
 rce overload-related failures (e.g., congestion, collapse), impacting syst
 ems and applications. Despite the adverse effects of these failures, curre
 nt systems do not provide methodologies for proactively detecting, localiz
 ing and diagnosing failures. We present Kaleidoscope, a near real-time fai
 lure detection and diagnosis framework, consisting of hierarchical domain-
 guided machine learning models that identify the failing components and th
 e corresponding failure mode, and point to the most likely cause indicativ
 e of the failure in near real-time (within one minute of failure occurrenc
 e). Kaleidoscope has been deployed on Blue Waters supercomputer and evalua
 ted with more than two years of production telemetry data. Our evaluation 
 shows that Kaleidoscope successfully localized 99.3% and pinpointed the ro
 ot causes of 95.8% of 843 real-world production issues, with less than 0.0
 1% runtime overhead.\n\nTag: Fault Tolerance, Reliability and Resiliency, 
 Storage\n\nRegistration Category: Tech Program Reg Pass\n\nAward Finalist:
  Best Paper Finalist, Best Student Paper Finalists
END:VEVENT
END:VCALENDAR

