BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160103Z
LOCATION:Track 2
DTSTART;TZID=America/New_York:20201119T100000
DTEND;TZID=America/New_York:20201119T103000
UID:submissions.supercomputing.org_SC20_sess151_pap551@linklings.com
SUMMARY:Scalable Heterogeneous Execution of a Coupled-Cluster Model with P
 erturbative Triples
DESCRIPTION:Paper\n\nScalable Heterogeneous Execution of a Coupled-Cluster
  Model with Perturbative Triples\n\nKim, Panyala, Peng, Kowalski, Sadayapp
 an...\n\nThe CCSD(T) coupled-cluster model with perturbative triples is co
 nsidered a gold standard for computational modeling of the correlated beha
 vior of electrons in molecular systems. A fundamental constraint is the re
 latively small global memory capacity in GPUs compared to the main memory 
 capacity on host nodes, necessitating relatively smaller tile sizes for hi
 gh-dimensional tensor contractions in NWChem's GPU-accelerated implementat
 ion of the CCSD(T) method. A coordinated redesign is described to address 
 this limitation and associated data movement overheads, including a novel 
 fused GPU kernel for a set of tensor contractions, along with inter-node c
 ommunication optimization and data caching. The new implementation of GPU-
 accelerated CCSD(T) improves overall performance by 3.4x. We discuss the t
 rade-offs in using this fused algorithm on current and future supercomputi
 ng platforms.\n\nTag: Applications, Scalable Computing\n\nRegistration Cat
 egory: Tech Program Reg Pass
END:VEVENT
END:VCALENDAR