BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160544Z
LOCATION:Poster Module
DTSTART;TZID=America/New_York:20201119T083000
DTEND;TZID=America/New_York:20201119T170000
UID:submissions.supercomputing.org_SC20_sess337_rpost115@linklings.com
SUMMARY:Optimization of Tensor-Product Operations in Nekbone on GPUs
DESCRIPTION:Posters, Research Posters\n\nOptimization of Tensor-Product Op
 erations in Nekbone on GPUs\n\nKarp, Jansson, Podobas, Schlatter, Markidis
 \n\nIn the CFD solver Nek5000, the computation is dominated by the evaluat
 ion of small tensor operations. Nekbone is a proxy app for Nek5000 and has
  previously been ported to GPUs with a mixed OpenACC and CUDA approach. In
  this work, we continue this effort and further optimize the main tensor-p
 roduct operation in Nekbone. Our optimization is done in CUDA and uses a d
 ifferent (2D) thread structure to make the computations layer by layer. Th
 e results show that our implementation outperforms previous GPU Nekbone im
 plementations by 6% to 10% on Pascal and Volta GPU architectures. Compared
  to a measured roofline, we obtain 77% to 92% of the peak performance for 
 both Nvidia P100 and V100 GPUs for inputs with 1024 to 4096 elements and p
 olynomial degree 9. In this poster we discuss our findings and bring up fu
 ture considerations as we move toward exascale CFD simulations.\n\nRegistr
 ation Category: Tech Program Reg Pass, Exhibits Reg Pass
END:VEVENT
END:VCALENDAR