BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210402T160555Z
LOCATION:Track 8
DTSTART;TZID=America/New_York:20201112T154500
DTEND;TZID=America/New_York:20201112T161000
UID:submissions.supercomputing.org_SC20_sess214_ws_lasalss107@linklings.co
 m
SUMMARY:Recursive Basic Linear Algebra Operations on TensorCore GPU
DESCRIPTION:Workshop\n\nRecursive Basic Linear Algebra Operations on Tenso
 rCore GPU\n\nZhang, Karihaloo, Wu\n\nEncouraged by the requirement of high
  speed matrix computations and training deep neural networks, TensorCore w
 as introduced in NVIDIA GPU\nto further accelerate matrix-matrix multiplic
 ation. It supports very fast half precision general matrix matrix multipli
 cations (GEMMs), which is around 8x faster then single precision CUDA core
  GEMMs.  \nSo far the use of TensorCore GPU for matrix operations other th
 an\nmatrix-matrix multiplication is under developed. \nIn this paper, we p
 ropose efficient BLAS3 operations that exploits TensorCore. The experiment
 al results show that the proposed algorithms outperform cublas correspondi
 ng routines and the naive TensorCore implementation with up to 4.7x speedu
 p.\n\nTag: Algorithms, Extreme Scale Computing, Performance/Productivity M
 easurement and Evaluation, Scalable Computing, Scientific Computing\n\nReg
 istration Category: Workshop Reg Pass
END:VEVENT
END:VCALENDAR

