Recursive Basic Linear Algebra Operations on TensorCore GPU
Performance/Productivity Measurement and Evaluation
TimeThursday, 12 November 20203:45pm - 4:10pm EDT
DescriptionEncouraged by the requirement of high speed matrix computations and training deep neural networks, TensorCore was introduced in NVIDIA GPU
to further accelerate matrix-matrix multiplication. It supports very fast half precision general matrix matrix multiplications (GEMMs), which is around 8x faster then single precision CUDA core GEMMs.
So far the use of TensorCore GPU for matrix operations other than
matrix-matrix multiplication is under developed.
In this paper, we propose efficient BLAS3 operations that exploits TensorCore. The experimental results show that the proposed algorithms outperform cublas corresponding routines and the naive TensorCore implementation with up to 4.7x speedup.