Authors: Martin Karp, Niclas Jansson, Artur Podobas, Philipp Schlatter, and Stefano Markidis (KTH Royal Institute of Technology)
Abstract: In the CFD solver Nek5000, the computation is dominated by the evaluation of small tensor operations. Nekbone is a proxy app for Nek5000 and has previously been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we continue this effort and further optimize the main tensor-product operation in Nekbone. Our optimization is done in CUDA and uses a different (2D) thread structure to make the computations layer by layer. The results show that our implementation outperforms previous GPU Nekbone implementations by 6% to 10% on Pascal and Volta GPU architectures. Compared to a measured roofline, we obtain 77% to 92% of the peak performance for both Nvidia P100 and V100 GPUs for inputs with 1024 to 4096 elements and polynomial degree 9. In this poster we discuss our findings and bring up future considerations as we move toward exascale CFD simulations.
Best Poster Finalist (BP): no
Poster: PDF
Poster summary: PDF
Back to Poster Archive Listing