Two-Stage Asynchronous Iterative Solvers for Multi-GPU Clusters
Event Type
Workshop
Algorithms
Extreme Scale Computing
Performance/Productivity Measurement and Evaluation
Scalable Computing
Scientific Computing
W
TimeThursday, 12 November 202012:40pm - 1:05pm EDT
LocationTrack 8
DescriptionGiven the trend of supercomputers accumulating much of their compute power in
GPU accelerators composed of thousands of cores and operating in streaming
mode, global synchronization points become a bottleneck, severely confining
the performance of applications. In consequence, asynchronous methods
breaking
up the bulk-synchronous programming model are becoming increasingly
attractive. In this paper, we study a GPU-focused asynchronous version of the
Restricted Additive Schwarz (RAS) method that employs preconditioned Krylov
subspace methods as subdomain solvers. We analyze the method for various
parameters such as local solver
tolerance and iteration counts. Leveraging the multi-GPU architecture on
Summit, we show that these two-stage methods are more memory and time
efficient than asynchronous RAS using direct solvers. We also
demonstrate the superiority over synchronous counterparts, and present
results using one-sided CUDA-aware MPI on up to 36 NVIDIA V100 GPUs.
GPU accelerators composed of thousands of cores and operating in streaming
mode, global synchronization points become a bottleneck, severely confining
the performance of applications. In consequence, asynchronous methods
breaking
up the bulk-synchronous programming model are becoming increasingly
attractive. In this paper, we study a GPU-focused asynchronous version of the
Restricted Additive Schwarz (RAS) method that employs preconditioned Krylov
subspace methods as subdomain solvers. We analyze the method for various
parameters such as local solver
tolerance and iteration counts. Leveraging the multi-GPU architecture on
Summit, we show that these two-stage methods are more memory and time
efficient than asynchronous RAS using direct solvers. We also
demonstrate the superiority over synchronous counterparts, and present
results using one-sided CUDA-aware MPI on up to 36 NVIDIA V100 GPUs.