Scalable MPI Collectives Using SHARP: Large Scale Performance Evaluation on the TACC Frontera System

SC20 Proceedings

Scalable MPI Collectives Using SHARP: Large Scale Performance Evaluation on the TACC Frontera System

Workshop:ExaMPI: Workshop on Exascale MPI

Authors: Bharath Ramesh, Kaushik Kandadi Suresh, Nick Sarkauskas, Mohammadreza Bayatpour, Jahanzeb Maqbool Hashmi, Hari Subramoni, and Dhabaleswar K. Panda (Ohio State University)

Abstract: Message-Passing Interface (MPI) is the de-facto standard for designing and executing applications on massively parallel hardware. MPI collectives provide a convenient abstraction for multiple processes/threads to communicate with one another. Mellanox’s HDR InfiniBand switches provide Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) capabilities to offload collective communication to the network and reduce CPU involvement in the process. In this paper, we propose, design and implement SHARP-based solutions for MPI Reduce and MPI Barrier in MVAPICH2-X. We evaluate the impact of proposed and existing SHARP-based solutions for MPI Allreduce, MPI Reduce and MPI Barrier operations on the performance of the collective operation on the 8th ranked TACC Frontera HPC system. Our experimental evaluation of the SHARP-based designs shows up to 5.4x reduction in latency for Reduce, 5.1x for Allreduce and 7.1x for Barrier at full system scale of 7,861 nodes over a host-based solution.

Back to ExaMPI: Workshop on Exascale MPI Archive Listing

Back to Full Workshop Archive Listing