SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

A Statistical Analysis of Error in MPI Reduction Operations


Workshop:Correctness 2020: 4th International Workshop on Software Correctness for HPC Applications

Authors: Samuel D. Pollard and Boyana Norris (University of Oregon)


Abstract: This work explores the effects of nonassociativity of floating-point addition on Message Passing Interface (MPI) reduction operations. Previous work indicates floating-point summation error is comprised of two independent factors: error based on summation algorithm and error based on the summands themselves. We find evidence to suggest, for MPI reductions, the error based on summands has a much greater effect than the error based on the summation algorithm. We begin by sampling from the state space of all possible summation orders for MPI reduction algorithms. Next, we show the effect of different random number distributions on summation error, taking a 1000-digit precision floating-point accumulator as ground truth. Our results show empirical error bounds that are much tighter than existing analytical bounds. Last, we simulate different allreduce algorithms on the high performance computing (HPC) proxy application Nekbone and find that the error is relatively stable across algorithms. Our approach provides HPC application developers with more realistic error bounds of MPI reduction operations. Quantifying the small---but nonzero---discrepancies between reduction algorithms can help developers ensure correctness and aid reproducibility across MPI implementations and cluster topologies.





Back to Correctness 2020: 4th International Workshop on Software Correctness for HPC Applications Archive Listing



Back to Full Workshop Archive Listing