Workshop:PAW-ATM 2020: The 3rd Annual Parallel Applications Workshop, Alternatives To MPI+X
Authors: Alexandre Bardakoff (National Institute of Standards and Technology (NIST); LIMOS, University of Clermont Auvergne); Walid Keyrouz and Timothy Blattner (National Institute of Standards and Technology (NIST)); Bruno Bachelet (LIMOS, University of Clermont Auvergne); Loïc Yon (ISIMA; LIMOS, University of Clermont Auvergne); and Gerson Kroiz (Department of Mathematics and Statistics, University of Maryland, Baltimore County (UMBC))
Abstract: Getting performance on high-end heterogeneous nodes is challenging. This is due to the large semantic gap between a computation's specification, possibly mathematical formulas or an abstract sequential algorithm, and its parallel implementation; this gap obscures the program's parallel structures and how it gains or loses performance. We present Hedgehog, a library aimed at coarse-grain parallelism. It explicitly creates a data-flow graph within a program and uses this graph at runtime to drive the program's execution so it takes advantage of hardware parallelism (multicore CPUs and multiple accelerators).
Hedgehog provides a separation of concerns and distinguishes between compute and state maintenance tasks. Its API reflects this separation and allows a developer to gain a better understanding of performance when executing the graph.
Hedgehog is implemented as a C++17 headers-only library. One feature of the framework is its low overhead; it transfers control of data between two nodes in ~1 microsecond. This low overhead combines with Hedgehog's API to provide essentially cost-free profiling of the graph, thereby enabling experimentation for performance, which enhances a developer's insight into a program's performance.
Hedgehog's asynchronous data-flow graph supports a data streaming programming model both within and between graphs. We demonstrate the effectiveness of this approach by highlighting the performance of streaming implementations of two numerical linear algebra routines, which are comparable to existing libraries: matrix multiplication achieves 95% of the theoretical peak of 4 GPUs; LU decomposition with partial pivoting starts streaming partial final result blocks 40x earlier than waiting for the full result.