A Case Study and Characterization of a Many-Socket, Multi-Tier NUMA HPC Platform
Parallel Programming Languages, Libraries, and Models
Resource Management and Scheduling
TimeWednesday, 11 November 202011:15am - 11:40am EST
DescriptionAs the number of processor cores and sockets on HPC compute nodes increase and systems expose more hierarchical non-uniform memory access (NUMA) architectures, efficiently scaling applications within even a single shared memory system is becoming more challenging. It is now common for HPC compute nodes to have two or more sockets and dozens of cores, but future generation systems may contain an order of magnitude more of each. We conduct experiments on a state-of-the-art Intel Xeon Platinum system with 12 processor sockets, totaling 288 cores (576 hardware threads), arranged in a multi-tier NUMA hierarchy. Platforms of this scale and memory hierarchy are uncommon today, providing us a unique opportunity to empirically evaluate—rather than model or simulate—an architecture potentially representative of future HPC compute nodes. We quantify the platform’s multi-tier NUMA patterns, then evaluate its suitability for HPC workloads using a modern HPC metagenome assembler application as a case study, and other HPC benchmarks with a variety of parallelization techniques to characterize the system’s performance, scalability, I/O patterns, and performance/power behavior. Our results demonstrate near- perfect scaling for embarrassingly parallel and weak scaling workloads, but challenges for random memory access workloads. For the latter, we find poor scaling performance with the default scheduling approaches—e.g., which do not pin threads— suggesting that userspace or kernel schedulers may require changes to better manage the multi-tier NUMA hierarchies of very large shared memory platforms.