GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training
Accelerators, FPGA, and GPUs
Machine Learning, Deep Learning and Artificial Intelligence
TimeWednesday, 18 November 202011am - 11:30am EDT
DescriptionData-parallelism has become an established paradigm in which to train DNNs that fit the GPU memory on large-scale HPC systems. Model-parallelism, however, is required to train out-of-core DNNs. In this paper, we deal with emerging requirements brought forward by very-large DNNs being trained using high-resolution images common in digital pathology. To address these, we propose, design and implement GEMS, a GPU-Enabled Memory-Aware Model-Parallelism System. We present several design schemes like GEMS-MAST, GEMS-MASTER and GEMS-Hybrid that offer excellent speedups over state-of-the-art systems like Mesh-TensorFlow and FlexFlow. Furthermore, we combine model-parallelism and data-parallelism to train a 1000-layer ResNet-1k model using 1024 Volta V100 GPUs with 97.32% scaling-efficiency. For the real-world histopathology whole-slide-image (WSI) of 100,000 x 100,000 pixels, we train custom ResNet-110-v2 on image tiles of size 1024 x 1024 and reduce the training time from seven hours to 28 minutes.