GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training

SC20 Proceedings

GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training

Authors: Arpan Jain, Ammar Ahmad Awan, Asmaa M. Aljuhani, Jahanzeb Maqbool Hashmi, Quentin G. Anthony, Hari Subramoni, Dhabaleswar K. Panda, Raghu Machiraju, and Anil Parwani (Ohio State University)

Abstract: Data-parallelism has become an established paradigm in which to train DNNs that fit the GPU memory on large-scale HPC systems. Model-parallelism, however, is required to train out-of-core DNNs. In this paper, we deal with emerging requirements brought forward by very-large DNNs being trained using high-resolution images common in digital pathology. To address these, we propose, design and implement GEMS, a GPU-Enabled Memory-Aware Model-Parallelism System. We present several design schemes like GEMS-MAST, GEMS-MASTER and GEMS-Hybrid that offer excellent speedups over state-of-the-art systems like Mesh-TensorFlow and FlexFlow. Furthermore, we combine model-parallelism and data-parallelism to train a 1000-layer ResNet-1k model using 1024 Volta V100 GPUs with 97.32% scaling-efficiency. For the real-world histopathology whole-slide-image (WSI) of 100,000 x 100,000 pixels, we train custom ResNet-110-v2 on image tiles of size 1024 x 1024 and reduce the training time from seven hours to 28 minutes.

Back to Technical Papers Archive Listing