SC20 Is Everywhere We Are

SC20 Virtual Platform
Herring: Rethinking the Parameter Server at Scale for the Cloud
Event Type
Paper
Tags
Accelerators, FPGA, and GPUs
Machine Learning, Deep Learning and Artificial Intelligence
Scalable Computing
Registration Categories
TP
TimeWednesday, 18 November 202010:30am - 11am EST
LocationTrack 3
DescriptionTraining large deep neural networks is time-consuming and may take days or even weeks to complete. Although parameter-server-based approaches were initially popular in distributed training, scalability issues led the field to move towards all-reduce-based approaches. Recent developments in cloud networking technologies, however, such as the Elastic Fabric Adapter (EFA) and Scalable Reliable Datagram (SRD), motivate a re-thinking of the parameter-server approach to address its fundamental inefficiencies.

To this end, we introduce a novel communication library, Herring, which is designed to alleviate the performance bottlenecks in parameter-server-based training. We show that gradient reduction with Herring is twice as fast as all-reduce-based methods. We further demonstrate that training deep learning models like BERT using Herring outperforms all-reduce-based training, achieving 85% scaling efficiency on large clusters with up to 2048 NVIDIA V100 GPUs.
Back To Top Button