Authors: Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He (Microsoft Corporation)
Abstract: Large deep learning models offer significant accuracy gains, but training billions of parameters is challenging. Existing solutions exhibit fundamental limitations fitting these models into limited device memory, while remaining efficient. Our solution uses ZeroRedundancy Optimizer (ZeRO) to optimize memory, vastly improving throughput while increasing model size. ZeRO eliminates memory redundancies allowing us to scale the model size in proportion to the number of devices with sustained high efficiency. ZeRO can scale beyond 1 trillion parameters using today’s hardware.
Our implementation of ZeRO can train models of over 100b parameters on 400 GPUs with super-linear speedup, achieving 15 petaflops. This represents an 8x increase in model size and 10x increase in achievable performance. ZeRO can train large models of up to 13b parameters without requiring model parallelism (which is harder for scientists to apply). Researchers have used ZeRO to create the world’s largest language model (17b parameters) with record breaking accuracy.
Back to Technical Papers Archive Listing