SC20 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Designing and Building Next-Generation Computer Systems for Deep Learning

Authors: Volodymyr Kindratenko (University of Illinois, National Center for Supercomputing Applications (NCSA)), Morris Riedel (University of Iceland, Juelich Supercomputing Centre), Yangang Wang (Chinese Academy of Sciences)

Abstract: Deep learning (DL) heavily relies on fast hardware and parallel algorithms to train complex neural networks. This BoF will bring together researchers and developers working on the design of next-generation computer systems for DL and parallel DL algorithms that can exploit the potential of these new systems. Research teams working on major deep learning systems deployed in the field will be invited to discuss latest hardware and software trends and to exchange views on the role Artificial Intelligence (AI) in general and DL in particular will play in the near future for big data analytics and HPC applications.

Long Description: Deep learning (DL) has emerged in the last few years as an enabling technology for a range of novel applications that previously were considered impossible to realize due to the high computational complexity and unavailability of sufficiently large data sets. Examples include self-driving cars, real-time speech translators, personal assistants, to name a few. High performance Computing (HPC) community is also making use of these techniques, with notable examples ranging from studying molecular systems for drug discovery to gravitational wave analysis for estimating properties of colliding black holes.

GPUs made the deep networks at the core of these applications trainable in an acceptable time. Parallel training algorithms are now readily available in popular frameworks, such as TensorFlow and PyTorch, and are widely used. However, significant challenges remain in bringing DL closer to the edge where real-time decisions need to be made and also in scaling up DL algorithms to enable faster processing of bigger datasets while using more complex models. These challenges are being addressed in part via the development of novel computer architectures tailored towards DL algorithms, such as Google’s TPUs, re-emergence of FPGA-based accelerators from Xilinx and Intel tailored towards DL, as well as scalable parallel network training algorithms that can make use of a large number of compute units, such as Horovod framework from Uber. This BoF is envisioned as a place for leading developers of DL hardware and software to brief the community about the upcoming systems and future plans, both in the academic research domain and in commercial applications, and to stimulate the discussion about these systems and their potential use by the HPC community.

The BoF is envisioned as a community building event co-organized by researchers from the National Center for Supercomputing Applications in the US, Juelich Supercomputing Centre in Europe, and the Computer Network Information Center, Chinese Academy of Sciences. All participating organizations have deployed systems tailored for DL applications and are working on the development of next-generation systems for DL at scale. Leading technology developers and providers from the US, Europe, and China will be invited to present both about latest hardware for DL and software frameworks. Also, research centers that have recently deployed major DL systems or are working on the upcoming deployment will be invited to share their views on the architecture, applicability and limitations of existing DL frameworks, and case studies.

This BoF was previously organized at SC18 and SC19 where it was a well-attended and a highly engaging event with discussions continuing well past the 1.5 hour time allocation. This BoF will complement 6th Workshop on Machine Learning in HPC Environments (MLHPC’20) by focusing on future trends in next-generation computer systems for DL and to some degree on parallel DL algorithms tailored for these systems whereas the main focus of the workshop is on the software aspects of DL at scale. Our experience from SC18 and SC19 indicates that this BoF has very little, if any, overlap with the workshop topics.


Back to Birds of a Feather Archive Listing