Toward Interactive, Reproducible Analytics at Scale on HPC Systems
TimeFriday, 13 November 20204:15pm - 4:40pm EDT
DescriptionThe growth in scientific data volumes has resulted in a need to scale up processing and analysis pipelines using high-performance computing (HPC) systems. These workflows need interactive, reproducible analytics at scale. The Jupyter platform provides core capabilities for interactivity but was not designed for HPC systems. In this paper, we outline our efforts that bring together core technologies based on the Jupyter platform to create interactive, reproducible analytics at scale on HPC systems. Our work is grounded in a real world science use case: applying geophysical simulations and inversions for imaging the subsurface. We describe a user study that we conducted to identify gaps and requirements to drive our work. Our core platform addresses three key areas of the scientific analysis workflow: reproducibility, scalability and interactivity. We describe our implemention of a system, based on Binder software, which allows us to capture a set of Jupyter notebooks along with the software environment needed to run these as a container. These reproducible containers, based on Docker technology, can be launched on NERSC HPC systems as live Jupyter analysis environments. We discuss our approach to capturing provenance in these environments with the Science Capsule framework. We show how these analyses can then be scaled up on HPC compute resources using the Dask framework, while interactively capturing and rendering results in the notebook. Finally we describe how we can interactively visualize real-time streams of HDF5 data from our underlying science application.