Understanding I/O behavior of Scientific Deep Learning Applications in HPC systems
TimeThursday, 19 November 20208:30am - 5pm EDT
DescriptionDeep learning has been widely utilized in various science domains to achieve unprecedented results. These applications typically rely on massive datasets to train the networks. As the size of datasets grow rapidly, I/O becomes a major bottleneck in large scale distributed training. We characterize the I/O behaviors of several scientific deep learning applications running on our production machine, Theta, at Argonne Leadership Computing Facility, with a goal to identify potential bottlenecks and to provide guidance for developing efficient parallel I/O library for scientific deep learning. We found that workloads utilizing TensorFlow Data Pipeline can achieve efficient I/O through overlapping I/O with computation; however, they have potential scaling issues at larger scale as POSIX I/O is used underneath without parallel I/O. We also identified directions for I/O optimization for workloads utilizing a custom data streaming function. These workloads can potentially benefit from data prefetching, data sieving and asynchronous I/O.