The Piz Daint supercomputer at CSCS provides an ideal platform for supporting intensive deep learning workloads as it comprises thousands of Tesla GPU compute nodes communicating through a high-speed interconnect.
In this two-day course, we looked at how to run distributed deep learning workloads with TensorFlow on Piz Daint. We used simple examples to demonstrate best practices for building efficient input pipelines to maximize the throughput of deep learning models with TensorFlow. TensorFlow is one of the most popular numerical libraries for deep learning and contains an extensive collection of algorithms optimized to exploit hardware as efficiently as possible.
The course included the following topics:
- Running TensorFlow on Piz Daint.
- Creating efficient input pipelines with TensorFlow's Dataset API for optimizing the throughput on Piz Daint.
- Reading and writing data as TFRecords files.
- Understanding the stochastic gradient descent and distributed synchronous stochastic gradient descent algorithms.
- Performing distributed training with TensorFlow and the ring allreduce algorithm implemented in Horovod (Keras and Tensorflow's Estimator API).
- Understanding Horovod and TensorFlow's operations timeline.
The material presented during the course (slides & notebooks) can be found on Github: https://github.com/eth-cscs/tensorflow-training