February 12, 2019 - by CSCS

The Piz Daint supercomputer at CSCS provides an ideal platform for supporting intensive deep learning workloads as it comprises thousands of Tesla GPU compute nodes communicating through a high-speed interconnect. In this two-day course, we will look at how to run distributed deep learning workloads with TensorFlow on Piz Daint. We will use simple examples to demonstrate best practices for building efficient input pipelines to maximize the throughput of deep learning models with TensorFlow. TensorFlow is one of the most popular numerical libraries for deep learning and contains an extensive collection of algorithms optimized to exploit hardware as efficiently as possible.

The course will include the following topics:

  • Running TensorFlow on Piz Daint.
  • Creating efficient input pipelines with TensorFlow's Dataset API for optimizing the throughput on Piz Daint.
  • Reading and writing data as TFRecords files.
  • Understanding the stochastic gradient descent and distributed synchronous stochastic gradient descent algorithms.
  • Performing distributed training with TensorFlow and the ring allreduce algorithm implemented in Horovod (Keras and Tensorflow's Estimator API).
  • Understanding Horovod and TensorFlow's operations timeline.

This course is addressed to scientists who are planning or are already engaged in intensive machine learning workloads and wish to start using TensorFlow on Piz Daint.

All participants must register for the course. Deadline for registration: Tuesday, March 5, 2019.

For more information and registration please visit the event page >