The Swiss National Supercomputing Centre (CSCS) is operated by ETH Zurich and is located in Lugano, Switzerland. CSCS develops and provides key supercomputing capabilities for solving important problems in science and society.

CSCS is offering different types of internships in the field of High Performance Computing for the development and optimization of scientific applications, in order to exploit the coming generations of supercomputing devices.

For the below positions applicants must be enrolled in a Swiss University (Bachelor/Master level) and the internship must be part of their mandatory education. The candidate must be a student in one of the following fields: Computer Science, Mathematics, Physics or related fields. Please note that due to Swiss labour laws only EU 25 / EFTA nationals are going to be taken into consideration. Ph.D. students will not be considered.

The ideal candidate is a team player and feels comfortable working in an international environment. Excellent command of written and spoken English (our official working language) is a must.

For further information, please contact Dr. Maria Grazia Giuffreda, Associate Director, by email (no applications).

We look forward to receiving your complete online application, which we ask you to submit to Stephanie Frequente, CSCS, via Trevano 131, 6900. Please specify in your application explicitly a maximum of 2 topics which fit your interests. As there is a high demand for the internships in certain periods and we can only offer 2 internships per quarter, kindly also state your availability.
The closing date for applications is 30th November 2018. Applications will only be reviewed immediately after the closing date.

  • Quantum Computing Simulators: performance evaluation

    Quantum Computing is an emerging computing model that is believed to offer a dramatic speedup in many fields, ranging from Chemistry [1] to Linear Systems, Graph Theory, Machine Learning and others. One of the most famous examples of a quantum algorithm is Shor's Prime Factorization algorithm [2] that provides exponential speedup over the fastest known classical algorithm.

    Simulating a Quantum Computer is demanding because it requires an exponential amount of storage and number of operations. One of the record- breaking simulation results happened in 2017, when scientists from ETH Zurich performed a 0.5 Petabyte Simulation of a 45-Qubit Quantum Circuit [3] at the National Energy Research Scientific Computing Center (NERSC). Today, there exist a variety of Quantum Computer simulators, like ProjectQ [4] and Qiskit [5], but none of them has been installed or used on Piz Daint so far.

    The main goal of this internship would be to investigate the potential of existing Quantum Computer simulators and evaluate their parallel and GPU performance on Piz Daint, as already done for the TensorFlow in a previous internship. If time permits, some of the well-known Quantum Algorithms will be implemented and evaluated, with possible candidates being the algorithms for problems like Max-Cut, 3-SAT, Boolean Matrix-Matrix Multiplication [6] or simulation of Lithium-Hydride [1].

    Duration: 4 months
    Supervisor(s): Dr Nur Fadel, Marko Kabić, Dr Joost VandeVondele

    • Bachelor in Physics, Computer Science, Mathematics or Chemistry
    • Background in Quantum Computing
    • Basic knowledge of Python

    Working Timeline:

    • Setting up the Toolkits (1 month): Install, evaluate and compare the available toolkits. Run the readily available algorithms.
    • Performance on Piz Daint (2 months): Focus on the most promising simulator and install it on Piz Daint. The candidate will benchmark  the toolkit and evaluate its parallel and GPU performance. In addition, the available algorithms will be evaluated with higher number of qubits, exploring the limits of the system.
    • Quantum Algorithms Implementation (1 month): The candidate will implement some of the more involved quantum algorithms like Max-Cut, 3-SAT and Boolean Matrix-Matrix Multiplication.


    [1] A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M. Gambetta, “Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets," Nature, vol. 549, no. 7671, p. 242, 2017.
    [2] P. W. Shor, “Algorithms for quantum computation: Discrete logarithms and factoring," in Foundations of Computer Science, 1994 Proceedings., 35th Annual Symposium on, pp. 124{134, Ieee, 1994.
    [3] T. Haener and D. Steiger, “0.5 petabyte simulation of a 45-qubit quantum circuit (2017)," arXiv preprint arXiv:1704.01127.
    [4] D. S. Steiger, T. Haeaner, and M. Troyer, \Projectq: an open source software framework for quantum computing," Quantum, vol. 2, p. 49, 2018.
    [5] A. Cross, “The ibm q experience and qiskit open-source quantum computing software," Bulletin of the American Physical Society, 2018.
    [6] F. Le Gall and H. Nishimura, \Quantum algorithms for matrix products over semirings," in Scandinavian Workshop on Algorithm Theory, pp. 331{343, Springer, 2014.

    Working place: Zürich

  • Using machine learning to help identifying root causes for system slowdowns

    Large HPC production systems sometimes exhibit performance fluctuations, usually occurring only in a limited part of the machine. Such a performance hit can be caused by many different things, ranging from failures/misbehavior of system components to OS and/or application related issues, like high I/O loads or high network usage, and even inefficient/wrong use of job submission scripts.

    Indicators for performance fluctuations can come from various sources, e. g. the output of a regression test suite, which unfortunately only gives information about the system’s state at the moment of the test, or simply tickets from users reporting issues. The understanding of the phenomena responsible for a slowdown involves different groups of an HPC center, from the operations team to the application support group. Common to all the different root causes is the necessity to parse large sets of monitoring data gathered in different data bases in order to track down the issue.

    Therefore, during this 3-month internship in the UES unit the intern will analyze the existing monitoring data and perform meaningful statistical analyses to determine correlations between jobs marked failed/timeout by SLURM with abnormal machine performance. In a first step, this will provide a set of users, applications, or nodes that have a higher probability of being related to a performance issue.

    In a second step, the accuracy of the results produced in the previous phase should be improved by using machine learning and, therefore, including the information of the system’s state around identified failures. Validation of the analyses will be achieved by detecting issues that had been identified in the past, e. g. based on user reports and observations by the operations team.

    The methods developed by the intern will be implemented in a software tool, which is intended to be used and adapted beyond the scope of this internship. It is expected to allow a more proactive behavior with respect to system performance issues observed in the future. The intern would gain experience in working in an HPC environment and insight into the details of how a supercomputer’s network functions.

    Duration: 3 months
    Supervisor(s): Dr Matthias Kraushaar, Dr Victor Holanda Rusu

    • Python
    • Background in machine learning is advantageous

    Main tasks during internship:

    • Familiarization with the systems of CSCS
    • Defining the machine learning approach to be used for the analysis
    • Implementing the data extraction and the machine learning algorithm
    • Validation of the tool
    • Documentation and presentation

    Working place: Lugano

  • Continuation of the Development of greasy

    GREASY is a tool which allows to schedule the execution of a list of tasks on diverse HPC settings. Although being created and initially developed at the Barcelona Supercomputing Center, CSCS supports an in-house-customized version of greasy which is used as a metascheduler to alleviate the overload of Slurm caused by an excessive number of calls to the srun command. greasy has proven to be very useful in workflows involving the submission at once of a large number of jobs, in particular for calculations that run in short times.

    This modified version of greasy consists of extensions to the original code in order to enable major features that are critical for its use at CSCS. Notably, the support for the execution of MPI tasks and the possibility of specifying different directories for running jobs, opposed to the single-rank and single-directory fashion of the original software. However, there is still need for improvement. For instance, currently greasy is able to manage only a single value for the number of MPI ranks, preventing a collection of tasks requiring to be executed with different number of MPI ranks to be combined on a single list. Indeed, improving the current version of greasy is the direction in which the internship we proposing is aimed.

    The main part of the work will be on the implementation of new features on GREASY’s basic and MPI schedulers. Those are two of the mechanisms of greasy to run lists of tasks. Within the first one, the basic scheduler, an allocation of resources is requested through Slurm and then greasy will issue an srun command for each of the tasks on the list. This can be used to run any kind of tasks, i.e. sequential, OpenMP, MPI or MPI/OpenMP, however it still may overload Slurm with all the srun calls. This is why, a second scheduler is also used, the MPI scheduler. In contrast to the basic, the MPI scheduler issues a single srun for the whole list of tasks, and once the job has been submitted it will schedule all the tasks inside the single Slurm job. This alleviates significantly the stress on Slurm as part of the work is offloaded to GREASY.

    This internship is designed looking forward to continue the development of greasy for its use at CSCS. The most important points will be:

    • Extend the MPI scheduler to allow the execution of tasks with a custom number of of MPI ranks.
    • Add a new syntax to the _le containing the list of tasks to enable individual numbers of MPI ranks.

    • Enable the rescheduling of tasks that for certain reasons need to be repeated.

    As a result of this internship, it will be provided a robust metascheduler which will enable CSCS's users with more efficient workflows, in particular those which need to submit large number of small tasks. In turn, there will be a significant offload of the Slurm scheduler, as many calls to the srun command will be replaced by GREASY’s scheduling mechanisms. Besides that, the intern will have the opportunity to work on a interesting real-life problem in which he can put in practice his knowledge as well as to contribute to ease the work of researchers from multiple fields which use CSCS facilities.

    Prerequisites: To successfully complete the internship, it is required a candidate with knowledge of C++ and MPI.
    Duration: 3 months
    Supervisor(s): Dr Rafael Sarmiento and Dr Victor Holanda Rusu
    Working place: Lugano

  • DCA++ CPU/GPU consolidation with Kokkos and HPX C++ Tasks

    The DCA++ code solves the physics of strongly correlated electron systems by employing modern quantum cluster methods using an implementation of the dynamical cluster approximation (DCA) and its DCA+ extension. The DCA algorithm maps the bulk lattice problem to a finite-size periodic cluster problem that is best solved by means of continuous-time Quantum Monte Carlo (QMC) methods.

    To solve the QMC integration, the code uses walkers that generate random states in the high dimensional space represented by the problem, and accumulators that compute a running average of the configurations visited by the walkers and their physically relevant properties. DCA++ may make use of either CPU or GPU for the computation of walker or accumulator steps, but currently has separate implementations of these for the CPU and the GPU (and would also require them for future accelerators). Note that the subsets of the algorithms dealing with these steps is reasonably small and familiarity with the entire code base is not necessary.

    CSCS is interested in combining the portable performance capabilities of the Kokkos library with the task-based asynchronous workflows of the HPX library and we would like to create a single source version of walker and/or accumulator code that can be compiled with the Kokkos model and then integrated into a larger application making use of HPX tasks. The aim of the internship would be to produce single source versions of walker/accumulators that perform as well as possible on intel x86 (multi-socket) processors and NVidia (cuda) GPUs and benchmark/profile them to ensure that the implementations make best use of memory bandwidth and compute resources. Kokkos uses the concept of memory spaces that are specialized via templates for CPU/GPU and allow algorithms to have loop unrolling performed by the compiler to best match the memory model of the architecture used.

    Prerequisites: The applicant should be comfortable with (heavily) templated C++ code and modern C++ programming techniques, have a working understanding of Cuda kernel writing, CMake/Git based project management and development on a Linux system. Knowledge of C++ template meta-programming and parallel programming.
    Duration: 4 months
    Supervisor(s): Dr John Biddiscombe
    Working place: Lugano
    Github References: DCAKokkosHPX

  • Enhancement of MPI MapReduce backend for ABCpy

    ABCpy is a scientific library for large scale approximate Bayesian computation (ABC).

    Approximate Bayesian computation is a set of methods that allow to calibrate the parameters of forward models given a set of observations. ABC works even if the likelihood of the model is not known or computationally intractable to compute. However, this comes at the cost of very large compute requirements. The library ABCpy allows users to automatically parallelize inference computation on large clusters or supercomputers to achieve reasonable times to solution. The main means of parallelization follow the MapReduce paradigm. The current implementation of ABCpy MapReduce MPI backend is very light-weight and does not support nested parallelization and proper fault tolerance.

    The goal of this internship is to enhance the backend such that a sophisticated reduce method is implemented in the backend using MPI. Further, proper fault tolerance mechanisms should be implemented. This includes that exceptions in the parallel execution should be handled correctly and work from crashed compute nodes should be redistributed automatically.

    The enhanced backend would make inference computations of ABC algorithms more balanced such that the overall execution is more compute efficient.

    Prerequisites: We are looking for a candidate with good programming skills in Python and knowledge of parallel programming, preferably MPI. Experience in software design, object-oriented programming, and unit testing is beneficial. Further, the candidate should have general knowledge in Linux operating systems and Git.
    Duration: 3-4 months
    Mentor: Dr Marcel Schöngens
    Working place: Zurich

  • Development of applications targeted to High Performance Computing, as well as MPI and FFT programming

    CSCS offers computing services for the scientific community. As such it relies on performance and sanity tests for existing and new machines. Besides established tests like HPL, DGEMM, ... we want to add a FFT test set to our collection of regression tests, since they are part of many applications running on HPC systems. In particular for some graphics processing unit (GPU) applications this CPU part is the bottleneck for strong scaling. The tests will be performed on small scale and large-scale representing material science applications and cosmological applications, respectively. The new test will be based on FFTW and an in-house developed extension of MPI. For different sizes of the FFT the performance will be measured and systematized. Questions are amongst others the scaling properties of FFTs for pure computation and pure communication. The intern would gain experience with development of applications targeted to High Performance Computing, as well as MPI and FFT programming.

    Duration: 3-4 months
    Supervisor(s): Dr Andreas Jocksch, Dr Victor Holanda Rusu
    Working place: Lugano

  • Metadata management and search for Persistent Identifiers service

    CSCS is providing a service to create and resolve persistent identifiers (PID, similar to DOI but for data). As the next evolution of the service, CSCS should offer the functionality to associate metadata to every managed PID and search them. The proposed project goal is to evaluate three possible database solutions (SQL, NoSQL, Triple Store) and developing a prototype API service to make queries on the database and return the corresponding PID.

    Prerequisites: Knowledge of Python or JavaScript language (better).
    Duration: 4 months
    Supervisor(s): Mario Valle 
    Working place: Lugano

  • Integration of persistent identifiers inside a scientific workflow engine

    The candidate will evaluate one (or more if time permitting) workflow engines in an HPC environment then integrate the generation of persistent identifiers to support provenance recording or public data dissemination.

    Prerequisites: Knowledge of Python
    Duration: 2 months
    Supervisor(s): Mario Valle
    Working place: Lugano

  • Exploring GPGPU Computing Possibilities using ROCm/HIP-based APIs for AMD GPUs

    AMD Vega/Polaris GPUs promise an alternative to Nvidia’s GPUs and its CUDA platform. AMD’s new chips can be programmed using the ROCm/HIP (Radeon Open Compute/Heterogeneous compute Interface for Portability) toolkits. These toolkits are open source and can potentially be used on devices from multiple vendors.

    CSCS shall soon deploy a test bed system featuring AMD Polaris/Vega GPUs along with AMD’s EPYC CPUs and Nvidia’s Volta GPUs.

    This internship would evaluate the performance of the new hardware and the new APIs for certain key algorithms that we see as building blocks to more complex simulation software.

    AMD’s HIP provides an interface for multiple backends. For example, hipBLAS can use both rocBLAS and cuBLAS (Nvidia-specific). HIP provides tools to write more API-agnostic, performance portable libraries.

    This project will evaluate the:

    • ease of porting codebases from CUDA to the new APIs;
    • maturity of the existing tools (compilers, libraries, etc);
    • performance relative to Nvidia GPUs.

    In service of these aims, some microbenchmarks shall be conducted and a mini-app will be ported from CUDA to HIP.


    The candidate should:

    • be comfortable with C++ development in a UNIX environment;
    • have prior exposure to HPC and CUDA programming via coursework or projects;
    • demonstrate good communication skills to write reports and explain findings.

    Duration: 3-4 months
    Supervisor(s): Dr. Benjamin Cumming, Prashanth Kanduri
    Working place: Zürich


    cuRAND vs. rocRAND Comparative Study

    ROCm Repository and Documentation

    HIP : C++ Heterogeneous-Compute Interface for Portability

    PCI-E 3.0 Features in ROCm

If you are interested in one of the above positions, please apply at the following link »

Experiences from previous internships: