“We built a system to run HPC services in a cloud-native way”

June 1, 2026 – Interview by Santina Russo

Elia and Dino, first of all, congratulations on winning the 2026 SUSE Customer Award as Virtualization Visionary. Can you tell us about the achievement that earned you the prize?

Elia Oggian: Sure. Together with colleagues here at CSCS, we found a way to deploy Kubernetes clusters on top of the Alps supercomputer. This allows us to operate hybrid clusters that run partly on virtual machines, and partly on physical high-performance Alps nodes.
Dino Conciatore: Basically, we built a system to run high-performance computing, or HPC, services in a cloud-native way. This allows our users to combine lightweight workflows, such as simple applications or web interfaces running on virtual machines, with high-performance computing workloads on Alps, like scientific simulations or AI model inference. Our innovation makes it possible to mix classic cloud-native workflows with HPC nodes and get the best possible performance.
Elia Oggian: Of course, one reason we got the award is certainly that we used SUSE software platforms to deliver these developments. But our efficient and stable blend of cloud-native Kubernetes clusters with the supercomputing nodes of Alps really got people’s attention. This is something many others are working towards as well, and we are the first to deliver it in operational quality.

Everything seems to be about virtualization and cloud-nativeness lately. Can you put your work into a bit of historical context for us—how long has virtualization been around?

EO: Originally, until around 20 to 25 years ago, there were no virtual machines. All computing power came from isolated physical machines that had to be connected manually via cables and configured individually to harness their collective resources to a certain extent. Everything was physical. Then engineers started to link individual machines together and control them through a hypervisor: a software layer that creates and runs virtual machines, or VMs, by separating them from the underlying physical hardware and sharing that hardware among them. The hypervisor serves as an abstraction layer on top of the hardware. It can combine multiple machines into one cluster, and this entity can then host virtual machines that spread across its nodes.
DC: You can think of the hypervisor as a traffic controller for processors, memory, storage, and networking. This controller assigns these resources to each VM. This way, the VMs can be used almost as they were an individual physical machines, just in a more dynamic and flexible way.

Screenshot of Dino Conciatore and Elia Oggian from CSCS, discussing in front of a white board

Going hybid: Dino Conciatore (on the left), a Senior Systems Engineer, and Elia Oggian, a Systems Engineer, are discussing their design for automating the deployment of Kubernetes clusters across virtual machines and physical HPC nodes.

What about virtualization at CSCS?

DC: At CSCS, virtual machines have been leveraged for a long time. What is new is the automated use of Kubernetes clusters. A Kubernetes cluster is an orchestrated environment for containerized applications, another control layer sitting on top of a virtualized infrastructure. The containers, in turn, hold the software applications and workflows, including their environment. This makes applications portable and enables an easy transfer from one hardware system to another. The Kubernetes cluster controls them. It decides on which compute nodes applications should run, keeps the applications in the desired state, and moves them away from failed nodes when needed.
EO: This way, everything becomes more stable and applications keep running optimally. Today, we operate around 60 Kubernetes clusters at CSCS. And we needed to find a way to streamline the process of creating and managing them efficiently and in a scalable way. Manually, this would simply have been too elaborate and time-consuming.

How did you implement this automation of Kubernetes clusters at CSCS?

EO: We used open-source software tools by SUSE—their Rancher Kubernetes Engine, SUSE Rancher, and SUSE Virtualization—to extend our Kubernetes clusters so that they also include physical nodes from Alps. We call these “Alpernetes” clusters. As a blend of cloud-native and high-performance computing technology, they really represent something new. In practice, SUSE Rancher and SUSE Virtualization automatically deploy, manage, secure, and monitor container workloads in the CSCS hybrid environment, which consists of virtual machines and on-site physical nodes on Alps.

Your aim was to increase efficiency and scaling capability—did it work?

CD: Yes, very much so. We reduced the time needed to create a cluster by about 80 percent. Within five to six minutes, we have a Kubernetes cluster up and running. Since the configuration of the machines—virtual and physical—is now handled by an automated workflow, deploying applications is also 70 percent faster.
EO: This means that CSCS can now have Kubernetes clusters ready in no time and offer them to the different scientific user communities. Managing Kubernetes clusters and the containerized applications running on them has become more automated, dynamic, and efficient. All of this means that the Kubernetes clusters are now unified across the complex environment of Alps and the CSCS virtual machines, supporting easier, faster, and more reliable access to supercomputing services for cutting-edge research and AI model inference.

––

“Such a system truly is an ‘AI Factory’ that supports the development of AI innovations all the way to end-to-end scientific and industrial use cases.”

––

That’s a big promise. Are CSCS users already using this approach?

CD: Yes. One team that started using this approach right away is part of the Square Kilometre Array Observatory (SKAO), which is currently being constructed in Australia and South Africa. The SKAO will be the most sensitive radio astronomy array of telescopes ever built. It will enable scientists to explore the Universe with unprecedented breadth, sensitivity, and precision, and it will generate immense amounts of data in the process. This makes an efficient and scalable hybrid computing system all the more important, as it can help process and interpret these invaluable datasets.
EO: Also, the training and serving of large-scale AI models will benefit from this development at CSCS.

How so?

DC: Large-scale pre-training of foundation models, such as the large language model Apertus, for instance, is computationally highly intensive. It is the first phase in enabling AI across diverse scientific and societal applications, and HPC facilities are indispensable for it. The subsequent phases of the AI lifecycle, however—including fine-tuning and inference, meaning the broad serving of a model online—benefit from hybrid systems. Not all steps require or benefit from the batch-processing HPC environment, instead, some need cloud-native services.
EO: Such a system truly is an “AI Factory” that supports the development of AI innovations all the way to end-to-end scientific and industrial use cases.

Speaking of a public AI Factory: are we already there, or is this still a future vision?

DC: We are on our way and, at least partly, already there. As we describe in our paper, with the automated deployment and management of hybrid Kubernetes clusters at CSCS and several other developments, such as self-managed spaces we call sandboxes for model inference, and a concept for model fine-tuning as a service, we have already assembled some crucial building blocks.
EO: Our developments and findings also offer a blueprint that other supercomputing centres can adopt to integrate AI factories and services into their existing HPC workflows—a universal recipe covering the entire lifecycle of AI models.

Reference:

D. Conciatore, E. Oggian, F. Da Forno, S. Schuppli, J. Tissieres, J. VandeVondele, M. Martinasso: Beyond Pre-Training: The Full Lifecycle of Foundation Models on HPC Systems. Preprint (2026). DOI: 10.48550/arXiv.2604.12599

Cover image: Abobe Stock