September 15, 2025 – Interview by Santina Russo

Ana Klimovic, are concerns around the power use of AI and supercomputers warranted, or is it mostly politics?

There are political considerations, but the trend to continuously scale up hardware and consume increasingly more power in computing for AI is something we need to address from a technical perspective. We can’t keep scaling up indefinitely. So, the research towards finding more sustainable paths is essential.

Where can we improve, could you give an example?

Some approaches aim to reduce complexity where it’s not needed. One avenue is to introduce degrees of sparsity into machine learning models. While a conventional model would activate their entire neural network to answer a query, a sparse model only activates the most relevant parts of a model. One way to introduce sparsity is to design the model as a “mixture of experts”. This type of model is then trained to recognize and decide how different inputs should be routed to each of its “experts”. This way, upon a query, only reasonably small portions of the model are active, which reduces power consumption considerably—without sacrificing model quality.

Is this what you are working on?

My research group works mainly on designing software frameworks for training and serving models and managing their data. Our goal is to deploy ML models to run as efficiently as possible on hardware like GPUs. There is an interesting paradox: on the one hand, given the immense size of ML models, we need a lot of GPUs that each have limited memory capacity. But on the other hand, when you measure the memory bandwidth and compute usage of each GPU training or serving an ML model, in most cases, utilization is rather low. This is inefficient as GPUs still consume significant power even at low utilization.

Credit: ETH Zürich/Giulia Marthaler

Ana Klimovic

is an Assistant Professor at the Department of Computer Science at ETH Zürich, where she leads the Efficient Architectures and Systems Lab. She is a member of the Steering Committee of the Swiss AI Initiative. The goal of her research is to improve the performance and resource efficiency of cloud computing and AI, while making it easier for users to deploy and manage their applications.

Whoops. Why is GPU utilization low?

There are several bottlenecks that can arise in ML pipelines, which stall GPUs. One is data pre-processing. Before feeding a ML model with data, you typically need to pre-process the data on CPUs. This data pre-processing and the link between CPUs and GPUs can become a bottleneck if not designed properly. Another bottleneck can be communication between multiple GPUs training or serving a model. In general, making a system efficient means identifying the bottlenecks and removing them to achieve a balanced throughput across the stages of an ML pipeline.

What are we talking about in numbers, how much potential is there for optimization?

A lot, in fact. The numbers vary depending on the system and the model but it’s not uncommon to see GPUs with compute utilization below 50 percent. In addition to the bottlenecks we discussed before, the GPU utilization gap comes from the fact that only some GPU functions in an ML workload are compute-intensive while others are memory-intensive. And each function may use the GPU for just hundreds of microseconds or a few milliseconds, which makes GPU utilization very spiky. Filling the gap completely to achieve constant 100 percent utilization of both compute and memory bandwidth is not realistic. But we can still improve substantially.

How?

There are many ways we can make individual jobs use GPUs more efficiently. For example, a key factor is how an ML training job is parallelized across GPUs and how many of them are used. There are several different parallelization concepts, and the optimal approach depends on the model characteristics and the heterogeneity of the hardware used. To find the best solution in each case, we recently developed an automated AI training platform called Sailor that helps users optimize job parallelism configurations to improve efficiency. This way, ML developers don’t need to worry about optimizing their software system for efficiency. They just focus on developing the model architecture and training algorithm to achieve high model quality, while software like Sailor optimizes efficiency under the hood. Another way to improve is to optimize which data the model ingests for training.

“

If you imagine the number of people around the globe who continuously interact with AI chatbots like ChatGPT, the need to make this more efficient is evident.

”

Makes sense, as the data volumes used for training are so immense. How can we improve there?

Selecting high-quality data is critical. This is why we are also developing systems like Modyn and Mixtera to help users easily express data selection policies that help a model train to a high accuracy in fewer steps, while the system optimizes the throughput of fetching the desired data from storage. I should also mention that while I’ve talked a lot about training models, efficiency is also important when it comes to serving models, what we call inference. If you imagine the number of people around the globe who continuously interact with AI chatbots like ChatGPT, the need to make this more efficient is evident.

What can you do to make model inference more efficient?

It’s common to fine-tune a pretrained model for different tasks by feeding it specialized input data to learn the target domain. For instance, when you want to build a chatbot, you can feed it examples of human conversations. In this process, the model weights are adjusted, which creates a different version of the original model. Each fine-tuned model is as big as the original base model, it just has different values for its parameters. So how do you design a system to efficiently serve many model variants to different users? To be fast, the models should already be loaded on GPUs, but this is inefficient, as you may not have continuous requests for each of the model variants. You could load models on-demand to GPUs as requests come in. However, users are rarely prepared to wait while a model is being loaded. Our approach to solving this conundrum was developing a LLM serving platform called DeltaZip that is designed to efficiently serve different fine-tuned model variants. The system takes advantage of the fact that in fine-tuning, the model weights do not change by much. We found that by storing only the “delta”—the difference between each fine-tuned model and the base—you shrink the necessary information to just a few bits. That compression reduces loading times for each variant and enables us to load models on demand while maintaining high performance.

Last question: You are part of the Swiss AI Initiative and a member of its Steering Committee. Before, you worked on AI at Google. What are the advantages of being in academia?

Compared to industry, academics tend to have more freedom to explore research ideas on a longer time horizon, which encourages the exploration of fundamentally new ideas that challenge the way things are done today. Access to large-scale compute is often a challenge for academics working on ML, but sometimes operating under more constrained resources can stimulate people to get more creative with the resources that they have. An upside of academia is not being driven by business needs. You can focus on work that benefits society. What I find exciting about the Swiss AI initiative is that it brings together a large, diverse team of hundreds of researchers across institutions with thousands of GPUs in the CSCS “Alps” supercomputer to develop models in a much more transparent and open way than is typically done in industry. I’m really excited about the release of the fully open Apertus model and the follow-up work to come.

The Swiss AI Initiative

was started in December 2023 and seeded with an initial investment of over 10m GPU hours on CSCS’s nearly carbon neutral “Alps” supercomputer and a grant of 20m CHF by the ETH Domain. The initiative is the largest open science/open source effort for AI foundation models worldwide, and the first initiative of the Swiss National AI Institute, a partnership between the ETH AI Center and the EPFL AI Center. The initiative further benefits from the critical mass of expertise of over 800 researchers (including 70 AI-focused professors) from over 10 academic institutions across Switzerland.

Tools to help train and serve AI models more efficiently

Sailor: A distributed AI training platform that optimizes hardware and job parallelism configurations to improve efficiency.

Orion: A GPU scheduler that enables to share GPU resources between different ML jobs. The tool co-schedules ML jobs in a way that minimizes interference between jobs. The key idea: co-locate work using complementary resources, e.g. memory-intensive and compute-intensive routines of ML workloads.

DeltaZip: A LLM serving platform designed to efficiently serve many different model variants that are fine-tuned from the same base model.

Mixtera: A training data mixing platform with a declarative interface that lets ML developers express the characteristics of the data they want to train on and ingests this data at high throughput during training.

The open-source codes of these tools are available at https://github.com/eth-easl.