May 19, 2025 – by Melissa Achisi, EPFL

Unlike the grid-like streets of many American cities, European roads are often narrow, winding and irregular. Such urban environments are filled with countless intersections without clear markings, pedestrian-only zones, roundabouts and areas where bicycles and scooters share the road with cars. Designing an autonomous mobility system that can safely operate in these conditions requires more than just sophisticated sensors and cameras.

It’s mostly about tackling a tremendous challenge: predicting the dynamics of the world, in other words, understanding how humans navigate within given urban environments. Pedestrians, for example, often make spontaneous decisions such as darting across a street, suddenly changing direction, or weaving through crowds. A kid might run after a dog. Cyclists and scooters further complicate the equation, with their agile and often unpredictable manoeuvres.

“Autonomous mobility, whether in the form of self-driving cars or delivery robots, must evolve beyond merely reacting to the present moment,” says Alexandre Alahi, head of EPFL’s Visual Intelligence for Transportation Laboratory (VITA). “To navigate our complex, dynamic world, these AI-driven systems need the ability to imagine, anticipate, and simulate possible futures — just as humans do when we wonder what might happen next. In essence, AI must learn to wonder.” At the VITA laboratory, the goal of making AI wonder is indeed becoming a reality. The team recently trained several new AI models on CSCS’s ‘Alps’ supercomputer that provided the massive computational power needed to process the vast amounts of multimodal data used for AI training.

Pushing the boundaries of prediction

The VITA team presented their progress in seven papers at the prestigious Conference on Computer Vision and Pattern Recognition. Each contribution introduced a novel method to help AI systems imagine, predict, and simulate possible futures — from forecasting human motion to generating entire video sequences. All models and datasets are being released as open source, empowering the global research community and industry to build upon and extend this work.

One of the most innovative models is designed to predict video sequences from a single image captured by a camera mounted on a vehicle, or any other egocentric view. Called GEM (Generalizable Ego-Vision Multimodal World Model), it helps autonomous systems anticipate future events by learning how scenes evolve over time.

Safe and realistic autonomous system training

As part of the Swiss AI Initiative, and in collaboration with four other institutions (University of Bern, Swiss Data Science Center (SDSC), University of Zurich and ETH Zurich), the team trained their model on ‘Alps’ using 4000 hours of videos spanning autonomous driving, human activities seen from the first-person point of view, and drone footage. This way, GEM learned how people and objects move in different environments and used this knowledge to generate entirely new video sequences that imagined what might happen next in a given scene — whether it's a pedestrian crossing the street or a car turning at an intersection.

These imagined scenarios can even be controlled by adding cars and pedestrians, making GEM a powerful tool for training and testing autonomous systems in a wide range of realistic situations.

To make these predictions, the model simultaneously looks at several types of information, also called modalities. It analyses standard colour video frames to understand the visual context of a scene, and depth maps to grasp its 3D structure. Together, the two data types allow the model to interpret both what is happening and where things are in space. In addition, GEM takes into account the movement of the camera, human poses, and object dynamics over time. By learning how these signals evolve together across thousands of real-world situations, the model can generate coherent, realistic sequences that reflect how a scene might change in the next few seconds.

“GEM can function as a realistic simulator for vehicles, drones and other robots, enabling the safe testing of control policies in virtual environments before deploying them in real-world conditions,” says Mariam Hassan, PhD student at VITA lab. “It can also assist in planning by helping these robots anticipate changes in their surroundings, making decision-making more robust and context aware.”

The long road to predictions

GEM represents just one piece of the VITA Lab’s effort to predict human behaviour. Other research projects from Alexandre Alahi’s team are tackling lower levels of abstractions to enhance prediction with robustness, generalizability, and social awareness. For example, one of them aims to certify where people will move, even when the data is incomplete or slightly off. Another project named MotionMap tackles the inherent unpredictability of human motion through a probabilistic approach, which helps systems prepare for unexpected movements in dynamic environments.

There are still challenges: long-term consistency, high-fidelity spatial accuracy, and computational efficiency are still evolving. At the heart of it all lies the toughest question: how well shall we predict people who don’t always follow patterns? Human decisions are shaped by intent, emotion, and context — factors that aren’t always visible to machines.

(Cover image at the top: Adobe Stock; embedded videos: EPFL/VITA Lab)

About the Swiss AI Initiative

Launched in December 2023 by EPFL and ETH Zurich, the Swiss AI Initiative is supported by more than 10 academic institutions across Switzerland. With over 800 researchers involved and access to 10 million GPU hours on CSCS’s ‘Alps’, it stands as the world’s largest open science and open-source effort dedicated to AI foundation models. The model developed by VITA lab, in collaboration with four other institutions (University of Bern, SDSC, University of Zurich and ETH Zurich) is among the first major models to emerge from this ambitious collaboration.

Autonomous mobility in Switzerland

In Switzerland, fully autonomous mobility is not yet permitted on public roads. However, as of March 2025, cars equipped with advanced assisted driving systems will be allowed to steer, accelerate and brake autonomously. While the drivers must remain alert and ready to take control, this marks a significant step towards everyday automation. Cantons have the authority to approve specific routes for fully autonomous vehicles, operating without a human on board and monitored remotely by control centres. These routes will primarily be used by buses and delivery vans.

References

“GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control”, M. Hassan*, S. Stapf*, A. Rahimi*, P. M. B. Rezende*, Y. Haghighi, D. Brüggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha,M. Cannici, E. Aljalbout, B. Ye, X. Wang, A. Davtyan, M. Salzmann, D. Scaramuzza, M. Pollefeys, P. Favaro, A. Alahi, CVPR’25. arXiv:2412.11198

“MotionMap: Representing Multimodality in Human Pose Forecasting”, R. Hosseininejad, M. Shukla, S. Saadatnejad, M. Salzmann, A. Alahi, CVPR’25. arXiv:2412.18883

“Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation”, M. Zayene, J. Endres, A. Havolli, C.Corbière, S. Cherkaoui, A. Ben Ahmed Kontouli, A. Alahi, CVPR’25. arXiv:2411.18335

“FG2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching”. Z. Xia, A. Alahi, CVPR’25. DOI: https://doi.org/10.1007/978-3-031-72751-1_23

“Towards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting”, K. Messaoud, M. Cord, A. Alahi, CVPR’25. arXiv:2501.04815

“Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations”, A. Rahimi, P-C. Luan, Y. Liu, F. Rajic, A. Alahi, CVPR’25. arXiv:2312.04540

“Certified human trajectory prediction”, M. Bahari, S. Saadatnejad, A. Askari Farsangi, S. Moosavi-Dezfooli, A. Alahi, CVPR’25. arXiv:2403.13778