Learning from BEV and Dash-Camera Views: A Multimodal Large Language Model for Quantitative Risk Assessment
- Institut
- Professur für autonome Fahrzeugsysteme
- Typ
- Semesterarbeit Masterarbeit
- Inhalt
- experimentell theoretisch
- Beschreibung
Background
The autonomy of vehicles has advanced rapidly in recent years, reaching a level where human intervention is barely or not at all required in certain controlled environments. Leading the way are manufacturers that now offer Level 3 autonomous vehicles depending on the system’s design. This progress relies heavily on the development and validation of highly reliable driving functions. Ensuring their safety and reliability requires extensive testing in diverse and challenging scenarios.
At the same time, Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have emerged as powerful tools for scene understanding, decision-making, and risk analysis in autonomous driving. Recent benchmarks have shown that BEV-based datasets allow VLMs to learn quantitative, spatio-temporal risk reasoning. However, BEV images are not always available in real-world systems — most vehicles are equipped with front and surround dash cameras, while BEV maps require additional processing. This gap raises the question: Can a model trained with both BEV and dash-camera inputs learn to generalize, enabling quantitative risk analysis even when only dash-cam views are available?
Objective
The primary objective of this project is to develop a Multimodal Large Language Model (MLLM) agent that learns from both BEV images and dash-camera views during training, and is able to perform quantitative, spatio-temporal risk assessment from dash-camera views alone at inference.
The framework will consist of three key components:
- Cross-modal dataset construction: Extend the NuRisk pipeline to generate paired BEV and dash-camera views with consistent, quantitative risk labels.
- MLLM training with modality dropout: Train a hybrid BEV+dash-camera model where the BEV acts as structured supervision, but the model is designed to handle missing modalities at inference.
- Evaluation & benchmarking: Test the model’s ability to (i) perform accurate quantitative risk scoring, (ii) generalize from BEV+camera training to camera-only inference, and (iii) outperform BEV-only or camera-only baselines.
We Offer
- A dynamic and future-oriented research environment
- Hands-on experience with a state-of-the-art software stack for autonomous driving (nuScenes, Waymo, CommonRoad, CARLA)
- Opportunity to publish a scientific paper (based on merit)
- The thesis can be written in either English or German
Requirements (What You Should Bring)
- Initiative and a creative, problem-solving mindset
- Excellent English or German proficiency
- Advanced knowledge of Python and deep learning frameworks (PyTorch or TensorFlow)
- Prior experience with autonomous vehicles, Vision-Language Models, or multimodal learning is an advantage
- Familiarity with common software development tools (e.g., Git, Ubuntu) is desirable
Work can begin immediately.
If you are interested in this topic, please first have a look at our recent survey paper: https://arxiv.org/abs/2506.11526
Then send an email with a brief cover letter explaining why you are fascinated by this subject, along with a current transcript of records and your resume, to: yuan_avs.gaotum.de
- Tags
- AVS GAO
- Möglicher Beginn
- sofort
- Kontakt
-
Yuan Gao
yuan_avs.gaotum.de