Fachschaft Maschinenbau: BaSaMa & HiWi

Learning from BEV and Dash-Camera Views: A Multimodal Large Language Model for Quantitative Risk Assessment

Institut

Professur für autonome Fahrzeugsysteme

Typ

Semesterarbeit / Masterarbeit /

Inhalt

experimentell / theoretisch /

Beschreibung

Background

The autonomy of vehicles has advanced rapidly in recent years, reaching a level where human intervention is barely or not at all required in certain controlled environments. Leading the way are manufacturers that now offer Level 3 autonomous vehicles depending on the system’s design. This progress relies heavily on the development and validation of highly reliable driving functions. Ensuring their safety and reliability requires extensive testing in diverse and challenging scenarios.

At the same time, Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have emerged as powerful tools for scene understanding, decision-making, and risk analysis in autonomous driving. Recent benchmarks have shown that BEV-based datasets allow VLMs to learn quantitative, spatio-temporal risk reasoning. However, BEV images are not always available in real-world systems — most vehicles are equipped with front and surround dash cameras, while BEV maps require additional processing. This gap raises the question: Can a model trained with both BEV and dash-camera inputs learn to generalize, enabling quantitative risk analysis even when only dash-cam views are available?

Objective

The primary objective of this project is to develop a Multimodal Large Language Model (MLLM) agent that learns from both BEV images and dash-camera views during training, and is able to perform quantitative, spatio-temporal risk assessment from dash-camera views alone at inference.

The framework will consist of three key components:

Cross-modal dataset construction: Extend the NuRisk pipeline to generate paired BEV and dash-camera views with consistent, quantitative risk labels.
MLLM training with modality dropout: Train a hybrid BEV+dash-camera model where the BEV acts as structured supervision, but the model is designed to handle missing modalities at inference.
Evaluation & benchmarking: Test the model’s ability to (i) perform accurate quantitative risk scoring, (ii) generalize from BEV+camera training to camera-only inference, and (iii) outperform BEV-only or camera-only baselines.

We Offer

A dynamic and future-oriented research environment
Hands-on experience with a state-of-the-art software stack for autonomous driving (nuScenes, Waymo, CommonRoad, CARLA)
Opportunity to publish a scientific paper (based on merit)
The thesis can be written in either English or German

Requirements (What You Should Bring)

Initiative and a creative, problem-solving mindset
Excellent English or German proficiency
Advanced knowledge of Python and deep learning frameworks (PyTorch or TensorFlow)
Prior experience with autonomous vehicles, Vision-Language Models, or multimodal learning is an advantage
Familiarity with common software development tools (e.g., Git, Ubuntu) is desirable

Work can begin immediately.

If you are interested in this topic, please first have a look at our recent survey paper: https://arxiv.org/abs/2506.11526

Then send an email with a brief cover letter explaining why you are fascinated by this subject, along with a current transcript of records and your resume, to: yuan_avs.gaotum.de

Tags

AVS Gao

Möglicher Beginn

sofort

Kontakt

Yuan Gao
yuan_avs.gaotum.de

Navigation

Navigation

Learning from BEV and Dash-Camera Views: A Multimodal Large Language Model for Quantitative Risk Assessment

Background