BaSaMa - Fachschaft Maschinenbau: BaSaMa & HiWi

Physics-Aware Multimodal Learning for Foundation Models: Diagnosing and Preventing Modality Collapse

Institute

Professur für autonome Fahrzeugsysteme (TUM-ED)

Type

Bachelor's Thesis / Semester Thesis /

Content

theoretical /

Description

Hintergrund

Foundation Models wie Vision-Language Models (VLMs) oder Vision-Language-Action Models (VLAs) werden zunehmend um neue Modalitäten erweitert, z. B. um Sensorik, Zustandsgrößen, Simulationsrollouts oder explizite physikalische und dynamische Constraints. In der Praxis tritt dabei jedoch häufig Modality Collapse auf: Das Modell ignoriert die neu hinzugefügte Modalität (hier: Physik bzw. Dynamik) und stützt sich weiterhin primär auf dominante Signale aus Vision und Sprache. Dadurch entstehen Modelle, die zwar gute Metriken erzielen, physikalische Nebenbedingungen jedoch nicht zuverlässig berücksichtigen. Dies ist insbesondere in sicherheitskritischen Anwendungen wie Robotik, Manipulation oder autonomem Fahren problematisch.

Diese Arbeit untersucht theoretisch und systematisch, warum Modality Collapse beim Hinzufügen einer physikalischen bzw. dynamischen Modalität entsteht, wie er frühzeitig erkannt werden kann und welche Trainings- und Architekturstrategien geeignet sind, ihn zu verhindern oder zu beheben. Ziel der Arbeit ist eine fundierte Ausarbeitung, die als praxisnaher Leitfaden zur Vermeidung von Modality Collapse in realen Systemen dienen kann.

Zielsetzung

Erarbeitung einer Taxonomie von Modality-Collapse-Mechanismen beim Hinzufügen einer Physik-Modalität zu VLMs, VLAs oder allgemeinen Foundation Models
Identifizierung geeigneter Indikatoren zur frühzeitigen Erkennung von Modality Collapse (z. B. Ablations- und Counterfactual-Tests, Gradient- und Attributionsanalysen)
Systematische Review und Einordnung wirksamer Gegenmaßnahmen, insbesondere:
- Daten- und Task-Designs, die die Nutzung physikalischer Information notwendig machen
- Trainings-Curricula (z. B. Warm-up-Phasen, Staged Training, Modality Dropout)
- Architekturen für robuste multimodale Fusion (z. B. Gating/FiLM, dedizierte Cross-Attention, Adapter- oder LoRA-basierte Ansätze)
- Zielfunktionen und Regularisierungen (z. B. kontrastive und counterfactual Objectives, Konsistenz- und Constraint-basierte Losses)

Wir bieten

Ein aktuelles Forschungsthema an der Schnittstelle von Multimodal Learning, Foundation Models und physikbasierten Constraints
Hohe Praxisrelevanz bei gleichzeitig solider theoretischer Fundierung
Flexible Betreuung sowie die Möglichkeit, die Arbeit in Deutsch oder Englisch durchzuführen
Möglichkeit zur Veröffentlichung der Ergebnisse bei entsprechender Eignung

Anforderungen (Was Du mitbringen solltest)

Sehr gute Deutsch- oder Englischkenntnisse
Solide Kenntnisse in Machine Learning und Deep Learning
Erfahrung mit Python (idealerweise mit PyTorch)
Interesse an Multimodal Learning, Representation Learning und/oder physikbasierten Modellen
Eigeninitiative sowie eine strukturierte und selbstständige Arbeitsweise

Start

Die Arbeit kann und sollte ab sofort/zeitnah begonnen werden. Bei Interesse senden Sie mir bitte eine E-Mail mit kurzem Leistungsnachweis und Lebenslauf.

--------------------------------------------------- ENGLISH VERSION ---------------------------------------------------

Background

Foundation models such as Vision-Language Models (VLMs) and Vision-Language-Action Models (VLAs) are increasingly extended with additional modalities, for example sensor data, state variables, simulation rollouts, or explicit physical and dynamical constraints. In practice, however, a common failure mode is modality collapse: the model largely ignores the newly added modality (here: physics and dynamics) and continues to rely primarily on dominant signals from vision and language. As a result, models may achieve strong benchmark performance while failing to reliably respect physical constraints. This is particularly problematic in safety-critical applications such as robotics, manipulation, and autonomous driving.

This thesis investigates, in a theoretical and systematic manner, why modality collapse arises when adding a physical or dynamical modality, how it can be detected at an early stage, and which training and architectural strategies are effective in preventing or mitigating it. The goal of this work is to provide a well-founded analysis that can serve as a practical guideline for avoiding modality collapse in real-world multimodal systems.

Objectives

Develop a taxonomy of modality collapse mechanisms that occur when adding a physics modality to VLMs, VLAs, or general foundation models
Identify suitable indicators for the early detection of modality collapse (e.g., ablation and counterfactual tests, gradient-based and attribution-based analyses)
Conduct a systematic review and categorization of effective countermeasures, including:
- Data and task designs that make the use of physical information necessary
- Training curricula (e.g., warm-up phases, staged training, modality dropout)
- Architectures for robust multimodal fusion (e.g., gating/FiLM, dedicated cross-attention mechanisms, adapter- or LoRA-based approaches)
- Objective functions and regularization strategies (e.g., contrastive and counterfactual objectives, consistency-based and constraint-based losses)

We Offer

A timely research topic at the intersection of multimodal learning, foundation models, and physics-based constraints
High practical relevance combined with solid theoretical depth
Flexible supervision and the option to conduct the work in either English or German
Possibility of publication for strong contributions

Requirements

Excellent proficiency in English or German
Solid background in machine learning and deep learning
Experience with Python (ideally using PyTorch)
Interest in multimodal learning, representation learning, and/or physics-based modeling
Self-motivation and a structured, independent working style

Start

The project should begin immediately/in the near future. If you are interested, please send an email including a brief academic record and your CV.

Requirements

Tags

AVS Schaefer

Possible start

sofort

Contact

Finn Rasmus Schäfer
finn.schaefertum.de

Navigation

Navigation

Physics-Aware Multimodal Learning for Foundation Models: Diagnosing and Preventing Modality Collapse

Hintergrund

Zielsetzung

Wir bieten

Anforderungen (Was Du mitbringen solltest)

Start

Background

Objectives

We Offer

Requirements

Start