BaSaMa & HiWi

Training Vision-Language-Action (VLA) Foundation Models for Embodied Control

Institut

Professur für autonome Fahrzeugsysteme (TUM-ED)

Typ

Masterarbeit /

Inhalt

experimentell / theoretisch / konstruktiv /

Beschreibung

Background

Join our team to train the "brain" of our next-generation robots by developing Vision-Language-Action (VLA) foundation models, pushing the boundaries of Embodied AI! Are you fascinated by the intersection of Large Language Models, Computer Vision, and Robotics? Do you want to build AI that doesn't just generate text, but actively interacts with the physical world based on natural language commands? This project offers a unique opportunity to leverage our infrastructure to train state-of-the-art end-to-end models for our Unitree robots. Traditional robotic pipelines rely on disjointed modules for perception, planning, and control. Inspired by recent breakthroughs like Physical Intelligence’s $\pi_0$ and OpenVLA, we are shifting toward unified foundation models. These VLAs ingest visual observations and text prompts, directly outputting low-level motor commands. This allows for unprecedented generalization, enabling robots to perform complex, multi-step manipulation and navigation tasks. You will utilize open-source datasets (e.g., Open X-Embodiment) alongside custom data collected in our lab to pre-train and fine-tune VLA models. You will investigate different model architectures, flow-matching mechanisms, and action representation strategies. Your work will serve as the core intelligence driving our humanoid and quadruped robots, translating high-level human intent into physical reality.

Example Thesis Topics (subject to availability):

Fine-Tuning Open-Source VLAs for Novel Embodiments: Adapt existing VLA base models (e.g., OpenVLA, OpenPI) to the specific action spaces and camera configurations of the Unitree G1 and Z1 using parameter-efficient fine-tuning (LoRA).
Evaluating Action Representations for Dexterous Manipulation: Compare different output modalities (e.g., discrete tokens, continuous diffusion models, flow-matching) for representing high-frequency motor commands in end-to-end VLA architectures.
Integrating Multimodal Prompting into Embodied Control: Investigate how VLAs can be conditioned not just on text, but on goal images, bounding boxes, or human sketches to improve the precision of manipulation tasks.
Continual Learning for VLAs in Real-World Environments: Develop methods for VLA models to incrementally learn from real-world failures and new teleoperation data without catastrophically forgetting previously learned tasks.

Technologies Used

Python, PyTorch, Transformers, Large Language Models (LLMs), Vision-Language Models (VLMs), Vision-Language-Action Models (VLAs), Diffusion Models, Hugging Face, Multi-GPU Training, NVIDIA DGX, Embodied AI.

Your Benefits: Join a High-Performance Robotics Team

Impactful Research: Work on a project where your code doesn't live in a silo; it is a critical gear in an end-to-end pipeline. Your results will directly enable robots to perform complex tasks.
Top-Tier Hardware Stack: Gain exclusive hands-on experience with NVIDIA DGX (training), Jetson Thor (inference), and Unitree Humanoids/Quadrupeds - very similar stack used by industry leaders like Tesla, Figure AI, and Physical Intelligence.
Scientific Publication: We aim for high-impact results. If your work meets the quality standards, we will co-author and submit a paper to top-tier robotics/AI conferences (e.g., ICRA, IROS, CoRL, or CVPR).
Professional Career Launchpad: This thesis is designed to mirror the workflow of elite AI labs. We provide dedicated mentorship and professional support to help you land roles at top-tier robotics startups or Big Tech AI labs.
Dynamic Lab Culture: You will be part of a "squad" of motivated Master’s students working in parallel, fostering a collaborative, fast-paced, and supportive environment.

Requirements

We are looking for students who know their thesis is not just as a degree requirement, but as a career-defining project.

Must-Have:

English Proficiency: High level of written and spoken English (the language of our research and documentation).
Proactive Mindset: You are comfortable with a "fail fast, learn fast" approach and is comfortable solving hands-on hardware/software integration challenges.
Independence: Ability to own a technical module and drive it forward while communicating effectively with the rest of the team.
Growth Path: A passion for Robotics/AI and an eagerness to learn new technologies.

Nice-to-Have (The "Plus"):

Technical Foundation: Proficiency in Python and/or C++.
Domain Experience: Prior exposure to PyTorch, ROS 2, or physics simulators (Isaac Sim/MuJoCo).
Hardware Skills: Experience working with robotic hardware, sensors, or VR systems.

Ready to build the future of Embodied AI? Send your CV, recent transcript, and a brief email on why you are the right fit for this specific "squad" and your career goals.

Möglicher Beginn

sofort

Kontakt

Roberto Brusnicki
roberto.brusnickitum.de

Navigation

Navigation

Training Vision-Language-Action (VLA) Foundation Models for Embodied Control

Background

Your Benefits: Join a High-Performance Robotics Team

Requirements