Multimodal Learning for Complex Visual Environments: From Multimodal Understanding to Controllable Data Synthesis

PhD Thesis Proposal Defence


Title: "Multimodal Learning for Complex Visual Environments: From Multimodal
Understanding to Controllable Data Synthesis"

by

Mr. Kai CHEN


Abstract:

Complex visual environments pose a demanding challenge for modern multimodal 
learning. Models must perceive heterogeneous contexts spanning images, 
language, and speech; reason over diverse and dynamic situations; and 
generate outputs that support interaction with both human users and the 
visual environment. A second challenge lies in the visual experience required 
for training and evaluation: long- tailed scenes are costly to collect, 
expensive to annotate, and hard to vary systematically. This raises the 
central question of the thesis: how can multimodal models be trained to 
understand complex visual environments in a unified and scalable manner?

This thesis begins by establishing the training and evaluation foundations 
for multimodal understanding. It first studies how to extend the reasoning 
and instruction-following abilities of Large Language Models (LLMs) to 
unified Multi-modal LLMs (MLLMs), covering large-scale data curation, 
end-to-end training, and evaluation. It then moves to real-world scene 
reasoning, where MLLMs must perceive relevant entities, localize critical 
regions, explain their influence, and provide task-aware suggestions, with 
self-driving serving as a representative stress test.

The perception failures observed in complex scene reasoning further expose a 
data bottleneck in multimodal understanding: models need visual experience 
that covers rare objects, structured layouts, and dynamic interactions, yet 
such data is difficult to obtain via real-world collection alone. This thesis 
addresses the bottleneck through controllable data synthesis, examining how 
to generate semantically and geometrically aligned visual data for perception 
learning.

Overall, this thesis argues that unified and scalable multimodal 
understanding requires both principled model training and controllable data 
synthesis. By connecting the end-to-end MLLM construction, real-world scene 
reasoning, and controllable visual data synthesis, it provides a coherent 
pathway for multimodal models to perceive, reason about, and interact with 
complex visual environments.


Date:                   Wednesday, 21 May 2026

Time:                   2:00pm - 4:00pm

Venue:                  Room 2128B
                        Lift 19

Committee Members:      Prof. Dit-Yan Yeung (Supervisor)
                        Prof. Chi-Keung Tang (Chairperson)
                        Dr. Dan Xu