More about HKUST
Multimodal Learning for Complex Visual Environments: From Multimodal Understanding to Controllable Data Synthesis
PhD Thesis Proposal Defence
Title: "Multimodal Learning for Complex Visual Environments: From Multimodal
Understanding to Controllable Data Synthesis"
by
Mr. Kai CHEN
Abstract:
Complex visual environments pose a demanding challenge for modern multimodal
learning. Models must perceive heterogeneous contexts spanning images,
language, and speech; reason over diverse and dynamic situations; and
generate outputs that support interaction with both human users and the
visual environment. A second challenge lies in the visual experience required
for training and evaluation: long- tailed scenes are costly to collect,
expensive to annotate, and hard to vary systematically. This raises the
central question of the thesis: how can multimodal models be trained to
understand complex visual environments in a unified and scalable manner?
This thesis begins by establishing the training and evaluation foundations
for multimodal understanding. It first studies how to extend the reasoning
and instruction-following abilities of Large Language Models (LLMs) to
unified Multi-modal LLMs (MLLMs), covering large-scale data curation,
end-to-end training, and evaluation. It then moves to real-world scene
reasoning, where MLLMs must perceive relevant entities, localize critical
regions, explain their influence, and provide task-aware suggestions, with
self-driving serving as a representative stress test.
The perception failures observed in complex scene reasoning further expose a
data bottleneck in multimodal understanding: models need visual experience
that covers rare objects, structured layouts, and dynamic interactions, yet
such data is difficult to obtain via real-world collection alone. This thesis
addresses the bottleneck through controllable data synthesis, examining how
to generate semantically and geometrically aligned visual data for perception
learning.
Overall, this thesis argues that unified and scalable multimodal
understanding requires both principled model training and controllable data
synthesis. By connecting the end-to-end MLLM construction, real-world scene
reasoning, and controllable visual data synthesis, it provides a coherent
pathway for multimodal models to perceive, reason about, and interact with
complex visual environments.
Date: Wednesday, 21 May 2026
Time: 2:00pm - 4:00pm
Venue: Room 2128B
Lift 19
Committee Members: Prof. Dit-Yan Yeung (Supervisor)
Prof. Chi-Keung Tang (Chairperson)
Dr. Dan Xu