More about HKUST
Towards Unified Visual Intelligence: Bridging Dense Perception and Complex Reasoning in Multimodal Foundation Models
PhD Qualifying Examination
Title: "Towards Unified Visual Intelligence: Bridging Dense Perception and
Complex Reasoning in Multimodal Foundation Models"
by
Mr. Jiazhen LIU
Abstract:
Multimodal Large Language Models (MLLMs) have achieved widespread success in
the visual domain, demonstrating exceptional high-level semantic
understanding in tasks such as visual question answering. However,
constrained by the prevailing discrete token generation paradigm, MLLMs still
face significant bottlenecks in fine-grained dense perception tasks (e.g.,
pixel-level segmentation). Equipping MLLMs with native dense perception is
paramount for two reasons: first, it represents the ultimate pursuit of a
unified visual foundation model capable of handling all vision tasks within a
single closed-loop architecture; second, this low-level perception
fundamentally enhances the model's high-level comprehension. For instance, by
explicitly detecting or segmenting targets during Chain-of-Thought (CoT)
reasoning, models can effectively suppress background clutter and anchor
their logical deductions in precise visual cues. Consequently, this report
focuses on two core objectives: first, exploring how to optimize existing
MLLM architectures to natively and efficiently support diverse dense
perception tasks; and second, investigating how to deeply integrate these
dense perception capabilities into the intrinsic reasoning mechanisms of
MLLMs, paving the way for a unified, end-to-end foundation model capable of
Latent Visual Reasoning.
Date: Tuesday, 14 April 2026
Time: 2:00pm - 4:00pm
Venue: Room 2132C
Lift 22
Committee Members: Dr. Long Chen (Supervisor)
Dr. Dan Xu (Chairperson)
Dr. Qifeng Chen