Towards Unified Visual Intelligence: Bridging Dense Perception and Complex Reasoning in Multimodal Foundation Models

PhD Qualifying Examination


Title: "Towards Unified Visual Intelligence: Bridging Dense Perception and 
Complex Reasoning in Multimodal Foundation Models"

by

Mr. Jiazhen LIU


Abstract:

Multimodal Large Language Models (MLLMs) have achieved widespread success in 
the visual domain, demonstrating exceptional high-level semantic 
understanding in tasks such as visual question answering. However, 
constrained by the prevailing discrete token generation paradigm, MLLMs still 
face significant bottlenecks in fine-grained dense perception tasks (e.g., 
pixel-level segmentation). Equipping MLLMs with native dense perception is 
paramount for two reasons: first, it represents the ultimate pursuit of a 
unified visual foundation model capable of handling all vision tasks within a 
single closed-loop architecture; second, this low-level perception 
fundamentally enhances the model's high-level comprehension. For instance, by 
explicitly detecting or segmenting targets during Chain-of-Thought (CoT) 
reasoning, models can effectively suppress background clutter and anchor 
their logical deductions in precise visual cues. Consequently, this report 
focuses on two core objectives: first, exploring how to optimize existing 
MLLM architectures to natively and efficiently support diverse dense 
perception tasks; and second, investigating how to deeply integrate these 
dense perception capabilities into the intrinsic reasoning mechanisms of 
MLLMs, paving the way for a unified, end-to-end foundation model capable of 
Latent Visual Reasoning.


Date:                   Tuesday, 14 April 2026

Time:                   2:00pm - 4:00pm

Venue:                  Room 2132C
                        Lift 22

Committee Members:      Dr. Long Chen (Supervisor)
                        Dr. Dan Xu (Chairperson)
                        Dr. Qifeng Chen