Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Enhancing Multimodal Large Language Models: From Multimodal 
Alignment, Fine-Grained Perception to Robust Reasoning"

By

Mr. Renjie PI


Abstract:

Multimodal Large Language Models (MLLMs) have significantly advanced the 
integration of visual and textual data, enabling applications such as image 
captioning, visual question answering, and interactive AI agents. Despite 
these advancements, MLLMs face persistent challenges that limit their 
effectiveness. First, achieving precise alignment between visual inputs and 
textual representations remains difficult, often leading to 
misinterpretations and inconsistencies. Second, capturing fine-grained 
perceptual signals within images is challenging, causing the models to be 
difficult in real world application that require visual grounding. Third, 
performing robust reasoning across modalities is hindered by the models' 
limited ability to integrate information from diverse sources, resulting in 
superficial inferences, especially in tasks requiring complex logical 
deductions or spatial reasoning. To address these challenges, this thesis 
presents a series of methods aimed at enhancing MLLMs from the above aspects. 
Through extensive experimentation and evaluation, our proposed methods 
demonstrate significant improvements in the alignment, perception, and 
reasoning capabilities of MLLMs. This work contributes to the development of 
more robust and versatile MLLMs, paving the way for advanced applications in 
areas such as autonomous driving, GUI agents and robotics.


Date:                   Tuesday, 8 July 2025

Time:                   9:30am - 11:30am

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Dr. Qing CHEN (MAE)

Committee Members:      Prof. Xiaofang ZHOU (Supervisor)
                        Dr. Qifeng CHEN
                        Prof. Raymond WONG
                        Dr. Sirui HAN (EMIA)
                        Dr. Linqi SONG (CityU)

Privacy Sitemap

Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning

About

People

Research

Academics

Admissions