More about HKUST
Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Enhancing Multimodal Large Language Models: From Multimodal
Alignment, Fine-Grained Perception to Robust Reasoning"
By
Mr. Renjie PI
Abstract:
Multimodal Large Language Models (MLLMs) have significantly advanced the
integration of visual and textual data, enabling applications such as image
captioning, visual question answering, and interactive AI agents. Despite
these advancements, MLLMs face persistent challenges that limit their
effectiveness. First, achieving precise alignment between visual inputs and
textual representations remains difficult, often leading to
misinterpretations and inconsistencies. Second, capturing fine-grained
perceptual signals within images is challenging, causing the models to be
difficult in real world application that require visual grounding. Third,
performing robust reasoning across modalities is hindered by the models'
limited ability to integrate information from diverse sources, resulting in
superficial inferences, especially in tasks requiring complex logical
deductions or spatial reasoning. To address these challenges, this thesis
presents a series of methods aimed at enhancing MLLMs from the above aspects.
Through extensive experimentation and evaluation, our proposed methods
demonstrate significant improvements in the alignment, perception, and
reasoning capabilities of MLLMs. This work contributes to the development of
more robust and versatile MLLMs, paving the way for advanced applications in
areas such as autonomous driving, GUI agents and robotics.
Date: Tuesday, 8 July 2025
Time: 9:30am - 11:30am
Venue: Room 5501
Lifts 25/26
Chairman: Dr. Qing CHEN (MAE)
Committee Members: Prof. Xiaofang ZHOU (Supervisor)
Dr. Qifeng CHEN
Prof. Raymond WONG
Dr. Sirui HAN (EMIA)
Dr. Linqi SONG (CityU)