More about HKUST
Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning" By Mr. Renjie PI Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced the integration of visual and textual data, enabling applications such as image captioning, visual question answering, and interactive AI agents. Despite these advancements, MLLMs face persistent challenges that limit their effectiveness. First, achieving precise alignment between visual inputs and textual representations remains difficult, often leading to misinterpretations and inconsistencies. Second, capturing fine-grained perceptual signals within images is challenging, causing the models to be difficult in real world application that require visual grounding. Third, performing robust reasoning across modalities is hindered by the models' limited ability to integrate information from diverse sources, resulting in superficial inferences, especially in tasks requiring complex logical deductions or spatial reasoning. To address these challenges, this thesis presents a series of methods aimed at enhancing MLLMs from the above aspects. Through extensive experimentation and evaluation, our proposed methods demonstrate significant improvements in the alignment, perception, and reasoning capabilities of MLLMs. This work contributes to the development of more robust and versatile MLLMs, paving the way for advanced applications in areas such as autonomous driving, GUI agents and robotics. Date: Tuesday, 8 July 2025 Time: 9:30am - 11:30am Venue: Room 5501 Lifts 25/26 Chairman: Dr. Qing CHEN (MAE) Committee Members: Prof. Xiaofang ZHOU (Supervisor) Dr. Qifeng CHEN Prof. Raymond WONG Dr. Sirui HAN (EMIA) Dr. Linqi SONG (CityU)