More about HKUST
Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning
PhD Thesis Proposal Defence Title: "Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning" by Mr. Renjie PI Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced the integration of visual and textual data, enabling applications such as image captioning, visual question answering, and interactive AI agents. Despite these advancements, MLLMs face persistent challenges that limit their effectiveness. First, achieving precise alignment between visual inputs and textual representations remains difficult, often leading to misinterpretations and inconsistencies. Second, capturing fine-grained perceptual signals within images is challenging, causing the models to be difficult in real world application that require visual grounding. Third, performing robust reasoning across modalities is hindered by the models' limited ability to integrate information from diverse sources, resulting in superficial inferences, especially in tasks requiring complex logical deductions or spatial reasoning. To address these challenges, this thesis presents a series of methods aimed at enhancing MLLMs from the above aspects. Through extensive experimentation and evaluation, our proposed methods demonstrate significant improvements in the alignment, perception, and reasoning capabilities of MLLMs. This work contributes to the development of more robust and versatile multimodal systems, paving the way for advanced applications in areas such as visual question answering, image captioning, and interactive AI agents. Date: Friday, 9 May 2025 Time: 4:15pm - 6:15pm Venue: Room 2408 Lifts 17/18 Committee Members: Prof. Xiaofang Zhou (Supervisor) Dr. Qifeng Chen (Chairperson) Dr. May Fung Dr. Yangqiu Song