More about HKUST
Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning
PhD Thesis Proposal Defence
Title: "Enhancing Multimodal Large Language Models: From Multimodal
Alignment, Fine-Grained Perception to Robust Reasoning"
by
Mr. Renjie PI
Abstract:
Multimodal Large Language Models (MLLMs) have significantly advanced the
integration of visual and textual data, enabling applications such as image
captioning, visual question answering, and interactive AI agents. Despite
these advancements, MLLMs face persistent challenges that limit their
effectiveness. First, achieving precise alignment between visual inputs and
textual representations remains difficult, often leading to
misinterpretations and inconsistencies. Second, capturing fine-grained
perceptual signals within images is challenging, causing the models to be
difficult in real world application that require visual grounding. Third,
performing robust reasoning across modalities is hindered by the models'
limited ability to integrate information from diverse sources, resulting in
superficial inferences, especially in tasks requiring complex logical
deductions or spatial reasoning.
To address these challenges, this thesis presents a series of methods aimed
at enhancing MLLMs from the above aspects. Through extensive experimentation
and evaluation, our proposed methods demonstrate significant improvements in
the alignment, perception, and reasoning capabilities of MLLMs. This work
contributes to the development of more robust and versatile multimodal
systems, paving the way for advanced applications in areas such as visual
question answering, image captioning, and interactive AI agents.
Date: Friday, 9 May 2025
Time: 4:15pm - 6:15pm
Venue: Room 2408
Lifts 17/18
Committee Members: Prof. Xiaofang Zhou (Supervisor)
Dr. Qifeng Chen (Chairperson)
Dr. May Fung
Dr. Yangqiu Song