Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning

PhD Thesis Proposal Defence


Title: "Enhancing Multimodal Large Language Models: From Multimodal 
Alignment, Fine-Grained Perception to Robust Reasoning"

by

Mr. Renjie PI


Abstract:

Multimodal Large Language Models (MLLMs) have significantly advanced the 
integration of visual and textual data, enabling applications such as image 
captioning, visual question answering, and interactive AI agents. Despite 
these advancements, MLLMs face persistent challenges that limit their 
effectiveness. First, achieving precise alignment between visual inputs and 
textual representations remains difficult, often leading to 
misinterpretations and inconsistencies. Second, capturing fine-grained 
perceptual signals within images is challenging, causing the models to be 
difficult in real world application that require visual grounding. Third, 
performing robust reasoning across modalities is hindered by the models' 
limited ability to integrate information from diverse sources, resulting in 
superficial inferences, especially in tasks requiring complex logical 
deductions or spatial reasoning.

To address these challenges, this thesis presents a series of methods aimed 
at enhancing MLLMs from the above aspects. Through extensive experimentation 
and evaluation, our proposed methods demonstrate significant improvements in 
the alignment, perception, and reasoning capabilities of MLLMs. This work 
contributes to the development of more robust and versatile multimodal 
systems, paving the way for advanced applications in areas such as visual 
question answering, image captioning, and interactive AI agents.


Date:                   Friday, 9 May 2025

Time:                   4:15pm - 6:15pm

Venue:                  Room 2408
                        Lifts 17/18

Committee Members:      Prof. Xiaofang Zhou (Supervisor)
                        Dr. Qifeng Chen (Chairperson)
                        Dr. May Fung
                        Dr. Yangqiu Song

Privacy Sitemap

Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning

About

People

Research

Academics

Admissions