A Survey on Visual Perception Enhanced Multimodal Large Language Models

PhD Qualifying Examination


Title: "A Survey on Visual Perception Enhanced Multimodal Large Language 
Models"

by

Mr. Renjie PI


Abstract:

The integration of visual inputs with large language models (LLMs) has 
catalyzed significant progress in multimodal capabilities, leading to the 
emergence of vision large language models (VLLMs). Despite these 
advancements, leveraging LLMs for complex visual perception tasks, such as 
detection and segmentation, remains challenging. These difficulties make it 
hard to directly apply state-of-the-art multimodal large language models 
(MLLMs) to applications requiring precise localization, such as robotics and 
autonomous driving. In this survey, we provide a comprehensive overview of 
the current paradigm of multimodal LLMs, exploring the evolution and 
methodologies that have been proposed to enhance these models with visual 
perception capabilities. Additionally, we present a synthesis of our recent 
findings and propose potential avenues for future research, aiming to bridge 
the gap between visual perception and LLMs in real-world, precision-critical 
tasks.


Date:                   Friday, 7 March 2025

Time:                   2:00pm - 4:00pm

Venue:                  Room 2408
                        Lifts 17/18

Committee Members:      Prof. Xiaofang Zhou (Supervisor)
                        Dr. Qifeng Chen (Chairperson)
                        Dr. May Fung
                        Dr. Yangqiu Song