More about HKUST
A Survey on Visual Perception Enhanced Multimodal Large Language Models
PhD Qualifying Examination
Title: "A Survey on Visual Perception Enhanced Multimodal Large Language
Models"
by
Mr. Renjie PI
Abstract:
The integration of visual inputs with large language models (LLMs) has
catalyzed significant progress in multimodal capabilities, leading to the
emergence of vision large language models (VLLMs). Despite these
advancements, leveraging LLMs for complex visual perception tasks, such as
detection and segmentation, remains challenging. These difficulties make it
hard to directly apply state-of-the-art multimodal large language models
(MLLMs) to applications requiring precise localization, such as robotics and
autonomous driving. In this survey, we provide a comprehensive overview of
the current paradigm of multimodal LLMs, exploring the evolution and
methodologies that have been proposed to enhance these models with visual
perception capabilities. Additionally, we present a synthesis of our recent
findings and propose potential avenues for future research, aiming to bridge
the gap between visual perception and LLMs in real-world, precision-critical
tasks.
Date: Friday, 7 March 2025
Time: 2:00pm - 4:00pm
Venue: Room 2408
Lifts 17/18
Committee Members: Prof. Xiaofang Zhou (Supervisor)
Dr. Qifeng Chen (Chairperson)
Dr. May Fung
Dr. Yangqiu Song