More about HKUST
A Survey on Visual Perception Enhanced Multimodal Large Language Models
PhD Qualifying Examination Title: "A Survey on Visual Perception Enhanced Multimodal Large Language Models" by Mr. Renjie PI Abstract: The integration of visual inputs with large language models (LLMs) has catalyzed significant progress in multimodal capabilities, leading to the emergence of vision large language models (VLLMs). Despite these advancements, leveraging LLMs for complex visual perception tasks, such as detection and segmentation, remains challenging. These difficulties make it hard to directly apply state-of-the-art multimodal large language models (MLLMs) to applications requiring precise localization, such as robotics and autonomous driving. In this survey, we provide a comprehensive overview of the current paradigm of multimodal LLMs, exploring the evolution and methodologies that have been proposed to enhance these models with visual perception capabilities. Additionally, we present a synthesis of our recent findings and propose potential avenues for future research, aiming to bridge the gap between visual perception and LLMs in real-world, precision-critical tasks. Date: Friday, 7 March 2025 Time: 2:00pm - 4:00pm Venue: Room 2408 Lifts 17/18 Committee Members: Prof. Xiaofang Zhou (Supervisor) Dr. Qifeng Chen (Chairperson) Dr. May Fung Dr. Yangqiu Song