More about HKUST
From Objects to Prompts: Towards Generalized Multi-modal Foundation Model
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "From Objects to Prompts: Towards Generalized Multi-modal Foundation Model" By Mr. Feng LI Abstract: The integration of vision and language has become pivotal in developing generalized multimodal foundation models, enabling AI systems to understand and interact with the world in increasingly human-like ways. This dissertation traces my research journey from object-centric perception to prompt-based multimodal understanding, focusing on scalability, generalization, and real-world applicability. The foundation of this work lies in advancing object-centric perception through novel query designs for Transformer-based architectures (e.g., DN-DETR, DINO). Beyond closed-set recognition, we expand perception to open-vocabulary language prompts (OpenSeed, Semantic-SAM) and visual prompts (DINOv), enabling versatile human-AI interaction in real-world scenarios. Further, we generalize vision-language integration by leveraging large language models (LLMs). Our proposed LLaVA-Interleave unifies text, images, video, and 3D data through multimodal in- terleaved processing, achieving unprecedented generalization and pushing the boundaries of multimodal AI. Date: Friday, 25 April 2025 Time: 9:00am - 11:00am Venue: Room 4472 Lifts 25/26 Chairman: Prof. Vincent Kin Nang LAU (ECE) Committee Members: Prof. Harry SHUM (Supervisor) Prof. Lionel NI (Supervisor) Dr. Qifeng CHEN Dr. May FUNG Prof. Ping TAN (ECE) Dr. Song HAN (MIT)