From Objects to Prompts: Towards Generalized Multi-modal Foundation Model

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "From Objects to Prompts: Towards Generalized Multi-modal Foundation 
Model"

By

Mr. Feng LI


Abstract:

The integration of vision and language has become pivotal in developing 
generalized multimodal foundation models, enabling AI systems to understand 
and interact with the world in increasingly human-like ways. This 
dissertation traces my research journey from object-centric perception to 
prompt-based multimodal understanding, focusing on scalability, 
generalization, and real-world applicability. The foundation of this work 
lies in advancing object-centric perception through novel query designs for 
Transformer-based architectures (e.g., DN-DETR, DINO). Beyond closed-set 
recognition, we expand perception to open-vocabulary language prompts 
(OpenSeed, Semantic-SAM) and visual prompts (DINOv), enabling versatile 
human-AI interaction in real-world scenarios. Further, we generalize 
vision-language integration by leveraging large language models (LLMs). Our 
proposed LLaVA-Interleave unifies text, images, video, and 3D data through 
multimodal in- terleaved processing, achieving unprecedented generalization 
and pushing the boundaries of multimodal AI.


Date:                   Friday, 25 April 2025

Time:                   9:00am - 11:00am

Venue:                  Room 4472
                        Lifts 25/26

Chairman:               Prof. Vincent Kin Nang LAU (ECE)

Committee Members:      Prof. Harry SHUM (Supervisor)
                        Prof. Lionel NI (Supervisor)
                        Dr. Qifeng CHEN
                        Dr. May FUNG
                        Prof. Ping TAN (ECE)
                        Dr. Song HAN (MIT)