More about HKUST
From Objects to Prompts: Towards Generalized Multi-modal Foundation Model
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "From Objects to Prompts: Towards Generalized Multi-modal Foundation
Model"
By
Mr. Feng LI
Abstract:
The integration of vision and language has become pivotal in developing
generalized multimodal foundation models, enabling AI systems to understand
and interact with the world in increasingly human-like ways. This
dissertation traces my research journey from object-centric perception to
prompt-based multimodal understanding, focusing on scalability,
generalization, and real-world applicability. The foundation of this work
lies in advancing object-centric perception through novel query designs for
Transformer-based architectures (e.g., DN-DETR, DINO). Beyond closed-set
recognition, we expand perception to open-vocabulary language prompts
(OpenSeed, Semantic-SAM) and visual prompts (DINOv), enabling versatile
human-AI interaction in real-world scenarios. Further, we generalize
vision-language integration by leveraging large language models (LLMs). Our
proposed LLaVA-Interleave unifies text, images, video, and 3D data through
multimodal in- terleaved processing, achieving unprecedented generalization
and pushing the boundaries of multimodal AI.
Date: Friday, 25 April 2025
Time: 9:00am - 11:00am
Venue: Room 4472
Lifts 25/26
Chairman: Prof. Vincent Kin Nang LAU (ECE)
Committee Members: Prof. Harry SHUM (Supervisor)
Prof. Lionel NI (Supervisor)
Dr. Qifeng CHEN
Dr. May FUNG
Prof. Ping TAN (ECE)
Dr. Song HAN (MIT)