More about HKUST
From Objects to Prompts: Towards Generalized Multimodal Foundation Model
PhD Thesis Proposal Defence
Title: "From Objects to Prompts: Towards Generalized Multimodal Foundation
Model"
by
Mr. Feng LI
Abstract:
In the rapidly evolving landscape of artificial intelligence, the
integration of vision and language models has become pivotal in developing
generalized multimodal foundation models. This talk, titled "From Objects to
Prompts," traces the evolution of my research from object-centric methods to
prompt-based approaches for building such models. Beginning with
foundational perception models (DINO, Mask DINO), the work progresses to
vision-language prompts, transitioning from text-prompted understanding
(Grounding DINO) to vision-prompted understanding (SEEM, T-Rex).
I will also highlight my contributions to vision LLMs (LLaVA-Next series),
showcasing how advancements in multimodal integration and interleaved
formatting enable seamless functionality across text, images, video, and 3D.
These efforts unify diverse tasks and domains, driving remarkable
generalization and expanding the boundaries of multimodal AI.
Short Bio:
Feng Li is a final-year PhD student at HKUST, supervised by Heung-Yeung
Shum. His research focuses on fine-grained visual understanding and
vision-language models. He is a core contributor to the 'DINO' series (DINO,
Mask DINO, Grounding DINO) for object detection and the LLaVA-Next series
(LLaVA-NeXT-Interleave, LLaVA-OneVision) for large vision-language models.
Feng's first/co-first works have earned over 15,000 stars on GitHub and
6,500 citations on Google Scholar. He has also interned at Meta AI (FAIR),
Microsoft Research (Redmond), and Bytedance.
Date: Friday, 14 February 2025
Time: 11:00am - 1:00pm
Venue: Room 3523
Lifts 25/26
Committee Members: Prof. Harry Shum (Supervisor)
Prof. Lionel Ni (Co-Supervisor)
Dr. Dan Xu (Chairperson)
Dr. May Fung