From Objects to Prompts: Towards Generalized Multimodal Foundation Model

PhD Thesis Proposal Defence


Title: "From Objects to Prompts: Towards Generalized Multimodal Foundation 
Model"

by

Mr. Feng LI


Abstract:

In the rapidly evolving landscape of artificial intelligence, the 
integration of vision and language models has become pivotal in developing 
generalized multimodal foundation models. This talk, titled "From Objects to 
Prompts," traces the evolution of my research from object-centric methods to 
prompt-based approaches for building such models. Beginning with 
foundational perception models (DINO, Mask DINO), the work progresses to 
vision-language prompts, transitioning from text-prompted understanding 
(Grounding DINO) to vision-prompted understanding (SEEM, T-Rex).

I will also highlight my contributions to vision LLMs (LLaVA-Next series), 
showcasing how advancements in multimodal integration and interleaved 
formatting enable seamless functionality across text, images, video, and 3D. 
These efforts unify diverse tasks and domains, driving remarkable 
generalization and expanding the boundaries of multimodal AI.


Short Bio:

Feng Li is a final-year PhD student at HKUST, supervised by Heung-Yeung 
Shum. His research focuses on fine-grained visual understanding and 
vision-language models. He is a core contributor to the 'DINO' series (DINO, 
Mask DINO, Grounding DINO) for object detection and the LLaVA-Next series 
(LLaVA-NeXT-Interleave, LLaVA-OneVision) for large vision-language models. 
Feng's first/co-first works have earned over 15,000 stars on GitHub and 
6,500 citations on Google Scholar. He has also interned at Meta AI (FAIR), 
Microsoft Research (Redmond), and Bytedance.


Date:                   Friday, 14 February 2025

Time:                   11:00am - 1:00pm

Venue:                  Room 3523
                        Lifts 25/26

Committee Members:      Prof. Harry Shum (Supervisor)
                        Prof. Lionel Ni (Co-Supervisor)
                        Dr. Dan Xu (Chairperson)
                        Dr. May Fung
Privacy Sitemap
From Objects to Prompts: Towards Generalized Multimodal Foundation Model

About

People

Research

Academics

Admissions