More about HKUST
From Objects to Prompts: Towards Generalized Multimodal Foundation Model
PhD Thesis Proposal Defence Title: "From Objects to Prompts: Towards Generalized Multimodal Foundation Model" by Mr. Feng LI Abstract: In the rapidly evolving landscape of artificial intelligence, the integration of vision and language models has become pivotal in developing generalized multimodal foundation models. This talk, titled "From Objects to Prompts," traces the evolution of my research from object-centric methods to prompt-based approaches for building such models. Beginning with foundational perception models (DINO, Mask DINO), the work progresses to vision-language prompts, transitioning from text-prompted understanding (Grounding DINO) to vision-prompted understanding (SEEM, T-Rex). I will also highlight my contributions to vision LLMs (LLaVA-Next series), showcasing how advancements in multimodal integration and interleaved formatting enable seamless functionality across text, images, video, and 3D. These efforts unify diverse tasks and domains, driving remarkable generalization and expanding the boundaries of multimodal AI. Short Bio: Feng Li is a final-year PhD student at HKUST, supervised by Heung-Yeung Shum. His research focuses on fine-grained visual understanding and vision-language models. He is a core contributor to the 'DINO' series (DINO, Mask DINO, Grounding DINO) for object detection and the LLaVA-Next series (LLaVA-NeXT-Interleave, LLaVA-OneVision) for large vision-language models. Feng's first/co-first works have earned over 15,000 stars on GitHub and 6,500 citations on Google Scholar. He has also interned at Meta AI (FAIR), Microsoft Research (Redmond), and Bytedance. Date: Friday, 14 February 2025 Time: 11:00am - 1:00pm Venue: Room 3523 Lifts 25/26 Committee Members: Prof. Harry Shum (Supervisor) Prof. Lionel Ni (Co-Supervisor) Dr. Dan Xu (Chairperson) Dr. May Fung