Language-Grounded Visual Intelligence

Speaker: Dr. Yiwu Zhong
Chinese University of Hong Kong

Title: Language-Grounded Visual Intelligence

Date: Thursday, 13 March 2025

Time: 4:00pm - 5:00pm

Venue: Room 4502 (via lift 25/26), HKUST

Abstract:

In this talk, I will present my research on language-grounded visual intelligence, a field that aligns the visual world and human language to achieve a unified understanding in both modalities. By aligning visual and textual representations, we can build generalizable and scalable multi-modal models, capable of perception, reasoning, planning, and interacting with the open visual world, while responding to human instructions. I will start with open-world object detection, focusing on methods to identify objects across diverse, unconstrained environments. I will then introduce my work on designing and generating structured visual representations tailored for complex multi-modal reasoning. Further, I will demonstrate my work on task planning, which learns diverse task procedures from unannotated internet videos and enables zero-shot step forecasting. With vision-language research as the foundation, the talk will conclude with an outline of my future research directions in embodied AI.

Biography:

Yiwu Zhong obtained his Ph.D. in Computer Science from the University of Wisconsin-Madison and is currently a postdoctoral fellow at the Chinese University of Hong Kong. His research focuses on language-grounded visual intelligence, aiming to bridge the gap between the visual world and human language to achieve unified multimodal understanding. His work extends to cutting-edge fields such as multimodal large language models, embodied intelligence, and mobile computing. He has published over 10 high-quality papers in top-tier conferences, making representative contributions to research areas such as open-world object detection (e.g., RegionCLIP and GLIP), structured visual representation learning (e.g., SGG-NLS and VisualTable), and action planning (e.g., ProcedureVRL and NaviLLM). He has worked as an intern for 24 months at leading tech companies, including Microsoft Research, Meta AI, and Tencent AI Lab (Seattle and Shenzhen). He has received honors including the CVPR 2023 Doctoral Consortium Award and CVPR 2022 Best Paper Finalist.

Privacy Sitemap

Language-Grounded Visual Intelligence

About

People

Research

Academics

Admissions