How Far Are We from Visual Language Understanding? Pitfalls and Progress in Vision-Language Models

Speaker: Yifan Hou
Department of Computer Science
ETH Zurich

Title: How Far Are We from Visual Language Understanding? Pitfalls and Progress in Vision-Language Models

Date: Monday, 25 August 2025

Time: 2:00pm - 3:30pm (followed by research discussion)

Venue: Room 2503 (via lift 25/26), HKUST

Abstract:

Visual language, comprising symbols, spatial structures, and abstract forms, is a powerful medium for conveying structured information beyond natural language. Diagrams, as a key example, require models not just to perceive, but to ground symbols, interpret structure, and reason visually. In this talk, I examine how well current vision-language models (VLMs) understand visual language. Drawing on a semiotic-inspired framework, we break down understanding into four levels: recognition, interpretation, grounding, and reasoning. Using a new benchmark focused on diagram understanding, we reveal three types of shortcuts that VLMs often exploit: visual memorization, structural patterning (Clever-Hans), and knowledge-based shortcuts. I will highlight our study on knowledge shortcuts, where models often rely on prior associations, such as commonsense, rather than genuine visual-symbolic reasoning. This exposes a key limitation in how current multimodal systems integrate perception and reasoning. Overall, these findings show that VLMs' progress in visual language understanding remains fragile. I will close by discussing paths toward more robust, semantically grounded multimodal reasoning.

Biography:

Yifan Hou is a final-year Ph.D. candidate in Computer Science at ETH Zurich, advised by Mrinmaya Sachan and Antoine Bosselut. His research explores the intersection of vision-language learning, interpretability, and reasoning, with the goal of building AI systems that both perceive the world like humans and reason in ways we understand. His work has been published at top venues such as NeurIPS, ICML, ICLR, ACL, and EMNLP, and recognized with distinctions including the SDSC Ph.D. Fellowship. Yifan has completed research experience at Meta AI, EPFL, Tencent, and TTIC, and actively contributes to the research community as a reviewer for major conferences.

Privacy Sitemap

How Far Are We from Visual Language Understanding? Pitfalls and Progress in Vision-Language Models

About

People

Research

Academics

Admissions