Aligning With the Physical World: A Comprehensive Survey on Physically Grounded VLM Agents

PhD Qualifying Examination


Title: Aligning With the Physical World: A Comprehensive Survey on Physically 
Grounded VLM Agents

by

Mr. Yongjiang LIU


Abstract:

Vision-language models are increasingly moving from passive perception and 
language generation toward embodied decision making, being used as agents 
that must interpret, predict, and act in the physical world. Yet reliable 
action in the physical world requires more than semantic recognition or 
plausible language reasoning: agents must recover physical state, predict 
intervention-sensitive consequences, and execute actions under embodiment, 
dynamics, and safety constraints. This survey studies this challenge through 
the lens of physical alignment, defined as the alignment of an agent's 
representations, reasoning mechanisms, and action policies with the causal 
structure and physical constraints of the real world. We organize the 
literature around a gap-centric taxonomy with three recurring failure modes: 
representational failure, where semantic visual-language abstractions fail to 
capture geometry, affordances, physical properties, and temporal state; 
causal failure, where models rely on language or visual priors instead of 
intervention-aware physical reasoning and world models; and operational 
failure, where high-level multimodal competence does not translate into 
continuous, closed-loop, and safe embodied action. We review representative 
work on spatial grounding, affordance reasoning, physical property 
estimation, temporal perception, intuitive physics, world models, 
counterfactual reasoning, long-horizon consistency, VLA policies, action 
tokenization, closed-loop interaction, and embodied safety. We further 
summarize benchmark trends and argue that physically grounded evaluation 
should be diagnostic, intervention-based, process-aware, embodiment-aware, 
robustness-oriented, and bidirectional. Finally, we identify future 
directions toward scaling physical data, hybrid neural-symbolic and 
physics-aware models, intervention-aware foundation models, retrospective 
verification, and active experimentation. The central message of this survey 
is that physical intelligence is not simply a by-product of larger multimodal 
models, but a distinct alignment problem that connects perception, causality, 
and action.


Date:                   Monday, 1 June 2026

Time:                   3:30pm - 4:30pm

Venue:                  Room 2129A
                        Lift 19

Committee Members:      Prof. Song Guo (Supervisor)
                        Dr. Shuai Wang (Chairperson)
                        Dr. Wei Wang