More about HKUST
Aligning With the Physical World: A Comprehensive Survey on Physically Grounded VLM Agents
PhD Qualifying Examination
Title: Aligning With the Physical World: A Comprehensive Survey on Physically
Grounded VLM Agents
by
Mr. Yongjiang LIU
Abstract:
Vision-language models are increasingly moving from passive perception and
language generation toward embodied decision making, being used as agents
that must interpret, predict, and act in the physical world. Yet reliable
action in the physical world requires more than semantic recognition or
plausible language reasoning: agents must recover physical state, predict
intervention-sensitive consequences, and execute actions under embodiment,
dynamics, and safety constraints. This survey studies this challenge through
the lens of physical alignment, defined as the alignment of an agent's
representations, reasoning mechanisms, and action policies with the causal
structure and physical constraints of the real world. We organize the
literature around a gap-centric taxonomy with three recurring failure modes:
representational failure, where semantic visual-language abstractions fail to
capture geometry, affordances, physical properties, and temporal state;
causal failure, where models rely on language or visual priors instead of
intervention-aware physical reasoning and world models; and operational
failure, where high-level multimodal competence does not translate into
continuous, closed-loop, and safe embodied action. We review representative
work on spatial grounding, affordance reasoning, physical property
estimation, temporal perception, intuitive physics, world models,
counterfactual reasoning, long-horizon consistency, VLA policies, action
tokenization, closed-loop interaction, and embodied safety. We further
summarize benchmark trends and argue that physically grounded evaluation
should be diagnostic, intervention-based, process-aware, embodiment-aware,
robustness-oriented, and bidirectional. Finally, we identify future
directions toward scaling physical data, hybrid neural-symbolic and
physics-aware models, intervention-aware foundation models, retrospective
verification, and active experimentation. The central message of this survey
is that physical intelligence is not simply a by-product of larger multimodal
models, but a distinct alignment problem that connects perception, causality,
and action.
Date: Monday, 1 June 2026
Time: 3:30pm - 4:30pm
Venue: Room 2129A
Lift 19
Committee Members: Prof. Song Guo (Supervisor)
Dr. Shuai Wang (Chairperson)
Dr. Wei Wang