From Generation to Interaction: A Survey of Interactive Video World Models

PhD Qualifying Examination


Title: "From Generation to Interaction: A Survey of Interactive Video World 
Models"

by

Mr. Bohai GU


Abstract:

The rapid progress of video foundation models and embodied AI has accelerated 
a major transition in world modeling, from latent-state models designed for 
internal planning toward pixel-space models that directly generate visible 
interactive worlds. This survey analyzes this transition through a 
problem-oriented framework rather than a flat taxonomy of architectures. We 
organize the current field around three tightly coupled subproblems: 
generation, which concerns the shift from offline clip synthesis to causal 
rollout for interactive systems; action diversity, which examines how 
camera-based, instruction- based, and other action signals can be 
incorporated into video world models to induce meaningful world changes; and 
context construction, which addresses how world state can be represented, 
refreshed, and maintained across long-horizon interaction. Beyond these core 
subproblems, we discuss physics-aware interaction, off-camera state 
prediction, and self-evolution as promising future directions. Overall, this 
survey aims to clarify how interactive video world models can evolve from 
visually plausible generators into causal, actionable, and stateful world 
simulators.


Date:                   Wednesday, 27 May 2026

Time:                   10:00am - 12:00noon

Venue:                  Room 2128A
                        Lift 19

Committee Members:      Prof. Song Guo (Supervisor)
                        Dr. Dan Xu (Chairperson)
                        Dr. Long Chen