Structured Scene Representation Learning for End-to-End Navigation in Autonomous Driving

PhD Thesis Proposal Defence


Title: "Structured Scene Representation Learning for End-to-End Navigation in 
Autonomous Driving"

by

Miss Xiaodong MEI


Abstract:

The evolution of autonomous driving is transitioning from modular and 
rule-based systems towards end-to-end and data-driven paradigms, which promise 
greater adaptability in complex traffic environments. However, a central 
challenge remains for end-to-end autonomous driving: how to effectively 
represent dynamic and interactive driving scenarios that fundamentally support 
the scene understanding, reasoning and planning.

This thesis investigates structured scene representation learning in 
end-to-end autonomous driving. We trace a progressive path to address the 
challenge, across three axes: (a) from the constrained intersection scenarios 
to diverse urban driving scenarios; (b) from sparse scene graphs to dense 
token sequences and ultimately to unified latent space; and (c) from 
intermediate vectorized inputs to raw image and natural language instructions.

First, we start from explicit interaction modeling of surrounding agents with 
G-GIL, a branched framework that represents the unsignalized intersection 
scenario as a structured graph. We employ graph convolutional networks (GCNs) 
to aggregate the scene feature, combined with conditional imitation learning 
to generate safe and reactive navigation policies from expert demonstrations.

Second, we move beyond hand-crafted graph structures with HAMF, a hybrid 
Attention-Mamba framework that models the various scene elements as a sequence 
of tokens in urban driving scenarios. We jointly encode the scene context and 
future motion representations in the unified architecture, to decode the 
feasible and diverse trajectories without explicit element-relation 
definitions.

Finally, we extend from vectorized inputs to raw images and natural language 
instructions with LVDrive, a latent visual representation enhanced 
vision-language-action (VLA) model that jointly models future scene 
representations and motion features in a shared latent space, enabling 
future-aware reasoning to refine the trajectory generation for end-to-end 
navigation.

Collectively, this thesis traces a pathway from sparse scene graphs to dense 
latent representations, and from constrained intersection navigation with 
vectorized inputs to open-world end-to-end driving with raw sensors. Together, 
these contributions move toward a more unified and advanced foundation for 
scene representation learning, to enhance the scene understanding, reasoning 
and planning in autonomous driving.


Date:                   Tuesday, 23 June 2026

Time:                   10:00am - 12:00noon

Venue:                  Room 3494
                        Lift 25/26

Committee Members:      Dr. Dan Xu (Supervisor)
                        Dr. Hao Chen (Chairperson)
                        Dr. Long Chen