More about HKUST
Structured Scene Representation Learning for End-to-End Navigation in Autonomous Driving
PhD Thesis Proposal Defence
Title: "Structured Scene Representation Learning for End-to-End Navigation in
Autonomous Driving"
by
Miss Xiaodong MEI
Abstract:
The evolution of autonomous driving is transitioning from modular and
rule-based systems towards end-to-end and data-driven paradigms, which promise
greater adaptability in complex traffic environments. However, a central
challenge remains for end-to-end autonomous driving: how to effectively
represent dynamic and interactive driving scenarios that fundamentally support
the scene understanding, reasoning and planning.
This thesis investigates structured scene representation learning in
end-to-end autonomous driving. We trace a progressive path to address the
challenge, across three axes: (a) from the constrained intersection scenarios
to diverse urban driving scenarios; (b) from sparse scene graphs to dense
token sequences and ultimately to unified latent space; and (c) from
intermediate vectorized inputs to raw image and natural language instructions.
First, we start from explicit interaction modeling of surrounding agents with
G-GIL, a branched framework that represents the unsignalized intersection
scenario as a structured graph. We employ graph convolutional networks (GCNs)
to aggregate the scene feature, combined with conditional imitation learning
to generate safe and reactive navigation policies from expert demonstrations.
Second, we move beyond hand-crafted graph structures with HAMF, a hybrid
Attention-Mamba framework that models the various scene elements as a sequence
of tokens in urban driving scenarios. We jointly encode the scene context and
future motion representations in the unified architecture, to decode the
feasible and diverse trajectories without explicit element-relation
definitions.
Finally, we extend from vectorized inputs to raw images and natural language
instructions with LVDrive, a latent visual representation enhanced
vision-language-action (VLA) model that jointly models future scene
representations and motion features in a shared latent space, enabling
future-aware reasoning to refine the trajectory generation for end-to-end
navigation.
Collectively, this thesis traces a pathway from sparse scene graphs to dense
latent representations, and from constrained intersection navigation with
vectorized inputs to open-world end-to-end driving with raw sensors. Together,
these contributions move toward a more unified and advanced foundation for
scene representation learning, to enhance the scene understanding, reasoning
and planning in autonomous driving.
Date: Tuesday, 23 June 2026
Time: 10:00am - 12:00noon
Venue: Room 3494
Lift 25/26
Committee Members: Dr. Dan Xu (Supervisor)
Dr. Hao Chen (Chairperson)
Dr. Long Chen