Learning Representations for Efficient Data Processing, 3D Perception, and Planning in Autonomous Driving

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Learning Representations for Efficient Data Processing, 3D Perception, and
Planning in Autonomous Driving"

By

Mr. Zhili CHEN


Abstract:

Enabling autonomous systems to perceive, reason, and interact safely with the 
3D world is fundamental to advancing physical intelligence. Data, 3D 
perception, and planning are the three primary pillars of building reliable 
autonomous systems.

The rapid proliferation of intelligent vehicles generates massive amounts of 
sensor data that can empower more advanced models. Yet, raw sensor data impose 
heavy storage and transmission burdens, especially for sparse, disordered 
point cloud data. While learning-based compression methods for point cloud 
data show promise, they have not fully exploited the inherent redundancies in 
the data. Perception largely determines the performance limits of an 
autonomous system. It requires finer-grained geometric modeling within limited 
computational budgets, along with more effective representations that can fuse 
rich multi-sensor scene details for diverse downstream tasks. Planning further 
demands an understanding of the spatial-temporal dynamics among traffic 
participants, map elements, and the environment, necessitating finer-grained 
interaction and game modeling to support reliable, human-like decision-making.

This thesis aims to tackle challenges lying in data, perception, and planning 
through a progressive line of work. We propose an octree-based compression 
framework for point cloud data. By leveraging the siblings' children context 
at finer-grained resolution, the better-learned representation enables the 
entropy model to encode the point cloud data into a more compact bitstream. 
Extending the idea of learning more representative features for point cloud 
data, we introduce an efficient plug-and-play cross-cluster shifting operation 
that improves object recognition performance by enabling information exchange 
and modeling longer-range dependencies among points. We further propose an 
efficient vector representation that fuses fine-grained features across 
sensors, in contrast to the Bird's-Eye-View (BEV) representation, which incurs 
quadratic computational costs. Finally, to improve planning for self-driving 
vehicles, we explicitly model the interactions among ego-to-agent, ego-to-map, 
and ego-to-BEV query representations by interleaving planning and prediction 
throughout the prediction horizon, rather than relying on a single sequential 
interaction.

Together, these contributions advance representation learning for 
data-efficient processing, 3D perception, and interaction-aware planning 
toward safer and more reliable autonomous driving.


Date:                   Tuesday, 19 May 2026

Time:                   2:00pm - 4:00pm

Venue:                  Room 2128B
                        Lift 19

Chairman:               Prof. King Lau CHOW (LIFS)

Committee Members:      Dr. Qifeng CHEN (Supervisor)
                        Dr. May FUNG
                        Dr. Dan XU
                        Prof. Ping TAN (ECE)
                        Prof. Si LIU (Beihang University)