Learned Structures: Data-Aware Primitives for State Management

PhD Thesis Proposal Defence


Title: "Learned Structures: Data-Aware Primitives for State Management"

by

Mr. Siyuan HAN


Abstract:

The exponential growth of data-intensive applications has precipitated a 
fundamental hardware bottleneck known as the Storage Wall: a widening disparity 
between processor throughput and storage/memory access costs. This bottleneck is 
particularly acute in large-scale state management systems, where the working set 
can exceed main memory and performance becomes sensitive to both metadata 
footprint and tail latency.

Traditional general-purpose indexes, such as B-Trees and Log-Structured 
Merge-trees, are designed around rigid discrete invariants; when applied to massive 
or skewed datasets, they incur significant memory overheads and amplification 
effects that hinder scalability under tight space budgets.

This dissertation advances a learned-first view of access methods via learned 
primitives: data-aware structures that make the approximation error explicit and 
preserve correctness through bounded "last-mile" mechanisms. We focus on two 
complementary primitives. First, PGM++ revisits error-bounded learned indexing 
through the PGM-Index: it tightens the structural understanding of why PGM 
hierarchies are flat in practice (with a sub-logarithmic height bound O(log log N)), 
and improves predictability by redesigning the internal error-bounded search 
operator (hybrid branchless search, layer bypassing) and introducing a calibrated 
cost model for parameter tuning under space budgets. Second, SegPQ extends 
learned principles to high-dimensional vector retrieval by losslessly compressing PQ 
codebooks: it treats the codebook as a compressible monotone representation (via 
a bijective projection and reordering) and represents it as an epsilon-bounded 
piecewise linear model plus compact residuals, with a closed-form OPT that targets 
near-entropy compression while keeping end-to-end ANN processing effectively 
unchanged.

Together, these contributions demonstrate that learned structures can be made 
robust enough to serve as deployable building blocks for modern systems that 
demand predictable performance under strict memory constraints. Large-scale 
state-management workloads serve as a motivating background setting where tight 
budgets, skew, and high-entropy identifiers highlight the robustness requirements for 
learned primitives.


Date:                   Thursday, 5 March 2026

Time:                   2:30pm - 4:30pm

Venue:                  Room 3494
                        Lift 25/26

Committee Members:      Prof. Lei Chen (Supervisor)
                        Prof. Qiong Luo (Chairperson)
                        Prof. Qian Zhang