More about HKUST
Learned Structures: Data-Aware Primitives for State Management
PhD Thesis Proposal Defence
Title: "Learned Structures: Data-Aware Primitives for State Management"
by
Mr. Siyuan HAN
Abstract:
The exponential growth of data-intensive applications has precipitated a
fundamental hardware bottleneck known as the Storage Wall: a widening disparity
between processor throughput and storage/memory access costs. This bottleneck is
particularly acute in large-scale state management systems, where the working set
can exceed main memory and performance becomes sensitive to both metadata
footprint and tail latency.
Traditional general-purpose indexes, such as B-Trees and Log-Structured
Merge-trees, are designed around rigid discrete invariants; when applied to massive
or skewed datasets, they incur significant memory overheads and amplification
effects that hinder scalability under tight space budgets.
This dissertation advances a learned-first view of access methods via learned
primitives: data-aware structures that make the approximation error explicit and
preserve correctness through bounded "last-mile" mechanisms. We focus on two
complementary primitives. First, PGM++ revisits error-bounded learned indexing
through the PGM-Index: it tightens the structural understanding of why PGM
hierarchies are flat in practice (with a sub-logarithmic height bound O(log log N)),
and improves predictability by redesigning the internal error-bounded search
operator (hybrid branchless search, layer bypassing) and introducing a calibrated
cost model for parameter tuning under space budgets. Second, SegPQ extends
learned principles to high-dimensional vector retrieval by losslessly compressing PQ
codebooks: it treats the codebook as a compressible monotone representation (via
a bijective projection and reordering) and represents it as an epsilon-bounded
piecewise linear model plus compact residuals, with a closed-form OPT that targets
near-entropy compression while keeping end-to-end ANN processing effectively
unchanged.
Together, these contributions demonstrate that learned structures can be made
robust enough to serve as deployable building blocks for modern systems that
demand predictable performance under strict memory constraints. Large-scale
state-management workloads serve as a motivating background setting where tight
budgets, skew, and high-entropy identifiers highlight the robustness requirements for
learned primitives.
Date: Thursday, 5 March 2026
Time: 2:30pm - 4:30pm
Venue: Room 3494
Lift 25/26
Committee Members: Prof. Lei Chen (Supervisor)
Prof. Qiong Luo (Chairperson)
Prof. Qian Zhang