Fast and Scalable Indexing Methods for Metric Space Retrieval

PhD Thesis Proposal Defence


Title: "Fast and Scalable Indexing Methods for Metric Space Retrieval"

by

Mr. Zengyang GONG


Abstract:

In the era of data explosion, characterized by increasing volume, velocity, 
and variety, efficiently managing diverse data types is a critical 
challenge. Metric spaces offer a powerful and unified framework to address 
data heterogeneity, supporting complex types such as spatio-temporal data 
and high-dimensional vectors. By defining a distance metric that satisfies 
the triangle inequality, metric spaces facilitate effective information 
retrieval across a range of data representations. However, as datasets grow 
exponentially, efficient query processing becomes a significant bottleneck. 
To address this, indexing techniques are essential for accelerating 
similarity searches and enabling scalable query operations. Advanced 
indexing methods can dramatically improve retrieval performance in metric 
spaces, making it feasible to handle real-world applications in domains 
ranging from real-time location-based services to high-dimensional vector 
search for Retrieval-Augmented Generation (RAG).

In the first part of this thesis, we focus on spatio-temporal data by 
modeling real-world road networks as time-dependent graphs, in which each 
edge is associated with a time-dependent travel cost function. To accelerate 
queries under resource constraints, we propose a novel shortcut selection 
problem based on tree decomposition, which transforms the original graph 
into a tree structure. This problem is proven to be NP-hard, and we present 
a solution with theoretical performance guarantees. Experimental results 
demonstrate that the selected shortcuts significantly reduce query time in 
tree-based indexes.

In the second part of this thesis, we examine the vector data, which has 
become increasingly prominent for their abilities to support data storage 
and retrieval-augmented generation, with the rise of large language models 
(LLMs). While hierarchical graph-based indexing is widely adopted in current 
systems, query efficiency is often limited by exhaustive traversals across 
multiple levels of proximity graphs. To overcome this, we propose a novel 
indexing method that selectively skips unnecessary levels. Our approach 
integrates a hierarchical vector compression method that progressively 
reduces vector dimensionality at each level while maintaining approximation 
bounds for distance calculations. Furthermore, we introduce a novel data 
structure that learns from inter-level distance approximations to predict 
and safely skip levels during query processing. Experiments conducted on 
widely used benchmark datasets show that our method achieves significant 
speed-ups compared with state-of-the-art techniques.

In the final part of the thesis, we outline potential directions for future 
work, aiming to further enhance the efficiency and scalability of metric 
space indexing techniques in emerging applications.


Date:                   Monday, 30 June 2025

Time:                   10:00am - 12:00noon

Venue:                  Room 3494
                        Lifts 25/26

Committee Members:      Prof. Lei Chen (Supervisor)
                        Prof. Qiong Luo (Chairperson)
                        Dr. Binhang Yuan