More about HKUST
Fast and Scalable Indexing Methods for Metric Space Retrieval
PhD Thesis Proposal Defence
Title: "Fast and Scalable Indexing Methods for Metric Space Retrieval"
by
Mr. Zengyang GONG
Abstract:
In the era of data explosion, characterized by increasing volume, velocity,
and variety, efficiently managing diverse data types is a critical
challenge. Metric spaces offer a powerful and unified framework to address
data heterogeneity, supporting complex types such as spatio-temporal data
and high-dimensional vectors. By defining a distance metric that satisfies
the triangle inequality, metric spaces facilitate effective information
retrieval across a range of data representations. However, as datasets grow
exponentially, efficient query processing becomes a significant bottleneck.
To address this, indexing techniques are essential for accelerating
similarity searches and enabling scalable query operations. Advanced
indexing methods can dramatically improve retrieval performance in metric
spaces, making it feasible to handle real-world applications in domains
ranging from real-time location-based services to high-dimensional vector
search for Retrieval-Augmented Generation (RAG).
In the first part of this thesis, we focus on spatio-temporal data by
modeling real-world road networks as time-dependent graphs, in which each
edge is associated with a time-dependent travel cost function. To accelerate
queries under resource constraints, we propose a novel shortcut selection
problem based on tree decomposition, which transforms the original graph
into a tree structure. This problem is proven to be NP-hard, and we present
a solution with theoretical performance guarantees. Experimental results
demonstrate that the selected shortcuts significantly reduce query time in
tree-based indexes.
In the second part of this thesis, we examine the vector data, which has
become increasingly prominent for their abilities to support data storage
and retrieval-augmented generation, with the rise of large language models
(LLMs). While hierarchical graph-based indexing is widely adopted in current
systems, query efficiency is often limited by exhaustive traversals across
multiple levels of proximity graphs. To overcome this, we propose a novel
indexing method that selectively skips unnecessary levels. Our approach
integrates a hierarchical vector compression method that progressively
reduces vector dimensionality at each level while maintaining approximation
bounds for distance calculations. Furthermore, we introduce a novel data
structure that learns from inter-level distance approximations to predict
and safely skip levels during query processing. Experiments conducted on
widely used benchmark datasets show that our method achieves significant
speed-ups compared with state-of-the-art techniques.
In the final part of the thesis, we outline potential directions for future
work, aiming to further enhance the efficiency and scalability of metric
space indexing techniques in emerging applications.
Date: Monday, 30 June 2025
Time: 10:30am - 12:30pm
Venue: Room 3494
Lifts 25/26
Committee Members: Prof. Lei Chen (Supervisor)
Prof. Qiong Luo (Chairperson)
Dr. Binhang Yuan