More about HKUST
Fast and Scalable Indexing Methods for Metric Space Retrieval
PhD Thesis Proposal Defence Title: "Fast and Scalable Indexing Methods for Metric Space Retrieval" by Mr. Zengyang GONG Abstract: In the era of data explosion, characterized by increasing volume, velocity, and variety, efficiently managing diverse data types is a critical challenge. Metric spaces offer a powerful and unified framework to address data heterogeneity, supporting complex types such as spatio-temporal data and high-dimensional vectors. By defining a distance metric that satisfies the triangle inequality, metric spaces facilitate effective information retrieval across a range of data representations. However, as datasets grow exponentially, efficient query processing becomes a significant bottleneck. To address this, indexing techniques are essential for accelerating similarity searches and enabling scalable query operations. Advanced indexing methods can dramatically improve retrieval performance in metric spaces, making it feasible to handle real-world applications in domains ranging from real-time location-based services to high-dimensional vector search for Retrieval-Augmented Generation (RAG). In the first part of this thesis, we focus on spatio-temporal data by modeling real-world road networks as time-dependent graphs, in which each edge is associated with a time-dependent travel cost function. To accelerate queries under resource constraints, we propose a novel shortcut selection problem based on tree decomposition, which transforms the original graph into a tree structure. This problem is proven to be NP-hard, and we present a solution with theoretical performance guarantees. Experimental results demonstrate that the selected shortcuts significantly reduce query time in tree-based indexes. In the second part of this thesis, we examine the vector data, which has become increasingly prominent for their abilities to support data storage and retrieval-augmented generation, with the rise of large language models (LLMs). While hierarchical graph-based indexing is widely adopted in current systems, query efficiency is often limited by exhaustive traversals across multiple levels of proximity graphs. To overcome this, we propose a novel indexing method that selectively skips unnecessary levels. Our approach integrates a hierarchical vector compression method that progressively reduces vector dimensionality at each level while maintaining approximation bounds for distance calculations. Furthermore, we introduce a novel data structure that learns from inter-level distance approximations to predict and safely skip levels during query processing. Experiments conducted on widely used benchmark datasets show that our method achieves significant speed-ups compared with state-of-the-art techniques. In the final part of the thesis, we outline potential directions for future work, aiming to further enhance the efficiency and scalability of metric space indexing techniques in emerging applications. Date: Monday, 30 June 2025 Time: 10:00am - 12:00noon Venue: Room 3494 Lifts 25/26 Committee Members: Prof. Lei Chen (Supervisor) Prof. Qiong Luo (Chairperson) Dr. Binhang Yuan