More about HKUST
Towards Efficient and Scalable Retrieval in Metric Spaces
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Towards Efficient and Scalable Retrieval in Metric Spaces" By Mr. Zengyang GONG Abstract: In the era of data explosion, characterized by increasing volume, velocity, and variety, efficiently managing diverse data types is a critical challenge. Metric spaces provide a powerful and unified framework for handling diverse data types, including spatio-temporal data and high-dimensional vectors, by defining distance metrics that satisfy the triangle inequality. This property enables effective information retrieval across various data representations. However, as data scales exponentially, efficient query processing becomes a major bottleneck, necessitating scalable retrieval methods. Advanced retrieval techniques are essential for enabling real-world applications, from real-time location-based services to high-dimensional vector search for retrieval-augmented generation (RAG). This thesis investigates retrieval methods across various representative real-world search problems, each reflecting distinct application scenarios. First, for the origin-destination search problem, we model real-world road networks as time-dependent graphs and introduce a novel shortcut selection problem based on tree decomposition. This approach transforms the graph into a tree structure, and we prove the selection problem is NP-hard. We further present a solution with theoretical performance guarantees to accelerate retrieval. Second, for the optimal routing problem in shared mobility services, we focus on minimizing additional travel time required to serve new requests (e.g. passengers or parcels) for workers (e.g. drivers or couriers). To improve efficiency, we design a novel data summarization model that reduces retrieval time from cubic to linear and enables constant-time delay queries along routes. Third, for the similarity search problem, which has gained importance with the rise of large language models (LLMs) and retrieval-augmented generation. We first propose a dimensionality reduction technique that preserves distance approximation guarantees between high-dimensional vectors. Additionally, we develop a data structure that safely prunes redundant distance calculations, further accelerating retrieval. Finally, extensive experiments on real datasets demonstrate that our proposed solutions consistently outperform state-of-the-art algorithms by significant margins. Date: Friday, 10 October 2025 Time: 9:00am - 11:00am Venue: Room 2408 Lifts 17/18 Chairman: Dr. June SHI (MARK) Committee Members: Prof. Lei CHEN (Supervisor) Prof. Qiong LUO Prof. Raymond WONG Prof. Can YANG (MATH) Prof. Jianliang XU (HKBU)