Efficient High-Dimensional Search in Modern Vector Databases

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Efficient High-Dimensional Search in Modern Vector Databases"

By

Mr. Xi ZHAO


Abstract:

The rapid expansion of vector datasets, driven by the widespread adoption of 
embedding models for processing unstructured data, has significantly amplified 
the importance of high-dimensional search and the role of vector databases. 
However, the vector search problem in high-dimensional spaces is inherently 
computationally expensive due to the curse of dimensionality. As a result, 
finding approximate solutions has become a more efficient and viable 
alternative. Modern vector databases commonly address two types of similarity 
search problems: Approximate Nearest Neighbor Search (ANNS) and Approximate 
Maximum Inner Product Search (AMIPS). Various methods have been developed for 
efficient vector search in modern vector databases, including locality-sensitive 
hashing (LSH) based approaches and graph-based techniques.

This thesis investigates the performance of LSH-based and graph-based methods, 
evaluating factors such as index cost (index size and indexing time), query 
latency, and query accuracy. To support efficient vector search in modern vector 
databases while maintaining acceptable indexing costs, we propose a series of 
vector indexing techniques that advance both the theoretical and practical 
boundaries of similarity search. Our contributions include: (i) DB-LSH, a 
query-centric LSH framework for ANNS that reduces index size while preserving 
sub-linear query complexity and improving candidate quality through dynamic 
bucketing; (ii) LSH-APG, a hybrid indexing strategy that significantly reduces the 
construction cost of approximate proximity graphs for ANNS by integrating a 
lightweight LSH framework; (iii) FARGO, an LSH-based AMIPS framework that 
improves accuracy and efficiency through asymmetric transformation, global 
multi-probing, and adaptive termination; (iv) NAP, a norm-adaptive partitioning 
strategy with hybrid index structures tailored to different norm regions for 
accurate and efficient AMIPS; and (v) SOSIA, an innovative framework for AMIPS 
over sparse vectors that converts sparse vectors into a binary space and provides 
an unbiased estimator for the inner product between any two vectors.

Together, these efforts address emerging challenges in vector databases, shaped 
by the rapid advancements in large language models (LLMs), and pave the way for 
the future of intelligent vector management, seamlessly integrated with modern AI 
workloads.



Date:                   Thursday, 9 April 2026

Time:                   9:00am - 11:00am

Venue:                  Room 2132C
                        Lift 22

Chairman:               Dr. Qiming SHAO (ECE)

Committee Members:      Prof. Xiaofang ZHOU (Supervisor)
                        Prof. Raymond WONG
                        Prof. Ke YI
                        Dr. Sirui HAN (EMIA)
                        Prof. Yunjun GAO (Zhejiang University)