From Locality-Sensitive Hashing to Similarity Search in Vector Databases

PhD Thesis Proposal Defence


Title: "From Locality-Sensitive Hashing to Similarity Search in Vector 
Databases"

by

Miss Yao TIAN


Abstract:

Workloads in the post-LLM era (e.g., RAG) frequently involve high-dimensional 
embedding vectors, creating a growing need for effective vector data 
management and efficient similarity-based retrieval capabilities, such as 
range/nearest neighbor search, multi-vector search, query with filter and so 
on. In this talk, I will discuss my recent work that pushes the theoretical 
and practical boundaries of similarity search.

First, I will present a locality-sensitive hashing (LSH) framework for 
high-dimensional approximate nearest neighbor (ANN) search problem, which 
offers the best-known query complexity in theory (ICDE '22), along with 
several optimizations to improve query quality and accelerate lookups in 
practice (TKDE '23). In the framework, ANN search is converted into several 
exact range queries in a lower-dimensional space (e.g., 15-d), which presents 
a dilemma: it is too high for low-dimensional indexes like R-trees to perform 
well, yet too low for high-dimensional indexing methods. To bridge this gap, 
I will introduce a disk-based learned index that reimages indexing as a model 
mapping keys to positions, effectively capturing data patterns to improve I/O 
efficiency and reduce memory usage (TKDE '23). Third, I will discuss a 
database solution to mitigate the high costs of using LLM (both in terms of 
monetary costs and response latency). This involves a filter that identifies 
recurring workloads over arbitrary user-defined sliding windows, enabling 
stored responses to be retrieved later and bypassing the need for repeated 
LLM calls (SIGMOD '24).

Lastly, I will outline future research directions (IEEE Data Eng. Bull. '23) 
and my ongoing work: 1) an index for multi-vector search, addressing the 
limited expressivity of single-vector embeddings; 2) an LSH-enhanced 
graph-based index to accelerate both index construction and query processing; 
and 3) a query optimizer that seamlessly integrates vectors and relational 
database features. Together, these efforts drive the advancement of databases 
to meet the evolving demands of the post-LLM era.


Date:                   Thursday, 14 November 2024

Time:                   2:00pm - 4:00pm

Venue:                  Room 5501
                        Lifts 25/26

Committee Members:      Prof. Xiaofang Zhou (Supervisor)
                        Prof. Ke Yi (Chairperson)
                        Prof. Cunsheng Ding
                        Prof. Dimitris Papadias