More about HKUST
From Locality-Sensitive Hashing to Similarity Search in Vector Databases
PhD Thesis Proposal Defence
Title: "From Locality-Sensitive Hashing to Similarity Search in Vector
Databases"
by
Miss Yao TIAN
Abstract:
Workloads in the post-LLM era (e.g., RAG) frequently involve high-dimensional
embedding vectors, creating a growing need for effective vector data
management and efficient similarity-based retrieval capabilities, such as
range/nearest neighbor search, multi-vector search, query with filter and so
on. In this talk, I will discuss my recent work that pushes the theoretical
and practical boundaries of similarity search.
First, I will present a locality-sensitive hashing (LSH) framework for
high-dimensional approximate nearest neighbor (ANN) search problem, which
offers the best-known query complexity in theory (ICDE '22), along with
several optimizations to improve query quality and accelerate lookups in
practice (TKDE '23). In the framework, ANN search is converted into several
exact range queries in a lower-dimensional space (e.g., 15-d), which presents
a dilemma: it is too high for low-dimensional indexes like R-trees to perform
well, yet too low for high-dimensional indexing methods. To bridge this gap,
I will introduce a disk-based learned index that reimages indexing as a model
mapping keys to positions, effectively capturing data patterns to improve I/O
efficiency and reduce memory usage (TKDE '23). Third, I will discuss a
database solution to mitigate the high costs of using LLM (both in terms of
monetary costs and response latency). This involves a filter that identifies
recurring workloads over arbitrary user-defined sliding windows, enabling
stored responses to be retrieved later and bypassing the need for repeated
LLM calls (SIGMOD '24).
Lastly, I will outline future research directions (IEEE Data Eng. Bull. '23)
and my ongoing work: 1) an index for multi-vector search, addressing the
limited expressivity of single-vector embeddings; 2) an LSH-enhanced
graph-based index to accelerate both index construction and query processing;
and 3) a query optimizer that seamlessly integrates vectors and relational
database features. Together, these efforts drive the advancement of databases
to meet the evolving demands of the post-LLM era.
Date: Thursday, 14 November 2024
Time: 2:00pm - 4:00pm
Venue: Room 5501
Lifts 25/26
Committee Members: Prof. Xiaofang Zhou (Supervisor)
Prof. Ke Yi (Chairperson)
Prof. Cunsheng Ding
Prof. Dimitris Papadias