More about HKUST
From Locality-Sensitive Hashing to Similarity Search in Vector Databases
PhD Thesis Proposal Defence Title: "From Locality-Sensitive Hashing to Similarity Search in Vector Databases" by Miss Yao TIAN Abstract: Workloads in the post-LLM era (e.g., RAG) frequently involve high-dimensional embedding vectors, creating a growing need for effective vector data management and efficient similarity-based retrieval capabilities, such as range/nearest neighbor search, multi-vector search, query with filter and so on. In this talk, I will discuss my recent work that pushes the theoretical and practical boundaries of similarity search. First, I will present a locality-sensitive hashing (LSH) framework for high-dimensional approximate nearest neighbor (ANN) search problem, which offers the best-known query complexity in theory (ICDE '22), along with several optimizations to improve query quality and accelerate lookups in practice (TKDE '23). In the framework, ANN search is converted into several exact range queries in a lower-dimensional space (e.g., 15-d), which presents a dilemma: it is too high for low-dimensional indexes like R-trees to perform well, yet too low for high-dimensional indexing methods. To bridge this gap, I will introduce a disk-based learned index that reimages indexing as a model mapping keys to positions, effectively capturing data patterns to improve I/O efficiency and reduce memory usage (TKDE '23). Third, I will discuss a database solution to mitigate the high costs of using LLM (both in terms of monetary costs and response latency). This involves a filter that identifies recurring workloads over arbitrary user-defined sliding windows, enabling stored responses to be retrieved later and bypassing the need for repeated LLM calls (SIGMOD '24). Lastly, I will outline future research directions (IEEE Data Eng. Bull. '23) and my ongoing work: 1) an index for multi-vector search, addressing the limited expressivity of single-vector embeddings; 2) an LSH-enhanced graph-based index to accelerate both index construction and query processing; and 3) a query optimizer that seamlessly integrates vectors and relational database features. Together, these efforts drive the advancement of databases to meet the evolving demands of the post-LLM era. Date: Thursday, 14 November 2024 Time: 2:00pm - 4:00pm Venue: Room 5501 Lifts 25/26 Committee Members: Prof. Xiaofang Zhou (Supervisor) Prof. Ke Yi (Chairperson) Prof. Cunsheng Ding Prof. Dimitris Papadias