More about HKUST
Cardinality Estimation for Similarity Search: A Locality Sensitive Hashing Approach
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
MPhil Thesis Defence
Title: "Cardinality Estimation for Similarity Search: A Locality Sensitive
Hashing Approach"
By
Mr. Zhonghan CHEN
Abstract:
In relational databases, cardinality estimation has been extensively studied
for several decades, serving as a crucial component of query optimization in
database systems. Specifically, it aims to efficiently estimate the number of
output rows produced by relational operators, such as selection or join,
without actual query execution.
With the recent advancement of vector databases and semantic operators,
cardinality estimation for high-dimensional similarity search has emerged as a
prominent research topic. In this context, the task is to efficiently
estimate the number of points that fall within a specified threshold distance
from a given query point. Existing cardinality estimators for similarity
search, particularly in high-dimensional spaces, are predominantly
learning-based frameworks, which, however, suffer from several limitations,
including unstable prediction accuracy, dependence on high quality training
data, and prohibitively long offline construction times.
To tackle disadvantages aforementioned, we design a framework that is
lightweight, easy to construct, and capable of providing accurate estimations
with satisfying online efficiency. We leverage locality-sensitive hashing
(LSH) to partition the vector space while preserving distance proximity.
Building on this, we adopt the principles of classical multi-probe LSH to
adaptively explore neighboring buckets, accounting for distance thresholds of
varying magnitudes. To improve online efficiency, we employ progressive
sampling to reduce the number of distance computations and utilize asymmetric
distance computation in product quantization to accelerate distance
calculations in high-dimensional spaces. In addition to handling static
datasets, our framework includes updating algorithm designed to efficiently
support large-scale dynamic scenarios of data updates. Experiments
demonstrate that our methods can accurately estimate the cardinality of
similarity queries, yielding satisfying efficiency, for both static and
dynamic scenarios.
Date: Wednesday, 10 December 2025
Time: 10:00am - 12:00noon
Venue: Room 2130A
Lift 22
Chairman: Prof. Raymond WONG
Committee Members: Prof. Xiaofang ZHOU (Supervisor)
Dr. May FUNG