ON EFFECTIVE AND EFFICIENT LEARNED DATA STRUCTURE DESIGN

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "ON EFFECTIVE AND EFFICIENT LEARNED DATA STRUCTURE DESIGN"

By

Mr. Qiyu LIU


Abstract

Over the past decades, machine learning techniques has achieved a breakthrough 
and make people re-examine the traditional data management challenges such as 
data indexing, cardinality estimation, query optimization, etc. In the seminal 
paper of Kraska et al., the notion of "learned index" was first introduced to 
refer to the new data structure design paradigm by using a combination of 
machine learning models (to provide fast prediction) and traditional data 
structures (to handle errors caused by ML models).

Inspired by the impressive performance, optimizations and extensions to the 
original learned indices have caught the attention from both academia and 
industry. In this thesis, we introduce three of our recent works on designing 
learned data structures to efficiently process fundamental queries in big data 
management and analytics, including the streaming membership query, approximate 
multi-dimensional range count query, and Hamming space similarity query.

First, for membership queries over data streams, we devise a novel Bloom filter 
variant called Stable Learned Bloom filter which addresses the performance 
decay issue of Bloom filters on intensive insertion workloads by combining 
classifiers with updatable backup filters. Second, we study the application of 
learned index as spatial data synopsis, which is widely adopted to speed-up 
query processing over large spatial databases. We propose a learned data 
synopsis technique named Learned Multi-dimensional Histogram (LHist). Compared 
with the traditional data synopsis techniques, LHist is fully data-driven, 
easy-to-implement, and has the potential to achieve better storage-accuracy 
trade-off. Third, we investigate the problem of indexing large-scale binary 
databases to support fast Hamming distance-based similarity queries, which is 
fundamental in image/video search applications. We propose HAP, an efficient 
ML-enhanced Hamming space index framework to support both Hamming range queries 
and kNN queries.

Extensive experimental studies on large-scale real-world datasets and synthetic 
benchmarks reveal that our learned data structures can outperform the 
corresponding state-of-the-art baselines in terms of the storage cost and query 
processing efficiency.


Date:			Friday, 28 January 2022

Time:			9:00am - 11:00am

Zoom Meeting: 
https://hkust.zoom.us/j/98996333457?pwd=YWRxMTA5S1RWeG1oSERESUd3bzRydz09

Chairperson:		Prof. Gang WANG (CIVL)

Committee Members:	Prof. Lei CHEN (Supervisor)
 			Prof. Qiong LUO
 			Prof. Xiaofang ZHOU
 			Prof. Can YANG (MATH)
 			Prof. Jianliang XU (HKBU)


**** ALL are Welcome ****