More about HKUST
ON EFFECTIVE AND EFFICIENT LEARNED DATA STRUCTURE DESIGN
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "ON EFFECTIVE AND EFFICIENT LEARNED DATA STRUCTURE DESIGN" By Mr. Qiyu LIU Abstract Over the past decades, machine learning techniques has achieved a breakthrough and make people re-examine the traditional data management challenges such as data indexing, cardinality estimation, query optimization, etc. In the seminal paper of Kraska et al., the notion of "learned index" was first introduced to refer to the new data structure design paradigm by using a combination of machine learning models (to provide fast prediction) and traditional data structures (to handle errors caused by ML models). Inspired by the impressive performance, optimizations and extensions to the original learned indices have caught the attention from both academia and industry. In this thesis, we introduce three of our recent works on designing learned data structures to efficiently process fundamental queries in big data management and analytics, including the streaming membership query, approximate multi-dimensional range count query, and Hamming space similarity query. First, for membership queries over data streams, we devise a novel Bloom filter variant called Stable Learned Bloom filter which addresses the performance decay issue of Bloom filters on intensive insertion workloads by combining classifiers with updatable backup filters. Second, we study the application of learned index as spatial data synopsis, which is widely adopted to speed-up query processing over large spatial databases. We propose a learned data synopsis technique named Learned Multi-dimensional Histogram (LHist). Compared with the traditional data synopsis techniques, LHist is fully data-driven, easy-to-implement, and has the potential to achieve better storage-accuracy trade-off. Third, we investigate the problem of indexing large-scale binary databases to support fast Hamming distance-based similarity queries, which is fundamental in image/video search applications. We propose HAP, an efficient ML-enhanced Hamming space index framework to support both Hamming range queries and kNN queries. Extensive experimental studies on large-scale real-world datasets and synthetic benchmarks reveal that our learned data structures can outperform the corresponding state-of-the-art baselines in terms of the storage cost and query processing efficiency. Date: Friday, 28 January 2022 Time: 9:00am - 11:00am Zoom Meeting: https://hkust.zoom.us/j/98996333457?pwd=YWRxMTA5S1RWeG1oSERESUd3bzRydz09 Chairperson: Prof. Gang WANG (CIVL) Committee Members: Prof. Lei CHEN (Supervisor) Prof. Qiong LUO Prof. Xiaofang ZHOU Prof. Can YANG (MATH) Prof. Jianliang XU (HKBU) **** ALL are Welcome ****