More about HKUST
Managing Data for Efficient Learning over Data Streams
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Managing Data for Efficient Learning over Data Streams" By Mr. Yiming LI Abstract: Deep learning over data streams is critical for many real-world applications such as social network services, personalized recommender systems, and future link prediction. This thesis delves into data management techniques aimed at enhancing the efficiency of deep learning across two distinct types of data streams: graph data (e.g., social networks) and vector data (e.g., images and tabular data). On the one hand, temporal graph neural networks (T-GNNs) are powerful representation learning methods and have achieved remarkable effectiveness on dynamic graphs. Unfortunately, T-GNNs continue to face the challenge of high time complexity, which makes them impractical for scaling to large dynamic graphs. On the other hand, although many advanced models (e.g., ResNet) have been developed for vector data, it remains challenging to efficiently and effectively update model parameters over continuously arriving data. In this thesis, we first investigate two data management techniques for improving the efficiency of T-GNNs with theoretical guarantees. In the first work, we propose Orca, a novel framework that accelerates T-GNN training by non-trivially caching and reusing intermediate embeddings. We design an optimal cache replacement policy that not only improves the efficiency of T-GNN training by maximizing the cache hit ratio but also reduces the approximation errors by avoiding keeping and reusing extremely stale embeddings. Meanwhile, we develop profound theoretical analysis on the approximation error introduced by our reuse schemes and offer rigorous convergence guarantees. In the second work, we build theoretical connection between the temporal message passing scheme adopted by T-GNNs and the temporal random walk on dynamic graphs. Based on the theoretical finding, we propose to utilize T-PPR, a parameterized metric for estimating the influence score of nodes on evolving graphs. We further develop an efficient single-scan algorithm to answer the top-k T-PPR query with rigorous approximation guarantees. Finally, we present Zebra, a scalable framework that accelerates the computation of T-GNN by directly aggregating the features of the most prominent temporal neighbors returned by the top-k T-PPR query. Second, we present Camel, a system that improves the efficiency of incremental learning over streams of vector data. Specifically, Camel includes two independent data management components: coreset selection and buffer update. To accelerate model training, Camel selects a coreset from each streaming data batch for model update. Selecting a coreset with worst-case guarantees is NP-hard. To solve this problem, we reformulate coreset selection as a submodular maximization problem by deriving an upper bound on the objective function. To mitigate catastrophic forgetting, Camel maintains a buffer of past representative samples as new data arrive. Moreover, Camel quantizes numerical data in buffer via a quantile sketch to reduce the memory footprint. We demonstrate the efficiency and effectiveness of our proposed methods against stateof- the-art approaches through extensive experiments on real-world datasest. Finally, we conclude this thesis with future research directions. Date: Friday, 15 December 2023 Time: 2:00pm - 4:00pm Venue: Room 4472 Lifts 25/26 Chairman: Prof. Xiangtong QI (IEDA) Committee Members: Prof. Lei CHEN (Supervisor) Prof. Ke YI Prof. Xiaofang ZHOU Prof. George YUAN (ECE) Prof. Guoliang LI (Tsinghua Univ.) **** ALL are Welcome ****