Managing Data for Efficient Learning over Data Streams

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Managing Data for Efficient Learning over Data Streams"

By

Mr. Yiming LI


Abstract:

Deep learning over data streams is critical for many real-world applications
such as social network services, personalized recommender systems, and future
link prediction. This thesis delves into data management techniques aimed at
enhancing the efficiency of deep learning across two distinct types of data
streams: graph data (e.g., social networks) and vector data (e.g., images and
tabular data). On the one hand, temporal graph neural networks (T-GNNs) are
powerful representation learning methods and have achieved remarkable
effectiveness on dynamic graphs. Unfortunately, T-GNNs continue to face the
challenge of high time complexity, which makes them impractical for scaling to
large dynamic graphs. On the other hand, although many advanced models (e.g.,
ResNet) have been developed for vector data, it remains challenging to
efficiently and effectively update model parameters over continuously arriving
data.

In this thesis, we first investigate two data management techniques for
improving the efficiency of T-GNNs with theoretical guarantees. In the first
work, we propose Orca, a novel framework that accelerates T-GNN training by
non-trivially caching and reusing intermediate embeddings. We design an optimal
cache replacement policy that not only improves the efficiency of T-GNN
training by maximizing the cache hit ratio but also reduces the approximation
errors by avoiding keeping and reusing extremely stale embeddings. Meanwhile,
we develop profound theoretical analysis on the approximation error introduced
by our reuse schemes and offer rigorous convergence guarantees. In the second
work, we build theoretical connection between the temporal message passing
scheme adopted by T-GNNs and the temporal random walk on dynamic graphs. Based
on the theoretical finding, we propose to utilize T-PPR, a parameterized metric
for estimating the influence score of nodes on evolving graphs. We further
develop an efficient single-scan algorithm to answer the top-k T-PPR query with
rigorous approximation guarantees. Finally, we present Zebra, a scalable
framework that accelerates the computation of T-GNN by directly aggregating the
features of the most prominent temporal neighbors returned by the top-k T-PPR
query.

Second, we present Camel, a system that improves the efficiency of incremental
learning over streams of vector data. Specifically, Camel includes two
independent data management components: coreset selection and buffer update. To
accelerate model training, Camel selects a coreset from each streaming data
batch for model update. Selecting a coreset with worst-case guarantees is
NP-hard. To solve this problem, we reformulate coreset selection as a
submodular maximization problem by deriving an upper bound on the objective
function. To mitigate catastrophic forgetting, Camel maintains a buffer of past
representative samples as new data arrive. Moreover, Camel quantizes numerical
data in buffer via a quantile sketch to reduce the memory footprint.

We demonstrate the efficiency and effectiveness of our proposed methods against
stateof- the-art approaches through extensive experiments on real-world
datasest. Finally, we conclude this thesis with future research directions.


Date:                   Friday, 15 December 2023

Time:                   2:00pm - 4:00pm

Venue:                  Room 4472
                        Lifts 25/26

Chairman:               Prof. Xiangtong QI (IEDA)

Committee Members:      Prof. Lei CHEN (Supervisor)
                        Prof. Ke YI
                        Prof. Xiaofang ZHOU
                        Prof. George YUAN (ECE)
                        Prof. Guoliang LI (Tsinghua University)


**** ALL are Welcome ****