Learning with Differential Privacy

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Learning with Differential Privacy"

By

Mr. Peng YE


Abstract:

While machine learning has achieved remarkable advancements across various
domains in recent years, it simultaneously raises significant privacy
concerns, particularly due to the potential inclusion of sensitive
information within datasets. Differential privacy offers a rigorous
approach to quantifying privacy leakage and has emerged as the gold
standard for privacy protection. However, safeguarding privacy often comes
at the cost of utility degradation. In this thesis, we investigate the
cost of privacy in learning and present the following contributions.

We examine two primary models of data generation: the Probably
Approximately Correct (PAC) model and the online model. In the PAC model,
the input data points are independently and identically distributed
(i.i.d.) and provided beforehand. The algorithm takes as input the entire
dataset and produces a single hypothesis. In contrast, the online model
targets the scenario where data points are no longer i.i.d. and arrive
sequentially in a stream. The learner must output an updated hypothesis
immediately after receiving each data point. Both models can operate under
two distinct settings: the realizable setting, which assumes the existence
of a hypothesis within the given concept class that perfectly classifies
the data, and the agnostic setting, which makes no such assumption.

For agnostic PAC learning, we study the problem under pure differential
privacy and propose an algorithm that achieves near-optimal sample
complexity. The improvement is made by a novel construction of the score
function used in the exponential mechanism. We also extend our algorithm
to the user-level privacy setting, where each user may hold multiple data
points, and derive improved bounds compared to prior work.

We further introduce a general transformation that converts any private
realizable learner into a private agnostic learner with minimal increase
in sample complexity, applicable to both pure and approximate differential
privacy. This result demonstrates that there is no extra cost for privacy
in agnostic PAC learning compared to realizable PAC learning, as the
increase is necessary even without privacy constraints. Additionally, we
apply our techniques to resolve the sample complexity for the private
prediction problem, where the algorithm is required to produce a single
prediction rather than a hypothesis.

For the online learning model, we first establish two hardness results
for private learning in the realizable setting, suggesting that privacy
constraints can render online learning significantly more challenging or
even intractable. These findings reveal significant separations between
non-private learning and private learning, as well as between pure
differential privacy and approximate differential privacy.

Finally, we explore online learning against an adaptive adversary under
approximate differential privacy, who generates each data point based on
the learner's previous predictions. We design a realizable learner with a
logarithmic mistake bound and propose the first agnostic learner that
achieves sublinear regret. Our results indicate that the cost of privacy
in agnostic online learning is relatively small, with only minor
degradation compared to the optimal non-private regret.


Date:                   Monday, 5 January 2026

Time:                   2:00pm - 4:00pm

Venue:                  Room 3494
                        Lifts 25/26

Chairman:               Prof. Yilong HAN (PHYS)

Committee Members:      Prof. Bo LI (Supervisor)
                        Dr. Wei WANG (Co-Supervisor)
                        Prof. Siu-Wing CHENG
                        Prof. Ke YI
                        Prof. Dan WANG (ENVR)
                        Dr. Zhiyi HUANG (HKU)