Learning with Differential Privacy

PhD Thesis Proposal Defence


Title: "Learning with Differential Privacy"

by

Mr. Peng YE


Abstract:

While machine learning has achieved remarkable advancements across various
domains in recent years, it simultaneously raises significant privacy
concerns, particularly due to the potential inclusion of sensitive
information within datasets. Differential privacy offers a rigorous approach
to quantifying privacy leakage and has emerged as the gold standard for
privacy protection. However, safeguarding privacy often comes at the cost of
utility degradation. In this thesis, we investigate the statistical cost of
privacy in learning and present the following contributions.

We consider two data generation models: the Probably Approximately Correct
(PAC) model and the online model. In the PAC model, the input data points
are independently and identically distributed (i.i.d.) and provided
beforehand. PAC learning, also known as realizable learning, assumes the
existence of a hypothesis within the given class that perfectly classifies
the data. In contrast, the agnostic learning framework does not make this
assumption. In the online model, data points are no longer i.i.d. and
arrive in a stream. The learner must output a hypothesis immediately after
receiving each data point. Similar to the PAC model, online learning can be
categorized as realizable or agnostic depending on whether a zero-error
hypothesis exists within the given hypothesis class.

First, we examine agnostic learning under pure differential privacy and
introduce an algorithm that achieves near-optimal sample complexity. Unlike
realizable learning, which assumes that a hypothesis within the given class
labels all the data, agnostic learning is more general and practical, as it
imposes no assumptions on the data. Prior to our work, optimal algorithms
were known only for private realizable learning, and all existing algorithms
for private agnostic learning were suboptimal. We also explore the problem
under user-level privacy, where a user may possess multiple data points, and
derive improved bounds compared to previous research.

Second, we propose a general transformation that converts any private
realizable learner into a private agnostic learner with minimal increase in
sample complexity. This generalizes our previous result for pure private
agnostic learning, as it applies to any algorithm and works under both pure
and approximate privacy. We also apply the proposed techniques to settle the
sample complexity for the private prediction problem, where the algorithm is
tasked with producing a single prediction rather than a hypothesis.

Third, we investigate private learning in the online scenario, where data
points are provided sequentially and the learner must output a hypothesis
immediately after receiving each data point. We derive hardness results for
both pure and approximate privacy, suggesting that privacy constraints can
render online learning significantly more challenging or even intractable.


Date:                   Monday, 27 October 2025

Time:                   3:00pm - 5:00pm

Venue:                  Room 4472
                        Lifts 25/26

Committee Members:      Prof. Bo Li (Supervisor)
                        Dr. Wei Wang (Co-supervisor)
                        Prof. Ke Yi (Chairperson)
                        Prof. Siu-Wing Cheng