More about HKUST
Learning with Differential Privacy
PhD Thesis Proposal Defence Title: "Learning with Differential Privacy" by Mr. Peng YE Abstract: While machine learning has achieved remarkable advancements across various domains in recent years, it simultaneously raises significant privacy concerns, particularly due to the potential inclusion of sensitive information within datasets. Differential privacy offers a rigorous approach to quantifying privacy leakage and has emerged as the gold standard for privacy protection. However, safeguarding privacy often comes at the cost of utility degradation. In this thesis, we investigate the statistical cost of privacy in learning and present the following contributions. We consider two data generation models: the Probably Approximately Correct (PAC) model and the online model. In the PAC model, the input data points are independently and identically distributed (i.i.d.) and provided beforehand. PAC learning, also known as realizable learning, assumes the existence of a hypothesis within the given class that perfectly classifies the data. In contrast, the agnostic learning framework does not make this assumption. In the online model, data points are no longer i.i.d. and arrive in a stream. The learner must output a hypothesis immediately after receiving each data point. Similar to the PAC model, online learning can be categorized as realizable or agnostic depending on whether a zero-error hypothesis exists within the given hypothesis class. First, we examine agnostic learning under pure differential privacy and introduce an algorithm that achieves near-optimal sample complexity. Unlike realizable learning, which assumes that a hypothesis within the given class labels all the data, agnostic learning is more general and practical, as it imposes no assumptions on the data. Prior to our work, optimal algorithms were known only for private realizable learning, and all existing algorithms for private agnostic learning were suboptimal. We also explore the problem under user-level privacy, where a user may possess multiple data points, and derive improved bounds compared to previous research. Second, we propose a general transformation that converts any private realizable learner into a private agnostic learner with minimal increase in sample complexity. This generalizes our previous result for pure private agnostic learning, as it applies to any algorithm and works under both pure and approximate privacy. We also apply the proposed techniques to settle the sample complexity for the private prediction problem, where the algorithm is tasked with producing a single prediction rather than a hypothesis. Third, we investigate private learning in the online scenario, where data points are provided sequentially and the learner must output a hypothesis immediately after receiving each data point. We derive hardness results for both pure and approximate privacy, suggesting that privacy constraints can render online learning significantly more challenging or even intractable. Date: Monday, 27 October 2025 Time: 3:00pm - 5:00pm Venue: Room 4472 Lifts 25/26 Committee Members: Prof. Bo Li (Supervisor) Dr. Wei Wang (Co-supervisor) Prof. Ke Yi (Chairperson) Prof. Siu-Wing Cheng