More about HKUST
Towards Efficient and Secure Large-Scale Systems for Distributed Machine Learning Training
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Towards Efficient and Secure Large-Scale Systems for Distributed Machine Learning Training" By Mr. Chengliang ZHANG Abstract: Machine learning (ML) techniques have advanced in leaps and bounds in the past decade. Its success critically relies on the abundant computing power and the availability of big data, it is impractical to host ML training on a single machine, and a sole data source usually does not produce a general enough model. By distributing ML workload across multiple machines and utilizing data across multiple silos, we are able to substantially improve the quality of ML training. As large-scale ML training are increasingly deployed in production systems involving multiple entities, how to improve efficiency, and ensure the confidentiality of the participants become the pressing needs. First, how to efficiently train an ML model in a cluster with the presence of heterogeneity? Second, in the context of federated learning (FL) where multiple data owners collaboratively train a model together, how to mitigate the overhead introduced by the privacy-preserving techniques? Lastly, in the nuance case where many organizations who own data but not ML expertise would like to pool their data and collaborate with those who have expertise (model owner) to train generalizable models, how to protect the model owner's intellectual property (model privacy) while preserving the data privacy of data owners? General ML training solutions find themselves inadequate under the efficiency and privacy challenges posed by distributed ML. First, traditional distributed ML systems often conduct asynchronous training to mitigate the impact of stragglers. While it maximizes the training throughput, the price paid is degraded training quality as there are inconsistency across workers. Second, although techniques like Homomorphic Encryption (HE) can be conveniently adopted to preserve data privacy in FL, they induce prohibitively high computation and communication overheads. Third, there is yet to be a practical solution that can protect model owner's intellectual properties without compromising data owner's privacy. To fill in the gaps mentioned above, we profile, analyze, and propose new strategies to improve training efficiency and privacy guarantees. To improve the efficiency in distributed asynchronous training, we first propose a new distributed synchronization scheme, termed speculative synchronization. Our scheme allows workers to speculate about the recent parameter updates from others on the fly, and if necessary, the workers abort the ongoing computation, pull fresher parameters, and start over to improve the quality of training. We implement our scheme and demonstrate that speculative synchronization achieves substantial speedups over the asynchronous parallel scheme with minimal communication overhead. Second, we present BatchCrypt, a system solution for cross-silo FL that significantly reduces the encryption and communication overhead caused by HE. Instead of encrypting individual gradients with full precision, we encode a batch of quantized gradients into a long integer and encrypt it in one go. To allow gradient-wise aggregation to be performed on ciphertexts of the encoded batches, we develop new quantization and encoding schemes along with a novel gradient clipping technique. Our evaluations confirm that BatchCrypt can effectively reduce the computation and communication overhead. Lastly, to address the collaborative learning scenarios where model privacy is also required, we devise a scalable system Citadel. Citadel protects privacy for both data owner and model owner in untrusted infrastructures with the help of Intel SGX. Citadel performs distributed training across multiple training enclaves running on behalf of data owners and an aggregator enclave on behalf of the model owner. Citadel further establishes a strong information barrier between these enclaves by means of zero-sum masking and hierarchical aggregation to prevent data/model leakage during collaborative training. We deploy Citadel on cloud to train various ML models, and prove that it is scalable while providing strong privacy guarantees. Date: Wednesday, 31 March 2021 Time: 1:00pm - 3:00pm Zoom Meeting: https://hkust.zoom.us/j/96913415039?pwd=MzV3a2Mrbk5qTS9uU05Kb3BHRVVJdz09 Chairperson: Prof. Kun XU (MATH) Committee Members: Prof. Wei WANG (Supervisor) Prof. Bo LI Prof. Shuai WANG Prof. Jiang XU (ECE) Prof. Song GUO (PolyU) **** ALL are Welcome ****