More about HKUST
A Survey on Communication-Efficient Machine Learning in Multi-Cloud Environments
PhD Qualifying Examination Title: "A Survey on Communication-Efficient Machine Learning in Multi-Cloud Environments" by Mr. Decang SUN Abstract: The rapid growth of foundation models has made it hard for a single datacenter to meet the growing need for computing power. As a result, many systems now rely on large, geo-distributed multi-cluster environments. While these offer more total resources, they also introduce major challenges, especially in managing workflows and ensuring efficient communication, since wide-area networks (WANs) tend to have limited bandwidth and higher latency. This survey reviews recent methods for training and serving machine learning models across multiple clusters. We start with basic parallel training techniques, then explore advanced systems that use dynamic pipelines, global scheduling, modality-aware parallelism, and network-aware placement strategies to improve performance. From these studies, we summarize key ideas and trade-offs that shape large-scale distributed learning. We highlight the need for better orchestration frameworks that jointly optimize system throughput, communication cost, and training efficiency. By bringing together recent advances, this survey helps inform efforts to expand model training beyond the limits of single datacenter. Date: Friday, 25 July 2025 Time: 9:00am - 11:00am Venue: Room 3494 Lifts 25/26 Committee Members: Prof. Kai Chen (Supervisor) Dr. Binhang Yuan (Chairperson) Dr. Dan Xu