More about HKUST
A Survey on Communication-Efficient Machine Learning in Multi-Cloud Environments
PhD Qualifying Examination
Title: "A Survey on Communication-Efficient Machine Learning in Multi-Cloud
Environments"
by
Mr. Decang SUN
Abstract:
The rapid growth of foundation models has made it hard for a single
datacenter to meet the growing need for computing power. As a result, many
systems now rely on large, geo-distributed multi-cluster environments. While
these offer more total resources, they also introduce major challenges,
especially in managing workflows and ensuring efficient communication, since
wide-area networks (WANs) tend to have limited bandwidth and higher latency.
This survey reviews recent methods for training and serving machine learning
models across multiple clusters. We start with basic parallel training
techniques, then explore advanced systems that use dynamic pipelines, global
scheduling, modality-aware parallelism, and network-aware placement
strategies to improve performance.
From these studies, we summarize key ideas and trade-offs that shape
large-scale distributed learning. We highlight the need for better
orchestration frameworks that jointly optimize system throughput,
communication cost, and training efficiency. By bringing together recent
advances, this survey helps inform efforts to expand model training beyond
the limits of single datacenter.
Date: Friday, 25 July 2025
Time: 9:00am - 11:00am
Venue: Room 3494
Lifts 25/26
Committee Members: Prof. Kai Chen (Supervisor)
Dr. Binhang Yuan (Chairperson)
Dr. Dan Xu