A Survey on Communication-Efficient Machine Learning in Multi-Cloud Environments

PhD Qualifying Examination


Title: "A Survey on Communication-Efficient Machine Learning in Multi-Cloud 
Environments"

by

Mr. Decang SUN


Abstract:

The rapid growth of foundation models has made it hard for a single 
datacenter to meet the growing need for computing power. As a result, many 
systems now rely on large, geo-distributed multi-cluster environments. While 
these offer more total resources, they also introduce major challenges, 
especially in managing workflows and ensuring efficient communication, since 
wide-area networks (WANs) tend to have limited bandwidth and higher latency.

This survey reviews recent methods for training and serving machine learning 
models across multiple clusters. We start with basic parallel training 
techniques, then explore advanced systems that use dynamic pipelines, global 
scheduling, modality-aware parallelism, and network-aware placement 
strategies to improve performance.

From these studies, we summarize key ideas and trade-offs that shape 
large-scale distributed learning. We highlight the need for better 
orchestration frameworks that jointly optimize system throughput, 
communication cost, and training efficiency. By bringing together recent 
advances, this survey helps inform efforts to expand model training beyond 
the limits of single datacenter.


Date:                   Friday, 25 July 2025

Time:                   9:00am - 11:00am

Venue:                  Room 3494
                        Lifts 25/26

Committee Members:      Prof. Kai Chen (Supervisor)
                        Dr. Binhang Yuan (Chairperson)
                        Dr. Dan Xu