Towards Communication-Efficient Distributed Training Systems

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Communication-Efficient Distributed Training Systems"

By

Mr. Xinchen WAN


Abstract:

As the scaling law persists consistently, distributed training has become 
the standard methodology to manage the exponential increase in model size 
and training data. Following this trend, distributed training systems are 
developed to handle the complexities and scale of distributed training and 
to embrace the computation powers of multiple devices. However, the 
communication remains one of the major challenges in these systems. The 
communication issues manifest in two key domains: the varied communication 
patterns across training paradigms during the model training stage, and the 
vendor-specific inter-device transmission across different hardware during 
the data and model management stages.

This dissertation delineates my research efforts on building 
communication-efficient distributed training systems through 
paradigm-specific optimizations for the model training stage and 
hardware-agnostic optimizations for the data and model management stages. 
The resulting optimized systems are full-stack solutions tailored for 
hardware- accelerated Graph Neural Network (GNN) and Large Language Model 
(LLM) training.

For the sampling-based GNN training paradigm, we propose DGS, a 
communication- efficient graph sampling framework. Its key idea is to reduce 
network communication cost by sampling neighborhood information based on the 
locality of the neighbor nodes in the cluster, and sampling data at both 
node and feature levels. As a result, DGS strikes a balance between 
communication efficiency and model accuracy, and integrates seamlessly with 
distributed GNN training systems.

For the full-graph GNN training paradigm, we design G3, a scalable and 
efficient full- graph training system. G3 incorporates GNN hybrid 
parallelism to scale out full-graph training with meticulous peer-to-peer 
intermediate data sharing, and accelerates the training process by balancing 
workloads among workers through locality-aware iterative partitioning, and 
overlapping communication with computation through a multi-level pipeline 
scheduling algorithm. Although initially tailored for GNN training, we 
believe the fundamental principle of peer-to-peer sharing data in hybrid 
parallelism can be generalized to other training tasks.

For the LLM training paradigm, we introduce Hermod, a near-optimal coflow 
scheduler designed to manage coflows across all parallelisms in LLM 
training. At its core, Hermod employs a priority-based inter-coflow 
scheduling policy to prioritize coflows based on key factors such as 
microbatch order, type of communication operators, and layer order. It also 
incorporates an optimal intra-coflow scheduling policy that minimizes coflow 
completion time (CCT) by maintaining line-rate transmission for flows within 
the highest- priority coflow. We believe our revisit of coflow scheduling in 
the LLM training context will inspire further research and lead to more 
efficient LLM training techniques.

To enable hardware-agnostic optimizations across different hardware 
platforms, we present Leo, a generic and efficient communication framework 
for distributed services. Leo offers 1) a communication path abstraction to 
describe various distributed services with predictable communication 
performance across edge accelerators, 2) unified APIs and wrappers to 
simplify programming experience with automatic communication configuration, 
and 3) a built-in multi-path communication optimization strategy to enhance 
communication efficiency. We believe Leo will serve as a stepping stone for 
the development of hardware-accelerated distributed services in distributed 
training systems.


Date:                   Tuesday, 11 February 2025

Time:                   1:00pm - 3:00pm

Venue:                  Room 4472
                        Lifts 25/26

Chairman:               Prof. Christopher Kin Ying LEUNG (CIVL)

Committee Members:      Prof. Kai CHEN (Supervisor)
                        Prof. Song GUO
                        Prof. Qiong LUO
                        Prof. Wei ZHANG (ECE)
                        Dr. Hong XU (CUHK)