More about HKUST
Towards Communication-Efficient Distributed Training Systems
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Towards Communication-Efficient Distributed Training Systems"
By
Mr. Xinchen WAN
Abstract:
As the scaling law persists consistently, distributed training has become
the standard methodology to manage the exponential increase in model size
and training data. Following this trend, distributed training systems are
developed to handle the complexities and scale of distributed training and
to embrace the computation powers of multiple devices. However, the
communication remains one of the major challenges in these systems. The
communication issues manifest in two key domains: the varied communication
patterns across training paradigms during the model training stage, and the
vendor-specific inter-device transmission across different hardware during
the data and model management stages.
This dissertation delineates my research efforts on building
communication-efficient distributed training systems through
paradigm-specific optimizations for the model training stage and
hardware-agnostic optimizations for the data and model management stages.
The resulting optimized systems are full-stack solutions tailored for
hardware- accelerated Graph Neural Network (GNN) and Large Language Model
(LLM) training.
For the sampling-based GNN training paradigm, we propose DGS, a
communication- efficient graph sampling framework. Its key idea is to reduce
network communication cost by sampling neighborhood information based on the
locality of the neighbor nodes in the cluster, and sampling data at both
node and feature levels. As a result, DGS strikes a balance between
communication efficiency and model accuracy, and integrates seamlessly with
distributed GNN training systems.
For the full-graph GNN training paradigm, we design G3, a scalable and
efficient full- graph training system. G3 incorporates GNN hybrid
parallelism to scale out full-graph training with meticulous peer-to-peer
intermediate data sharing, and accelerates the training process by balancing
workloads among workers through locality-aware iterative partitioning, and
overlapping communication with computation through a multi-level pipeline
scheduling algorithm. Although initially tailored for GNN training, we
believe the fundamental principle of peer-to-peer sharing data in hybrid
parallelism can be generalized to other training tasks.
For the LLM training paradigm, we introduce Hermod, a near-optimal coflow
scheduler designed to manage coflows across all parallelisms in LLM
training. At its core, Hermod employs a priority-based inter-coflow
scheduling policy to prioritize coflows based on key factors such as
microbatch order, type of communication operators, and layer order. It also
incorporates an optimal intra-coflow scheduling policy that minimizes coflow
completion time (CCT) by maintaining line-rate transmission for flows within
the highest- priority coflow. We believe our revisit of coflow scheduling in
the LLM training context will inspire further research and lead to more
efficient LLM training techniques.
To enable hardware-agnostic optimizations across different hardware
platforms, we present Leo, a generic and efficient communication framework
for distributed services. Leo offers 1) a communication path abstraction to
describe various distributed services with predictable communication
performance across edge accelerators, 2) unified APIs and wrappers to
simplify programming experience with automatic communication configuration,
and 3) a built-in multi-path communication optimization strategy to enhance
communication efficiency. We believe Leo will serve as a stepping stone for
the development of hardware-accelerated distributed services in distributed
training systems.
Date: Tuesday, 11 February 2025
Time: 1:00pm - 3:00pm
Venue: Room 4472
Lifts 25/26
Chairman: Prof. Christopher Kin Ying LEUNG (CIVL)
Committee Members: Prof. Kai CHEN (Supervisor)
Prof. Song GUO
Prof. Qiong LUO
Prof. Wei ZHANG (ECE)
Dr. Hong XU (CUHK)