More about HKUST
Towards Communication-Efficient Distributed Training Systems
PhD Thesis Proposal Defence
Title: "Towards Communication-Efficient Distributed Training Systems"
by
Mr. Xinchen WAN
Abstract:
As the scaling law persists consistently, distributed training has become the
standard methodology to manage the exponential increase in model size and
training data. Following this trend, distributed training systems are developed
to handle the complexities and scale of distributed training and to embrace the
computation powers of multiple devices. However, the communication remains one
of the major challenges in these systems. The significant communication issues
vary in the overheads from gradient aggregation and embedding synchronization
during the training stage, and the intricate scheduling across different
hardware during other stages.
This dissertation delineates my research efforts in building a
communication-efficient distributed training systems with multi-level
optimizations for the communication stack of distributed training systems.
At the application-level, we firstly design DGS, a communication-efficient
graph sampling framework for distributed GNN training. Its key idea is to
reduce network communication cost by sampling neighborhood information based on
the locality of the neighbor nodes in the cluster, and sampling data at both
node and feature levels. As a result, DGS strikes the balance between
communication efficiency and model accuracy, and integrates seamlessly with
distributed GNN training systems.
We next propose G3, a scalable and efficient system for full-graph GNN
training. G3 incorporates GNN hybrid parallelism to scale out full-graph
training with meticulous peer-to-peer intermediate data sharing, and
accelerates the training process by balancing workloads among workers through
locality- aware iterative partitioning, and overlapping communication with
computation through a multi-level pipeline scheduling algorithm. Although
initially tailored for GNN training, we believe the fundamental principle of
peer-to-peer sharing data in hybrid parallelism can be generalized to other
training tasks.
At the communication library-level, we present Leo, a generic and efficient
communication library for distributed training systems. Leo offers 1) a
communication path abstraction to describe the diverse distributed services
employed in systems with predictable communication performance across edge
accelerators; 2) unified APIs and wrappers that simplify programming experience
with automatic communication configuration; and 3) a built-in multi-path
communication optimization strategy to enhance communication efficiency. We
believe Leo can serve as a stepping stone for the development of
hardware-accelerated distributed services in distributed training systems.
Date: Monday, 29 April 2024
Time: 4:30pm - 6:00pm
Venue: Room 5506
Lifts 25/26
Committee Members: Prof. Kai Chen (Supervisor)
Dr. Binhang Yuan (Chairperson)
Dr. Yangqiu Song
Dr. Weiwa Wang