More about HKUST
Towards Communication-Efficient Distributed Training Systems
PhD Thesis Proposal Defence Title: "Towards Communication-Efficient Distributed Training Systems" by Mr. Xinchen WAN Abstract: As the scaling law persists consistently, distributed training has become the standard methodology to manage the exponential increase in model size and training data. Following this trend, distributed training systems are developed to handle the complexities and scale of distributed training and to embrace the computation powers of multiple devices. However, the communication remains one of the major challenges in these systems. The significant communication issues vary in the overheads from gradient aggregation and embedding synchronization during the training stage, and the intricate scheduling across different hardware during other stages. This dissertation delineates my research efforts in building a communication-efficient distributed training systems with multi-level optimizations for the communication stack of distributed training systems. At the application-level, we firstly design DGS, a communication-efficient graph sampling framework for distributed GNN training. Its key idea is to reduce network communication cost by sampling neighborhood information based on the locality of the neighbor nodes in the cluster, and sampling data at both node and feature levels. As a result, DGS strikes the balance between communication efficiency and model accuracy, and integrates seamlessly with distributed GNN training systems. We next propose G3, a scalable and efficient system for full-graph GNN training. G3 incorporates GNN hybrid parallelism to scale out full-graph training with meticulous peer-to-peer intermediate data sharing, and accelerates the training process by balancing workloads among workers through locality- aware iterative partitioning, and overlapping communication with computation through a multi-level pipeline scheduling algorithm. Although initially tailored for GNN training, we believe the fundamental principle of peer-to-peer sharing data in hybrid parallelism can be generalized to other training tasks. At the communication library-level, we present Leo, a generic and efficient communication library for distributed training systems. Leo offers 1) a communication path abstraction to describe the diverse distributed services employed in systems with predictable communication performance across edge accelerators; 2) unified APIs and wrappers that simplify programming experience with automatic communication configuration; and 3) a built-in multi-path communication optimization strategy to enhance communication efficiency. We believe Leo can serve as a stepping stone for the development of hardware-accelerated distributed services in distributed training systems. Date: Monday, 29 April 2024 Time: 4:30pm - 6:00pm Venue: Room 5506 Lifts 25/26 Committee Members: Prof. Kai Chen (Supervisor) Dr. Binhang Yuan (Chairperson) Dr. Yangqiu Song Dr. Weiwa Wang