More about HKUST
Towards Communication-Efficient Distributed Training Systems
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Towards Communication-Efficient Distributed Training Systems" By Mr. Xinchen WAN Abstract: As the scaling law persists consistently, distributed training has become the standard methodology to manage the exponential increase in model size and training data. Following this trend, distributed training systems are developed to handle the complexities and scale of distributed training and to embrace the computation powers of multiple devices. However, the communication remains one of the major challenges in these systems. The communication issues manifest in two key domains: the varied communication patterns across training paradigms during the model training stage, and the vendor-specific inter-device transmission across different hardware during the data and model management stages. This dissertation delineates my research efforts on building communication-efficient distributed training systems through paradigm-specific optimizations for the model training stage and hardware-agnostic optimizations for the data and model management stages. The resulting optimized systems are full-stack solutions tailored for hardware- accelerated Graph Neural Network (GNN) and Large Language Model (LLM) training. For the sampling-based GNN training paradigm, we propose DGS, a communication- efficient graph sampling framework. Its key idea is to reduce network communication cost by sampling neighborhood information based on the locality of the neighbor nodes in the cluster, and sampling data at both node and feature levels. As a result, DGS strikes a balance between communication efficiency and model accuracy, and integrates seamlessly with distributed GNN training systems. For the full-graph GNN training paradigm, we design G3, a scalable and efficient full- graph training system. G3 incorporates GNN hybrid parallelism to scale out full-graph training with meticulous peer-to-peer intermediate data sharing, and accelerates the training process by balancing workloads among workers through locality-aware iterative partitioning, and overlapping communication with computation through a multi-level pipeline scheduling algorithm. Although initially tailored for GNN training, we believe the fundamental principle of peer-to-peer sharing data in hybrid parallelism can be generalized to other training tasks. For the LLM training paradigm, we introduce Hermod, a near-optimal coflow scheduler designed to manage coflows across all parallelisms in LLM training. At its core, Hermod employs a priority-based inter-coflow scheduling policy to prioritize coflows based on key factors such as microbatch order, type of communication operators, and layer order. It also incorporates an optimal intra-coflow scheduling policy that minimizes coflow completion time (CCT) by maintaining line-rate transmission for flows within the highest- priority coflow. We believe our revisit of coflow scheduling in the LLM training context will inspire further research and lead to more efficient LLM training techniques. To enable hardware-agnostic optimizations across different hardware platforms, we present Leo, a generic and efficient communication framework for distributed services. Leo offers 1) a communication path abstraction to describe various distributed services with predictable communication performance across edge accelerators, 2) unified APIs and wrappers to simplify programming experience with automatic communication configuration, and 3) a built-in multi-path communication optimization strategy to enhance communication efficiency. We believe Leo will serve as a stepping stone for the development of hardware-accelerated distributed services in distributed training systems. Date: Tuesday, 11 February 2025 Time: 1:00pm - 3:00pm Venue: Room 4472 Lifts 25/26 Chairman: Prof. Christopher Kin Ying LEUNG (CIVL) Committee Members: Prof. Kai CHEN (Supervisor) Prof. Song GUO Prof. Qiong LUO Prof. Wei ZHANG (ECE) Dr. Hong XU (CUHK)