More about HKUST
Enhancing Reliability of Large-Scale Distributed Training
PhD Qualifying Examination Title: "Enhancing Reliability of Large-Scale Distributed Training" by Mr. Tianyuan WU Abstract: Deep Learning (DL) techniques are widely adopted in both academia and industry, with large deep neural networks (DNNs) gaining significant attention in recent years for their remarkable performance, often viewed as a milestone toward artificial general intelligence. However, training these massive DNNs requires unprecedented computational power, underscoring the importance of large-scale distributed training systems. At this hyperscale, failures and performance degradations are the norm rather than exceptions, presenting unique challenges to the reliability of these training systems. This paper surveys recent advancements in reliable distributed training systems, focusing on effectively addressing critical failures (i.e., fail-stop issues) and stragglers (i.e., fail-slow issues). We begin by introducing the fundamentals of distributed training, followed by an overview of basic reliability concepts. We then review prior work on managing fail-stop and fail-slow issues, including their characterization, algorithms, and system designs. We hope this survey sheds light on future research on reliable training systems. Date: Monday, 2 December 2024 Time: 10:30am - 12:00noon Venue: Room 5501 Lifts 25/26 Committee Members: Dr. Wei Wang (Supervisor) Dr. Binhang Yuan (Chairperson) Prof. Mo Li Prof. Qian Zhang