Enhancing Reliability of Large-Scale Distributed Training

PhD Qualifying Examination


Title: "Enhancing Reliability of Large-Scale Distributed Training"

by

Mr. Tianyuan WU


Abstract:

Deep Learning (DL) techniques are widely adopted in both academia and 
industry, with large deep neural networks (DNNs) gaining significant 
attention in recent years for their remarkable performance, often viewed as a 
milestone toward artificial general intelligence. However, training these 
massive DNNs requires unprecedented computational power, underscoring the 
importance of large-scale distributed training systems. At this hyperscale, 
failures and performance degradations are the norm rather than exceptions, 
presenting unique challenges to the reliability of these training systems.

This paper surveys recent advancements in reliable distributed training 
systems, focusing on effectively addressing critical failures (i.e., 
fail-stop issues) and stragglers (i.e., fail-slow issues). We begin by 
introducing the fundamentals of distributed training, followed by an overview 
of basic reliability concepts. We then review prior work on managing 
fail-stop and fail-slow issues, including their characterization, algorithms, 
and system designs. We hope this survey sheds light on future research on 
reliable training systems.


Date:                   Monday, 2 December 2024

Time:                   10:30am - 12:00noon

Venue:                  Room 5501
                        Lifts 25/26

Committee Members:      Dr. Wei Wang (Supervisor)
                        Dr. Binhang Yuan (Chairperson)
                        Prof. Mo Li
                        Prof. Qian Zhang