More about HKUST
Enhancing Reliability of Large-Scale Distributed Training
PhD Qualifying Examination
Title: "Enhancing Reliability of Large-Scale Distributed Training"
by
Mr. Tianyuan WU
Abstract:
Deep Learning (DL) techniques are widely adopted in both academia and
industry, with large deep neural networks (DNNs) gaining significant
attention in recent years for their remarkable performance, often viewed as a
milestone toward artificial general intelligence. However, training these
massive DNNs requires unprecedented computational power, underscoring the
importance of large-scale distributed training systems. At this hyperscale,
failures and performance degradations are the norm rather than exceptions,
presenting unique challenges to the reliability of these training systems.
This paper surveys recent advancements in reliable distributed training
systems, focusing on effectively addressing critical failures (i.e.,
fail-stop issues) and stragglers (i.e., fail-slow issues). We begin by
introducing the fundamentals of distributed training, followed by an overview
of basic reliability concepts. We then review prior work on managing
fail-stop and fail-slow issues, including their characterization, algorithms,
and system designs. We hope this survey sheds light on future research on
reliable training systems.
Date: Monday, 2 December 2024
Time: 10:30am - 12:00noon
Venue: Room 5501
Lifts 25/26
Committee Members: Dr. Wei Wang (Supervisor)
Dr. Binhang Yuan (Chairperson)
Prof. Mo Li
Prof. Qian Zhang