Enhancing the Reliability of Distributed Large Model Training

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering

Final Year Thesis Oral Defense

Title: "Enhancing the Reliability of Distributed Large Model Training"

by

DUAN Qinkai

Abstract:

Training Large Language Models (LLMs) on GPU cloud platforms is hindered by 
fail-slow events, where underperforming GPUs create bottlenecks, reducing 
efficiency. Standard Tensor Parallelism (TP) assumes uniform GPU 
performance, leading to inefficiencies during such events. We propose an 
adaptive workload redistribution mechanism that dynamically adjusts 
computations based on real-time GPU capacity monitoring. This approach 
reduces incremental iteration times by up to 40% in scenarios with moderate 
GPU underperformance, while maintaining model accuracy, thereby enhancing 
the reliability and efficiency of distributed training.


Date            : 2 May 2025 (Friday)

Time            : 15:00 - 15:40

Venue           : Room 2128B (near lift 19), HKUST

Advisor         : Dr. WANG Wei

2nd Reader      : Prof. LI Bo