More about HKUST
Enhancing the Reliability of Distributed Large Model Training
The Hong Kong University of Science and Technology Department of Computer Science and Engineering Final Year Thesis Oral Defense Title: "Enhancing the Reliability of Distributed Large Model Training" by DUAN Qinkai Abstract: Training Large Language Models (LLMs) on GPU cloud platforms is hindered by fail-slow events, where underperforming GPUs create bottlenecks, reducing efficiency. Standard Tensor Parallelism (TP) assumes uniform GPU performance, leading to inefficiencies during such events. We propose an adaptive workload redistribution mechanism that dynamically adjusts computations based on real-time GPU capacity monitoring. This approach reduces incremental iteration times by up to 40% in scenarios with moderate GPU underperformance, while maintaining model accuracy, thereby enhancing the reliability and efficiency of distributed training. Date : 2 May 2025 (Friday) Time : 15:00 - 15:40 Venue : Room 2128B (near lift 19), HKUST Advisor : Dr. WANG Wei 2nd Reader : Prof. LI Bo