More about HKUST
Fail-slow Problems in LLM Training Cluster
The Hong Kong University of Science and Technology Department of Computer Science and Engineering Final Year Thesis Oral Defense Title: "Fail-slow Problems in LLM Training Cluster" by SUN Zhuotao Abstract: Fail-slow problems, marked by performance degradation without complete failure, present significant reliability challenges in hyper-scale large language model (LLM) training clusters. Stragglers (degraded components), prevalent in clusters with thousands of GPUs, propagate delays across synchronized workflows, and reduce cluster throughput by a measurable percentage. This study investigates fail-slow issues through public publications, analyzing their definition, impacts, root causes, and existing solutions. Then, we propose the design of Super-Monitor, a novel system on the foundation of FALCON for real-time straggler detection and diagnosis, incorporating proactive health checks to address fail-slow degradations. This work highlights the urgency of mitigating fail-slow problems and provides a design framework to tackle fail-slow problems and enhance cluster reliability. Date : 2 May 2025 (Friday) Time : 16:00 - 16:40 Venue : Room 2128B (near lift 19), HKUST Advisor : Dr. WANG Wei 2nd Reader : Prof. LI Bo