Fail-slow Problems in LLM Training Cluster

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering

Final Year Thesis Oral Defense

Title: "Fail-slow Problems in LLM Training Cluster"

by

SUN Zhuotao

Abstract:

Fail-slow problems, marked by performance degradation without complete 
failure, present significant reliability challenges in hyper-scale large 
language model (LLM) training clusters. Stragglers (degraded components), 
prevalent in clusters with thousands of GPUs, propagate delays across 
synchronized workflows, and reduce cluster throughput by a measurable 
percentage. This study investigates fail-slow issues through public 
publications, analyzing their definition, impacts, root causes, and existing 
solutions. Then, we propose the design of Super-Monitor, a novel system on 
the foundation of FALCON for real-time straggler detection and diagnosis, 
incorporating proactive health checks to address fail-slow degradations. This 
work highlights the urgency of mitigating fail-slow problems and provides a 
design framework to tackle fail-slow problems and enhance cluster 
reliability.


Date            : 2 May 2025 (Friday)

Time            : 16:00 - 16:40

Venue           : Room 2128B (near lift 19), HKUST

Advisor         : Dr. WANG Wei

2nd Reader      : Prof. LI Bo