More about HKUST
Survey of GPU Cluster Scheduling in the Era of Large Language Models
PhD Qualifying Examination
Title: "Survey of GPU Cluster Scheduling in the Era of Large Language Models"
by
Mr. Xinyu YANG
Abstract:
The burgeoning scale of deep learning (DL) models, particularly Large
Language Models (LLMs), necessitates efficient training on distributed GPU
clusters. However, traditional cluster schedulers are ill-equipped to handle
the dynamic and specialized demands of modern DL workloads, leading to
significant inefficiencies. This paper provides a comprehensive survey of GPU
cluster scheduling, tracing its evolution from foundational concepts to
advanced solutions. We first delineate the unique characteristics of DL
workloads such as iterative computation, dynamic resource demands, and
sensitivity to data locality.
We then review pioneering scheduling approaches like Optimus, Gandiva, and
Tiresias, which introduced concepts such as dynamic resource management,
fine-grained time-slicing, and preemption, laying the groundwork for more
sophisticated, DL-aware resource management.
We then highlight the critical paradigm shift introduced by LLMs, emphasizing
the increased complexity due to their massive scale, diverse parallelism
strategies (e.g., Pipeline, Tensor, Expert), and heightened susceptibility to
inter-GPU communication bottlenecks. To mitigate these issues, the paper
discusses the need for topology-aware and responsive scheduling mechanisms,
which have been shown through research to significantly improve performance
by optimizing network utilization and dynamically reallocating resources.
This survey synthesizes key motivations, technical advancements, and core
contributions of prominent research, ultimately identifying cross-cutting
challenges and outlining promising future directions for building highly
efficient, scalable, and performant GPU clusters for next-generation AI.
Date: Friday, 25 July 2025
Time: 11:00am - 1:00pm
Venue: Room 3494
Lifts 25/26
Committee Members: Prof. Kai Chen (Supervisor)
Dr. Dan Xu (Chairperson)
Dr. Binhang Yuan