Survey of GPU Cluster Scheduling in the Era of Large Language Models

PhD Qualifying Examination


Title: "Survey of GPU Cluster Scheduling in the Era of Large Language Models"

by

Mr. Xinyu YANG


Abstract:

The burgeoning scale of deep learning (DL) models, particularly Large 
Language Models (LLMs), necessitates efficient training on distributed GPU 
clusters. However, traditional cluster schedulers are ill-equipped to handle 
the dynamic and specialized demands of modern DL workloads, leading to 
significant inefficiencies. This paper provides a comprehensive survey of GPU 
cluster scheduling, tracing its evolution from foundational concepts to 
advanced solutions. We first delineate the unique characteristics of DL 
workloads such as iterative computation, dynamic resource demands, and 
sensitivity to data locality.

We then review pioneering scheduling approaches like Optimus, Gandiva, and 
Tiresias, which introduced concepts such as dynamic resource management, 
fine-grained time-slicing, and preemption, laying the groundwork for more 
sophisticated, DL-aware resource management.

We then highlight the critical paradigm shift introduced by LLMs, emphasizing 
the increased complexity due to their massive scale, diverse parallelism 
strategies (e.g., Pipeline, Tensor, Expert), and heightened susceptibility to 
inter-GPU communication bottlenecks. To mitigate these issues, the paper 
discusses the need for topology-aware and responsive scheduling mechanisms, 
which have been shown through research to significantly improve performance 
by optimizing network utilization and dynamically reallocating resources.

This survey synthesizes key motivations, technical advancements, and core 
contributions of prominent research, ultimately identifying cross-cutting 
challenges and outlining promising future directions for building highly 
efficient, scalable, and performant GPU clusters for next-generation AI.


Date:                   Friday, 25 July 2025

Time:                   11:00am - 1:00pm

Venue:                  Room 3494
                        Lifts 25/26

Committee Members:      Prof. Kai Chen (Supervisor)
                        Dr. Dan Xu (Chairperson)
                        Dr. Binhang Yuan