More about HKUST
Survey of GPU Cluster Scheduling in the Era of Large Language Models
PhD Qualifying Examination Title: "Survey of GPU Cluster Scheduling in the Era of Large Language Models" by Mr. Xinyu YANG Abstract: The burgeoning scale of deep learning (DL) models, particularly Large Language Models (LLMs), necessitates efficient training on distributed GPU clusters. However, traditional cluster schedulers are ill-equipped to handle the dynamic and specialized demands of modern DL workloads, leading to significant inefficiencies. This paper provides a comprehensive survey of GPU cluster scheduling, tracing its evolution from foundational concepts to advanced solutions. We first delineate the unique characteristics of DL workloads such as iterative computation, dynamic resource demands, and sensitivity to data locality. We then review pioneering scheduling approaches like Optimus, Gandiva, and Tiresias, which introduced concepts such as dynamic resource management, fine-grained time-slicing, and preemption, laying the groundwork for more sophisticated, DL-aware resource management. We then highlight the critical paradigm shift introduced by LLMs, emphasizing the increased complexity due to their massive scale, diverse parallelism strategies (e.g., Pipeline, Tensor, Expert), and heightened susceptibility to inter-GPU communication bottlenecks. To mitigate these issues, the paper discusses the need for topology-aware and responsive scheduling mechanisms, which have been shown through research to significantly improve performance by optimizing network utilization and dynamically reallocating resources. This survey synthesizes key motivations, technical advancements, and core contributions of prominent research, ultimately identifying cross-cutting challenges and outlining promising future directions for building highly efficient, scalable, and performant GPU clusters for next-generation AI. Date: Friday, 25 July 2025 Time: 11:00am - 1:00pm Venue: Room 3494 Lifts 25/26 Committee Members: Prof. Kai Chen (Supervisor) Dr. Dan Xu (Chairperson) Dr. Binhang Yuan