More about HKUST
Resource-Efficient Model Serving in Multi-Tenant GPU Clusters
PhD Thesis Proposal Defence Title: "Resource-Efficient Model Serving in Multi-Tenant GPU Clusters" by Mr. Lingyun YANG Abstract: The rapid development of artificial intelligence over the past decade has driven the widespread adoption of model serving workloads in data centers. To support these workloads efficiently, large technology companies have built multi-tenant GPU clusters that facilitate resource sharing and minimize waste. However, achieving resource efficiency in such clusters presents significant challenges, particularly due to GPU resource fragmentation. This dissertation tackles the problem of GPU resource fragmentation from multiple perspectives, addressing key issues such as resource underutilization due to overestimation, stranded resources due to shortages in other dimensions, and inefficiencies due to resource mismatches. To address internal fragmentation caused by overprovisioning, this dissertation introduces Morphling, a fast and near-optimal auto-configuration framework for tuning hardware and runtime parameters (e.g., GPU type, GPU memory, batch size). Morphling employs an offline meta-model to capture general performance trends across configurations and adapts this model to new inference services with minimal sampling. Morphling effectively reduces overestimation by enabling precise, service-specific resource allocation. To mitigate external fragmentation caused by stranded resources, the dissertation proposes a novel statistical measure for quantifying GPU fragmentation across resource dimensions. Building on this measure, a scheduling policy called Fragmentation Gradient Descent (FGD) is introduced, which minimizes the growth of resource fragmentation. Compared to traditional packing-based schedulers, FGD achieves higher GPU allocation efficiency. To resolve external fragmentation caused by resource mismatches, the dissertation presents Prism, a serving system tailored for deep learning recommendation models (DLRMs). Prism leverages resource disaggregation by partitioning DLRMs into CPU- and GPU-intensive subgraphs, which are then scheduled on CPU and GPU servers for disaggregated serving. Prism reduces resource fragmentation in high-allocation GPU clusters and enables efficient capacity loaning from training clusters during peak demand (e.g., seasonal e-commerce promotions). Date: Friday, 20 June 2025 Time: 4:00pm - 6:00pm Venue: Room 3494 Lifts 25/26 Committee Members: Dr. Wei Wang (Supervisor) Prof. Bo Li (Chairperson) Dr. Binhang Yuan