Resource-Efficient Model Serving in Multi-Tenant GPU Clusters

PhD Thesis Proposal Defence


Title: "Resource-Efficient Model Serving in Multi-Tenant GPU Clusters"

by

Mr. Lingyun YANG


Abstract:

The rapid development of artificial intelligence over the past decade has 
driven the widespread adoption of model serving workloads in data centers. 
To support these workloads efficiently, large technology companies have 
built multi-tenant GPU clusters that facilitate resource sharing and 
minimize waste. However, achieving resource efficiency in such clusters 
presents significant challenges, particularly due to GPU resource 
fragmentation. This dissertation tackles the problem of GPU resource 
fragmentation from multiple perspectives, addressing key issues such as 
resource underutilization due to overestimation, stranded resources due to 
shortages in other dimensions, and inefficiencies due to resource 
mismatches.

To address internal fragmentation caused by overprovisioning, this 
dissertation introduces Morphling, a fast and near-optimal 
auto-configuration framework for tuning hardware and runtime parameters 
(e.g., GPU type, GPU memory, batch size). Morphling employs an offline 
meta-model to capture general performance trends across configurations and 
adapts this model to new inference services with minimal sampling. Morphling 
effectively reduces overestimation by enabling precise, service-specific 
resource allocation.

To mitigate external fragmentation caused by stranded resources, the 
dissertation proposes a novel statistical measure for quantifying GPU 
fragmentation across resource dimensions. Building on this measure, a 
scheduling policy called Fragmentation Gradient Descent (FGD) is introduced, 
which minimizes the growth of resource fragmentation. Compared to 
traditional packing-based schedulers, FGD achieves higher GPU allocation 
efficiency.

To resolve external fragmentation caused by resource mismatches, the 
dissertation presents Prism, a serving system tailored for deep learning 
recommendation models (DLRMs). Prism leverages resource disaggregation by 
partitioning DLRMs into CPU- and GPU-intensive subgraphs, which are then 
scheduled on CPU and GPU servers for disaggregated serving. Prism reduces 
resource fragmentation in high-allocation GPU clusters and enables efficient 
capacity loaning from training clusters during peak demand (e.g., seasonal 
e-commerce promotions).


Date:                   Friday, 20 June 2025

Time:                   4:00pm - 6:00pm

Venue:                  Room 3494
                        Lifts 25/26

Committee Members:      Dr. Wei Wang (Supervisor)
                        Prof. Bo Li (Chairperson)
                        Dr. Binhang Yuan