More about HKUST
Resource-Efficient Model Serving in Multi-Tenant GPU Clusters
PhD Thesis Proposal Defence
Title: "Resource-Efficient Model Serving in Multi-Tenant GPU Clusters"
by
Mr. Lingyun YANG
Abstract:
The rapid development of artificial intelligence over the past decade has
driven the widespread adoption of model serving workloads in data centers.
To support these workloads efficiently, large technology companies have
built multi-tenant GPU clusters that facilitate resource sharing and
minimize waste. However, achieving resource efficiency in such clusters
presents significant challenges, particularly due to GPU resource
fragmentation. This dissertation tackles the problem of GPU resource
fragmentation from multiple perspectives, addressing key issues such as
resource underutilization due to overestimation, stranded resources due to
shortages in other dimensions, and inefficiencies due to resource
mismatches.
To address internal fragmentation caused by overprovisioning, this
dissertation introduces Morphling, a fast and near-optimal
auto-configuration framework for tuning hardware and runtime parameters
(e.g., GPU type, GPU memory, batch size). Morphling employs an offline
meta-model to capture general performance trends across configurations and
adapts this model to new inference services with minimal sampling. Morphling
effectively reduces overestimation by enabling precise, service-specific
resource allocation.
To mitigate external fragmentation caused by stranded resources, the
dissertation proposes a novel statistical measure for quantifying GPU
fragmentation across resource dimensions. Building on this measure, a
scheduling policy called Fragmentation Gradient Descent (FGD) is introduced,
which minimizes the growth of resource fragmentation. Compared to
traditional packing-based schedulers, FGD achieves higher GPU allocation
efficiency.
To resolve external fragmentation caused by resource mismatches, the
dissertation presents Prism, a serving system tailored for deep learning
recommendation models (DLRMs). Prism leverages resource disaggregation by
partitioning DLRMs into CPU- and GPU-intensive subgraphs, which are then
scheduled on CPU and GPU servers for disaggregated serving. Prism reduces
resource fragmentation in high-allocation GPU clusters and enables efficient
capacity loaning from training clusters during peak demand (e.g., seasonal
e-commerce promotions).
Date: Friday, 20 June 2025
Time: 4:00pm - 6:00pm
Venue: Room 3494
Lifts 25/26
Committee Members: Dr. Wei Wang (Supervisor)
Prof. Bo Li (Chairperson)
Dr. Binhang Yuan