More about HKUST
Cost-Efficient Execution of Large-Scale Generative AI on Rack-Scale Infrastructure
PhD Qualifying Examination
Title: "Cost-Efficient Execution of Large-Scale Generative AI on Rack-Scale
Infrastructure"
by
Mr. Boran SUN
Abstract:
LLM inference is now power and capital intensive infrastructure: cost
efficiency is determined not by accelerator FLOPS alone but by how memory,
bandwidth, power, and facility amortization convert into useful tokens under
latency SLOs. The hardware response is tighter scaleup integration:
rack-scale high-bandwidth domains now bind dozens to hundreds of accelerators
into a single fabric. Inside such a fabric, overheads that were first-order
in PCIe- or RDMA-based clusters become less dominant, while new bottlenecks
emerge at fabric-attached memory tiers and rack-level power limits.
This survey re-examines recent LLM-serving systems through this hardware lens
across four layers: GPU kernels and communication primitives, KV-cache and
memory- hierarchy management, disaggregated and rack-scale serving systems,
and infrastructure-level cost modeling. Several widely used assumptions turn
out to be regime-dependent. Compute-memory imbalance is not a fixed law of
serving systems: some bottlenecks exposed by high-performance kernels are
partially absorbed by later hardware roadmaps, although software portability
remains fragile. Hierarchical KV caching is valuable for long-context,
high-reuse, multi-turn workloads, but can be over- engineered for
short-context API traffic with limited cache lifetimes. Prefill-decode
disaggregation has become a common production design, but at moderate batch
sizes a collocated chunked-prefill baseline can match its throughput at lower
energy, so the "always disaggregate" default leaves energy on the table for
many workload mixes. Scaleup procurement is becoming a joint
hardware-software decision rather than a simple accelerator comparison,
because interconnect semantics, memory tiers, power density, and software
maturity together determine realized cost per token. We close by arguing that
these findings share a common shape: the rack has become the unit of
capacity, power, and shared state, while scheduling, memory ownership, and
the communication control plane are still defined at the GPU instance.
Realigning these layers is where hardware-software co-design has the most
leverage.
Date: Tuesday, 2 June 2026
Time: 3:00pm - 4:00pm
Venue: Room 2129A
Lift 19
Committee Members: Prof. Bo Li (Supervisor)
Dr. Wei Wang (Chairperson)
Dr. Xiaomin Ouyang