Cost-Efficient Execution of Large-Scale Generative AI on Rack-Scale Infrastructure

PhD Qualifying Examination


Title: "Cost-Efficient Execution of Large-Scale Generative AI on Rack-Scale
Infrastructure"

by

Mr. Boran SUN


Abstract:

LLM inference is now power and capital intensive infrastructure: cost 
efficiency is determined not by accelerator FLOPS alone but by how memory, 
bandwidth, power, and facility amortization convert into useful tokens under 
latency SLOs. The hardware response is tighter scaleup integration: 
rack-scale high-bandwidth domains now bind dozens to hundreds of accelerators 
into a single fabric. Inside such a fabric, overheads that were first-order 
in PCIe- or RDMA-based clusters become less dominant, while new bottlenecks 
emerge at fabric-attached memory tiers and rack-level power limits.

This survey re-examines recent LLM-serving systems through this hardware lens 
across four layers: GPU kernels and communication primitives, KV-cache and 
memory- hierarchy management, disaggregated and rack-scale serving systems, 
and infrastructure-level cost modeling. Several widely used assumptions turn 
out to be regime-dependent. Compute-memory imbalance is not a fixed law of 
serving systems: some bottlenecks exposed by high-performance kernels are 
partially absorbed by later hardware roadmaps, although software portability 
remains fragile. Hierarchical KV caching is valuable for long-context, 
high-reuse, multi-turn workloads, but can be over- engineered for 
short-context API traffic with limited cache lifetimes. Prefill-decode 
disaggregation has become a common production design, but at moderate batch 
sizes a collocated chunked-prefill baseline can match its throughput at lower 
energy, so the "always disaggregate" default leaves energy on the table for 
many workload mixes. Scaleup procurement is becoming a joint 
hardware-software decision rather than a simple accelerator comparison, 
because interconnect semantics, memory tiers, power density, and software 
maturity together determine realized cost per token. We close by arguing that 
these findings share a common shape: the rack has become the unit of 
capacity, power, and shared state, while scheduling, memory ownership, and 
the communication control plane are still defined at the GPU instance. 
Realigning these layers is where hardware-software co-design has the most 
leverage.


Date:                   Tuesday, 2 June 2026

Time:                   3:00pm - 4:00pm

Venue:                  Room 2129A
                        Lift 19

Committee Members:      Prof. Bo Li (Supervisor)
                        Dr. Wei Wang (Chairperson)
                        Dr. Xiaomin Ouyang