More about HKUST
Serving Agentic Workloads: From Token-Level Inference to Agent-Level Execution
PhD Qualifying Examination
Title: "Serving Agentic Workloads: From Token-Level Inference to Agent-Level
Execution"
by
Mr. Chaokun CHANG
Abstract:
Large language model applications are shifting from single-turn chatbot
interactions to agentic workloads: dynamic executions that repeatedly call
models, invoke tools, retrieve data, run code, manipulate state, and adapt to
intermediate results. This shift challenges the abstractions used by
conventional LLM serving systems. Token-level and request-level mechanisms
such as continuous batching, prefill/decode scheduling, KV-cache management,
and request migration remain essential, but they are no longer sufficient as
the primary optimization boundary. A single user task may expand into an
irregular execution involving LLM calls, tool invocations, memory accesses,
sandbox operations, external services, and side effects.
This survey argues that agent-level execution should be treated as a
first-class schedulable and programmable systems entity. We organize recent
work into a taxonomy spanning agent-aware LLM serving, runtime component
serving for tools, retrieval, memory, and sandboxes, workflow and program
serving, and cluster-level multi-agent serving. We further discuss how
agentic workloads shift evaluation from request latency toward end-to-end
task completion, cost per successful task, recovery behavior, heterogeneous
resource contention, and multi-tenant fairness. Finally, we identify open
research challenges in defining the right schedulable entity, coordinating
across runtime layers, evaluating agentic serving systems, and designing
agent-native infrastructure primitives such as checkpoint and restore.
Date: Monday, 11 May 2026
Time: 11:30am - 1:00pm
Venue: Room 5501
Lift 25/26
Committee Members: Dr. Wei Wang (Supervisor)
Prof. Bo Li (Supervisor)
Prof. Song Guo (Chairperson)
Dr. Binhang Yuan