A Survey on Communication Optimization for LLM Serving

PhD Qualifying Examination


Title: "A Survey on Communication Optimization for LLM Serving"

by

Mr. Yijun SUN


Abstract:

The rise of Large Language Models (LLMs) has catalyzed a new era for 
generative AI applications. To efficiently serve increasingly large models, 
modern serving clusters widely employ model parallelism, Key-Value (KV) cache 
reuse, and Prefill-Decode (P/D) disaggregation. However, these techniques 
introduce high communication overhead, which has become a primary performance 
bottleneck severely affecting both latency and throughput.

This survey provides a comprehensive overview of communication optimization 
techniques for LLM serving. We first outlines the fundamentals of LLM 
inference and the serving paradigms that give rise to communication overhead. 
Then we systematically classify and explore a wide range of optimization 
strategies, categorizing them into two primary approaches: lossy 
optimizations, which reduce data volume at the cost of model quality, and 
lossless optimizations, which improve communication efficiency without 
compromising generation quality. Through a comprehensive synthesis of the 
objectives, methodologies, and inherent trade-offs of existing approaches, 
this survey offers valuable insights into building quality-preserving and 
communication-efficient serving systems.


Date:                   Tuesday, 29 July 2025

Time:                   3:00pm - 5:00pm

Venue:                  Room 3494
                        Lifts 25/26

Committee Members:      Prof. Kai Chen (Supervisor)
                        Prof. Gary Chan (Chairperson)
                        Dr. Xiaomin Ouyang