A Survey on Token-Level KV Cache Optimization for Efficient LLM Inference

PhD Qualifying Examination


Title: "A Survey on Token-Level KV Cache Optimization for Efficient LLM 
Inference"

by

Mr. Enze MA


Abstract:

Large language models (LLMs) have demonstrated remarkable capabilities across
a wide range of natural language processing tasks. However, their deployment in
long-context and low-latency scenarios is increasingly challenged by the
substantial memory and computational overhead of the key-value (KV) cache
during auto-regressive decoding. In response, token-level optimization
techniques for dynamic KV cache management have emerged as a pivotal research
direction. This survey provides a systematic review of these techniques,
focusing on core paradigms such as token eviction, compression, and
recomputation. We analyze their underlying principles, performance boundaries,
and inherent trade-offs between efficiency and model quality. Furthermore, we
explore the integration of these token-centric methods to meet future demands
for longer contexts and faster inference. Finally, the survey identifies key
limitations in current research and outlines promising directions for future
exploration.


Date:                   Friday, 28 November 2025

Time:                   1:00pm - 3:00pm

Venue:                  Room 2612A
                        Lift 27/28

Committee Members:      Prof. Song Guo (Supervisor)
                        Dr. Zili Meng (Chairperson)
                        Dr. Chaojian Li