KV cache techniques for long context inference

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering

Final Year Thesis Oral Defense

Title: "KV cache techniques for long context inference"

by

LAM Hoi Kei

Abstract:

This project investigates Key-Value (KV) cache compression methods for 
efficient long-context inference in Large Language Models. We build upon 
ShadowKV, a state-of-the-art KV cache offloading framework, and propose two 
algorithmic variants: a low-frequency landmark method that discards the top 
50% of head-dimension channels, and a hybrid scoring method that combines 
max-pooled high-frequency dot products with mean-pooled low-frequency 
landmarks. Through formal analysis of the Dirichlet-kernel low-pass filtering 
effect induced by RoPE mean-pooling, we show that low-frequency key 
components carry majority of semantic information for page selection, while 
high-frequency channels encode fine-grained positional detail critical for 
multi-key discrimination. Evaluated on Qwen2.5-7B-Instruct-1M using the RULER 
and SCBench benchmarks, the hybrid variant achieves a RULER average score of 
86.76 and a SCBench average score of 33.45, outperforming the standard 
ShadowKV, with negligible runtime overhead due to an optimized CUTLASS-backed 
CUDA kernel. However, exact string-matching tasks remain fundamentally 
resistant to all sub-linear memory methods, revealing a limitation of 
pooling-based KV cache compression.

Date            : 28 April 2026 (Tuesday)

Time            : 16:00 - 16:40

Venue           : Room 2126D (near Lift 19), HKUST

Advisor         : Prof. SONG Yangqiu

2nd Reader      : Dr. FUNG May