A Survey of Activation Steering Methods in Large Language Models

PhD Qualifying Examination


Title: "A Survey of Activation Steering Methods in Large Language Models"

by

Miss Zheng CHEN


Abstract:

Large language models (LLMs) exhibit impressive capabilities but often 
produce outputs that are untruthful, toxic, or misaligned with user intent. A 
rapidly growing paradigm called activation steering addresses this challenge 
by directly manipulating the model's internal representations and hence 
modifying the outputs without retraining. Since any activation-level 
intervention must first decide where to intervene and then how to intervene, 
we organize the literature along these two axes into three families: (1) 
locating methods that identify behaviorally relevant directions or 
components; (2) steering methods that modify activations to control 
generation; and (3) integrated methods that jointly locate and steer. For 
each family we formalize the core mathematical operation, compare 
representative works, and discuss trade-offs. We then provide an empirical 
comparison across six application domains, which include truthfulness 
enhancement, behavioral and persona steering, reasoning steering, toxicity 
reduction and controlled generation, safety and refusal evaluation, and 
mechanistic interpretability benchmarks. We conclude by outlining the open 
problems and future directions.


Date:                   Tuesday, 12 May 2026

Time:                   1:00pm - 2:00pm

Venue:                  Room 5501
                        Lift 25/26

Committee Members:      Prof. Bo Li (Supervisor)
                        Dr. May Fung (Chairperson)
                        Dr. Dan Xu