More about HKUST
Jailbreak Attacks and Evaluations for Large Language Models
PhD Qualifying Examination Title: "Jailbreak Attacks and Evaluations for Large Language Models" by Mr. Ruixuan HUANG Abstract: Large language models (LLMs) are increasingly deployed as general-purpose interfaces for our daily lives. Safety alignment is expected to prevent them from assisting illegal, harmful, or high-risk requests. However, jailbreak attacks show that refusal is not a fixed boundary of model capability. Instead, refusal is a context-dependent behavior shaped by the wording of the request, the surrounding instruction format, the interaction history, and, in open-weight settings, the model's post-training state or internal representations. This survey studies jailbreak attacks and evaluation methods. We first review the Transformer components and safety alignment pipeline needed to understand modern jailbreak methods. We then formalize jailbreak attacks as attempts to induce unsafe compliance from a model that should refuse, separating adversary observability from adversary controllability. Under this framework, jailbreak attacks are organized into (1) input-side and (2) model-side mechanisms. The survey further analyzes jailbreak evaluation. We review benchmark construction principles, evaluator families, and ambiguity sources including success ambiguity, dataset ambiguity, metric ambiguity, and experimental comparability. Finally, this survey identifies future directions toward attacks and evaluations. Date: Wednesday, 24 June 2026 Time: 2:00pm - 4:00pm Zoom Meeting: https://hkust.zoom.us/j/96184667833?pwd=rg9yi3hEdPkLSpRdbjpbi3c4SrDmM1.1 Committee Members: Dr. Shuai Wang (Supervisor) Dr. Dan Xu (Chairperson) Dr. Binhang Yuan