Jailbreak Attacks and Evaluations for Large Language Models

PhD Qualifying Examination


Title: "Jailbreak Attacks and Evaluations for Large Language Models"

by

Mr. Ruixuan HUANG


Abstract:

Large language models (LLMs) are increasingly deployed as general-purpose
interfaces for our daily lives. Safety alignment is expected to prevent them
from assisting illegal, harmful, or high-risk requests. However, jailbreak
attacks show that refusal is not a fixed boundary of model capability.
Instead, refusal is a context-dependent behavior shaped by the wording of the
request, the surrounding instruction format, the interaction history, and, in
open-weight settings, the model's post-training state or internal
representations. This survey studies jailbreak attacks and evaluation methods.
We first review the Transformer components and safety alignment pipeline
needed to understand modern jailbreak methods. We then formalize jailbreak
attacks as attempts to induce unsafe compliance from a model that should
refuse, separating adversary observability from adversary controllability.
Under this framework, jailbreak attacks are organized into (1) input-side and
(2) model-side mechanisms. The survey further analyzes jailbreak evaluation.
We review benchmark construction principles, evaluator families, and ambiguity
sources including success ambiguity, dataset ambiguity, metric ambiguity, and
experimental comparability. Finally, this survey identifies future directions
toward attacks and evaluations.


Date:                   Wednesday, 24 June 2026

Time:                   2:00pm - 4:00pm

Zoom Meeting:
https://hkust.zoom.us/j/96184667833?pwd=rg9yi3hEdPkLSpRdbjpbi3c4SrDmM1.1

Committee Members:      Dr. Shuai Wang (Supervisor)
                        Dr. Dan Xu (Chairperson)
                        Dr. Binhang Yuan