Jailbreak Guardrails for Large Language Models

PhD Qualifying Examination


Title: "Jailbreak Guardrails for Large Language Models"

by

Mr. Xunguang WANG


Abstract:

Large Language Models (LLMs) have achieved remarkable progress, but their 
susceptibility to jailbreak attacks poses significant safety risks. Jailbreak 
guardrails have emerged as a crucial defense mechanism, operating as external 
safety filters to intercept harmful access attempts without altering the 
model's behavior for benign queries. This report presents a comprehensive 
survey of jailbreak guardrails, exploring their evolution from simple 
toxicity detection to complex, context-aware safety assessment. We categorize 
existing guardrail mechanisms based on their intervention stages: 
pre-processing, intra-processing, and post-processing. Additionally, we 
review benchmark datasets and evaluation metrics used to assess the 
effectiveness and efficiency of these guardrails. Finally, we discuss future 
research directions to enhance the efficiency, versatility and explainability 
of jailbreak guardrails. This survey aims to provide a deeper understanding 
of the current landscape and future prospects of jailbreak guardrails in 
ensuring the safety and reliability of LLM deployments.


Date:                   Monday, 19 January 2026

Time:                   2:00pm - 4:00pm

Zoom Meeting:
https://hkust.zoom.us/j/95585189208?pwd=jyJXmabfEMODic3vgzylab96FQ2AoQ.1

Committee Members:      Dr. Shuai Wang (Supervisor)
                        Dr. Binhang Yuan (Chairperson)
                        Dr. Zihan Zhang