More about HKUST
Jailbreak Guardrails for Large Language Models
PhD Qualifying Examination Title: "Jailbreak Guardrails for Large Language Models" by Mr. Xunguang WANG Abstract: Large Language Models (LLMs) have achieved remarkable progress, but their susceptibility to jailbreak attacks poses significant safety risks. Jailbreak guardrails have emerged as a crucial defense mechanism, operating as external safety filters to intercept harmful access attempts without altering the model's behavior for benign queries. This report presents a comprehensive survey of jailbreak guardrails, exploring their evolution from simple toxicity detection to complex, context-aware safety assessment. We categorize existing guardrail mechanisms based on their intervention stages: pre-processing, intra-processing, and post-processing. Additionally, we review benchmark datasets and evaluation metrics used to assess the effectiveness and efficiency of these guardrails. Finally, we discuss future research directions to enhance the efficiency, versatility and explainability of jailbreak guardrails. This survey aims to provide a deeper understanding of the current landscape and future prospects of jailbreak guardrails in ensuring the safety and reliability of LLM deployments. Date: Monday, 19 January 2026 Time: 2:00pm - 4:00pm Zoom Meeting: https://hkust.zoom.us/j/95585189208?pwd=jyJXmabfEMODic3vgzylab96FQ2AoQ.1 Committee Members: Dr. Shuai Wang (Supervisor) Dr. Binhang Yuan (Chairperson) Dr. Zihan Zhang