Towards Trustworthy Guardrails for Large Language Models

PhD Thesis Proposal Defence


Title: "Towards Trustworthy Guardrails for Large Language Models"

by

Mr. Xunguang WANG


Abstract:

The rapid advancement of large language models (LLMs) has unlocked 
unprecedented capabilities across various domains. However, their widespread 
deployment introduces critical vulnerabilities regarding content safety and 
reasoning integrity. Malicious actors frequently exploit these models through 
jailbreak attacks to bypass alignment protocols and extract harmful 
information. Furthermore, the extended deduction processes of LLMs are highly 
susceptible to reasoning hijacking and denial-of-service attacks, which can 
manipulate intermediate steps or exhaust computational resources. Although 
various safety guardrails have emerged to mitigate these threats, the current 
defense landscape remains fragmented. Existing solutions are largely ad-hoc 
and lack a unified evaluation standard, making it difficult to assess their 
practical trade-offs. To comprehensively address these challenges, this 
thesis proposal introduces an interconnected paradigm to evaluate and 
construct trustworthy guardrails for LLMs.

First, to bridge the evaluation gap in the research community, we develop the 
first systematic evaluation framework for jailbreak guardrails. Based on a 
novel multi- dimensional taxonomy and a Security-Efficiency-Utility 
evaluation metric approach, this framework provides a standardized 
methodology to benchmark and optimize defense deployments. Second, we propose 
a self-defense guardrail framework that operates alongside the target model. 
This dual-layer mechanism concurrently evaluates user requests to filter 
malicious queries and ensure content safety with negligible latency. Third, 
we present a reasoning safety monitor designed to safeguard the execution 
chains of reasoning models. By inspecting reasoning steps in a streaming 
fashion, this monitor dynamically interrupts compromised or redundant 
deductions without requiring task-specific fine-tuning.

Collectively, these works establish a robust foundation for building secure, 
efficient, and reliable artificial intelligence systems, guiding the 
principled advancement of model safety.


Date:                   Wednesday, 27 May 2026

Time:                   2:00pm - 4:00pm

Venue:                  Room 2129A
                        Lift 19

Committee Members:      Dr. Shuai Wang (Supervisor)
                        Dr. May Fung (Chairperson)
                        Dr. Mingxun Zhou