More about HKUST
Towards Trustworthy Guardrails for Large Language Models
PhD Thesis Proposal Defence
Title: "Towards Trustworthy Guardrails for Large Language Models"
by
Mr. Xunguang WANG
Abstract:
The rapid advancement of large language models (LLMs) has unlocked
unprecedented capabilities across various domains. However, their widespread
deployment introduces critical vulnerabilities regarding content safety and
reasoning integrity. Malicious actors frequently exploit these models through
jailbreak attacks to bypass alignment protocols and extract harmful
information. Furthermore, the extended deduction processes of LLMs are highly
susceptible to reasoning hijacking and denial-of-service attacks, which can
manipulate intermediate steps or exhaust computational resources. Although
various safety guardrails have emerged to mitigate these threats, the current
defense landscape remains fragmented. Existing solutions are largely ad-hoc
and lack a unified evaluation standard, making it difficult to assess their
practical trade-offs. To comprehensively address these challenges, this
thesis proposal introduces an interconnected paradigm to evaluate and
construct trustworthy guardrails for LLMs.
First, to bridge the evaluation gap in the research community, we develop the
first systematic evaluation framework for jailbreak guardrails. Based on a
novel multi- dimensional taxonomy and a Security-Efficiency-Utility
evaluation metric approach, this framework provides a standardized
methodology to benchmark and optimize defense deployments. Second, we propose
a self-defense guardrail framework that operates alongside the target model.
This dual-layer mechanism concurrently evaluates user requests to filter
malicious queries and ensure content safety with negligible latency. Third,
we present a reasoning safety monitor designed to safeguard the execution
chains of reasoning models. By inspecting reasoning steps in a streaming
fashion, this monitor dynamically interrupts compromised or redundant
deductions without requiring task-specific fine-tuning.
Collectively, these works establish a robust foundation for building secure,
efficient, and reliable artificial intelligence systems, guiding the
principled advancement of model safety.
Date: Wednesday, 27 May 2026
Time: 2:00pm - 4:00pm
Venue: Room 2129A
Lift 19
Committee Members: Dr. Shuai Wang (Supervisor)
Dr. May Fung (Chairperson)
Dr. Mingxun Zhou