More about HKUST
Scalable and Reliable Oversight for Large Language Models
PhD Thesis Proposal Defence
Title: "Scalable and Reliable Oversight for Large Language Models"
by
Mr. Zeyu QIN
Abstract:
Modern deep learning systems, and especially large language models, are
increasingly deployed in settings where failures are difficult to detect and
easy to conceal, including safety-critical interaction, reasoning-intensive
tasks, and open-ended agentic environments. In such settings, progress depends
not only on model scaling or stronger optimization, but also on whether the
mechanisms used to evaluate, supervise, and reinforce model behavior remain
scalable and reliable as tasks become more complex. This dissertation
investigates this problem through the lens of scalable and reliable oversight:
the design of evaluation, supervision, critique, and reward mechanisms that
remain effective under distribution shift, data scarcity, and proxy
exploitation.
The dissertation develops this perspective through four concrete oversight
problems. First, using backdoor purification as a case study, it shows that
standard robustness metrics of deep learning models can overestimate safety by
failing to detect the persistence of harmful features, thereby motivating more
faithful evaluation and robustness analysis. Second, it revisits the
out-of-distribution generalization problem in safety alignment, shows that
pretrained models already contain substantial latent safety knowledge, and
introduces Safety Reasoning with Guidelines, a framework that improves safety
generalization by training models to reason explicitly with structured safety
guidelines.
Third, it addresses the challenge of constructing high-quality synthetic
reasoning data at scale when human supervision is scarce. It presents
SynthLLM, a framework for document-grounded question synthesis, and shows that
the resulting data follows a predictable scaling law, with downstream
performance improving consistently as data volume increases. Fourth, it
extends oversight to open-ended reinforcement learning settings without
deterministic verifiers, proposing a rubric-based framework in which
structured rubrics serve as interpretable anchors for evaluation, reward
construction, and policy optimization.
Taken together, these studies support a unified view: scalable oversight is
insufficient if it is not reliable, and reliable oversight is insufficient if
it cannot scale. The central contribution of this dissertation is to connect
faithful evaluation, structured safety supervision, scalable synthetic
reasoning data, and rubric-based reward design within a common framework for
judging, guiding, and optimizing large language models.
Date: Monday, 4 May 2026
Time: 1:30pm - 3:00pm
Venue: Room 2126A
Lift 22
Committee Members: Dr. Shuai Wang (Supervisor)
Dr. Yangqiu Song (Chairperson)
Dr. Dan Xu