Scalable and Reliable Oversight for Large Language Models

PhD Thesis Proposal Defence


Title: "Scalable and Reliable Oversight for Large Language Models"

by

Mr. Zeyu QIN


Abstract:

Modern deep learning systems, and especially large language models, are 
increasingly deployed in settings where failures are difficult to detect and 
easy to conceal, including safety-critical interaction, reasoning-intensive 
tasks, and open-ended agentic environments. In such settings, progress depends 
not only on model scaling or stronger optimization, but also on whether the 
mechanisms used to evaluate, supervise, and reinforce model behavior remain 
scalable and reliable as tasks become more complex. This dissertation 
investigates this problem through the lens of scalable and reliable oversight: 
the design of evaluation, supervision, critique, and reward mechanisms that 
remain effective under distribution shift, data scarcity, and proxy 
exploitation.

The dissertation develops this perspective through four concrete oversight 
problems. First, using backdoor purification as a case study, it shows that 
standard robustness metrics of deep learning models can overestimate safety by 
failing to detect the persistence of harmful features, thereby motivating more 
faithful evaluation and robustness analysis. Second, it revisits the 
out-of-distribution generalization problem in safety alignment, shows that 
pretrained models already contain substantial latent safety knowledge, and 
introduces Safety Reasoning with Guidelines, a framework that improves safety 
generalization by training models to reason explicitly with structured safety 
guidelines.

Third, it addresses the challenge of constructing high-quality synthetic 
reasoning data at scale when human supervision is scarce. It presents 
SynthLLM, a framework for document-grounded question synthesis, and shows that 
the resulting data follows a predictable scaling law, with downstream 
performance improving consistently as data volume increases. Fourth, it 
extends oversight to open-ended reinforcement learning settings without 
deterministic verifiers, proposing a rubric-based framework in which 
structured rubrics serve as interpretable anchors for evaluation, reward 
construction, and policy optimization.

Taken together, these studies support a unified view: scalable oversight is 
insufficient if it is not reliable, and reliable oversight is insufficient if 
it cannot scale. The central contribution of this dissertation is to connect 
faithful evaluation, structured safety supervision, scalable synthetic 
reasoning data, and rubric-based reward design within a common framework for 
judging, guiding, and optimizing large language models.


Date:                   Monday, 4 May 2026

Time:                   1:30pm - 3:00pm

Venue:                  Room 2126A
                        Lift 22

Committee Members:      Dr. Shuai Wang (Supervisor)
                        Dr. Yangqiu Song (Chairperson)
                        Dr. Dan Xu