Foundations of Outcome-Based Reinforcement Learning: from Language Model Alignment to Reasoning
Speaker:
Mr. Zeyu JIA
Department of Electrical Engineering and Computer Science
MIT
Title: Foundations of Outcome-Based Reinforcement Learning: from Language Model Alignment to Reasoning
Date: Monday, 19 January 2026
Time: 9:30am - 10:30am
Join Zoom Meeting:
https://hkust.zoom.us/j/93833595480?pwd=zyGqwJJJcFpb4PDP41h9MHYABilcIg.1
Meeting ID: 938 3359 5480
Passcode: 825573
Abstract:
A central question in reinforcement learning for complex reasoning tasks is how feedback should be provided: should learning rely on fine-grained, step-by-step supervision (process supervision), or only on evaluations of final outcomes (outcome supervision)? Conventional wisdom holds that outcome-based supervision is inherently more difficult, due to trajectory-level coverage challenges, motivating substantial effort to collect detailed process annotations.
In this talk, I offer two complementary perspectives that revisit this assumption. First, in the offline setting, I introduce a transformation algorithm that converts outcome-supervision data into process-supervision data, and show through its analysis that, under standard coverage assumptions, outcome supervision is statistically no more difficult than process supervision. This result suggests that observed performance gaps arise from algorithmic limitations rather than fundamental statistical barriers. In addition, our results provide a finer-grained analysis of the Direct Policy Optimization (DPO) algorithm.
Second, I turn to the online setting and present provably sample-efficient algorithms that achieve strong performance guarantees using only trajectory-level feedback. At the same time, I identify sharp separations: there exist classes of MDPs in which outcome-based feedback incurs an exponential disadvantage relative to step-level supervision. These results precisely characterize when—and why—process supervision is genuinely necessary.
I conclude by outlining my broader research vision for the role of reinforcement learning in the age of large language models.
Biography:
Zeyu Jia is a final-year PhD student in the Department of Electrical Engineering and Computer Science at MIT, where he is affiliated with the Laboratory for Information and Decision Systems (LIDS). Prior to joining MIT, he received his bachelor's degree from the School of Mathematical Sciences at Peking University. His research interests include machine learning theory, with a focus on reinforcement learning, statistics, and information theory.