Grokking and Beyond Toward Understanding Generalization in Transformer Pretraining

PhD Qualifying Examination


Title: "Grokking and Beyond Toward Understanding Generalization in 
Transformer Pretraining"

by

Mr. Hangyu LIN


Abstract:

Transformer pretraining is typically guided by a loss-minimizing objective. 
This objective is indispensable, but it does not by itself explain when 
pretraining produces reusable computation that supports target outcomes such 
as reasoning, knowledge acquisition, and the formation of reusable features, 
circuits, and internal algorithms. This survey proceeds from the phenomenon 
of grokking, or delayed-generalization. In grokking, the delayed transition 
to generalization often coincides with the emergence of rules, circuits, 
representations, or internal algorithms that transfer beyond the observed 
data.

Taking grokking as the entry point, this survey asks how pretraining choices 
shape the acquisition of generalization-supporting computation. Specifically, 
this survey analyzes the literature surrounding three groups of training 
configuration variables: Training Compute Budget covers scale and exposure, 
Training Interventions covers data, protocol, and regularization, and 
Structural Mechanisms covers architecture and learned computation. Papers are 
also categorized across five outcome dimensions, namely generalization 
quality, data efficiency, compute efficiency, reliability under distribution 
shift, and generalization dynamics. Across the analyzed variables, scale and 
exposure, as well as architecture design, determines the possibility of 
targeted generalization behaviors. However, the most promising variables 
which can minimize, to the greatest extent, the cost of training compute to 
reach a certain degree of generalization, are intervention-oriented. In the 
existing literature, experiments surrounding these variables have mostly been 
conducted in simplified, controlled bridge settings, rather than in 
real-world pretraining contexts. The proposed research agenda focuses on 
pretraining-scale questions about data, protocol, and regularization with 
the objective of making target generalization behaviors more cost-effective 
in real-life settings.


Date:                   Monday, 15 June 2026

Time:                   2:00pm - 4:00pm

Venue:                  Room 5501
                        Lift 25/26

Committee Members:      Prof. Kai Chen (Supervisor)
                        Prof. Qiang Yang (Co-supervisor)
                        Dr. Yangqiu Song (Chairperson)
                        Dr. Long Chen