More about HKUST
Grokking and Beyond Toward Understanding Generalization in Transformer Pretraining
PhD Qualifying Examination
Title: "Grokking and Beyond Toward Understanding Generalization in
Transformer Pretraining"
by
Mr. Hangyu LIN
Abstract:
Transformer pretraining is typically guided by a loss-minimizing objective.
This objective is indispensable, but it does not by itself explain when
pretraining produces reusable computation that supports target outcomes such
as reasoning, knowledge acquisition, and the formation of reusable features,
circuits, and internal algorithms. This survey proceeds from the phenomenon
of grokking, or delayed-generalization. In grokking, the delayed transition
to generalization often coincides with the emergence of rules, circuits,
representations, or internal algorithms that transfer beyond the observed
data.
Taking grokking as the entry point, this survey asks how pretraining choices
shape the acquisition of generalization-supporting computation. Specifically,
this survey analyzes the literature surrounding three groups of training
configuration variables: Training Compute Budget covers scale and exposure,
Training Interventions covers data, protocol, and regularization, and
Structural Mechanisms covers architecture and learned computation. Papers are
also categorized across five outcome dimensions, namely generalization
quality, data efficiency, compute efficiency, reliability under distribution
shift, and generalization dynamics. Across the analyzed variables, scale and
exposure, as well as architecture design, determines the possibility of
targeted generalization behaviors. However, the most promising variables
which can minimize, to the greatest extent, the cost of training compute to
reach a certain degree of generalization, are intervention-oriented. In the
existing literature, experiments surrounding these variables have mostly been
conducted in simplified, controlled bridge settings, rather than in
real-world pretraining contexts. The proposed research agenda focuses on
pretraining-scale questions about data, protocol, and regularization with
the objective of making target generalization behaviors more cost-effective
in real-life settings.
Date: Monday, 15 June 2026
Time: 2:00pm - 4:00pm
Venue: Room 5501
Lift 25/26
Committee Members: Prof. Kai Chen (Supervisor)
Prof. Qiang Yang (Co-supervisor)
Dr. Yangqiu Song (Chairperson)
Dr. Long Chen