Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large Language Models from a Data-Centric Perspective

PhD Thesis Proposal Defence


Title: "Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large 
Language Models from a Data-Centric Perspective"

by

Miss Tianyi BAI


Abstract:

Multimodal large language models (MLLMs) have progressed rapidly, yet they 
remain unreliable when correctness depends on subtle visual evidence. They may 
miss a small but decisive object change, overlook a local attribute or 
relation, hallucinate content that is not present, or answer from an 
incomplete visual summary without recognizing that more evidence is needed. 
This dissertation studies that weakness from a data-centric perspective. Its 
central claim is that many fine-grained failures arise not only from limited 
model capacity, but also from weak supervision concentration around the visual 
distinctions that actually determine correctness. In that sense, fine-grained 
reliability is both a grounding problem and a data-efficiency problem.

The first technical line develops the data-centric foundation through the 
Micro Edit Dataset (MED). MED converts controlled local image edits into 
paired-image supervision, aligned difference descriptions, and diagnostic 
evaluation tasks. It is coupled with consistency-aware learning objectives 
that encourage nearby images to remain nearby in representation space while 
preserving the edited factor that changes the answer. By concentrating 
supervision on answer-changing visual factors, MED turns minimal semantic 
change into an explicit training and evaluation signal rather than a hidden 
failure mode.

The second technical line develops adaptive evidence acquisition through 
Visual Token Scaling with Verification (VTS). VTS formulates visual reasoning 
as sequential evidence acquisition with a tool-using reasoner and a learned 
verifier. The reasoner proposes plans and visual actions, while the verifier 
judges whether the newly acquired evidence meaningfully improves the reasoning 
state and whether another step is warranted. Supporting datasets, VTS-SFT and 
VTS-DPO, provide supervision for both executable reasoning trajectories and 
verifier preferences, enabling adaptive and interpretable visual reasoning at 
inference time. In this sense, VTS addresses evidence efficiency at test time: 
it allocates additional computation selectively instead of spending more 
visual tokens indiscriminately.

Taken together, these lines of work support a broader reliability claim: 
fine-grained multimodal reasoning improves when supervision efficiency, 
objective design, inferencetime control, and evaluation are developed in a 
coordinated way. Under this view, MED addresses what distinctions the model is 
trained to notice, whereas VTS addresses when the currently available evidence 
is insufficient and further inspection is warranted. Experiments across 
minimal-change benchmarks, broader multimodal evaluations, and verifier-guided 
reasoning settings show that both lines improve reliability under fine-grained 
reasoning pressure and point toward a more grounded, interpretable, and 
compute-aware approach to multimodal reasoning in next-generation MLLMs.


Date:                   Thursday, 7 May 2026

Time:                   4:00pm - 6:00pm

Venue:                  Room 2131B
                        Lift 22

Committee Members:      Dr. Binhang Yuan (Supervisor)
                        Dr. Yangqiu Song (Chairperson)
                        Dr. Chaojian Li