More about HKUST
Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large Language Models from a Data-Centric Perspective
PhD Thesis Proposal Defence
Title: "Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large
Language Models from a Data-Centric Perspective"
by
Miss Tianyi BAI
Abstract:
Multimodal large language models (MLLMs) have progressed rapidly, yet they
remain unreliable when correctness depends on subtle visual evidence. They may
miss a small but decisive object change, overlook a local attribute or
relation, hallucinate content that is not present, or answer from an
incomplete visual summary without recognizing that more evidence is needed.
This dissertation studies that weakness from a data-centric perspective. Its
central claim is that many fine-grained failures arise not only from limited
model capacity, but also from weak supervision concentration around the visual
distinctions that actually determine correctness. In that sense, fine-grained
reliability is both a grounding problem and a data-efficiency problem.
The first technical line develops the data-centric foundation through the
Micro Edit Dataset (MED). MED converts controlled local image edits into
paired-image supervision, aligned difference descriptions, and diagnostic
evaluation tasks. It is coupled with consistency-aware learning objectives
that encourage nearby images to remain nearby in representation space while
preserving the edited factor that changes the answer. By concentrating
supervision on answer-changing visual factors, MED turns minimal semantic
change into an explicit training and evaluation signal rather than a hidden
failure mode.
The second technical line develops adaptive evidence acquisition through
Visual Token Scaling with Verification (VTS). VTS formulates visual reasoning
as sequential evidence acquisition with a tool-using reasoner and a learned
verifier. The reasoner proposes plans and visual actions, while the verifier
judges whether the newly acquired evidence meaningfully improves the reasoning
state and whether another step is warranted. Supporting datasets, VTS-SFT and
VTS-DPO, provide supervision for both executable reasoning trajectories and
verifier preferences, enabling adaptive and interpretable visual reasoning at
inference time. In this sense, VTS addresses evidence efficiency at test time:
it allocates additional computation selectively instead of spending more
visual tokens indiscriminately.
Taken together, these lines of work support a broader reliability claim:
fine-grained multimodal reasoning improves when supervision efficiency,
objective design, inferencetime control, and evaluation are developed in a
coordinated way. Under this view, MED addresses what distinctions the model is
trained to notice, whereas VTS addresses when the currently available evidence
is insufficient and further inspection is warranted. Experiments across
minimal-change benchmarks, broader multimodal evaluations, and verifier-guided
reasoning settings show that both lines improve reliability under fine-grained
reasoning pressure and point toward a more grounded, interpretable, and
compute-aware approach to multimodal reasoning in next-generation MLLMs.
Date: Thursday, 7 May 2026
Time: 4:00pm - 6:00pm
Venue: Room 2131B
Lift 22
Committee Members: Dr. Binhang Yuan (Supervisor)
Dr. Yangqiu Song (Chairperson)
Dr. Chaojian Li