Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large Language Models from a Data-Centric Perspective

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large 
Language Models from a Data-Centric Perspective"

By

Miss Tianyi BAI


Abstract:

Multimodal large language models (MLLMs) have advanced rapidly in 
image-grounded dialogue, visual question answering, document understanding, 
and general multimodal assistance. However, they remain unreliable when 
correctness depends on subtle visual evidence. A small object insertion, a 
local attribute change, a shifted spatial relation, or an overlooked counting 
difference can alter the correct answer while leaving the overall scene 
nearly unchanged. In such cases, models may still produce fluent and 
plausible responses, but the responses are not reliably grounded in the 
decisive evidence. This dissertation studies this failure mode from a 
data-centric perspective. Its central claim is that reliable fine-grained 
visual reasoning requires careful control over how evidence enters the model 
during pretraining, how it is isolated during multimodal supervision, and how 
it is acquired during inference.

The first technical line develops multi-actor collaborative data selection 
for efficient language model pretraining. Existing data-selection methods 
often rely on a single criterion, such as document quality, domain mixture, 
topic diversity, or online model influence. These criteria are individually 
useful but can conflict: high-quality data may be topically narrow, diverse 
data may be noisy, and data that look useful offline may not be the most 
beneficial for the model at a particular training stage. The proposed 
framework treats these criteria as specialized actors and uses an actor 
console to coordinate their preferences dynamically. Experiments show that 
multi-actor collaboration improves data efficiency, convergence behavior, and 
downstream performance compared with single-criterion selection and larger 
random token budgets. This chapter establishes the first thesis principle: 
reliable systems should begin by selecting the data that most efficiently 
shapes the base model.

The second technical line develops the Micro Edit Dataset (MED) for 
fine-grained multimodal supervision. MED uses controlled local image edits to 
construct original-edited image pairs, aligned captions, concise difference 
descriptions, and diagnostic benchmark questions. Its construction pipeline 
filters editable source images, proposes category-aware edits, applies 
localized image editing, filters pairs by visual similarity, aligns original 
and edited descriptions, and verifies whether the final supervision is 
faithful to the visible change. Building on this resource, the chapter 
studies consistency-aware supervised fine-tuning, which encourages nearby 
images to remain close in representation space while preserving the semantic 
factor that changes the answer. MED turns minimal visual change into an 
explicit training and evaluation signal, making hidden grounding failures 
easier to observe and correct.

The third technical line develops Visual Token Scaling with Verification 
(VTS), an adaptive inference framework for multi-step visual reasoning. 
Instead of forcing the model to answer from a fixed initial visual summary, 
VTS formulates reasoning as sequential evidence acquisition. A tool-using 
reasoner proposes plans and visual actions, while a learned verifier judges 
whether the new observation improves the reasoning state and whether another 
step is warranted. The associated VTS-SFT and VTS-DPO datasets provide 
supervision for executable reasoning traces and verifier preferences. This 
design makes inference both more adaptive and more inspectable: the model can 
gather more evidence when needed, but it is also encouraged to stop when 
further computation is no longer useful.

Together, these lines of work support a unified view of reliability as an 
evidence lifecycle. Multi-actor data selection studies which pretraining 
tokens should shape the base model. MED studies which visual distinctions 
should be made explicit during multimodal learning. VTS studies when 
additional visual evidence should be acquired at inference time. The 
dissertation therefore argues that fine-grained multimodal reasoning is not 
solved by scale alone. It improves when data selection, supervision design, 
objective design, inference control, and evaluation are developed as 
coordinated parts of a reliability pipeline.


Date:                   Friday, 3 July 2026

Time:                   10:00am - 12:00noon

Venue:                  Room 3494
                        Lifts 25-26

Chairman:               Dr. Xiaojiang XIE (CHEM)

Committee Members:      Dr. Binhang YUAN (Supervisor)
                        Dr. Chaojian LI
                        Dr. Shuai WANG
                        Dr. Wenhan LUO (AMC)
                        Prof. Jiawei JIANG (Wuhan University)
Privacy Sitemap
Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large Language Models from a Data-Centric Perspective

About

People

Research

Academics

Admissions