Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large Language Models from a Data-Centric Perspective

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large 
Language Models from a Data-Centric Perspective"

By

Miss Tianyi BAI


Abstract:

Multimodal large language models (MLLMs) have advanced rapidly in image-grounded
dialogue, visual question answering, document understanding, and general 
multimodal assistance. However, they remain unreliable when correctness depends 
on subtle visual evidence. A small object insertion, a local attribute change, 
a shifted spatial relation, or an overlooked counting difference can alter the 
correct answer while leaving the overall scene nearly unchanged. In such cases, 
models may still produce fluent and plausible responses, but the responses are 
not reliably grounded in the decisive evidence. This dissertation studies this 
failure mode from a data-centric perspective. Its central claim is that reliable 
fine-grained visual reasoning requires careful control over how evidence enters 
the model during pretraining, how it is isolated during multimodal supervision, 
and how it is acquired during inference.

The first technical line develops multi-actor collaborative data selection for 
efficient language model pretraining. Existing data-selection methods often rely 
on a single criterion, such as document quality, domain mixture, topic diversity, 
or online model influence. These criteria are individually useful but can 
conflict: high-quality data may be topically narrow, diverse data may be noisy, 
and data that look useful offline may not be the most beneficial for the model 
at a particular training stage. The proposed framework treats these criteria as 
specialized actors and uses an actor console to coordinate their preferences 
dynamically. Experiments show that multi-actor collaboration improves data 
efficiency, convergence behavior, and downstream performance compared with 
single-criterion selection and larger random token budgets. This chapter 
establishes the first thesis principle: reliable systems should begin by 
selecting the data that most efficiently shapes the base model.

The second technical line develops the Micro Edit Dataset (MED) for fine-grained
multimodal supervision. MED uses controlled local image edits to construct 
original-edited image pairs, aligned captions, concise difference descriptions, 
and diagnostic benchmark questions. Its construction pipeline filters editable 
source images, proposes category-aware edits, applies localized image editing, 
filters pairs by visual similarity, aligns original and edited descriptions, and 
verifies whether the final supervision is faithful to the visible change. 
Building on this resource, the chapter studies consistency-aware supervised 
fine-tuning, which encourages nearby images to remain close in representation 
space while preserving the semantic factor that changes the answer. MED turns 
minimal visual change into an explicit training and evaluation signal, making 
hidden grounding failures easier to observe and correct.

The third technical line develops Visual Token Scaling with Verification (VTS), 
an adaptive inference framework for multi-step visual reasoning. Instead of 
forcing the model to answer from a fixed initial visual summary, VTS formulates 
reasoning as sequential evidence acquisition. A tool-using reasoner proposes 
plans and visual actions, while a learned verifier judges whether the new 
observation improves the reasoning state and whether another step is warranted. 
The associated VTS-SFT and VTS-DPO datasets provide supervision for executable 
reasoning traces and verifier preferences. This design makes inference both more 
adaptive and more inspectable: the model can gather more evidence when needed, 
but it is also encouraged to stop when further computation is no longer useful.

Together, these lines of work support a unified view of reliability as an 
evidence lifecycle. Multi-actor data selection studies which pretraining tokens 
should shape the base model. MED studies which visual distinctions should be 
made explicit during multimodal learning. VTS studies when additional visual 
evidence should be acquired at inference time. The dissertation therefore argues 
that fine-grained multimodal reasoning is not solved by scale alone. It improves 
when data selection, supervision design, objective design, inference control, 
and evaluation are developed as coordinated parts of a reliability pipeline.


Date:                   Friday, 3 July 2026

Time:                   10:00am - 12:00noon

Venue:                  Room 3494
                        Lifts 25-26

Chairman:               Dr. Xiaojiang XIE (CHEM)

Committee Members:      Dr. Binhang YUAN (Supervisor)
                        Dr. Chaojian LI
                        Dr. Shuai WANG
                        Dr. Wenhan LUO (AMC)
                        Prof. Jiawei JIANG (Wuhan University)