More about HKUST
Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large Language Models from a Data-Centric Perspective
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Towards Reliable Fine-Grained Visual Reasoning in Multimodal Large
Language Models from a Data-Centric Perspective"
By
Miss Tianyi BAI
Abstract:
Multimodal large language models (MLLMs) have advanced rapidly in image-grounded
dialogue, visual question answering, document understanding, and general
multimodal assistance. However, they remain unreliable when correctness depends
on subtle visual evidence. A small object insertion, a local attribute change,
a shifted spatial relation, or an overlooked counting difference can alter the
correct answer while leaving the overall scene nearly unchanged. In such cases,
models may still produce fluent and plausible responses, but the responses are
not reliably grounded in the decisive evidence. This dissertation studies this
failure mode from a data-centric perspective. Its central claim is that reliable
fine-grained visual reasoning requires careful control over how evidence enters
the model during pretraining, how it is isolated during multimodal supervision,
and how it is acquired during inference.
The first technical line develops multi-actor collaborative data selection for
efficient language model pretraining. Existing data-selection methods often rely
on a single criterion, such as document quality, domain mixture, topic diversity,
or online model influence. These criteria are individually useful but can
conflict: high-quality data may be topically narrow, diverse data may be noisy,
and data that look useful offline may not be the most beneficial for the model
at a particular training stage. The proposed framework treats these criteria as
specialized actors and uses an actor console to coordinate their preferences
dynamically. Experiments show that multi-actor collaboration improves data
efficiency, convergence behavior, and downstream performance compared with
single-criterion selection and larger random token budgets. This chapter
establishes the first thesis principle: reliable systems should begin by
selecting the data that most efficiently shapes the base model.
The second technical line develops the Micro Edit Dataset (MED) for fine-grained
multimodal supervision. MED uses controlled local image edits to construct
original-edited image pairs, aligned captions, concise difference descriptions,
and diagnostic benchmark questions. Its construction pipeline filters editable
source images, proposes category-aware edits, applies localized image editing,
filters pairs by visual similarity, aligns original and edited descriptions, and
verifies whether the final supervision is faithful to the visible change.
Building on this resource, the chapter studies consistency-aware supervised
fine-tuning, which encourages nearby images to remain close in representation
space while preserving the semantic factor that changes the answer. MED turns
minimal visual change into an explicit training and evaluation signal, making
hidden grounding failures easier to observe and correct.
The third technical line develops Visual Token Scaling with Verification (VTS),
an adaptive inference framework for multi-step visual reasoning. Instead of
forcing the model to answer from a fixed initial visual summary, VTS formulates
reasoning as sequential evidence acquisition. A tool-using reasoner proposes
plans and visual actions, while a learned verifier judges whether the new
observation improves the reasoning state and whether another step is warranted.
The associated VTS-SFT and VTS-DPO datasets provide supervision for executable
reasoning traces and verifier preferences. This design makes inference both more
adaptive and more inspectable: the model can gather more evidence when needed,
but it is also encouraged to stop when further computation is no longer useful.
Together, these lines of work support a unified view of reliability as an
evidence lifecycle. Multi-actor data selection studies which pretraining tokens
should shape the base model. MED studies which visual distinctions should be
made explicit during multimodal learning. VTS studies when additional visual
evidence should be acquired at inference time. The dissertation therefore argues
that fine-grained multimodal reasoning is not solved by scale alone. It improves
when data selection, supervision design, objective design, inference control,
and evaluation are developed as coordinated parts of a reliability pipeline.
Date: Friday, 3 July 2026
Time: 10:00am - 12:00noon
Venue: Room 3494
Lifts 25-26
Chairman: Dr. Xiaojiang XIE (CHEM)
Committee Members: Dr. Binhang YUAN (Supervisor)
Dr. Chaojian LI
Dr. Shuai WANG
Dr. Wenhan LUO (AMC)
Prof. Jiawei JIANG (Wuhan University)