More about HKUST
Evaluation-Driven Intelligence: From Visual Captioning Metrics to Agentic Workflow Optimization
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
MPhil Thesis Defence
Title: "Evaluation-Driven Intelligence: From Visual Captioning Metrics to
Agentic Workflow Optimization"
By
Mr. Tony Cheng TONG (former name: Zheng TANG)
Abstract:
This thesis advances a unifying paradigm of evaluation-driven
intelligence—for multimodal systems: robust evaluation is not merely a
way to measure vision—language models (VLMs), but a signal that can
train and continually improve agentic pipelines. We operationalize this
paradigm in two complementary contributions.
First, we introduce G-VEval, a chain-of-thought (CoT) multimodal evaluator
for image and video captioning that aligns closely with human judgments.
G-VEval supports reference-free, reference-only, and combined modes, produces
calibrated scalar scores together with natural-language rationales, and
demonstrates substantially higher correlation with human evaluation than
traditional n-gram and embedding metrics. By making criteria such as
accuracy, completeness, conciseness, and relevance explicit in its reasoning,
G-VEval yields interpretable, trustworthy assessments of vision
understanding.
Second, we present VisionGrad, a semantic backpropagation framework that
treats an agentic vision—language workflow as a trainable program whose
"parameters" are the prompts of its constituent modules. VisionGrad uses
evaluator feedback—in the form of CoT rationales and scores—as
semantic gradients to propose targeted prompt edits via a dual-LLM trainer
(global critic, local editors). Without updating model weights, VisionGrad
consistently improves visual question answering (VQA) accuracy, transferring
learned prompt policies across model scales and achieving state-of-the-art
results (e.g., up to 84.6% on knowledge-intensive VQA), while smaller
backbones gain double-digit absolute points.
Together, G-VEval and VisionGrad close the loop from measuring to optimizing
multimodal cognition. The resulting framework provides a general, modular,
and interpretable route to cultivate foundational vision understanding and
its extension to higher-level vision reasoning in multi-agent systems.
Date: Wednesday, 8 October 2025
Time: 9:30am – 11:30am
Venue: Room 5501
Lifts 25-26
Chairman: Dr. Dan XU
Committee Members: Prof. Dit-Yan YEUNG (Supervisor)
Prof. Raymond WONG