Evaluation-Driven Intelligence: From Visual Captioning Metrics to Agentic Workflow Optimization

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "Evaluation-Driven Intelligence: From Visual Captioning Metrics to
Agentic Workflow Optimization"

By

Mr. Tony Cheng TONG (former name: Zheng TANG)


Abstract:

This thesis advances a unifying paradigm of evaluation-driven 
intelligence—for multimodal systems: robust evaluation is not merely a 
way to measure vision—language models (VLMs), but a signal that can 
train and continually improve agentic pipelines. We operationalize this 
paradigm in two complementary contributions.

First, we introduce G-VEval, a chain-of-thought (CoT) multimodal evaluator 
for image and video captioning that aligns closely with human judgments. 
G-VEval supports reference-free, reference-only, and combined modes, produces 
calibrated scalar scores together with natural-language rationales, and 
demonstrates substantially higher correlation with human evaluation than 
traditional n-gram and embedding metrics. By making criteria such as 
accuracy, completeness, conciseness, and relevance explicit in its reasoning, 
G-VEval yields interpretable, trustworthy assessments of vision 
understanding.

Second, we present VisionGrad, a semantic backpropagation framework that 
treats an agentic vision—language workflow as a trainable program whose 
"parameters" are the prompts of its constituent modules. VisionGrad uses 
evaluator feedback—in the form of CoT rationales and scores—as 
semantic gradients to propose targeted prompt edits via a dual-LLM trainer 
(global critic, local editors). Without updating model weights, VisionGrad 
consistently improves visual question answering (VQA) accuracy, transferring 
learned prompt policies across model scales and achieving state-of-the-art 
results (e.g., up to 84.6% on knowledge-intensive VQA), while smaller 
backbones gain double-digit absolute points.

Together, G-VEval and VisionGrad close the loop from measuring to optimizing 
multimodal cognition. The resulting framework provides a general, modular, 
and interpretable route to cultivate foundational vision understanding and 
its extension to higher-level vision reasoning in multi-agent systems.


Date:                   Wednesday, 8 October 2025

Time:                   9:30am – 11:30am

Venue:                  Room 5501
                        Lifts 25-26

Chairman:               Dr. Dan XU

Committee Members:      Prof. Dit-Yan YEUNG (Supervisor)
                        Prof. Raymond WONG