More about HKUST
Evaluation-Driven Intelligence: From Visual Captioning Metrics to Agentic Workflow Optimization
The Hong Kong University of Science and Technology Department of Computer Science and Engineering MPhil Thesis Defence Title: "Evaluation-Driven Intelligence: From Visual Captioning Metrics to Agentic Workflow Optimization" By Mr. Tony Cheng TONG (former name: Zheng TANG) Abstract: This thesis advances a unifying paradigm of evaluation-driven intelligence—for multimodal systems: robust evaluation is not merely a way to measure vision—language models (VLMs), but a signal that can train and continually improve agentic pipelines. We operationalize this paradigm in two complementary contributions. First, we introduce G-VEval, a chain-of-thought (CoT) multimodal evaluator for image and video captioning that aligns closely with human judgments. G-VEval supports reference-free, reference-only, and combined modes, produces calibrated scalar scores together with natural-language rationales, and demonstrates substantially higher correlation with human evaluation than traditional n-gram and embedding metrics. By making criteria such as accuracy, completeness, conciseness, and relevance explicit in its reasoning, G-VEval yields interpretable, trustworthy assessments of vision understanding. Second, we present VisionGrad, a semantic backpropagation framework that treats an agentic vision—language workflow as a trainable program whose "parameters" are the prompts of its constituent modules. VisionGrad uses evaluator feedback—in the form of CoT rationales and scores—as semantic gradients to propose targeted prompt edits via a dual-LLM trainer (global critic, local editors). Without updating model weights, VisionGrad consistently improves visual question answering (VQA) accuracy, transferring learned prompt policies across model scales and achieving state-of-the-art results (e.g., up to 84.6% on knowledge-intensive VQA), while smaller backbones gain double-digit absolute points. Together, G-VEval and VisionGrad close the loop from measuring to optimizing multimodal cognition. The resulting framework provides a general, modular, and interpretable route to cultivate foundational vision understanding and its extension to higher-level vision reasoning in multi-agent systems. Date: Wednesday, 8 October 2025 Time: 9:30am – 11:30am Venue: Room 5501 Lifts 25-26 Chairman: Dr. Dan XU Committee Members: Prof. Dit-Yan YEUNG (Supervisor) Prof. Raymond WONG