More about HKUST
From Intent to Action: Modeling Tool-Using Capabilities in Language Agents
PhD Qualifying Examination
Title: "From Intent to Action: Modeling Tool-Using Capabilities in Language
Agents"
by
Mr. Zhaoyang LIU
Abstract:
Recent advancements in Vision-Language Models (VLMs) have rapidly expanded
their applicability beyond traditional static language generation, fostering
significant progress in multimodal interaction tasks. This survey highlights
the emergence and importance of multimodal tool use in Large Language Models
(LLMs), particularly emphasizing their specialized vertical application in
Computer Use Agents (CUAs).
Our survey begins by comprehensively reviewing the state-of-the-art in
multimodal tool-augmented LLMs. We categorize existing approaches into
training-free and instruction-tuned methods. Training-free methods primarily
leverage prompt engineering and in-context learning, demonstrating strong
capabilities in simpler multimodal scenarios but frequently suffering from
inaccurate tool invocation in complex settings. In contrast,
instruction-tuned methods, such as GPT4Tools, ModelScope-Agent, and
ControlLLM, utilize fine-tuned instruction-following capabilities,
significantly enhancing precision in multimodal tool usage. Notably,
ControlLLM introduces the Thoughts-on-Graph (ToG) paradigm, explicitly
modeling tool dependencies, thereby systematically mitigating tool selection
inaccuracies and execution inefficiencies.
Furthermore, this survey extensively examines Computer Use Agents, which
automate interactions with Graphical User Interfaces (GUIs). Despite recent
developments by proprietary systems like OpenAI's Operator and open-source
initiatives such as UI-TARS, CUAs continue to face substantial challenges
stemming from data scarcity, platform heterogeneity, and generalization
difficulties. Our investigation addresses these limitations by synthesizing a
novel cross-platform interactive data pipeline that integrates automated
agent exploration with expert-curated annotations. Additionally, we propose a
unified action space and train robust base agents, significantly advancing
performance on GUI grounding, sequential decision-making, and end-to-end task
completion tasks across multiple benchmarks.
Generally, this survey identifies key challenges and promising solutions in
multimodal tool-augmented LLMs and CUAs, advocating for a closer synergy
between multimodal perception, understanding and sequential planning to
enhance the generalizability and effectiveness of future tool-using agents.
Date: Thursday, 24 July 2025
Time: 10:00am - 11:00am
Venue: Room 3494
Lifts 25/26
Committee Members: Dr. Qifeng Chen (Supervisor)
Dr. Dan Xu (Chairperson)
Dr. Junxian He