From Intent to Action: Modeling Tool-Using Capabilities in Language Agents

PhD Qualifying Examination


Title: "From Intent to Action: Modeling Tool-Using Capabilities in Language 
Agents"

by

Mr. Zhaoyang LIU


Abstract:

Recent advancements in Vision-Language Models (VLMs) have rapidly expanded 
their applicability beyond traditional static language generation, fostering 
significant progress in multimodal interaction tasks. This survey highlights 
the emergence and importance of multimodal tool use in Large Language Models 
(LLMs), particularly emphasizing their specialized vertical application in 
Computer Use Agents (CUAs).

Our survey begins by comprehensively reviewing the state-of-the-art in 
multimodal tool-augmented LLMs. We categorize existing approaches into 
training-free and instruction-tuned methods. Training-free methods primarily 
leverage prompt engineering and in-context learning, demonstrating strong 
capabilities in simpler multimodal scenarios but frequently suffering from 
inaccurate tool invocation in complex settings. In contrast, 
instruction-tuned methods, such as GPT4Tools, ModelScope-Agent, and 
ControlLLM, utilize fine-tuned instruction-following capabilities, 
significantly enhancing precision in multimodal tool usage. Notably, 
ControlLLM introduces the Thoughts-on-Graph (ToG) paradigm, explicitly 
modeling tool dependencies, thereby systematically mitigating tool selection 
inaccuracies and execution inefficiencies.

Furthermore, this survey extensively examines Computer Use Agents, which 
automate interactions with Graphical User Interfaces (GUIs). Despite recent 
developments by proprietary systems like OpenAI's Operator and open-source 
initiatives such as UI-TARS, CUAs continue to face substantial challenges 
stemming from data scarcity, platform heterogeneity, and generalization 
difficulties. Our investigation addresses these limitations by synthesizing a 
novel cross-platform interactive data pipeline that integrates automated 
agent exploration with expert-curated annotations. Additionally, we propose a 
unified action space and train robust base agents, significantly advancing 
performance on GUI grounding, sequential decision-making, and end-to-end task 
completion tasks across multiple benchmarks.

Generally, this survey identifies key challenges and promising solutions in 
multimodal tool-augmented LLMs and CUAs, advocating for a closer synergy 
between multimodal perception, understanding and sequential planning to 
enhance the generalizability and effectiveness of future tool-using agents.


Date:                   Thursday, 24 July 2025

Time:                   10:00am - 11:00am

Venue:                  Room 3494
                        Lifts 25/26

Committee Members:      Dr. Qifeng Chen (Supervisor)
                        Dr. Dan Xu (Chairperson)
                        Dr. Junxian He