More about HKUST
From Intent to Action: Modeling Tool-Using Capabilities in Language Agents
PhD Qualifying Examination Title: "From Intent to Action: Modeling Tool-Using Capabilities in Language Agents" by Mr. Zhaoyang LIU Abstract: Recent advancements in Vision-Language Models (VLMs) have rapidly expanded their applicability beyond traditional static language generation, fostering significant progress in multimodal interaction tasks. This survey highlights the emergence and importance of multimodal tool use in Large Language Models (LLMs), particularly emphasizing their specialized vertical application in Computer Use Agents (CUAs). Our survey begins by comprehensively reviewing the state-of-the-art in multimodal tool-augmented LLMs. We categorize existing approaches into training-free and instruction-tuned methods. Training-free methods primarily leverage prompt engineering and in-context learning, demonstrating strong capabilities in simpler multimodal scenarios but frequently suffering from inaccurate tool invocation in complex settings. In contrast, instruction-tuned methods, such as GPT4Tools, ModelScope-Agent, and ControlLLM, utilize fine-tuned instruction-following capabilities, significantly enhancing precision in multimodal tool usage. Notably, ControlLLM introduces the Thoughts-on-Graph (ToG) paradigm, explicitly modeling tool dependencies, thereby systematically mitigating tool selection inaccuracies and execution inefficiencies. Furthermore, this survey extensively examines Computer Use Agents, which automate interactions with Graphical User Interfaces (GUIs). Despite recent developments by proprietary systems like OpenAI's Operator and open-source initiatives such as UI-TARS, CUAs continue to face substantial challenges stemming from data scarcity, platform heterogeneity, and generalization difficulties. Our investigation addresses these limitations by synthesizing a novel cross-platform interactive data pipeline that integrates automated agent exploration with expert-curated annotations. Additionally, we propose a unified action space and train robust base agents, significantly advancing performance on GUI grounding, sequential decision-making, and end-to-end task completion tasks across multiple benchmarks. Generally, this survey identifies key challenges and promising solutions in multimodal tool-augmented LLMs and CUAs, advocating for a closer synergy between multimodal perception, understanding and sequential planning to enhance the generalizability and effectiveness of future tool-using agents. Date: Thursday, 24 July 2025 Time: 10:00am - 11:00am Venue: Room 3494 Lifts 25/26 Committee Members: Dr. Qifeng Chen (Supervisor) Dr. Dan Xu (Chairperson) Dr. Junxian He