Towards Data Agents at Scale: A Survey of Benchmarking, Workflow Design, and Serving Infrastructure

PhD Qualifying Examination


Title: "Towards Data Agents at Scale: A Survey of Benchmarking, Workflow
Design, and Serving Infrastructure"

by

Mr. You PENG


Abstract:

Data agents are emerging as a new class of intelligent systems that interact
with structured, semi-structured, and unstructured data through natural
language. Early work focused on semantic parsing, table question answering,
and Text-to-SQL, aiming to retrieve correct answers or generate executable
queries over databases. Recent advances in large language models have
substantially broadened this paradigm: modern data agents can decompose
requests into multi-step workflows, invoke external tools such as SQL engines
and code interpreters, interact with users over multiple turns, and produce
higher-level analytical output.

This survey reviews the development of data agents from early structured-data
interfaces to contemporary agentic systems and the serving infrastructure
required to deploy them at scale. The survey organizes the literature around
three themes: benchmarking, workflow design, and serving infrastructure. We
argue that the field is moving from one-shot factual query answering toward
mixed-initiative analytical assistance, where success depends not only on
correctness but also on insightfulness, robustness, latency, and cost.

This survey highlights three central gaps that define promising future
research directions. First, current benchmarks emphasize factual correctness
more than insight generation and data storytelling. Second, current
workflows only partially support multi-turn exploration, persistent memory,
and iterative collaboration with human expert. Third, current serving
systems still optimize individual model inference more often than end-to-end
execution of tool-using workflows. By synthesizing these threads within a
unified framework, this survey clarifies the design space of data agents and
identifies the technical foundations needed to build reliable and scalable
data agents in real-world settings.


Date:                   Thursday, 23 April 2026

Time:                   2:00pm - 4:00pm

Venue:                  Room 2131B
                        Lift 22

Committee Members:      Dr. Binhang Yuan (Supervisor)
                        Dr. Yangqiu Song (Chairperson)
                        Prof. Ke Yi