Towards Precise and Controllable Co-creation: Multi-modal Interaction in Visual Generative Models

PhD Qualifying Examination


Title: "Towards Precise and Controllable Co-creation: Multi-modal Interaction in
Visual Generative Models"

by

Mr. Zichen LIU


Abstract:

Visual generative models have made significant progress, but relying solely on
text prompts often fails to convey complex user intents, creating an "Intention
Gap." To address this, the field is shifting from text-only generation to multi-
modal Human-AI co-creation. This survey provides a comprehensive review of
controllable visual generation and interaction paradigms. We first present a
taxonomy of explicit control mechanisms, including spatial and structural
constraints for images, and temporal and dynamic guidance for videos. Next, we
explore implicit high-level control, where Multi-modal Large Language Models
(MLLMs) and agentic workflows are used to interpret ambiguous user inputs into
precise generation plans. Furthermore, we examine these advancements from a
Human-Computer Interaction (HCI) perspective, discussing how layered
architectures and mixed-initiative systems can reduce users' cognitive load and
restore creative agency. Finally, we outline open challenges and future
directions toward unified spatiotemporal simulators and human-aligned
evaluation.


Date:                   Wednesday, 1 April 2026

Time:                   3:00pm - 5:00pm

Venue:                  Room 2132C
                        Lift 22

Committee Members:      Dr. Qifeng Chen (Supervisor)
                        Dr. Dan Xu (Chairperson)
                        Prof. Raymond Wong