More about HKUST
A Survey of Unified Multimodal Models for Text and Image Understanding and Generation
PhD Qualifying Examination
Title: "A Survey of Unified Multimodal Models for Text and Image
Understanding and Generation"
by
Mr. Zian QIAN
Abstract:
A pronounced schism exists between the architectural foundations of
multimodal understanding and image generation. The former is predominantly
governed by autoregressive transformers, whereas the latter is almost
exclusively built upon diffusion frameworks. Although both domains have
achieved substantial individual progress, their separate development
trajectories have created a significant technical barrier. Consequently,
the pursuit of a unified model that seamlessly integrates both capabilities
represents a major frontier in artificial intelligence, exemplified by
systems such as GPT-4o. The core challenge in this integration effort is
reconciling the inherent architectural discrepancies between these paradigms.
In this paper, we present a comprehensive survey on unified multimodal
models for the understanding and generation of text and images. We begin by
establishing groundwork of essential theories and surveying the latest
developments within the fields of multimodal understanding and generative
image models. We then provide an overview of the early approaches of unified
multimodal models. Afterwards, we classified the state-of-the-art approaches
into two primary paradigms: series-connected and hybrid parallel methods,
followed by a comparative analysis of their respective strengths and
limitations. Finally, we discuss the remaining challenges in this area and
suggest potential avenues for future research.
Date: Tuesday, 14 October 2025
Time: 2:00pm - 4:00pm
Venue: Room 4472
Lifts 25/26
Committee Members: Dr. Qifeng Chen (Supervisor)
Dr. Dan Xu (Chairperson)
Dr. Junxian He