A Survey of Unified Multimodal Models for Text and Image Understanding and Generation

PhD Qualifying Examination


Title: "A Survey of Unified Multimodal Models for Text and Image 
Understanding and Generation"

by

Mr. Zian QIAN


Abstract:

A pronounced schism exists between the architectural foundations of 
multimodal understanding and image generation. The former is predominantly 
governed by autoregressive transformers, whereas the latter is almost 
exclusively built upon diffusion frameworks. Although both domains have 
achieved substantial individual progress, their separate development 
trajectories have created a significant technical barrier. Consequently, 
the pursuit of a unified model that seamlessly integrates both capabilities 
represents a major frontier in artificial intelligence, exemplified by 
systems such as GPT-4o. The core challenge in this integration effort is 
reconciling the inherent architectural discrepancies between these paradigms. 
In this paper, we present a comprehensive survey on unified multimodal 
models for the understanding and generation of text and images. We begin by 
establishing groundwork of essential theories and surveying the latest 
developments within the fields of multimodal understanding and generative 
image models. We then provide an overview of the early approaches of unified 
multimodal models. Afterwards, we classified the state-of-the-art approaches 
into two primary paradigms: series-connected and hybrid parallel methods, 
followed by a comparative analysis of their respective strengths and 
limitations. Finally, we discuss the remaining challenges in this area and 
suggest potential avenues for future research.


Date:                   Tuesday, 14 October 2025

Time:                   2:00pm - 4:00pm

Venue:                  Room 4472
                        Lifts 25/26

Committee Members:      Dr. Qifeng Chen (Supervisor)
                        Dr. Dan Xu (Chairperson)
                        Dr. Junxian He