More about HKUST
A Survey of Unified Multimodal Models for Text and Image Understanding and Generation
PhD Qualifying Examination Title: "A Survey of Unified Multimodal Models for Text and Image Understanding and Generation" by Mr. Zian QIAN Abstract: A pronounced schism exists between the architectural foundations of multimodal understanding and image generation. The former is predominantly governed by autoregressive transformers, whereas the latter is almost exclusively built upon diffusion frameworks. Although both domains have achieved substantial individual progress, their separate development trajectories have created a significant technical barrier. Consequently, the pursuit of a unified model that seamlessly integrates both capabilities represents a major frontier in artificial intelligence, exemplified by systems such as GPT-4o. The core challenge in this integration effort is reconciling the inherent architectural discrepancies between these paradigms. In this paper, we present a comprehensive survey on unified multimodal models for the understanding and generation of text and images. We begin by establishing groundwork of essential theories and surveying the latest developments within the fields of multimodal understanding and generative image models. We then provide an overview of the early approaches of unified multimodal models. Afterwards, we classified the state-of-the-art approaches into two primary paradigms: series-connected and hybrid parallel methods, followed by a comparative analysis of their respective strengths and limitations. Finally, we discuss the remaining challenges in this area and suggest potential avenues for future research. Date: Tuesday, 14 October 2025 Time: 2:00pm - 4:00pm Venue: Room 4472 Lifts 25/26 Committee Members: Dr. Qifeng Chen (Supervisor) Dr. Dan Xu (Chairperson) Dr. Junxian He