Efficient Training Strategy for Aesthetic Text-to-Image Generation Diffusion Model

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "Efficient Training Strategy for Aesthetic Text-to-Image Generation 
Diffusion Model"

By

Mr. Jincheng YU


Abstract:

In this thesis, we address the resource-consuming problem of recent large 
text-to-image (T2I) generative models. We propose a three-stage training 
strategy with stage-specific datasets to reduce the training resources and 
time. i) Pixel dependency learning, where our model learns low-level pixel 
dependencies from the ImageNet dataset. This stage focuses on understanding the 
intrinsic pixel relationships in natural images. ii) Text- image alignment 
learning, where our model learns textual concepts from the SAM dataset, whose 
captions are refined by a large vision language model. This stage aims to align 
textual concepts with their visual representations. iii) High-resolution and 
aesthetic image generation, where our model is fine-tuned to generate 
high-resolution and aesthetic images. For this purpose, we utilize an internal 
dataset similar to JourneyDB. When we combine our three-stage training strategy 
with an existing parameter-efficient transformer-based diffusion model, 
experimental results demonstrate that our approach achieves comparable or even 
superior image quality and semantic control compared to the SOTA T2I model 
Stable Diffusion XL, while our training strategy only requires only 10.8% of 
its training time.


Date:                   Tuesday, 13 August 2024

Time:                   3:00pm - 5:00pm

Venue:                  Room 5506
                        Lifts 25/26

Chairman:               Dr. Dan XU

Committee Members:      Prof. James KWOK (Supervisor)
                        Dr. Qifeng CHEN