Towards Trustworthy Visual Generative Models: Reliable and Controllable Generation of Diffusion Models

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "Towards Trustworthy Visual Generative Models: Reliable and Controllable 
Generation of Diffusion Models"

By

Mr. Sen LI


Abstract:

Visual generative models, especially diffusion models, have demonstrated 
incredible performance on high-quality visual generation, attracting more and 
more attention in both academia and industry. Representative models or tools 
such as DALLE-3 and MidJourney have been widely used in daily life to 
facilitate the creation of artworks or pictures. However, these powerful tools 
also bring potential risks since they can be maliciously used to generate and 
disseminate unsafe content such as pornographic and violent pictures, which may 
cause severe results. In this thesis, we discuss how to make visual generative 
models more reliable and controllable from different aspects. In particular, we 
focus on diffusion models as they are the most widely used visual generative 
models.

Firstly, we uncover the potential risks existing in diffusion models, showing 
that they can be easily inserted with (malicious) invisible backdoors during 
training which can result in unreliable and harmful behaviors. To this end, we 
propose a novel bi-level optimization framework to formulate the training 
process, which can be instantiated by proposed different algorithms for 
unconditional and conditional diffusion models, respectively. Extensive 
experiments show that backdoors can be effectively inserted without affecting 
the benign performance of models, making the backdoors more stealthy and 
robust. Also, we empirically find that current various defense methods cannot 
mitigate the proposed invisible backdoors, enhancing the usability in practice. 
Moreover, the proposed invisible backdoors can be directly applied to model 
watermarking for model ownership verification in black-box setting, further 
enhancing the significance of the proposed framework.

Then, we focus on the controllable generation of text-to-image diffusion 
models. We introduce MuLan, a Multimodal-LLM agent, to progressively generate 
objects given a text prompt. MuLan firstly decompose the prompt to several 
sub-prompts, and each sub-prompt focuses on only one object. Each object is 
generated conditioned on previously generated objects. With a VLM 
(Vision-Language Model) checker, MuLan can timely monitor the process and 
adaptively correct possible mistakes after each generation stage. MuLan can 
greatly boost the generation performance in terms of object attributes and 
spatial relationships in text prompts. Evaluated by GPT-4V and human, extensive 
experiments show the superior performance of MuLan. In addition, we show that 
MuLan can enable human-agent interaction during generation, further enhancing 
the flexibility and effectiveness of the generation process.


Date:                   Monday, 19 August 2024

Time:                   12:00noon - 2:00pm

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Dr. Dan XU

Committee Members:      Dr. Shuai WANG (Supervisor)
                        Dr. Long CHEN