Wei - 2024 - J. - Phys. - Conf. - Ser. - 2711 - 012005

Journal of Physics: Conference
Series
PAPER • OPEN ACCESS You may also like

- Graph networks for molecular design
Review: Recent advances for the diffusion model Rocío Mercado, Tobias Rastemo, Edvard
Lindelöf et al.
To cite this article: Yufeng Wei 2024 J. Phys.: Conf. Ser. 2711 012005 - Predicting the Radiation Field of Molecular
Clouds Using Denoising Diffusion
Probabilistic Models
Duo Xu, Stella S. R. Offner, Robert
Gutermuth et al.
View the article online for updates and enhancements. - Infection diagnosis in hydrocephalus CT
images: a domain enriched attention
learning approach
Mingzhao Yu, Mallory R Peterson,
Venkateswararao Cherukuri et al.
This content was downloaded from IP address 131.246.194.21 on 03/03/2024 at 16:40

2023 International Conference on Machine Learning and Automation IOP Publishing
Journal of Physics: Conference Series 2711 (2024) 012005 doi:10.1088/1742-6596/2711/1/012005
Review: Recent advances for the diffusion model
Yufeng Wei
East China University of Science and Technology, Shanghai, 200237, China
19001782@ecust.edu.cn
Abstract. As the generative model technology becomes more and more popular, more and more
people have invested in the research of the current State-of-the-art (SOTA) generative model-
diffusion model. This paper reviews all SOTA generation models using the diffusion model for
text-to-image generation since the emergence of the diffusion model, including the denoising
diffusion probabilistic model (DDPM), DALL·E model, imagen model, stable diffusion model,
and diffusion transformer architecture (DiT) model. In the theoretical section, the basic
principles behind the diffusion model are reviewed in detail in the way of mathematical
calculation, including the training process of the model and the mathematical principles behind
the sampling process. Moreover, this paper focuses on the technical characteristics of these
models and various improvements made after model iteration, such as model structure
optimization, more efficient and accurate training methods, and the application of other
optimization techniques widely used in the field of deep learning to diffusion models. In the end,
the technical route of the development of the diffusion model is summarized, and some
predictions are made.
Keywords: generative model, diffusion model, structure of the model, deep neural network.
1. Introduction
Diffusion model is one of the most popular generative models in the world, and it has been used as the
generator for many State-of-the-art (SOTA) generative models published in recent years because of the
excellent samples it produced and the low training difficulty it has. The diffusion model helps generate
new image instances with high accuracy and efficiency, and it has a wide range of application areas such
as upscaling some pictures to get super-resolution images or getting some unreal pictures or paintings.
In contrast to other deep machine learning models designed for clustering, classifying, or regression,
the generative model can 'create' new instances the corpus or the dataset does not contain. For generative
models, the training process is more likely to provide an expert knowledge system which supplies
enough references so that the generative model can be able to generate samples through other formats
of media input, like some keywords and even music. However, large-scale datasets and high-efficiency
computing are needed to feed the model and thoroughly train this deep neural network [1].
It is necessary to provide a review of recent advances for the diffusion model because SOTA models
with new structures and usages of the diffusion model keep appearing this year. This article focuses on
the process of how this theory has become the most popular model in the image-generating area, from
theory to application, and what advantages and disadvantages this model has.
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
2. Theory
The diffusion model consists of parameterized Markov chains trained by Variational Inference (VI).
Markov chains are actually a matrix of probabilities to express the relationship between particular
transitions, and each state is probabilistically independent. A set{S0 , S1 , S2 , … , Sk } can be used to
describe k states so that a k × k matrix can be built to define the relationship among these states through
probabilities. As shown in Figure 1, a matrix of 4 × 4 is used as an example.
Figure 1. The schematic diagram of Markov matrix.
The diffusion model can be viewed as a kind of single-chain Markov chain (Figure 2), and each state
in the model is defined as {x0 , … , xT−1 , xt , xt+1 , … , x}. The original plot is defined as the {x0 } state, and
then a known amount of Gaussian noise (noise waveform follows a normal distribution) is added so that
the original plot becomes fuzzy, equivalent to a process of information entropy increase. The noise is
added again each time the image of the previous state is the reference. Until the {xT } state, the entire
picture becomes pure noise. Finally, reverse engineering is done to remove the noise. The forward
recurrence formula (1) for {xt } is given here. βt is defined as a single-increment function so that
{xt−1 } and {xt } add more noise when there is already noise interference to make the difference between
the states more obvious.
xt = √αt xt−1 + √1 − αt ϵt−1 ϵ ∼ 𝒩(ϵ ; 0, I) (1)
αt = 1 − βt (2)
βt = (0.002 − 0.0001)t/T + 0.0001 (3)
Figure 2. Schematic diagram of Markov matrix [2].
2
Because of the low efficiency of recursive calculation, the general expression of xt can be obtained
by calculation. The derivation process is as follows:
xt = √αt xt−1 + √1 − αt ϵt−1 (4)
∗ ∗
= √αt (√αt−1 xt−2 + √1 − αt−1 ϵt−2 ) + √1 − αt ϵt−1 (5)
∗ ∗
= √αt αt−1 xt−2 + √αt − αt αt−1 ϵt−2 + √1 − αt ϵt−1 (6)
2 2
= √αt αt−1 xt−2 + √√αt − αt αt−1 + √1 − αt ϵt−2 (7)
= √αt αt−1 xt−2 + √αt − αt−1 + 1 − αt ϵt−2 (8)
= √αt αt−1 xt−1 + √1 − αt αt−1 ϵt−2 (9)
=⋯ (10)
t t
= √∏i=1 αi x0 + √1 − ∏i=1 αi ϵ0 (11)
= √αt x0 + √1 − αt ϵ0 (12)
~𝒩 (xt ; √αt x0 , (1 − αt ) I) (13)
What needs to be explained here is that the addition of two independently equally distributed
Gaussian noises in (6) is still Gaussian noise. Accessible (5) √1 − αt ϵ∗t−1 can be viewed as
𝒩(0, (1 − αt )I) of a sample. Similarly, √αt − αt αt−1 ϵ∗t−2 can be viewed as 𝒩(0, (1 − αt + αt −
αt αt−1 )I) = 𝒩(0, (1 − αt αt−1 )I) to get a new normal distribution ϵt ~𝒩(0,1) in (6). Then, (7) is
obtained as the xt s general term formula, and αi can be calculated by βi . ϵ0 is a random Gaussian noise,
so only x0 needs to be used, and it can directly generate xt , equivalent to obtain q for any t's (xt |x0 ).
This Process of generating xt is called the Diffusion Process, and in order to train a model, the diffusion
model also includes a Reverse Diffusion Process, which is to predict x0 from the known xT . In order
to find x0 , first, there is a need to know how to find q(xt−1 |xt , x0 ) from the known q(xt |x0 ) and
q(xt−1 |x0 ), which can be found using Bayes' formula, as follows:
q(x |x , x )q(xt−1|x0 )
t t−1 0
q (xt−1|xt , x0 ) = q(xt |x0 )
(14)
𝒩(xt ;√αtxt−1 ,(1−αt )I)𝒩(xt−1 ;√αt−1 x0 ,(1−αt−1 )I)

= (15)
𝒩(xt ;√αt x0 ,(1−αt )I)
2 2 2
(xt −√αt xt−1 ) (xt−1 −√αt−1 x0 ) (xt −√αt x0 )
∝ exp (− + − ) (16)
2(1−αt ) 2(1−αt−1 ) 2(1−αt )
[ ]
3
2 2
(−2√αtxt xt−1 +αtxt−1 ) (xt−1 −2√αt−1 xt−1 x0)
1
= exp (− 2 [ 1−αt
+ + ∁ (xt , x0 )]) (17)
1−αt−1
1 αt 1 2 √αt−1 x0 √αt xt
∝ exp (− 2 [(1−α + ) xt−1 − 2 ( + 1−αt
) xt−1 ]) (18)
t 1−αt−1 1−αt−1
1 1−αt 2 √αt−1 x0 √αt xt

= exp (− 2 [ xt−1 − 2 ( + 1−αt
) xt−1 ]) (19)
(1−αt )(1−αt−1 ) 1−αt−1
1 1−αt 2 √αt−1 x0 √αt xt

= exp (− 2 [ xt−1 − 2 ( + 1−αt
) xt−1 ]) (20)
(1−αt )(1−αt−1 ) 1−αt−1
√α x √αt xt
t−1 0
+
1−αt 1−α 1−α
1 2 t−1 t
= exp − xt−1 − 2 xt−1 (21)
2 (1−α )(1−α 1−α
t t−1 ) t
(1−α )(1−α )
( [ t t−1 ])
√α x √αt xt
t−1 0
( + )(1−αt )(1−αt−1 )
1−αt 1−α 1−α
1 2 t−1 t
= exp − 2 xt−1 − 2 xt−1 (22)
(1−αt )(1−αt−1 ) 1−αt
( [ ])
1 1 2 √αt (1−αt−1 )xt+√αt−1 (1−αt)x0

= exp − 2 [xt−1 − 2 xt−1 + ∁] (23)
(1−α )(1−α
t t−1
) 1−αt
1−α
( ( t ) )
√αt (1−αt−1 )xt +√αt−1 (1−αt )x0 (1−α )(1−α )
t t−1
∝ 𝒩 (xt−1 ; , I) (24)
1−αt 1−αt
√αt (1−αt−1 )xt+√αt−1 (1−αt )x0

μq (xt , x0 ) = (25)
1−αt
It is noted that ∁(xt , x0 ) is treated as a constant in (17) because it does not contain xt−1 . This
constant is not ignored but is used to compensate for the absence of a perfect square in (23). From (23),
it can be seen that the expression is converted to a standard expression for the normal distribution
function. Therefore, the expectation and variance of xt−1 can be easily obtained. Then, calculate the
relative information entropy (KL-Divergence) of the two probability distributions (13) and (24) for the
forward and reverse processes. Here, one can assume that the parameters of the two normal distributions
are exactly equal, such that the direction of the smallest change in relative information entropy is the
direction of the smallest difference between the two probability distributions. Therefore, in theory, the
neural network can easily estimate x0 , but instead of going into the math here, the author will continue
to show how to obtain the parameters for training.
4
It can be seen in (24) that x0 should be an unknown quantity and should not appear in the derivation,
so the forward derivation (12) can be used to make a slight transformation, replacing x0 with xt , and
the derivation is as follows:
xt −√1−αt ϵ0
x0 = (26)
√αt
Plugging (26) into μq (xt , x0 ) (25) yields the following:

x −√1−α ϵ
t t 0
√αt (1−αt−1 )xt+√αt−1 (1−αt )
√α
t
μq (xt , x0 ) = (27)
1−αt
x −√1−α ϵ
t t 0
√αt (1−αt−1 )xt+(1−αt )
√αt
= (28)
1−αt
αt (1−αt−1 )+(1−αt) 1−αt

=( ) xt − ϵ0 (29)
(1−αt )√αt √1−αt √αt
(1−αt ) 1−αt
= xt − ϵ0 (30)
(1−αt )√αt √1−αt √αt
1 1−αt
= xt − ϵ0 (31)
√αt √1−αt √αt
However, the expression μq (xt , x0 ) is not known, so x0 cannot be reversed. In the forward process,
ϵt is a random noise with a known standard normal distribution, therefore, the above method of building
a simple neural network can be applied to reduce the error between the estimators ϵ̂(xt , t) and ϵt to
zero. This gives μq (xt , x0 ) the estimate μ. In order to make the model estimate more accurate, the time
parameter can be added to strengthen the model's understanding of the data. Finally, the formula (32) is
obtained as follows, and Table 1 shows the training process and the sampling process.
1 1−αt
μθ (xt , t) = xt − ϵ̂ θ (xt , t) (32)
√αt √1−αt √αt
Table 1. The diffusion model training and inference process [3].

Training process Sampling process
1:T − 2 states are randomly set between the 1:Pure noise xT ~𝒩(0, 𝐈),𝑧~𝒩(0, 𝐈)
original image and pure noise 2:Use
2:Let ϵ~𝒩(0, 𝐈) ϵ̂(𝑥𝑡 , 𝑡)
3:Set the Loss function and
𝐿𝑜𝑠𝑠 = ϵ𝑡 − ϵ̂(𝑥𝑡 , 𝑡) xt−1 = 𝜇𝜃 (𝑥𝑡 , 𝑡) + 𝜎𝑡 𝑧
4: Training until the loss approaches 0 to get the 𝑥0
5
3. Model
3.1. DALL· E2
The diffusion model has been proposed in 2015 or even before, but the first SOTA model based on it is
DALL·E 2, which was proposed in April 2022. DALL·E 2 [4] was the first to make the diffusion model
beat the GAN model to become the SOTA model. DALL·E 2 is divided into two parts.
In the prior part, two encoders are used to extract the features of the text and the features of the image
corresponding to the text respectively during training, and then through training, the model can identify
the required image features through text semantics. This part is completed with the help of the
Contrastive Language-Image Pre-Training (CLIP) model, a large model proposed by OpenAI, which
shows the relationship between the semantics of the input text and the features of the image. But
DALL·E 2 is required to generate a zero-shot image from the image features provided by the PRIOR
part of the model, so the PRIOR part of the DALL·E 2 model is trained with the aid of CLIP. The PRIOR
section implements the function of converting the text semantics into the concatenation of multiple
image features.
In the second part, the function of the decoder part is realized by the diffusion model, and the U-Net
network structure is used to realize the reverse Diffusion stage of the diffusion model. As shown in
Figure 3, U-Net is a full convolutional network with an encoding and decoding model structure, which
inputs images with added noise and then trains U-Net to output the original added noise. Compared with
the traditional U-Net network structure, additional time information is added here to make the model
have different outputs according to the progress of the reverse process which is shown in Figure 4.
Figure 3. Schematic diagram of DALL·E 2 [4].
Figure 4. Network structure of DALL·E 2 [5].
6
3.2. Imagen
Except for the text encoder of the model, the Imagen model can be split into three parts, and all three
parts are based on the diffusion model. The Google team froze the text encoder to train the subsequent
diffusion model step by step. The reason for this is to reduce the number of variables in the experiment,
and the step training saves the training cost of the big model and reduces the difficulty of training. The
second reason is to prevent the input of the diffusion model out of control because the action of freezing
the text encoder establishes the fixed input-output correspondence of the encoder, which is the same
input text that must output the same text embedding. Thus, it is difficult to prevent the under-fitting of
the subsequent diffusion model. The quality of the image generated by this training mode will also lead
to the loss of the diversity of images generated by the diffusion model.
The imagen decoder is similar to the DALL·E 2 model, but compared with the DALL·E 2 model,
imagen reduces the resolution of the output image, only the image of 32×32, and then the super-
resolution operation. The original one-step model is divided into three model series implementation
functions. It can reduce the parameters of the model, which reduces the difficulty of the model training
and also reduces the cost of the model training. The reason for the diffusion model to output only 32×32-
size pictures is to prevent the model from focusing on the details that generate the picture. The process
of making low-resolution pictures and then gradually refining them is more in line with the creative
process of people. As it turns out, this approach allowed the model to produce better-quality pictures.
The super-resolution process also used two diffusion models as shown in Figure 5, the first of which
increased the resolution from 32×32 to 256×256, and the second model increased the resolution from
256×256 to 1024×1024, reducing the difficulty of model training and hyperparameter tuning.
Figure 5. Schematic diagram of Imagen [6].
3.3. Stable diffusion

Stable diffusion is also called latent diffusion. As shown in Figure 6, stable diffusion uses U-Net as the
model structure to implement the diffusion model principle. However, a cross-attention structure is
added to the convolutional layer, which allows the diffusion model to add other modal conditions to
participate in the generation process. Stable diffusion also uses the diffusion model to generate highly
7
abstract images and then uses super-resolution technology to refine the details of the generated images.
It can be found that in the process of inference, stable diffusion tries to make the diffusion model
generate highly condensed information in latent space and then decodes it through Variational
AutoEncoder (VAE) before the super-resolution process in Pixel Space. This reduces the training cost
of the model and improves the flexibility of the model. The principle of this model is shown in Figure
6.
Figure 6. Schematic diagram of Stable Diffusion [7].
3.4. The diffusion transformer architecture (DiT)

Before this, all diffusion models use the U-Net network structure to implement diffusion models, and
DiT uses Transformer to replace U-Net models to achieve better performance [8]. Similar to Stable
Diffusion, DiT uses the latent space to convert images into 32×32×4 patches, which reduces the training
cost of the model. After experiments, it is found that the model performance of adding a cross-attention
mechanism or in-context conditioning in the transformer is inferior to that of adding an adaptive
normalization layer as shown in Figure 7 [9,10]. DiT is the current SOTA generation model.
Figure 7. Schematic diagram of DiT [8].
4. Conclusion
This paper briefly introduces the development process of diffusion models through DALL·E, imagen,
Stable Diffusion, and DiT models. Only one year has passed since the diffusion model replaced GAN
with the SOTA model, and the diffusion model is still largely optimized. Moreover, according to the
time required for other proposed new models to fully exploit the potential of the model, the diffusion
8
model still has a lot of room for development. At present, the diffusion model is directly generated from
the original U-Net of DALL·E 2, and only abstract small-scale images are generated by Imagen for
further super-resolution operations. After that, Stable Diffusion re-adopted VAE technology to project
images from pixel space to latent space, which further improved the performance of the diffusion model
and reduced the training difficulty and training cost of the model. After that, the current SOTA
generation model DiT uses the Transformer architecture, which performs better than the U-Net
architecture in all fields, to implement diffusion so that the diffusion model has a higher starting point.
At present, there is still a lot of room for optimization of the diffusion model. It seems that many
techniques that have been used in image processing have the potential to further improve the
performance of the diffusion model, such as various transformer structures and other normalization
methods. These techniques have proved to be very good in other fields and are worth trying.
References
[1] Gozalo-Brizuela R, Garrido-Merchan E C. (2023). ChatGPT is not all you need. A State of the
Art Review of large Generative AI models. arXiv preprint arXiv:2301.04655.
[2] Luo C. (2022). Understanding diffusion models: A unified perspective. arXiv preprint
arXiv:2208.11970.
[3] Ho J, Jain A, Abbeel P. (2020). Denoising diffusion probabilistic models. Advances in neural
information processing systems, 33, 6840-6851.
[4] Ramesh A, Dhariwal P, Nichol A, et al. (2022). Hierarchical text-conditional image generation
with clip latents. arXiv preprint arXiv:2204.06125.
[5] Saharia C, Chan W, Chang H, et al. (2022). Palette: Image-to-image diffusion models. ACM
SIGGRAPH 2022 Conference Proceedings, 1-10.
[6] Saharia C, Chan W, Saxena S, et al. (2022). Photorealistic text-to-image diffusion models with
deep language understanding. Advances in Neural Information Processing Systems, 35,
36479-36494.
[7] Rombach R, Blattmann A, Lorenz D, et al. (2022). High-resolution image synthesis with latent
diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, 10684-10695.
[8] Peebles W, Xie S. (2022). Scalable diffusion models with transformers. arXiv preprint
arXiv:2212.09748.
[9] Vaswani A, Shazeer N, Parmar N, et al. (2017). Attention is all you need. Advances in neural
information processing systems, 30.
[10] Hou R, Chang H, Ma B, et al. (2019). Cross attention network for few-shot classification.
Advances in neural information processing systems, 32.

Wei - 2024 - J. - Phys. - Conf. - Ser. - 2711 - 012005

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wei - 2024 - J. - Phys. - Conf. - Ser. - 2711 - 012005

Uploaded by

Copyright:

Available Formats

Journal of Physics: Conference

PAPER • OPEN ACCESS You may also like

This content was downloaded from IP address 131.246.194.21 on 03/03/2024 at 16:40

Review: Recent advances for the diffusion model

Figure 1. The schematic diagram of Markov matrix.

xt = √αt xt−1 + √1 − αt ϵt−1 ϵ ∼ 𝒩(ϵ ; 0, I) (1)

βt = (0.002 − 0.0001)t/T + 0.0001 (3)

Figure 2. Schematic diagram of Markov matrix [2].

xt = √αt xt−1 + √1 − αt ϵt−1 (4)

= √αt αt−1 xt−2 + √αt − αt−1 + 1 − αt ϵt−2 (8)

= √αt αt−1 xt−1 + √1 − αt αt−1 ϵt−2 (9)

~𝒩 (xt ; √αt x0 , (1 − αt ) I) (13)

𝒩(xt ;√αtxt−1 ,(1−αt )I)𝒩(xt−1 ;√αt−1 x0 ,(1−αt−1 )I)

1 1−αt 2 √αt−1 x0 √αt xt

1 1−αt 2 √αt−1 x0 √αt xt

1 1 2 √αt (1−αt−1 )xt+√αt−1 (1−αt)x0

√αt (1−αt−1 )xt+√αt−1 (1−αt )x0

Plugging (26) into μq (xt , x0 ) (25) yields the following:

αt (1−αt−1 )+(1−αt) 1−αt

Table 1. The diffusion model training and inference process [3].

Figure 3. Schematic diagram of DALL·E 2 [4].

Figure 4. Network structure of DALL·E 2 [5].

Figure 5. Schematic diagram of Imagen [6].

3.3. Stable diffusion

Figure 6. Schematic diagram of Stable Diffusion [7].

3.4. The diffusion transformer architecture (DiT)

Figure 7. Schematic diagram of DiT [8].

You might also like