Team15 Dreamfusion

DREAMFUSION: TEXT-TO-3D
USING 2D DIFFUSION
[Poole et al., ICLR 2023]
Team 15
20190156 Yun Kim
20190063 Ki Nam Kim
CS380: Introduction to Computer Graphics

Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023
Motivation: Text-to-3D using 2D Diffusion

Background
• Text-to-image task succeeded by training on Large text-image pair dataset directly
Stable diffusion Upainting
https://forums.fast.ai/t/new-paper-upainting-unified-text-to-image-diffusion-generation-with-cross-modal-guidance/101669


Background
• Text-to-image task succeeded by training on Large text-image pair dataset directly
Stable diffusion Upainting
Q. Can we train Text-to-3D model directly using text - 3D object pair dataset?
https://forums.fast.ai/t/new-paper-upainting-unified-text-to-image-diffusion-generation-with-cross-modal-guidance/101669


A. No, Text-to-3D datasets are not big enough to train 3D Generative model directly
≪
Text-3D pair dataset (800K) Text-Image pair dataset (5B)
https://paperswithcode.com/dataset/laion-5b
https://blog.allenai.org/objaverse-a-universe-of-annotated-3d-objects-718ef3d61fd6


A. No, Text-to-3D datasets are not big enough to train 3D Generative model directly
≪
Text-3D pair dataset (800K) Text-Image pair dataset (5B)
Solution: Train 3D model with 2D images! → One approach is using NeRF

Background: NeRF
Optimize 3D model via gradient descent such that its 2D renderings

from random angles achieve a low loss
Only need 2D image data (no need for 3D data)

Background: NeRF
Issue using NeRF in text-to-3D
• Need multiple images from various perspectives to train NeRF.
• However in text-to-3D, we don’t have ground truth images, only have a single
text. → We can’t train NeRF in general way
“yellow lego bulldozer”
Q. Then how can we train NeRF without ground truth images, but using only single text?

Solution: Dreamfusion(NeRF + Text-to-Image Diffusion model)

Optimize NeRF with gradient descent using
loss from text-to-image diffusion model
NeRF (mip-NeRF 360) Text-to-image diffusion model (Imagen)

Simple diagram of Dreamfusion
Render
image
NeRF
Initially, NeRF render

random image
CS380: Introduction to Computer Graphics Ideal image

Text prompt
“A yellow lego bulldozer”
Render
image Input SDS Loss
Text-to-Image diffusion (contains information on
NeRF model how to adjust the rendered
image to align with the pro-
vided text)

random image

Text prompt
Render
Text-to-Image diffusion (contains information on
NeRF model how to adjust the rendered
vided text)

random image
Optimize NeRF
Backpropagation

Text prompt
Render
(contains information on
NeRF how to adjust the rendered
vided text)

random image An AI artist capable of generating
any image based on the provided text
Tells how to create better image

https://en.wikipedia.org/wiki/Bob_Ross


But NeRF requires multi-view images to be trained...
Q. How can diffusion model provide multi-view image information to NeRF?
(
Text-to-Image dif-
NeRF
fusion model
(
Text-to-Image dif-
NeRF
fusion model


Q. How can diffusion model provide multi-view image information to NeRF?
A. Append view-dependent text to the provided input text based on the location of camera
“A yellow lego bulldozer, over-

Elevation angle > head view”
Append “overhead view” at the end of the text
(
Text-to-Image dif-
NeRF
fusion model
Elevation angle “A yellow lego bulldozer,

Append “Front view”, “side view”, or “back view” front view”
(
Text-to-Image dif-
NeRF
fusion model

Text-to-Image Diffusion model
Text prompt: “A yellow lego bulldozer”
Denoise Denoise Denoise Denoise
𝑥𝑇 𝑥 𝑇 −1 𝑥3 𝑥2 𝑥1 𝑥0
Text prompt Text prompt Text prompt Text prompt

Denoise Denoise Denoise Denoise
𝑥𝑇 𝑥 𝑇 −1 𝑥3 𝑥2 𝑥1 𝑥0
Text prompt Text prompt Text prompt Text prompt

Transformer One step denoised

Input noisy image Predicted noise image
Denoise
𝑥𝑡 Noise Prediction Model

(U-Net)
^
𝜖 𝑥𝑡 − 1
Key point !
Diffusion model doesn’t predict denoised image directly,
but it predicts noise first and subtract to to denoise image

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss

1. Render image from NeRF
NeRF
𝑥0

1. Render image from NeRF
NeRF
𝑥0
2. Generate random noise

1. Render image from NeRF 3. Select random denoise
timestep t
t = random(1, T)
NeRF
𝑥0

timestep t
t = random(1, T)
NeRF 4. Add noise to make noisy

image at timestep t
𝑥0
𝑥𝑡

timestep t
t = random(0, T)

image at timestep t 5. Predict noise
𝑥0
Text-to-Image diffu-
sion model
𝑥𝑡 ^𝑡
𝜖

timestep t
t = random(0, T)

𝑥0
sion model
6. Subtract injected noise

𝑥𝑡 from predicted noise ^𝑡
𝜖
𝜖
( update direction that tells NeRF
how to render better image )

Step-by-step how NeRF is optimized using SDS loss 1
timestep t
t = random(0, T)

𝑥0
sion model

𝜖
rop
𝜖 bac
kp
1 ( update direction that tells NeRF


Step-by-step how NeRF is optimized using SDS loss 1 2

timestep t
t = random(0, T)
backprop
2
𝑥0
sion model

𝜖
rop
𝜖 bac
kp


Step-by-step how NeRF is optimized using SDS loss 1 2 3

timestep t
3 backprop t = random(0, T)
backprop
4. Add noise to make noisy
NeRF image at timestep t 5. Predict noise
2
𝑥0 Text-to-Image diffu-
sion model

𝜖
rop
𝜖 bac
kp



timestep t
In practice, U-Net Jacobian term is expensive to compute.
3 backprop t = random(0, T) And omitting it doesn’t change the
backprop update direction
2
sion model

𝜖
rop
𝜖 bac
kp


Dreamfusion Pipeline Gradient of SDS loss


timestep t
So just omit it!
3 backprop t = random(0, T)
backprop
2
sion model

𝜖
rop
𝜖 bac
kp


Visualization of optimizing NeRF

Dreamfusion Pipeline in paper
NeRF (mip-NeRF 360) Text-to-image diffusion model (Imagen)

Result: Qualitative Experiment
CLIP Encoder
- Extract semantic information from the input image.

- Encode image into feature vector that reside in the same space as text feature vector.

Result: Qualitative Experiment
Optimize NeRF with CLIP
Reimplement with enhanced

NeRF setting in Dreamfusion
Optimize mesh with CLIP
Dreamfusion DreamFusion generated the high-

est quality 3D objects!

Result: Quantitative Experiment
(R-Precision measures how accurately CLIP can find

This means that DreamFusion has generated
the correct text caption when shown a picture of a scene)
the 3D object that is best aligned with the text!
Even though Dream Fields and CLIP-Mesh are trained on CLIP,

DreamFusion outperforms on CLIP Score (R-Precision)

Result: Examples
Experience more examples here: Dreamfusion 3D Gallery!

Limitation
• SDS loss is not a perfect loss function. It often produces oversmoothed results.
• Dreamfusion uses 64x64 Imagen model so the image resolution is limited to 64x64.
Oversmoothed example
(64x64)

Limitation
Janus Problem
DreamFusion approximates the view direction by categorizing angles into four rough categories, which
are “overhead”, “front”, “side”, and “back”.
However, this method can lead to issues, such as the occurrence of multiple features
(e.g., faces, eyes) at different angles.
Example of Janus problem

Contribution
Papers inspired by Dreamfusion
Magic3D (CVPR 2023 highlight) Prolific Dreamer (NeurIPS 2023 highlight)
- As the originator of using a 2D diffusion model to create 3D objects, this methodology led to many
subsequent studies that improved SDS loss, resulting in better text-to-3D models.
- This approach offers a revolutionary methodology for solving 3D-related tasks not by relying on scarce
3D data, but by utilizing abundant 2D data alone.

Thank You
Team 15
20190156 Yun Kim
20190063 Ki Nam Kim

Quiz
https://forms.gle/HG67Nz3DrawxLVkq7
Team 15
20190156 Yun Kim
20190063 Ki Nam Kim

Team15 Dreamfusion

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Team15 Dreamfusion

Uploaded by

Copyright:

Available Formats

DREAMFUSION: TEXT-TO-3D

CS380: Introduction to Computer Graphics

Motivation: Text-to-3D using 2D Diffusion

Stable diffusion Upainting

CS380: Introduction to Computer Graphics

Motivation: Text-to-3D using 2D Diffusion

Stable diffusion Upainting

CS380: Introduction to Computer Graphics

Motivation: Text-to-3D using 2D Diffusion

CS380: Introduction to Computer Graphics

Motivation: Text-to-3D using 2D Diffusion

Solution: Train 3D model with 2D images! → One approach is using NeRF

CS380: Introduction to Computer Graphics

Optimize 3D model via gradient descent such that its 2D renderings

Only need 2D image data (no need for 3D data)

CS380: Introduction to Computer Graphics

“yellow lego bulldozer”

CS380: Introduction to Computer Graphics

Solution: Dreamfusion(NeRF + Text-to-Image Diffusion model)

NeRF (mip-NeRF 360) Text-to-image diffusion model (Imagen)

CS380: Introduction to Computer Graphics

Simple diagram of Dreamfusion

Initially, NeRF render

CS380: Introduction to Computer Graphics Ideal image

Simple diagram of Dreamfusion

Initially, NeRF render

CS380: Introduction to Computer Graphics Ideal image

Simple diagram of Dreamfusion

Initially, NeRF render

CS380: Introduction to Computer Graphics Ideal image

Simple diagram of Dreamfusion

Initially, NeRF render

Tells how to create better image

CS380: Introduction to Computer Graphics Ideal image

Simple diagram of Dreamfusion

“A yellow lego bulldozer”

“A yellow lego bulldozer”

CS380: Introduction to Computer Graphics

Simple diagram of Dreamfusion

“A yellow lego bulldozer, over-

Elevation angle “A yellow lego bulldozer,

CS380: Introduction to Computer Graphics

Text-to-Image Diffusion model

Text prompt: “A yellow lego bulldozer”

Denoise Denoise Denoise Denoise

CS380: Introduction to Computer Graphics

Text-to-Image Diffusion model

Text prompt: “A yellow lego bulldozer”

Denoise Denoise Denoise Denoise

CS380: Introduction to Computer Graphics

Text-to-Image Diffusion model

Text prompt: “A yellow lego bulldozer”

Transformer One step denoised

𝑥𝑡 Noise Prediction Model

CS380: Introduction to Computer Graphics

CS380: Introduction to Computer Graphics

CS380: Introduction to Computer Graphics

CS380: Introduction to Computer Graphics

CS380: Introduction to Computer Graphics

NeRF 4. Add noise to make noisy

CS380: Introduction to Computer Graphics

NeRF 4. Add noise to make noisy

CS380: Introduction to Computer Graphics