Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

DREAMFUSION: TEXT-TO-3D

USING 2D DIFFUSION
[Poole et al., ICLR 2023]

Team 15
20190156 Yun Kim
20190063 Ki Nam Kim

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Motivation: Text-to-3D using 2D Diffusion


Background
• Text-to-image task succeeded by training on Large text-image pair dataset directly

Stable diffusion Upainting

https://forums.fast.ai/t/new-paper-upainting-unified-text-to-image-diffusion-generation-with-cross-modal-guidance/101669

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Motivation: Text-to-3D using 2D Diffusion


Background
• Text-to-image task succeeded by training on Large text-image pair dataset directly

Stable diffusion Upainting

Q. Can we train Text-to-3D model directly using text - 3D object pair dataset?

https://forums.fast.ai/t/new-paper-upainting-unified-text-to-image-diffusion-generation-with-cross-modal-guidance/101669

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Motivation: Text-to-3D using 2D Diffusion


Q. Can we train Text-to-3D model directly using text - 3D object pair dataset?
A. No, Text-to-3D datasets are not big enough to train 3D Generative model directly


Text-3D pair dataset (800K) Text-Image pair dataset (5B)

https://paperswithcode.com/dataset/laion-5b
https://blog.allenai.org/objaverse-a-universe-of-annotated-3d-objects-718ef3d61fd6

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Motivation: Text-to-3D using 2D Diffusion


Q. Can we train Text-to-3D model directly using text - 3D object pair dataset?
A. No, Text-to-3D datasets are not big enough to train 3D Generative model directly


Text-3D pair dataset (800K) Text-Image pair dataset (5B)

Solution: Train 3D model with 2D images! → One approach is using NeRF

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Background: NeRF

Optimize 3D model via gradient descent such that its 2D renderings


from random angles achieve a low loss

Only need 2D image data (no need for 3D data)

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Background: NeRF
Issue using NeRF in text-to-3D
• Need multiple images from various perspectives to train NeRF.
• However in text-to-3D, we don’t have ground truth images, only have a single
text. → We can’t train NeRF in general way

“yellow lego bulldozer”

Q. Then how can we train NeRF without ground truth images, but using only single text?

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Solution: Dreamfusion(NeRF + Text-to-Image Diffusion model)


Optimize NeRF with gradient descent using
loss from text-to-image diffusion model

NeRF (mip-NeRF 360) Text-to-image diffusion model (Imagen)

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion

Render
image
NeRF

Initially, NeRF render


random image

CS380: Introduction to Computer Graphics Ideal image


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion

Text prompt
“A yellow lego bulldozer”

Render
image Input SDS Loss
Text-to-Image diffusion (contains information on
NeRF model how to adjust the rendered
image to align with the pro-
vided text)

Initially, NeRF render


random image

CS380: Introduction to Computer Graphics Ideal image


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion

Text prompt
“A yellow lego bulldozer”

Render
image Input SDS Loss
Text-to-Image diffusion (contains information on
NeRF model how to adjust the rendered
image to align with the pro-
vided text)

Initially, NeRF render


random image

Optimize NeRF
Backpropagation

CS380: Introduction to Computer Graphics Ideal image


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion

Text prompt
“A yellow lego bulldozer”

Render
image Input SDS Loss
(contains information on
NeRF how to adjust the rendered
image to align with the pro-
vided text)

Initially, NeRF render


random image An AI artist capable of generating
any image based on the provided text

Tells how to create better image


https://en.wikipedia.org/wiki/Bob_Ross

CS380: Introduction to Computer Graphics Ideal image


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion


But NeRF requires multi-view images to be trained...
Q. How can diffusion model provide multi-view image information to NeRF?

“A yellow lego bulldozer”

(
Text-to-Image dif-
NeRF
fusion model

“A yellow lego bulldozer”

(
Text-to-Image dif-
NeRF
fusion model

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Simple diagram of Dreamfusion


Q. How can diffusion model provide multi-view image information to NeRF?
A. Append view-dependent text to the provided input text based on the location of camera

“A yellow lego bulldozer, over-


Elevation angle > head view”
Append “overhead view” at the end of the text

(
Text-to-Image dif-
NeRF
fusion model

Elevation angle “A yellow lego bulldozer,


Append “Front view”, “side view”, or “back view” front view”

(
Text-to-Image dif-
NeRF
fusion model

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Text-to-Image Diffusion model

Text prompt: “A yellow lego bulldozer”

Denoise Denoise Denoise Denoise

𝑥𝑇 𝑥 𝑇 −1 𝑥3 𝑥2 𝑥1 𝑥0
Text prompt Text prompt Text prompt Text prompt

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Text-to-Image Diffusion model

Text prompt: “A yellow lego bulldozer”

Denoise Denoise Denoise Denoise

𝑥𝑇 𝑥 𝑇 −1 𝑥3 𝑥2 𝑥1 𝑥0
Text prompt Text prompt Text prompt Text prompt

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Text-to-Image Diffusion model

Text prompt: “A yellow lego bulldozer”

Transformer One step denoised


Input noisy image Predicted noise image

Denoise

𝑥𝑡 Noise Prediction Model


(U-Net)
^
𝜖 𝑥𝑡 − 1

Key point !
Diffusion model doesn’t predict denoised image directly,
but it predicts noise first and subtract to to denoise image

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF

NeRF

𝑥0

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF

NeRF

𝑥0
2. Generate random noise

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF 3. Select random denoise
timestep t
t = random(1, T)

NeRF

𝑥0
2. Generate random noise

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF 3. Select random denoise
timestep t
t = random(1, T)

NeRF 4. Add noise to make noisy


image at timestep t

𝑥0
2. Generate random noise

𝑥𝑡

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF 3. Select random denoise
timestep t
t = random(0, T)

NeRF 4. Add noise to make noisy


“A yellow lego bulldozer”
image at timestep t 5. Predict noise

𝑥0
Text-to-Image diffu-
sion model
2. Generate random noise

𝑥𝑡 ^𝑡
𝜖

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss
1. Render image from NeRF 3. Select random denoise
timestep t
t = random(0, T)

NeRF 4. Add noise to make noisy


“A yellow lego bulldozer”
image at timestep t 5. Predict noise

𝑥0
Text-to-Image diffu-
sion model
2. Generate random noise

6. Subtract injected noise


𝑥𝑡 from predicted noise ^𝑡
𝜖

𝜖
( update direction that tells NeRF
how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss 1
1. Render image from NeRF 3. Select random denoise
timestep t
t = random(0, T)

NeRF 4. Add noise to make noisy


“A yellow lego bulldozer”
image at timestep t 5. Predict noise

𝑥0
Text-to-Image diffu-
sion model
2. Generate random noise

6. Subtract injected noise


𝑥𝑡 from predicted noise ^𝑡
𝜖

rop
𝜖 bac
kp

1 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss 1 2

1. Render image from NeRF 3. Select random denoise


timestep t
t = random(0, T)
backprop
NeRF 4. Add noise to make noisy
“A yellow lego bulldozer”
image at timestep t 5. Predict noise
2
𝑥0
Text-to-Image diffu-
sion model
2. Generate random noise

6. Subtract injected noise


𝑥𝑡 from predicted noise ^𝑡
𝜖

rop
𝜖 bac
kp

1 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss 1 2 3

1. Render image from NeRF 3. Select random denoise


timestep t

3 backprop t = random(0, T)
backprop
4. Add noise to make noisy
“A yellow lego bulldozer”
NeRF image at timestep t 5. Predict noise
2

𝑥0 Text-to-Image diffu-
sion model
2. Generate random noise

6. Subtract injected noise


𝑥𝑡 from predicted noise ^𝑡
𝜖

rop
𝜖 bac
kp

1 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline
Step-by-step how NeRF is optimized using SDS loss 1 2 3

1. Render image from NeRF 3. Select random denoise


timestep t
In practice, U-Net Jacobian term is expensive to compute.
3 backprop t = random(0, T) And omitting it doesn’t change the
backprop update direction
4. Add noise to make noisy
“A yellow lego bulldozer”
NeRF image at timestep t 5. Predict noise
2

𝑥0 Text-to-Image diffu-
sion model
2. Generate random noise

6. Subtract injected noise


𝑥𝑡 from predicted noise ^𝑡
𝜖

rop
𝜖 bac
kp

1 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline Gradient of SDS loss


Step-by-step how NeRF is optimized using SDS loss 1 2 3

1. Render image from NeRF 3. Select random denoise


timestep t
So just omit it!
3 backprop t = random(0, T)
backprop
4. Add noise to make noisy
“A yellow lego bulldozer”
NeRF image at timestep t 5. Predict noise
2

𝑥0 Text-to-Image diffu-
sion model
2. Generate random noise

6. Subtract injected noise


𝑥𝑡 from predicted noise ^𝑡
𝜖

rop
𝜖 bac
kp

1 ( update direction that tells NeRF


how to render better image )

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Visualization of optimizing NeRF

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Dreamfusion Pipeline in paper

NeRF (mip-NeRF 360) Text-to-image diffusion model (Imagen)

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Result: Qualitative Experiment

CLIP Encoder

- Extract semantic information from the input image.


- Encode image into feature vector that reside in the same space as text feature vector.

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Result: Qualitative Experiment

Optimize NeRF with CLIP

Reimplement with enhanced


NeRF setting in Dreamfusion

Optimize mesh with CLIP

Dreamfusion DreamFusion generated the high-


est quality 3D objects!

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Result: Quantitative Experiment

(R-Precision measures how accurately CLIP can find


This means that DreamFusion has generated
the correct text caption when shown a picture of a scene)
the 3D object that is best aligned with the text!

Even though Dream Fields and CLIP-Mesh are trained on CLIP,


DreamFusion outperforms on CLIP Score (R-Precision)

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Result: Examples

Experience more examples here: Dreamfusion 3D Gallery!

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Limitation

• SDS loss is not a perfect loss function. It often produces oversmoothed results.

• Dreamfusion uses 64x64 Imagen model so the image resolution is limited to 64x64.

Oversmoothed example
(64x64)

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Limitation
Janus Problem
DreamFusion approximates the view direction by categorizing angles into four rough categories, which
are “overhead”, “front”, “side”, and “back”.
However, this method can lead to issues, such as the occurrence of multiple features
(e.g., faces, eyes) at different angles.

Example of Janus problem

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Contribution
Papers inspired by Dreamfusion

Magic3D (CVPR 2023 highlight) Prolific Dreamer (NeurIPS 2023 highlight)

- As the originator of using a 2D diffusion model to create 3D objects, this methodology led to many
subsequent studies that improved SDS loss, resulting in better text-to-3D models.

- This approach offers a revolutionary methodology for solving 3D-related tasks not by relying on scarce
3D data, but by utilizing abundant 2D data alone.

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Thank You
Team 15
20190156 Yun Kim
20190063 Ki Nam Kim

CS380: Introduction to Computer Graphics


Poole et al., Dreamfusion: Text-to-3D using 2D Diffusion, ICLR 2023

Quiz
https://forms.gle/HG67Nz3DrawxLVkq7

Team 15
20190156 Yun Kim
20190063 Ki Nam Kim

CS380: Introduction to Computer Graphics

You might also like