Jihoon Reading Group

DreamFusion: Text-to-3D using 2D Diffusion
ICLR 2023, Outstanding paper award
Ben Poole1, Ajay Jain2, Jonathan T. Barron1, Ben Mildenhall1
1Google Research, 2UC Berkeley

Intro: Generative models show remarkable performance
Generative models show remarkable performance on multiple modalities
Image Video Text
Next challenge? More complex modalities and tasks

• 3D, long video (e.g., movie), multi-modal
[Sahaira et al., 2022] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, NeurIPS 2022
[Ho et al., 2022] Imagen Video: High Definition Video Generation with Diffusion Models, arXiv 2022
[OpenAI, 2023] GPT-4 Technical Report, arXiv 2023 2
Intro: 3D generation is typically hard
Unlike other modalities, well-curated 3D datasets are especially hard to collect
• E.g., Text, image, video, audio can be found in Google search, YouTube
[Choit et al., 2016] A Large Dataset of Object Scans, arXiv 2016

[Deitke et al., 2022] Objaverse: A Universe of Annotated 3D Objects, arXiv 2022 3
Summary of DreamFusion
TL;DR: Text-to-3D generation without 3D datasets by using NeRF and text-to-2D diffusion model
• Specifically? Optimize NeRF parameters with the diffusion training loss
4
Outline of today’s presentation
1) Review of NeRF [Mildenhall et al., 2020]
2) Brief review of diffusion models [Ho et al., 2020]
3) DreamFusion [Pool et al., 2023]
Appendix) Meta-learning DreamFusion [Lorraine et al., 2023]
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020
[Ho et al., 2020] Denoising Diffusion Probabilistic Models, NeurIPS 2020
[Pool et al., 2023] DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 5
Goal of NeRF…? Novel view synthesis / rendering
Reconstructing 3D scene with a set of 2D images (+ Camera position)
Example…? Google street view into 3D world
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 7
Goal of NeRF…? Novel view synthesis / rendering
Reconstructing 3D scene with a set of 2D images (+ Camera position)
Example…? Google street view into 3D world
Rendering as color prediction with a given view point
What is a 2D image? Projection of 3D object into a 2D plane or… your eyes
For a given eye coordinate 𝒐 + viewing direction 𝒅, estimate the color 𝑪
What is a 2D image? Projection of 3D object into a 2D plane or… your eyes
For a given eye coordinate 𝒐 + viewing direction 𝒅, estimate the color 𝑪
Novel view synthesis: predicting the color 𝑪! with a given viewing direction 𝒅! at 𝒐!
𝒅!
𝑪!
𝒐!
TL;DR of NeRF: train a MLP that can predict the color of a given view direction
TL;DR of NeRF: train a MLP that can predict the color of a given view direction
Two key contributions
• Injecting 3D prior
• Modeling high-frequency details
Key idea 1: Injecting 3D prior
Key idea 1: Explicitly model the density 𝝈 to have 3D view consistency (3D prior injection)
• 𝜎: 𝑅" → 𝑅#, 𝜎 𝒓(𝑡) ∈ 0,1 , 𝒓(𝑡) is the coordinate value of the 3D space
13
: camera ray
minΘ ! − C(r)!2
Ground truth color
NeRF: Let’s parametrize (or learn) 𝜎(⋅) and 𝑐(⋅,⋅) with a MLP 𝚯
14
𝜎(𝒙): occupancy at coordinate 𝒙

Should only depend on 𝒙
𝐜(𝒙, 𝒅): emitted color of 𝒙,

when the view direction is 𝒅
15
Key idea 2: Modeling high-frequency details
Key idea 2: Use the positional encoding (PE)
! "
FΘ : (x, d) → (c, σ) FΘ : γ(x), γ(d) → (c, σ)
16
Some recent advanced NeRFs: BlockNeRF
[Barron et al., 2022] Block-NeRF: Scalable Large Scene Neural View Synthesis, CVPR 2022 17
Some recent advanced NeRFs: Zip-NeRF
[Barron et al., 2023] Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields, arXiv 2023 18
Diffusion model
Denoising Diffusion Probabilistic Model (DDPM)
• Reverse process (sampling process): denoising
Σ! is considered as constant 𝜎"# 𝐼
• Forward process (diffusion process): adding noise
where
20
Diffusion model: Training / Sampling
Use evidence lower bound (ELBO) to model the data distribution
This loss can be simplified to a weighted score matching objective
𝜖" : U-Net that predicts the added noise

21
Motivation: Text-to-3D generation without 3D datasets
Recap: Hard to collect well-curated 3D datasets... (even more costly for text-3D pairs)
[Choit et al., 2016] A Large Dataset of Object Scans, arXiv 2016

[Deitke et al., 2022] Objaverse: A Universe of Annotated 3D Objects, arXiv 2022 23
Motivation: Text-to-3D generation without 3D datasets
Motivation:
• Text, 2D, or even text-image pairs are relatively cheap
• All 2D images is a projection of 3D scenes
→ Can we use the pre-trained text-to-2D diffusion foundation model to generate text-to-3D…?
[Beaumont et al., 2022] LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS, NeurIPS 2022
[Niemeyer et al., 2021] GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields, CVPR 2021 24
Key idea: Use a pre-trained diffusion model to train NeRF
Goal: Create NeRF that look like good images when rendered from random angles
Random angle
25
Definition of good image? The image that has a low training loss of the diffusion model
Pretrained diffusion model
Random angle
26
Definition of good image? The image that has a low training loss of the diffusion model
zt ≔
Idea: Perform gradient descent with the training loss
Update the NeRF parameter 𝜃+ Freeze the text-to-2D diffusion model

Random angle
27
New sampling scheme: Score Distillation Sampling (SDS)
However, the U-Net Jacobian term is quite expensive to compute
Exact gradient:
SDS:
≈ remove Jacobain
28
Using more 3D priors are important
For successful training, one needs to consider 3D priors (more than the density prior)
Will not cover the details for this part
29
Experimental results: Generation
https://dreamfusion3d.github.io/
30
Experimental results: Iterative refinement
31
Experimental results: Comparison with baselines
CLIP R-Precision: Measures the 3D-consistency
• Retrieves the correct caption among a set of distractors given a rendering of the scene
32
Limitations
Require more than 6 GPU hours to generate one 3D scene (15,000 steps of optimization)
The generation quality is somewhat low
Also, not scalable.. use resolution of 64 by 64 diffusion model due to the computation overhead
+ Janus problem (a.k.a multi-head problem)
33
Key takeaway message
All 2D image is a projection of a 3D scene
We can reconstruct/generate the 3D scene by using 2D generative models
34
Appendix: Meta-learning DreamFusion
Limitation of DreamFusion: Sampling time
Currently DreamFusion requires 6 GPU hours to generate one 3D scene
Key idea: Use amortized meta-learning to accelerate generation time
Amortized meta-learning?
Predicting a task-specific model with a given task context set
y = fθ (x, z) where z = gφ (C)
gφ fθ (x, y) ∈ Q
C Set encoder z
Query set
Base model
Context set e.g., Transformer Latent i.e., disjoint from context
or
modulation parameter
x
In text-to-3D generation? Predicting the NeRF parameter/latent with a given text context
37
High-level design
Predict the NeRF parameter with the Language Model (LM) from the given text
The predicted NeRF should minimize the SDS loss → Update the LM parameter with SDS loss
Language
Training: [TEXT]
model (LM) ∇LM LSDS (φ, x = g(θ))
NeRF parameter 𝜃
Testing: Language
[TEXT]
model (LM)
NeRF parameter 𝜃
Experimental results: Interpolation between context text
Much faster than DreamFusion (i.e., per-prompt training), but low diversity
Experimental results: Interpolation between context text
Interpolate the text embedding

Jihoon Reading Group

Uploaded by

Copyright:

Available Formats

You might also like

Jihoon Reading Group

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jihoon Reading Group

Uploaded by

Copyright:

Available Formats

DreamFusion: Text-to-3D using 2D Diffusion

ICLR 2023, Outstanding paper award

Ben Poole1, Ajay Jain2, Jonathan T. Barron1, Ben Mildenhall1

1Google Research, 2UC Berkeley

Image Video Text

Next challenge? More complex modalities and tasks

[Choit et al., 2016] A Large Dataset of Object Scans, arXiv 2016

2) Brief review of diffusion models [Ho et al., 2020]

3) DreamFusion [Pool et al., 2023]

Appendix) Meta-learning DreamFusion [Lorraine et al., 2023]

2) Brief review of diffusion models [Ho et al., 2020]

3) DreamFusion [Pool et al., 2023]

Appendix) Meta-learning DreamFusion [Lorraine et al., 2023]

𝜎(𝒙): occupancy at coordinate 𝒙

𝐜(𝒙, 𝒅): emitted color of 𝒙,

2) Brief review of diffusion models [Ho et al., 2020]

3) DreamFusion [Pool et al., 2023]

Appendix) Meta-learning DreamFusion [Lorraine et al., 2023]

• Reverse process (sampling process): denoising

Σ! is considered as constant 𝜎"# 𝐼

• Forward process (diffusion process): adding noise

Use evidence lower bound (ELBO) to model the data distribution

This loss can be simplified to a weighted score matching objective

𝜖" : U-Net that predicts the added noise

2) Brief review of diffusion models [Ho et al., 2020]

3) DreamFusion [Pool et al., 2023]

Appendix) Meta-learning DreamFusion [Lorraine et al., 2023]

[Choit et al., 2016] A Large Dataset of Object Scans, arXiv 2016

Pretrained diffusion model

Idea: Perform gradient descent with the training loss

Update the NeRF parameter 𝜃+ Freeze the text-to-2D diffusion model

All 2D image is a projection of a 3D scene

We can reconstruct/generate the 3D scene by using 2D generative models

y = fθ (x, z) where z = gφ (C)

Interpolate the text embedding

You might also like