Jihoon Reading Group

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

DreamFusion: Text-to-3D using 2D Diffusion

ICLR 2023, Outstanding paper award

Ben Poole1, Ajay Jain2, Jonathan T. Barron1, Ben Mildenhall1

1Google Research, 2UC Berkeley


Intro: Generative models show remarkable performance
Generative models show remarkable performance on multiple modalities

Image Video Text

Next challenge? More complex modalities and tasks


• 3D, long video (e.g., movie), multi-modal

[Sahaira et al., 2022] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, NeurIPS 2022
[Ho et al., 2022] Imagen Video: High Definition Video Generation with Diffusion Models, arXiv 2022
[OpenAI, 2023] GPT-4 Technical Report, arXiv 2023 2
Intro: 3D generation is typically hard
Unlike other modalities, well-curated 3D datasets are especially hard to collect
• E.g., Text, image, video, audio can be found in Google search, YouTube

[Choit et al., 2016] A Large Dataset of Object Scans, arXiv 2016


[Deitke et al., 2022] Objaverse: A Universe of Annotated 3D Objects, arXiv 2022 3
Summary of DreamFusion
TL;DR: Text-to-3D generation without 3D datasets by using NeRF and text-to-2D diffusion model
• Specifically? Optimize NeRF parameters with the diffusion training loss

4
Outline of today’s presentation
1) Review of NeRF [Mildenhall et al., 2020]

2) Brief review of diffusion models [Ho et al., 2020]

3) DreamFusion [Pool et al., 2023]

Appendix) Meta-learning DreamFusion [Lorraine et al., 2023]

[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020
[Ho et al., 2020] Denoising Diffusion Probabilistic Models, NeurIPS 2020
[Pool et al., 2023] DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 5
Outline of today’s presentation
1) Review of NeRF [Mildenhall et al., 2020]

2) Brief review of diffusion models [Ho et al., 2020]

3) DreamFusion [Pool et al., 2023]

Appendix) Meta-learning DreamFusion [Lorraine et al., 2023]

[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020
[Ho et al., 2020] Denoising Diffusion Probabilistic Models, NeurIPS 2020
[Pool et al., 2023] DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 6
Goal of NeRF…? Novel view synthesis / rendering
Reconstructing 3D scene with a set of 2D images (+ Camera position)
Example…? Google street view into 3D world

[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 7
Goal of NeRF…? Novel view synthesis / rendering
Reconstructing 3D scene with a set of 2D images (+ Camera position)
Example…? Google street view into 3D world

[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 8
Rendering as color prediction with a given view point
What is a 2D image? Projection of 3D object into a 2D plane or… your eyes
For a given eye coordinate 𝒐 + viewing direction 𝒅, estimate the color 𝑪

[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 9
Rendering as color prediction with a given view point
What is a 2D image? Projection of 3D object into a 2D plane or… your eyes
For a given eye coordinate 𝒐 + viewing direction 𝒅, estimate the color 𝑪
Novel view synthesis: predicting the color 𝑪! with a given viewing direction 𝒅! at 𝒐!
𝒅!

𝑪!

𝒐!

[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 10
Rendering as color prediction with a given view point
TL;DR of NeRF: train a MLP that can predict the color of a given view direction

[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 11
Rendering as color prediction with a given view point
TL;DR of NeRF: train a MLP that can predict the color of a given view direction
Two key contributions
• Injecting 3D prior
• Modeling high-frequency details

[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 12
Key idea 1: Injecting 3D prior
Key idea 1: Explicitly model the density 𝝈 to have 3D view consistency (3D prior injection)
• 𝜎: 𝑅" → 𝑅#, 𝜎 𝒓(𝑡) ∈ 0,1 , 𝒓(𝑡) is the coordinate value of the 3D space

13
Key idea 1: Injecting 3D prior
Key idea 1: Explicitly model the density 𝝈 to have 3D view consistency (3D prior injection)
• 𝜎: 𝑅" → 𝑅#, 𝜎 𝒓(𝑡) ∈ 0,1 , 𝒓(𝑡) is the coordinate value of the 3D space

: camera ray

minΘ ! − C(r)!2
Ground truth color

NeRF: Let’s parametrize (or learn) 𝜎(⋅) and 𝑐(⋅,⋅) with a MLP 𝚯
14
Key idea 1: Injecting 3D prior
Key idea 1: Explicitly model the density 𝝈 to have 3D view consistency (3D prior injection)
• 𝜎: 𝑅" → 𝑅#, 𝜎 𝒓(𝑡) ∈ 0,1 , 𝒓(𝑡) is the coordinate value of the 3D space

𝜎(𝒙): occupancy at coordinate 𝒙


Should only depend on 𝒙

𝐜(𝒙, 𝒅): emitted color of 𝒙,


when the view direction is 𝒅

15
Key idea 2: Modeling high-frequency details
Key idea 2: Use the positional encoding (PE)

! "
FΘ : (x, d) → (c, σ) FΘ : γ(x), γ(d) → (c, σ)

16
Some recent advanced NeRFs: BlockNeRF

[Barron et al., 2022] Block-NeRF: Scalable Large Scene Neural View Synthesis, CVPR 2022 17
Some recent advanced NeRFs: Zip-NeRF

[Barron et al., 2023] Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields, arXiv 2023 18
Outline of today’s presentation
1) Review of NeRF [Mildenhall et al., 2020]

2) Brief review of diffusion models [Ho et al., 2020]

3) DreamFusion [Pool et al., 2023]

Appendix) Meta-learning DreamFusion [Lorraine et al., 2023]

[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020
[Ho et al., 2020] Denoising Diffusion Probabilistic Models, NeurIPS 2020
[Pool et al., 2023] DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 19
Diffusion model
Denoising Diffusion Probabilistic Model (DDPM)

• Reverse process (sampling process): denoising

Σ! is considered as constant 𝜎"# 𝐼

• Forward process (diffusion process): adding noise

where

20
Diffusion model: Training / Sampling

Use evidence lower bound (ELBO) to model the data distribution

This loss can be simplified to a weighted score matching objective

𝜖" : U-Net that predicts the added noise


21
Outline of today’s presentation
1) Review of NeRF [Mildenhall et al., 2020]

2) Brief review of diffusion models [Ho et al., 2020]

3) DreamFusion [Pool et al., 2023]

Appendix) Meta-learning DreamFusion [Lorraine et al., 2023]

[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020
[Ho et al., 2020] Denoising Diffusion Probabilistic Models, NeurIPS 2020
[Pool et al., 2023] DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 22
Motivation: Text-to-3D generation without 3D datasets
Recap: Hard to collect well-curated 3D datasets... (even more costly for text-3D pairs)

[Choit et al., 2016] A Large Dataset of Object Scans, arXiv 2016


[Deitke et al., 2022] Objaverse: A Universe of Annotated 3D Objects, arXiv 2022 23
Motivation: Text-to-3D generation without 3D datasets
Motivation:
• Text, 2D, or even text-image pairs are relatively cheap
• All 2D images is a projection of 3D scenes
→ Can we use the pre-trained text-to-2D diffusion foundation model to generate text-to-3D…?

[Beaumont et al., 2022] LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS, NeurIPS 2022
[Niemeyer et al., 2021] GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields, CVPR 2021 24
Key idea: Use a pre-trained diffusion model to train NeRF
Goal: Create NeRF that look like good images when rendered from random angles

Random angle
25
Key idea: Use a pre-trained diffusion model to train NeRF
Goal: Create NeRF that look like good images when rendered from random angles
Definition of good image? The image that has a low training loss of the diffusion model

Pretrained diffusion model

Random angle
26
Key idea: Use a pre-trained diffusion model to train NeRF
Goal: Create NeRF that look like good images when rendered from random angles
Definition of good image? The image that has a low training loss of the diffusion model

zt ≔

Idea: Perform gradient descent with the training loss

Update the NeRF parameter 𝜃+ Freeze the text-to-2D diffusion model


Random angle
27
New sampling scheme: Score Distillation Sampling (SDS)
However, the U-Net Jacobian term is quite expensive to compute

Exact gradient:

SDS:
≈ remove Jacobain

28
Using more 3D priors are important
For successful training, one needs to consider 3D priors (more than the density prior)
Will not cover the details for this part

29
Experimental results: Generation
https://dreamfusion3d.github.io/

30
Experimental results: Iterative refinement

31
Experimental results: Comparison with baselines
CLIP R-Precision: Measures the 3D-consistency
• Retrieves the correct caption among a set of distractors given a rendering of the scene

32
Limitations
Require more than 6 GPU hours to generate one 3D scene (15,000 steps of optimization)
The generation quality is somewhat low
Also, not scalable.. use resolution of 64 by 64 diffusion model due to the computation overhead
+ Janus problem (a.k.a multi-head problem)

33
Key takeaway message

All 2D image is a projection of a 3D scene

We can reconstruct/generate the 3D scene by using 2D generative models

34
Appendix: Meta-learning DreamFusion
Limitation of DreamFusion: Sampling time
Currently DreamFusion requires 6 GPU hours to generate one 3D scene
Key idea: Use amortized meta-learning to accelerate generation time

[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 36
Amortized meta-learning?
Predicting a task-specific model with a given task context set

y = fθ (x, z) where z = gφ (C)

gφ fθ (x, y) ∈ Q
C Set encoder z
Query set
Base model
Context set e.g., Transformer Latent i.e., disjoint from context
or
modulation parameter
x

In text-to-3D generation? Predicting the NeRF parameter/latent with a given text context

37
High-level design
Predict the NeRF parameter with the Language Model (LM) from the given text
The predicted NeRF should minimize the SDS loss → Update the LM parameter with SDS loss

Language
Training: [TEXT]
model (LM) ∇LM LSDS (φ, x = g(θ))
NeRF parameter 𝜃

Testing: Language
[TEXT]
model (LM)

NeRF parameter 𝜃

[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 38
Experimental results: Interpolation between context text

Much faster than DreamFusion (i.e., per-prompt training), but low diversity

[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 39
Experimental results: Interpolation between context text

Interpolate the text embedding

[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 40

You might also like