Professional Documents
Culture Documents
Jihoon Reading Group
Jihoon Reading Group
Jihoon Reading Group
[Sahaira et al., 2022] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, NeurIPS 2022
[Ho et al., 2022] Imagen Video: High Definition Video Generation with Diffusion Models, arXiv 2022
[OpenAI, 2023] GPT-4 Technical Report, arXiv 2023 2
Intro: 3D generation is typically hard
Unlike other modalities, well-curated 3D datasets are especially hard to collect
• E.g., Text, image, video, audio can be found in Google search, YouTube
4
Outline of today’s presentation
1) Review of NeRF [Mildenhall et al., 2020]
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020
[Ho et al., 2020] Denoising Diffusion Probabilistic Models, NeurIPS 2020
[Pool et al., 2023] DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 5
Outline of today’s presentation
1) Review of NeRF [Mildenhall et al., 2020]
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020
[Ho et al., 2020] Denoising Diffusion Probabilistic Models, NeurIPS 2020
[Pool et al., 2023] DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 6
Goal of NeRF…? Novel view synthesis / rendering
Reconstructing 3D scene with a set of 2D images (+ Camera position)
Example…? Google street view into 3D world
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 7
Goal of NeRF…? Novel view synthesis / rendering
Reconstructing 3D scene with a set of 2D images (+ Camera position)
Example…? Google street view into 3D world
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 8
Rendering as color prediction with a given view point
What is a 2D image? Projection of 3D object into a 2D plane or… your eyes
For a given eye coordinate 𝒐 + viewing direction 𝒅, estimate the color 𝑪
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 9
Rendering as color prediction with a given view point
What is a 2D image? Projection of 3D object into a 2D plane or… your eyes
For a given eye coordinate 𝒐 + viewing direction 𝒅, estimate the color 𝑪
Novel view synthesis: predicting the color 𝑪! with a given viewing direction 𝒅! at 𝒐!
𝒅!
𝑪!
𝒐!
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 10
Rendering as color prediction with a given view point
TL;DR of NeRF: train a MLP that can predict the color of a given view direction
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 11
Rendering as color prediction with a given view point
TL;DR of NeRF: train a MLP that can predict the color of a given view direction
Two key contributions
• Injecting 3D prior
• Modeling high-frequency details
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020 12
Key idea 1: Injecting 3D prior
Key idea 1: Explicitly model the density 𝝈 to have 3D view consistency (3D prior injection)
• 𝜎: 𝑅" → 𝑅#, 𝜎 𝒓(𝑡) ∈ 0,1 , 𝒓(𝑡) is the coordinate value of the 3D space
13
Key idea 1: Injecting 3D prior
Key idea 1: Explicitly model the density 𝝈 to have 3D view consistency (3D prior injection)
• 𝜎: 𝑅" → 𝑅#, 𝜎 𝒓(𝑡) ∈ 0,1 , 𝒓(𝑡) is the coordinate value of the 3D space
: camera ray
minΘ ! − C(r)!2
Ground truth color
NeRF: Let’s parametrize (or learn) 𝜎(⋅) and 𝑐(⋅,⋅) with a MLP 𝚯
14
Key idea 1: Injecting 3D prior
Key idea 1: Explicitly model the density 𝝈 to have 3D view consistency (3D prior injection)
• 𝜎: 𝑅" → 𝑅#, 𝜎 𝒓(𝑡) ∈ 0,1 , 𝒓(𝑡) is the coordinate value of the 3D space
15
Key idea 2: Modeling high-frequency details
Key idea 2: Use the positional encoding (PE)
! "
FΘ : (x, d) → (c, σ) FΘ : γ(x), γ(d) → (c, σ)
16
Some recent advanced NeRFs: BlockNeRF
[Barron et al., 2022] Block-NeRF: Scalable Large Scene Neural View Synthesis, CVPR 2022 17
Some recent advanced NeRFs: Zip-NeRF
[Barron et al., 2023] Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields, arXiv 2023 18
Outline of today’s presentation
1) Review of NeRF [Mildenhall et al., 2020]
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020
[Ho et al., 2020] Denoising Diffusion Probabilistic Models, NeurIPS 2020
[Pool et al., 2023] DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 19
Diffusion model
Denoising Diffusion Probabilistic Model (DDPM)
where
20
Diffusion model: Training / Sampling
[Mildenhall et al., 2020] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020
[Ho et al., 2020] Denoising Diffusion Probabilistic Models, NeurIPS 2020
[Pool et al., 2023] DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 22
Motivation: Text-to-3D generation without 3D datasets
Recap: Hard to collect well-curated 3D datasets... (even more costly for text-3D pairs)
[Beaumont et al., 2022] LAION-5B: A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS, NeurIPS 2022
[Niemeyer et al., 2021] GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields, CVPR 2021 24
Key idea: Use a pre-trained diffusion model to train NeRF
Goal: Create NeRF that look like good images when rendered from random angles
Random angle
25
Key idea: Use a pre-trained diffusion model to train NeRF
Goal: Create NeRF that look like good images when rendered from random angles
Definition of good image? The image that has a low training loss of the diffusion model
Random angle
26
Key idea: Use a pre-trained diffusion model to train NeRF
Goal: Create NeRF that look like good images when rendered from random angles
Definition of good image? The image that has a low training loss of the diffusion model
zt ≔
Exact gradient:
SDS:
≈ remove Jacobain
28
Using more 3D priors are important
For successful training, one needs to consider 3D priors (more than the density prior)
Will not cover the details for this part
29
Experimental results: Generation
https://dreamfusion3d.github.io/
30
Experimental results: Iterative refinement
31
Experimental results: Comparison with baselines
CLIP R-Precision: Measures the 3D-consistency
• Retrieves the correct caption among a set of distractors given a rendering of the scene
32
Limitations
Require more than 6 GPU hours to generate one 3D scene (15,000 steps of optimization)
The generation quality is somewhat low
Also, not scalable.. use resolution of 64 by 64 diffusion model due to the computation overhead
+ Janus problem (a.k.a multi-head problem)
33
Key takeaway message
34
Appendix: Meta-learning DreamFusion
Limitation of DreamFusion: Sampling time
Currently DreamFusion requires 6 GPU hours to generate one 3D scene
Key idea: Use amortized meta-learning to accelerate generation time
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 36
Amortized meta-learning?
Predicting a task-specific model with a given task context set
gφ fθ (x, y) ∈ Q
C Set encoder z
Query set
Base model
Context set e.g., Transformer Latent i.e., disjoint from context
or
modulation parameter
x
In text-to-3D generation? Predicting the NeRF parameter/latent with a given text context
37
High-level design
Predict the NeRF parameter with the Language Model (LM) from the given text
The predicted NeRF should minimize the SDS loss → Update the LM parameter with SDS loss
Language
Training: [TEXT]
model (LM) ∇LM LSDS (φ, x = g(θ))
NeRF parameter 𝜃
Testing: Language
[TEXT]
model (LM)
NeRF parameter 𝜃
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 38
Experimental results: Interpolation between context text
Much faster than DreamFusion (i.e., per-prompt training), but low diversity
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 39
Experimental results: Interpolation between context text
[Lorraine et al., 2023] ATT3D: Amortized Text-to-3D Object Synthesis, arXiv 2023 40