Professional Documents
Culture Documents
Shape, Reflectance, and Illumination From Appearance
Shape, Reflectance, and Illumination From Appearance
Shape, Reflectance, and Illumination From Appearance
Appearance
by
Xiuming Zhang
B.Eng., National University of Singapore (2015)
S.M., Massachusetts Institute of Technology (2018)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2021
© Massachusetts Institute of Technology 2021. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
August 27, 2021
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
William T. Freeman
Thomas and Gerd Perkins Professor of Electrical Engineering and
Computer Science
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Leslie A. Kolodziejski
Professor of Electrical Engineering and Computer Science
Chair, Department Committee on Graduate Students
2
Shape, Reflectance, and Illumination From Appearance
by
Xiuming Zhang
Abstract
The image formation process describes how light interacts with the objects in a scene
and eventually reaches the camera, forming an image that we observe. Inverting this
process is a long-standing, ill-posed problem in computer vision, which involves es-
timating shape, material properties, and/or illumination passively from the object’s
appearance. Such “inverse rendering” capabilities enable 3D understanding of our
world (as desired in autonomous driving, robotics, etc.) and computer graphics appli-
cations such as relighting, view synthesis, and object capture (as desired in Extended
Reality [XR], etc.).
In this dissertation, we study inverse rendering by recovering three-dimensional
(3D) shape, reflectance, illumination, or everything jointly under different setups.
The input across different setups varies from single images to multi-view images lit
by multiple known lighting conditions, then to multi-view images under one unknown
illumination. Across the setups, we explore optimization-based recovery that exploits
multiple observations of the same object, learning-based reconstruction that heavily
relies on data-driven priors, and a mixture of both. Depending on the problem, we
perform inverse rendering at three different levels of abstraction: I) At a low level of
abstraction, we develop physically-based models that explicitly solve for every term
in the rendering equation, II) at a middle level, we utilize the light transport function
to abstract away intermediate light bounces and model only the final “net effect,”
and III) at a high level, we treat rendering as a black box and directly invert it
with learned data-driven priors. We also demonstrate how higher-level abstraction
leads to models that are simple and applicable to single images but also possess fewer
capabilities.
This dissertation discusses four instances of inverse rendering, gradually ascending
in the level of abstraction. In the first instance, we focus on the low-level abstraction
where we decompose appearance explicitly into shape, reflectance, and illumination.
To this end, we present a physically-based model capable of such full factorization
under one unknown illumination and another that handles one-bounce indirect illumi-
nation. In the second instance, we ascend to the middle level of abstraction, at which
we model appearance with the light transport function, demonstrating how this level
3
of modeling easily supports relighting with global illumination, view synthesis, and
both tasks simultaneously. Finally, at the high level of abstraction, we employ deep
learning to directly invert the rendering black box in a data-driven fashion. Specif-
ically, in the third instance, we recover 3D shapes from single images by learning
data-driven shape priors and further make our reconstruction generalizable to novel
shape classes unseen during training. Also relying on data-driven priors, the fourth
instance concerns how to recover lighting from the appearance of the illuminated
object, without explicitly modeling the image formation process.
4
Acknowledgments
These five years at MIT has been truly an amazing journey: I learned “like taking
a drink from a fire hose” (former MIT President Wiesner) and made lifelong friends
with whom I can share the ups and downs in taking that drink.
5
I owe a debt of gratitude to my advisors who got me started in research during
my undergraduate time: Thomas Yeo, Mert Sabuncu, and Beth Mormino. Thomas
was my Bachelor’s thesis advisor, with whom I worked together on a daily basis for
around two years. Technical and meanwhile attentive to details, he showed me what
top-notch research was while I did not even know much about machine learning. Since
my graduation, he has been continuing to support me in many aspects, from graduate
school applications to recommendation letter writing. Even though the collaboration
with Mert had been mostly online, he generously offered many helpful suggestions on
graduate research in our first (and probably only) in-person interaction back in 2016.
Without the rigorous research training from them, I would not be here writing this
dissertation today.
Besides those already mentioned, I was fortunate to have worked with many in-
telligent collaborators during my Ph.D. (in approximately chronological order): Ji-
ajun Wu*, Zhoutong Zhang*, Chengkai Zhang*, Josh Tenenbaum*, Tianfan Xue*,
Xingyuan Sun*, Charles He, Tali Dekel, Stefanie Mueller, Andrew Owens, Yikai
Li, Jiayuan Mao, Noah Snavely, Cecilia Zhang, Ren Ng, David E. Jacobs, Sergio
Orts-Escolano*, Rohit Pandey*, Christoph Rhemann*, Sean Fanello*, Yun-Ta Tsai*,
Tiancheng Sun*, Zexiang Xu*, Ravi Ramamoorthi*, Paul Debevec*, Boyang Deng*,
Pratul Srinivasan*, Matt Tancik*, Ben Mildenhall*, Steven Liu, Richard Zhang, Jun-
Yan Zhu, and Bryan Russell. This dissertation would not have been possible without
the input from the co-authors marked with an asterisk. I want to particularly thank
two labmates from this list: Jiajun and Zhoutong. As a senior student in the Lab, Jia-
jun provided valuable advice and help in bootstrapping my computer vision research;
the knowledge I gained from exploring 3D vision with Jiajun laid the foundation for
this dissertation. Zhoutong, despite being my peer, constantly amazes me with his
breadth of knowledge in vision and graphics; a “walking Visionpedia” is what I call
him. It was my privilege to have learned so much from everyone listed above.
I would not have been able to get through these challenging years without the
support from the staff in EECS and CSAIL (in no particular order): Janet Fischer,
Alicia Duarte, Kathy McCoy, Katrina LaCurts, Roger White, Sheila Sharbetian, Fern
6
Keniston, Rachel Gordon, Adam Conner-Simons, Garrett Wollman, Steve Ruggiero,
Jay Sekora, Tom Buehler, Jason Dorfman, Jon Proulx, Mark Pearrow, etc. Janet
and Katrina provided so much helpful advice as I navigated to today. I worked
with Rachel, Adam, Jason, and Tom on the MoSculp news article. They were such
a supportive and strong team that made MoSculp a hit. I would also like to thank
everyone in The Infrastructure Group, without whose solid technical skills and prompt
help, I would not be able to do any of the research presented in this dissertation.
I am thankful to everyone in the Vision Graphics Neighborhood and beyond at
MIT. I enjoyed every conversation we had in our (tiny) kitchen. We went hiking,
watched several musicals and plays, and witnessed the total solar eclipse together.
Thanks to all of you for making my MIT life colorful. To all my friends scattered
around the world, thank you, too, for the support and friendship.
I want to thank my entire family for their unwavering love and support, especially
my parents, Yanbin Sun and Chunmin Zhang, who have always been supporting my
decisions unconditionally (even though some came with great financial costs). I hope
we all agree that we have made the right calls. Being far away from home (and now
trapped by COVID), I was not able to go back home as often as I wanted during my
Ph.D.; as the single child, I wish I had done more. If only Nai-Nai and Lao-Ye were
still around with us today, they would be so proud to see their grandson graduating
with a Ph.D. Thank you, and I love you all.
Lastly, I thank my girlfriend, Hanzheng Li, for always being there for me. I owe a
lot of my success to her. Despite being a classical pianist, she knows all about nonsense
like Reviewer #2, weak reject, etc. She always manages to cheer me up when bad
things happen and to calm me down before overexcitement becomes sorrow – just the
perfect other half for me. The quality of life since I met her has risen significantly,
and I look forward to the next chapter of life together with her.
7
To Yanbin, Chunmin, and Hanzheng
8
Brief Contents
1 Introduction 25
1.1 Image Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.2 Inverting the Image Formation Process . . . . . . . . . . . . . . . . . 39
1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9
4 High-Level Abstraction: Data-Driven Shape Reconstruction 169
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.3 Method: Learning & Using Shape Priors . . . . . . . . . . . . . . . . 179
4.4 Method: Generalizing to Unseen Classes . . . . . . . . . . . . . . . . 183
4.5 Method: Building a Real-World Dataset . . . . . . . . . . . . . . . . 187
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
10
D Supplement: Generalizable Reconstruction (GenRe) 265
D.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
D.2 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
11
THIS PAGE INTENTIONALLY LEFT BLANK
12
Contents
1 Introduction 25
1.1 Image Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.1.1 Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.1.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.1.3 Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.1.4 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.1.5 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.2 Inverting the Image Formation Process . . . . . . . . . . . . . . . . . 39
1.2.1 Joint Estimation of Shape, Reflectance, & Illumination . . . . 40
1.2.2 Interpolating the Light Transport Function . . . . . . . . . . . 41
1.2.3 Shape Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 42
1.2.4 Lighting Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 44
13
2.3.2 Neural Reflectance Fields . . . . . . . . . . . . . . . . . . . . 62
2.3.3 Light Transport via Neural Visibility Fields . . . . . . . . . . 63
2.3.4 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.5 Training & Implementation Details . . . . . . . . . . . . . . . 68
2.4 Method: One Unknown Illumination . . . . . . . . . . . . . . . . . . 70
2.4.1 Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.4.2 Reflectance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.4.3 Illumination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.4.4 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 79
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.2 Shape Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.5.3 Joint Estimation of Shape, Reflectance, & Illumination . . . . 83
2.5.4 Free-Viewpoint Relighting . . . . . . . . . . . . . . . . . . . . 85
2.5.5 Material Editing . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.6.1 Baseline Comparisons: Multiple Known Illuminations . . . . . 91
2.6.2 Baseline Comparisons: One Unknown Illumination . . . . . . 95
2.6.3 Ablation Studies: Multiple Known Illuminations . . . . . . . . 99
2.6.4 Ablation Studies: One Unknown Illumination . . . . . . . . . 100
2.6.5 Estimation Consistency Across Different Illuminations . . . . 103
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
14
3.2.4 Multiple Views & Illuminants . . . . . . . . . . . . . . . . . . 118
3.3 Method: Precise, High-Frequency Relighting . . . . . . . . . . . . . . 119
3.3.1 Active Set Construction . . . . . . . . . . . . . . . . . . . . . 121
3.3.2 Alias-Free Pooling . . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 125
3.3.4 Loss Functions & Training Strategy . . . . . . . . . . . . . . . 126
3.4 Method: Free-Viewpoint Relighting . . . . . . . . . . . . . . . . . . . 127
3.4.1 Texture-Space Inputs . . . . . . . . . . . . . . . . . . . . . . . 129
3.4.2 Query & Observation Networks . . . . . . . . . . . . . . . . . 132
3.4.3 Residual Learning of High-Order Effects . . . . . . . . . . . . 133
3.4.4 Simultaneous Relighting & View Synthesis . . . . . . . . . . . 136
3.4.5 Network Architecture, Losses, & Other Details . . . . . . . . . 136
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.5.1 Hardware Setup & Data Acquisition . . . . . . . . . . . . . . 139
3.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 142
3.5.3 Precise Directional Relighting . . . . . . . . . . . . . . . . . . 143
3.5.4 High-Frequency Image-Based Relighting . . . . . . . . . . . . 144
3.5.5 Lighting Softness Control . . . . . . . . . . . . . . . . . . . . 144
3.5.6 Geometry-Free Relighting . . . . . . . . . . . . . . . . . . . . 146
3.5.7 Geometry-Based Relighting . . . . . . . . . . . . . . . . . . . 148
3.5.8 Changing the Viewpoint . . . . . . . . . . . . . . . . . . . . . 151
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.6.1 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.6.2 Image-Based Relighting Under Varying Light Frequency . . . 159
3.6.3 Subsampling the Light Stage . . . . . . . . . . . . . . . . . . . 161
3.6.4 Degrading the Input Geometry Proxy . . . . . . . . . . . . . . 163
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
15
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.2.1 3D Shape Completion . . . . . . . . . . . . . . . . . . . . . . 174
4.2.2 Single-Image 3D Reconstruction . . . . . . . . . . . . . . . . . 175
4.2.3 2.5D Sketch Recovery . . . . . . . . . . . . . . . . . . . . . . . 176
4.2.4 Perceptual Losses & Adversarial Learning . . . . . . . . . . . 176
4.2.5 Spherical Projections . . . . . . . . . . . . . . . . . . . . . . . 177
4.2.6 Zero- & Few-Shot Recognition . . . . . . . . . . . . . . . . . . 177
4.2.7 3D Shape Datasets . . . . . . . . . . . . . . . . . . . . . . . . 178
4.3 Method: Learning & Using Shape Priors . . . . . . . . . . . . . . . . 179
4.3.1 Shape Naturalness Network . . . . . . . . . . . . . . . . . . . 181
4.3.2 Training Paradigm . . . . . . . . . . . . . . . . . . . . . . . . 182
4.4 Method: Generalizing to Unseen Classes . . . . . . . . . . . . . . . . 183
4.4.1 Single-View Depth Estimator . . . . . . . . . . . . . . . . . . 184
4.4.2 Spherical Map Inpainting Network . . . . . . . . . . . . . . . 184
4.4.3 Voxel Refinement Network . . . . . . . . . . . . . . . . . . . . 185
4.4.4 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.5 Method: Building a Real-World Dataset . . . . . . . . . . . . . . . . 187
4.5.1 Collecting Image-Shape Pairs . . . . . . . . . . . . . . . . . . 189
4.5.2 Image-Shape Alignment . . . . . . . . . . . . . . . . . . . . . 190
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.6.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.6.4 Single-View Shape Completion . . . . . . . . . . . . . . . . . . 199
4.6.5 Single-View Shape Reconstruction . . . . . . . . . . . . . . . . 203
4.6.6 Estimating Depth for Novel Shape Classes . . . . . . . . . . . 208
4.6.7 Reconstructing Novel Objects From Training Classes . . . . . 208
4.6.8 Reconstructing Objects From Unseen Classes . . . . . . . . . 209
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.7.1 Network Visualization . . . . . . . . . . . . . . . . . . . . . . 212
16
4.7.2 Training With the Naturalness Loss Over Time . . . . . . . . 213
4.7.3 Failure Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.7.4 Effects of Viewpoints on Generalization . . . . . . . . . . . . . 214
4.7.5 Generalizing to Non-Rigid Shapes . . . . . . . . . . . . . . . . 214
4.7.6 Generalizing to Highly Regular Shapes . . . . . . . . . . . . . 215
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
17
B Supplement: Light Stage Super-Resolution (LSSR) 255
B.1 Progressive Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
B.2 Baseline Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
18
List of Figures
19
2-18 NeRV with analytic vs. MLP-predicted normals. . . . . . . . . . . . 100
2-19 Qualitative ablation studies of NeRFactor. . . . . . . . . . . . . . . . 102
2-20 Albedo estimation of NeRFactor across different illuminations. . . . 103
20
3-27 A failure case of NLT’s view synthesis. . . . . . . . . . . . . . . . . . 166
21
5-4 EarthGAN model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5-5 Different Earth appearances at similar timestamps. . . . . . . . . . . 233
5-6 Earth recovery by EarthGAN. . . . . . . . . . . . . . . . . . . . . . 236
5-7 Continuous Earth rotation learned by EarthGAN. . . . . . . . . . . 238
5-8 Smooth evolution of the Earth appearance learned by EarthGAN. . . 239
5-9 How EarthGAN learns to model the clouds. . . . . . . . . . . . . . . 241
5-10 Ablation studies of EarthGAN’s design choices. . . . . . . . . . . . . 243
22
List of Tables
23
THIS PAGE INTENTIONALLY LEFT BLANK
24
Chapter 1
Introduction
25
algorithm capable of estimating facial geometry and reflectance would enable “magic”
portrait relighting features on consumer mobile phones, such as Google’s Portrait
Light [Tsai and Pandey, 2020].
26
the subject’s appearance observed under various lighting conditions (also from differ-
ent views for Zhang et al. [2021b]), without further factorizing the function into the
underlying shape and reflectance. This mid-level abstraction allows our models to
easily include global illumination effects, but it does not support shape or material
editing (which the low-level abstraction permits) and requires multiple images of the
object (in contrast to the high-level abstraction that is applicable to single images).
Finally, at a high level of abstraction, we aim to directly regress the inter-
mediate factors (e.g., shape, lighting, etc.) from their resultant appearance, without
modeling the actual image formation process. This level of abstraction treats render-
ing as a black box to be inverted and usually involves training end-to-end machine
learning models on large datasets to learn data-driven priors directly on the inter-
mediate factors. Specifically, in this dissertation, we explore two instances of such
methods: 3D shape reconstruction from single images (Chapter 4) and lighting recov-
ery from the appearance of the illuminated object (Chapter 5). In the first problem of
shape reconstruction, we train neural networks to directly regress 3D shapes from sin-
gle images thereof, leveraging the data-driven shape priors learned from a large-scale
shape dataset [Sun et al., 2018b, Wu et al., 2018]. We further make such networks
generalizable to novel shape classes unseen during training, by wiring geometric pro-
jections (which we understand well and can specify exactly) as inductive bias into
our model [Zhang et al., 2018b]. In the second problem of lighting recovery, we train
a conditional generative model to learn regularities in our lighting conditions, such
that when given the appearance of the illuminated, the model generates a plausible
lighting condition responsible for the observation. With this high-level abstraction,
we ignore the physics of the image formation process and take data-driven approaches
that accept single-image input, leveraging the power of machine learning.
In this section, we briefly introduce the image formation process in nature or computer
graphics. Figure 1-1 shows a cartoon visualization of the relationships among the
27
object, light, and camera, an example real photo of the scene, and a computer graphics
render aiming to reproduce that real photo. We present only a simplified process that
is sufficient for what this dissertation concerns. In this simplified framework, there
are four key scene elements—shape, materials, lighting, and the camera—and the
rendering process that combines these elements into an RGB image of the scene. In
the following subsections, we elaborate on each of these four scene elements and the
rendering process.
Camera Light
Object
Shadow Catcher
Real Synthetic
Specular
Highlights
Shadows
Figure 1-1: Relationships among the object, light, and camera. Top: Light travels
from the source to the scene, interacts with the objects therein, and reaches the
camera. Bottom (left): A real photo contains complex light transport effects such as
specular highlights and soft shadows. Bottom (right): With careful scene modeling
and physically-based rendering, one can reproduce the real photo with a synthetic
render, thanks to computer graphics.
1.1.1 Shape
28
the representation must be powerful enough to represent high-frequency structures,
descriptive enough to extract information from, and fast enough to perform operations
on. Unsurprisingly, there is no single optimal shape representation that is omnipotent.
We visualize the popular shape representations in Figure 1-2.
(A) Mesh (B) Point Clouds (C) Voxels (D) SDF (0 level set) (E) 2.5D Maps
Mesh The computer graphics community has been using mesh as their shape repre-
sentation for decades. Briefly, mesh is a compact representation that describes shape
as a list of vertices and faces (i.e., connectivity among the vertices). See Figure 1-2
(A) for an example. Besides being compact, it is powerful to represent complex ge-
ometry of any topology (by simply adding more vertices and faces) and efficient to
compute ray intersections on [Möller and Trumbore, 1997], which is a particularly
important feature as millions of ray-mesh intersection computations are common in
ray tracing. Despite universal and efficient, mesh is less amenable to neural networks
than the other representations to be discussed below.
Point Clouds If we ablate the mesh faces (or equivalently, vertex connectivity), we
come to the point cloud representation where a collection of 3D points describe the
surface geometry. See Figure 1-2 (B) for an example. A point cloud of size 𝑁 is just
an 𝑁 × 3 array of unordered 3D coordinates and therefore can be easily processed
by network architectures such as PointNet [Qi et al., 2017a]. The major drawback,
though, is its lack of semantics for surface since there is no face information. As such,
a ray that should have hit the surface would travel through the unconnected points,
29
and the concept of being inside or outside of the shape is undefined.
30
this dissertation explores using MLPs to represent geometry in two ways: The first
half of the chapter maintains a volumetric representation using MLPs [Srinivasan
et al., 2021], while the second half opts for a surface representation but also using
MLPs [Zhang et al., 2021c].
2.5D Maps Besides 3D representations, there are also 2D representations that can
describe 3D shape. With 3D semantics such as depth or normals, these 2D images are
often referred to as “2.5D” maps or buffers [Marr, 1982]. More specifically, a depth
map has its pixel values indicating how far the camera rays travel before hitting the
objects in the scene; a normal map has its pixel values specifying the 3D orientations
of the surface points visible from this view. See Figure 1-2 (E) for an example. Unlike
3D representations, these 2.5D maps are dependent on the view: Different views of
the same scene lead to different 2.5D maps since different 3D points fall onto the
image plane.
Because these maps are essentially 2D images exploiting sparseness of 3D surface,
they are amenable to image CNNs and other network architectures designed for im-
ages. We use depth maps and other custom 2.5D maps (such as spherical maps in
Section 4.4) in recovering 3D shape in Chapter 4. In addition, Chapter 2 also vi-
sualizes many geometric properties such as surface normals and light visibility using
these 2.5D representations.
1.1.2 Materials
With the shape defined, one next specifies the material properties for the object,
possibly in a spatially-vary way. The simplest material description is reflectance,
concerning only a local surface point where the light ray lands. Because this type of
reflectance depends on only the incoming and outgoing directions (𝜔i and 𝜔o ) w.r.t.
the local surface normal 𝑛 at that point (i.e., no non-local information required), it can
conveniently expressed using a Bidirectional Reflectance Distribution Function
(BRDF): 𝑓 (𝜔i , 𝜔o ). Intuitively, 𝑓 (·) describes how the outgoing energy is distributed
over all possible 𝜔o ’s given every 𝜔i , as visualized in Figure 1-3. The fact that 𝜔i
31
and 𝜔o are often defined in the local frame with 𝑛 as the 𝑧-axis demonstrates why we
often require the shape be defined before considering materials (not to mention that
we need geometry to find the ray-surface intersection too).
Normal (n) n n n
Figure 1-3: Example BRDFs. A perfectly reflective BRDF reflects the incoming light
to the mirrored direction. A glossy BRDF reflects light to a lobe of directions centered
around the mirror direction. A diffuse BRDF reflects light equally to all directions.
A general BRDF reflects light into all directions non-uniformly.
With this formulation, one can describe a diffuse material using the Lambertian
BRDF. Because a perfectly Lambertian material reflects the incoming light to all
outgoing directions equally, the Lambertian BRDF simply returns the same constant
for all 𝜔o ’s given any 𝜔i . Other commonly used BRDFs include the Blinn-Phong
reflection model [Blinn, 1977] and the microfacet BRDF by Walter et al. [2007],
both of which are capable of describing glossy materials with specular highlights (like
those shown in Figure 1-1). If an object has different BRDFs at different surface
locations, Spatially-Varying BRDFs (SVBRDFs) are necessary to specify its material
properties. In Chapter 2, we use the microfacet BRDF by Walter et al. [2007] as the
main reflection model [Srinivasan et al., 2021] and as an analytic alternative to our
learned BRDF [Zhang et al., 2021c].
Despite easy to use, these surface reflectance models deal with only local light
transport happening right at the ray-surface contact points. Therefore, they are un-
able to express non-local light transport such as subsurface scattering (SSS) as com-
monly observed on human skin [Hanrahan and Krueger, 1993] or transmitting light
transport as observed in translucent materials. As such, researchers have developed
more general material-describing functions such as the Bidirectional Scattering
Distribution Function (BSDF) by Jensen et al. [2001]. The first half of Chapter 2
computes local radiance values with BRDFs only (i.e., no scattering or transmittance)
but then employs volume rendering to alpha composite the resultant radiance values
32
along a camera ray [Srinivasan et al., 2021]. On the other hand, the second half opts
for an entirely surface-based treatment: Radiance is computed locally with BRDFs
only, and that local radiance directly arrives at the camera, with no volume rendering
or attenuation along the path.
Besides what has been discussed, there are other important BXDF (“X” being
a wildcard for “R,” “S,” etc.) topics that this dissertation does not touch on. For
instance, while many BXDF models are designed to look realistic, others are carefully
crafted to be physically correct, with properties that a naturally existing material
would possess such as energy conservation and the Helmholtz reciprocity. Our learned
BRDF in Chapter 2 [Zhang et al., 2021c] falls into the former category, with no
guarantee to be physically accurate.
Another essential BXDF topic is importance sampling: the technique that ef-
ficiently samples Monte Carlo paths based on the BXDF to enable efficient, low-
variance rendering [Lawrence et al., 2004]. Incorporating such techniques to the
BRDFs in Chapter 2 could be interesting but meanwhile challenging because the
BRDFs there are unknown and being estimated jointly [Srinivasan et al., 2021, Zhang
et al., 2021c].
1.1.3 Lighting
Shape and materials are intrinsic properties of an object. Extrinsic to the object are
lighting and the camera.
Broadly, lighting can be categorized into direct or indirect illumination. Di-
rect illumination is the light arriving at the object directly from the light source,
while indirect illumination is the light bounced to the object from another object
in the scene rather than the light source, as illustrated in Figure 1-4. Taking into
account also the indirect illumination is crucial to photorealism: Figure 1-4 shows a
comparison between a scene rendered with direct illumination only vs. with global
illumination. It is clear that simulation of many light transport phenomena, such
as the green tint cast by the right green wall (pictured only indirectly via the mir-
ror ball) onto the other wall and the diffuse ball, requires modeling of the indirect
33
illumination. In Chapter 2, we study inverse rendering for both setups: considering
one-bounce indirect illumination [Srinivasan et al., 2021] and considering just direct
illumination [Zhang et al., 2021c].
34
terize” the 3D visibility into visibility maps associated with different views, or the
scene representation is volumetric as in Chapter 2 [Srinivasan et al., 2021]. Given a
light direction, the visibility map (as visualized in Figure 1-5) can be thought of as
a “shadow map,” informing us which pixels in this particular view are in shadow. If
we average these per-light visibility maps over all incoming light directions, we get
the ambient occlusion map (as visualized in Figure 1-5) that encodes how “exposed”
each point is to all light directions. We use these maps extensively in Chapter 2 for
visualization.
Many of the image features that we observe depend on the incoming light direc-
tion: When this direction varies, those image features such as shadows and specular
highlights change in the same fixed viewpoint. Consider the apple scene in Figure 1-1.
When the light bulb moves around, we will see the shadows and specular highlights
moving accordingly. Other light-dependent effects that are more subtle include
shadow softness and specularity spread: Still in the apple scene, if the light bulb
shrinks in size, approaching a point light, the cast shadows will become harder with
the penumbra gradually disappearing, and the specular highlights will become more
concentrated. Relighting is the problem of synthesizing such light-dependent effects
under novel lighting, addressed by Chapter 3 and Chapter 2 of this dissertation [Sun
et al., 2020, Zhang et al., 2021b, Srinivasan et al., 2021, Zhang et al., 2021c].
35
1.1.4 Cameras
Cameras record a 2D projection of the 3D world onto the image plane. The projection
is governed by camera extrinsics and intrinsics.
Camera extrinsics describes the rigid-body transformation from the world co-
ordinate system to the camera’s local coordinate system, usually in the form of a 3D
rotation matrix 𝑅 ∈ R3×3 and a 3D translation vector 𝑡 ∈ R3 . Camera extrinsics
can then be expressed with a 3 × 4 matrix 𝐸 = [𝑅 | 𝑡]. Therefore, given a 3D point
(in homogeneous coordinates) 𝑥homo
w in the world space, 𝑥c = 𝐸𝑥homo
w produces the
non-homogeneous coordinates of the same point in the camera’s local frame.
Camera intrinsics, on the other hand, specifies how the 3D-to-2D projection is
performed in the camera’s local space. In this dissertation (and most of the computer
vision projects), we assume zero skew, square pixels, and a center
[︂ optical
]︂ center.
𝑓 0 𝑤/2
These assumptions lead to the 3 × 3 intrinsics matrix 𝐾 = 0 𝑓 ℎ/2 , where 𝑓 is
0 0 1
the focal length in pixels1 , and (ℎ, 𝑤) are the image resolution. With both extrinsics
and intrinsics specified, the “one-stop” projection matrix projecting 𝑥homo
w to its 2D
homogeneous coordinates in the image space is 𝑥homo
i = 𝐾𝐸𝑥homo
w . We refer the
reader to Szeliski [2010] for more on camera models and to Hartley and Zisserman
[2004] for in-depth mathematics on projective geometry.
We use these 3D-to-2D projections, their inversions (as simple as matrix inversion),
and their extensions heavily throughout this dissertation. Specifically, in Chapter 2,
we cast camera rays to the scene by inverting the aforementioned 3D-to-2D camera
projection [Srinivasan et al., 2021, Zhang et al., 2021c]. In Chapter 3, we resample
pixels from the camera space to the UV texture space and back [Zhang et al., 2021b].
Finally, we estimate the extrinsics and intrinsics parameters [Sun et al., 2018b] and
backproject 2.5D depth maps to the 3D space (and to “spherical maps”) [Zhang et al.,
2018b] in Chapter 4.
There is a surprising (and perhaps unintuitive) duality between cameras and
1
To convert a mm focal length to pixels, one needs to compare the image resolution (which is
in pixels) with the effective sensor size (which is in mm), then compute how many pixels 1 mm
translates to, and finally scale the mm focal length accordingly.
36
lights as shown by the Dual Photography work of Sen et al. [2005], where they
successfully synthesize the scene appearance from the projector’s perspective and
also relight the scene as if the camera were the projector (light). Similar to the
light-dependent effects discussed above, view-dependent effects are the appearance
variations due to viewpoint changes. Unsurprisingly, specularity moves as you view
it from different viewpoints, e.g., by swaying your head left and right. Shadows,
however, are seldom view-dependent: Shadows do not move w.r.t. the rest of the
3D scene as the viewpoint varies. This is a distinction between cameras and lights
despite their similarities in other aspects. The task of view synthesis is about
synthesizing the view-dependent effects for a novel viewpoint, and we address this
task in Chapter 3 and Chapter 2 [Zhang et al., 2021b, Srinivasan et al., 2021, Zhang
et al., 2021c].
1.1.5 Rendering
We have defined the four essential scene aspects—shape, materials, lighting, and
cameras—and introduced their representations commonly used. The final missing
piece of the puzzle is rendering, the process of “combining” the four elements into an
RGB image.
To figure out the appearance for a 3D point 𝑥, one solves the rendering equa-
tion [Kajiya, 1986, Immel et al., 1986] often using Monte Carlo methods. In this
dissertation where no object emits light, we simplify the full equation to:
∑︁
(1.1)
(︀ )︀
𝐿o (𝑥, 𝜔o ) = 𝑅(𝑥, 𝜔i , 𝜔o )𝐿i (𝑥, 𝜔i ) 𝜔i · 𝑛 ∆𝜔i ,
𝜔i
37
the first half of Chapter 2, there is such recursion: 𝐿i is the sum of a light probe
pixel and the one-bounce indirect illumination from a nearby point [Srinivasan et al.,
2021], whereas in the second half, 𝐿i directly takes values from the light probe pixels
since we consider only direct illumination [Zhang et al., 2021c].
Although the rendering equation is expressive and general, one may not be able
to or need to fully decompose Equation 1.1 into every term. For instance, it is
error-prone, if possible at all, to explicitly find 𝑅 from samples of 𝐿o in the setup
of Chapter 3. Moreover, it is unnecessary to solve for every term in Equation 1.1
just for relighting and view synthesis in that setup since we do not plan to edit the
materials 𝑅. In such cases, a middle level of abstraction such as the light transport
function comes in useful. Formally, we reparameterize Equation 1.1 at a higher level
of abstraction:
∑︁
𝐿o (𝑥, 𝜔o ) = 𝑇 (𝑥, 𝜔i , 𝜔o )𝐿′i (𝜔i )∆𝜔i , (1.2)
𝜔i
where 𝑇 (𝑥, 𝜔i , 𝜔o ) is the light transport function that embraces the BRDF, cosine
term, light visibility, and the recursive nature of 𝐿i , and 𝐿′i (𝜔i ) is the light intensity
from 𝜔i . Crucially, unlike 𝐿i , 𝐿′i (𝜔i ) bears no dependency on 𝑥, thereby eliminating
the recursive nature of Equation 1.1. Intuitively, 𝑇 directly returns the “net radi-
ance” at 𝑥 when lit from 𝜔i and viewed from 𝜔o , concealing the actual recursion of
intermediate light bounces.
38
1.2 Inverting the Image Formation Process
Although the image formation process is so well understood that we can render im-
ages indistinguishable from real photographs, inverting this process—recovering from
images the scene properties that we discussed—is still highly challenging because in-
formation loss is huge during the forward process: 3D shape gets projected to the 2D
image plane; reflectance and lighting get convolved together and then observed. In
other words, inverting the image formation process is ill-posed: There are multiple
sets of scene elements that could have caused the images that we observe.
The four subproblems also represent three different levels of abstraction for the
inverse rendering problem. At a low level of abstraction, Chapter 2 attempts to solve
for every term in our (simplified) rendering equation (Equation 1.1) by re-rendering
all the estimated elements back to RGB images, which then get compared against the
observed images for loss computation. Despite challenging, this low-level approach
allows us to export the estimated shape and edit the estimated reflectance in ad-
dition to what a mid-level abstraction would support. Ascending to a middle level
of abstraction, Chapter 3 explores interpolating the light transport function (as in
Equation 1.2) given sparse samples thereof. Our models based on this mid-level ab-
39
straction enable relighting, view synthesis, and both tasks simultaneously while easily
including global illumination effects. Finally, at a high level of abstraction, Chapter 4
and Chapter 5 recover shape or lighting from single images, without modeling the
other scene elements or the rendering process. Relying on large datasets of shapes
or lighting patterns, these two chapters train deep learning models that directly map
the appearance observations to the underling shape or lighting.
In this subsection, we introduce the problem that Chapter 2 attempts to solve: es-
timating shape, reflectance, and illumination from the object’s appearance. This
amounts to explicitly solving for every term in Equation 1.1 and then re-rendering
these estimated factors into RGB images in a physically-based manner. As such, this
low level of abstraction supports operations on the estimated factors, such as lighting
editing (i.e., relighting), reflectance editing (i.e., material change), and shape export
(e.g. into a graphics engine).
Note that the well-known problem of Intrinsic Image Decomposition (IID) [Barrow
and Tenenbaum, 1978] solves only part of this factorization problem. In terms of
shape, the IID methods recover depth or surface normal maps only for the input
view, rather than a full 3D shape [Weiss, 2001, Tappen et al., 2003, Bell et al.,
2014, Barron and Malik, 2014, Janner et al., 2017]. This makes view synthesis with
these approaches impossible. Material-wise, these IID methods mostly assume the
Lambertian reflectance and tend to fail on more complicated materials. Finally,
lighting recovered by the IID approaches is also in the space of the input view (e.g., a
“lighting image”), making relighting with arbitrary lighting difficult. The appearance
factorization approaches that we propose in Chapter 2 address all of these issues that
the IID methods suffer from.
In Chapter 2, we study full appearance decomposition under two setups. In the
first setup, we assume that we observe the object under multiple arbitrary but known
lighting conditions [Srinivasan et al., 2021]. Note that “arbitrary” means that the
lighting does not have to be of a certain form such as one point light in the dark.
40
We also model first-bounce indirect illumination in this setup. In the second setup,
we relax the requirement for input lighting: We observe the object under only one
unknown lighting condition [Zhang et al., 2021c]. This relaxation allows us to apply
our method to a user capture under a natural, unknown lighting condition, such as
one made of a car on the street.
41
relighting (i.e., not view synthesis), this approach has the advantages of being purely
image-based and requiring no 3D modeling. With the additional input of geometry
proxy, we continue exploring the interpolation of 𝑇 in both light and view directions,
thereby enabling simultaneous relighting and view synthesis [Zhang et al., 2021b].
42
the reconstruction network tends to produce blurry “mean shapes” that satisfy the
ℓ2 loss but do not look realistic [Wu et al., 2018]. The second problem addressed by
Chapter 4 is the generalizability of these reconstruction networks: They work well
only on the shape categories seen during training but generalize poorly to novel shape
classes, still “retrieving” shapes from the training classes [Zhang et al., 2018b].
Operating at a high level of abstraction, all solutions proposed in Chapter 4 treat
rendering as a black box and invert it directly with deep learning models that learn
data-driven priors. These models based on the high level of abstraction rely on data
rather than physics and have the advantage of being applicable to single images (cf.
multiple images as required by the mid- and low-level abstractions).
43
remains future work.
As alluded to previously, Chapter 5 continues to stay at the high level of abstrac-
tion. Specifically, we train a conditional generative model to directly “regress” the
Earth image from the Moon observation and the timestamp. Our data-driven solu-
tion circumvents the need to model the image formation process for this extreme case
and enables lighting recovery from single images.
The overarching theme of this dissertation is recovering shape, reflectance, and illumi-
nation from appearance. We study four instances of inverse rendering: I) “shape,
reflectance, and illumination from appearance” in Chapter 2, II) “light transport func-
tion from appearance” in Chapter 3, III) “shape from appearance” in Chapter 4, and
IV) “lighting from appearance” in Chapter 5.
These four subtopics represent three levels of abstraction to tackle inverse
rendering: I) the low level of abstraction where we explicitly solve for every term—
shape, reflectance, and illumination—in the rendering equation (Equation 1.1) in a
physically-based manner, achieving full editability and exportability that a mid- or
high-level solution is incapable of, II) the middle level where we utilize the light
transport function (𝑇 in Equation 1.2) to abstract away intermediate light transport
and focus on just the final “net effect,” delivering high-quality relighting results with
global illumination effects for challenging reflectance (such as that of human skin),
and III) the high level where we treat rendering as a black box and invert it with
data-driven priors, supporting single-image input at test time.
In Chapter 1 (this chapter), we have introduced the image formation process by
explaining the four main scene elements (i.e., shape, materials, lighting, and cameras),
their representations in computer vision and graphics, and the rendering process that
“combines” these elements into images that we see. We then defined the problem of in-
verting the image formation process, where we aim to recover the scene elements from
image observations passively. Specifically, we have provided the problem statements
44
for the aforementioned four instances of inverse rendering.
At the middle level of abstraction, Chapter 3 utilizes the light transport func-
tion, 𝑇 in Equation 1.2, to abstract away intermediate light bounces and model
directly the “net effect” radiance. Specifically, we attempt to interpolate 𝑇 from the
sparse samples thereof. This is the right abstraction level for our problem, at which we
can perform high-quality relighting and view synthesis including global illumination
effects, without having to explicitly solve for geometry and reflectance or simulate all
light bounces. We first interpolate the light transport function in just the light direc-
tion, achieving precise, high-frequency portrait relighting with a model that we call
Light Stage Super-Resolution (LSSR) [Sun et al., 2020]. With the additional input of
a geometry proxy, we then develop Neural Light Transport (NLT) that interpolates
𝑇 in both the light and view directions, enabling simultaneous relighting and view
synthesis of humans with complex geometry and reflectance [Zhang et al., 2021b].
45
high-quality reconstruction with an adversarially learned perceptual loss [Wu et al.,
2018]. Tackling the generalization problem of ShapeHD and similar learning mod-
els, we then propose Generalizable Reconstruction (GenRe) capable of generalizing
to novel shape categories unseen during training [Zhang et al., 2018b]. Finally, we
briefly discuss how Pix3D—our own real-world dataset of image-shape pairs with
pixel-level alignment—is constructed [Sun et al., 2018b] and facilitates the evaluation
of ShapeHD and GenRe.
Staying at the high level of abstraction, Chapter 5 presents the current progress
of our work on data-driven lighting recovery from appearance, where we train a con-
ditional generative model of possible lighting patterns given various appearances of
the object illuminated. We frame this problem in a special Moon-Earth setup where
the Earth, as the light source, illuminates the dark side of the Moon. Our model,
Generative Adversarial Networks for the Earth (EarthGANs), aims to recover the
Earth appearance given a single-pixel Moon appearance and the corresponding times-
tamp. This is the proper level of abstraction that circumvents the need to model this
extreme image formation process and makes EarthGAN applicable to a “backyard
image” taken by a mobile phone camera.
Finally, Chapter 6 concludes the dissertation and discusses future directions.
46
Shape and Reflectance Under an Unknown Illumination. ACM Transactions
on Graphics (TOG), TBA, 2021.
• Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhe-
mann, Paul Debevec, Yun-Ta Tsai, Jonathan T. Barron, and Ravi Ramamoor-
thi. Light Stage Super-Resolution: Continuous High-Frequency Relighting.
ACM Transactions on Graphics (TOG), 39(6):1–12, 2020.
• Xiuming Zhang, Sean Fanello, Yun-Ta Tsai, Tiancheng Sun, Tianfan Xue, Ro-
hit Pandey, Sergio Orts-Escolano, Philip Davidson, Christoph Rhemann, Paul
Debevec, Jonathan T. Barron, Ravi Ramamoorthi, and William T. Freeman.
Neural Light Transport for Relighting and View Synthesis. ACM Transactions
on Graphics (TOG), 40(1):1–17, 2021.
• Xingyuan Sun*, Jiajun Wu*, Xiuming Zhang, Zhoutong Zhang, Chengkai Zh-
ang, Tianfan Xue, Joshua B. Tenenbaum, and William T. Freeman. Pix3D:
Dataset and Methods for Single-Image 3D Shape Modeling. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
47
THIS PAGE INTENTIONALLY LEFT BLANK
48
Chapter 2
Low-Level Abstraction:
Physically-Based Appearance
Factorization
49
factorizes the object appearance into shape, SVBRDFs, and direct illumination from
multi-view images of an object lit by just one unknown lighting condition (Section 2.4)
[Zhang et al., 2021c].
In Section 2.5, we describe our experiments that evaluate how well NeRV and
NeRFactor perform appearance decomposition (and subsequently free-viewpoint re-
lighting), and how they compare with the existing solutions to our tasks, under two
setups: multiple arbitrary but known lighting conditions (for NeRV) and one un-
known lighting condition (for NeRFactor). We also perform additional analyses, in
Section 2.6, to study the importance of each major component of the NeRV and NeR-
Factor models and analyze whether NeRFactor predicts albedo consistently for the
same object when lit by different lighting conditions.
2.1 Introduction
Recovering an object’s geometry and material properties from captured images, such
that it can be rendered from arbitrary viewpoints under novel lighting conditions,
is a longstanding problem within computer vision and graphics. In addition to its
importance for recognition and robotics, a solution to this could democratize 3D
content creation and allow anyone to use real-world objects in Extended Reality (XR)
applications, film-making, and game development. The difficulty of this problem
stems from its fundamentally underconstrained nature, and prior work has typically
addressed this either by using additional observations such as scanned geometry or
images of the object under controlled laboratory lighting conditions, or by making
restrictive assumptions such as assuming a single material for the entire object or
ignoring self-shadowing.
The vision and graphics communities have recently made substantial progress
towards the novel view synthesis portion of this goal. Neural Radiance Fields (NeRF)
has shown that it is possible to synthesize photorealistic images of scenes by training
a simple neural network to map 3D locations in the scene to a continuous field of
volume density and color [Mildenhall et al., 2020]. Volume rendering is trivially
50
differentiable, so the parameters of a NeRF can be optimized for a single scene by
using gradient descent to minimize the difference between renderings of the NeRF
and a set of observed images. Although NeRF produces compelling results for view
synthesis, it does not provide a solution for relighting. This is because NeRF models
just the amount of outgoing light from a location – the fact that this outgoing light
is the result of interactions between incoming light and the material properties of an
underlying surface is ignored.
At first glance, extending NeRF to enable relighting appears to require only chang-
ing the image formation model: Instead of modeling scenes as fields of density and
view-dependent color, we can model surface normals and material properties (e.g.,
the parameters of a Bidirectional Reflectance Distribution Function [BRDF]), and
simulate the transport of the scene’s light sources (which we first assume are known)
according to the rules of physically-based rendering [Pharr et al., 2016]. However,
simulating the attenuation and reflection of light by particles is fundamentally chal-
lenging in NeRF’s neural volumetric representation because content can exist any-
where within the scene, and determining the density at any location requires querying
a neural network.
Consider the naïve procedure for computing the radiance along a single camera
ray due to direct illumination, as illustrated in Figure 2-1: First, we query NeRF’s
Multi-Layer Perceptron (MLP) for the volume density at samples along the cam-
era ray to determine the amount of light reflected by particles at each location that
reaches the camera. For each location along the camera ray, we then query the MLP
for the volume density at densely sampled points between the location and every light
source to estimate the attenuation of light before it reaches that location. This proce-
dure quickly becomes prohibitively expensive if we want to model environment light
sources or global illumination, in which case scene points may be illuminated from
all directions. Prior methods for estimating relightable volumetric representations
from images have not overcome this challenge and can only simulate direct illumina-
tion from a single point light source when training. This is what we refer to as the
“computational complexity problem” of extending NeRF for relighting.
51
Naïve Ours
Figure 2-1: How NeRV
reduces the
Direct
0.4 computational
0.6 complexity. Brute-force
0.7
light transport
simulation through
NeRF’s volumetric
0.8 representation with
One-Bounce
naïve raymarching
Indirect
3m
(left) is intractable. By
approximating
visibility with a neural
visibility field (right)
𝑛 is the number of samples along each ray, ℓ is the number of light
that is optimized
sources, and 𝑑 is the number of indirect illumination directions sam-
pled. Black dots represent evaluating a shape MLP for volume den- alongside the shape
sity at a position, red arrows represent evaluating the visibility MLP MLP, we are able to
at a position along a direction, and the blue arrow represents eval- make optimization
uating the visibility MLP for the expected termination depth of a with complex
ray. Output visibility multipliers and termination depths from the
illumination tractable.
visibility MLP are displayed as text.
In the first half of this chapter, we present Neural Reflectance and Visibility
Fields (NeRV), an approach for estimating a volumetric 3D representation from
52
images of a scene under multiple arbitrary but known lighting conditions [Srinivasan
et al., 2021], such that novel images can be rendered from arbitrary unseen viewpoints
and under novel unobserved lighting conditions, as shown in Figure 2-2.
(a) Input images of the scene under unconstrained varying (known) lighting conditions
NeRV can simulate realistic environment lighting and global illumination. Our
key insight is to train an MLP to act as a lookup table into a visibility field during
rendering. Instead of estimating light or surface visibility at a given 3D position
along a given direction by densely evaluating an MLP for the volume density along
the corresponding ray (which would be prohibitively expensive), we simply query
this visibility MLP to estimate visibility and expected termination depth in any di-
rection (see Figure 2-1). This visibility MLP is optimized alongside the MLP that
53
represents volume density and supervised to be consistent with the volume density
samples observed during optimization. Using this neural approximation of the true
visibility field significantly eases the computational burden of estimating volume ren-
dering integrals while training. NeRV enables the recovery of a NeRF-like model
that supports relighting in addition to view synthesis. While previous solutions for
relightable NeRFs are limited to controlled settings that require the input images be
illuminated by a single point light [Bi et al., 2020a], NeRV supports training with
arbitrary environment lighting and one-bounce indirect illumination.
In the second half of this chapter, we continue to investigate whether we can
achieve what NeRV accomplishes but with just one unknown illumination, a setup
often encountered when the user wants to capture daily in-the-wild objects. To this
end, we develop Neural Factorization of Shape and Reflectance (NeRFactor),
a model capable of recovering convincing relightable representations from images of
an object captured under one unknown natural illumination condition [Zhang et al.,
2021c], as shown in Figure 2-3.
Albedo BRDF
Real-World Capture
NeRFactor Applications
Figure 2-3: NeRFactor overview. Given a set of posed images of an object captured
from multiple views under just one unknown illumination condition (left), NeRFactor
is able to factorize the scene into 3D neural fields of surface normals, light visibil-
ity, albedo, and material (center), which enables applications such as free-viewpoint
relighting and material editing (right).
Our key insight is that we can first optimize a NeRF [Mildenhall et al., 2020] from
the input images to initialize our model’s surface normals and light visibility, and then
jointly optimize these initial estimates along with the spatially-varying reflectance and
the lighting condition, such that these estimates, when re-rendered, match the ob-
54
served images. The use of NeRF to produce a high-quality geometry initialization
helps break the inherent ambiguities among shape, reflectance, and lighting, thereby
allowing us to recover a full 3D model for convincing view synthesis and relight-
ing using just a re-rendering loss, simple spatial smoothness priors for each of these
components, and a novel data-driven BRDF prior. Because NeRFactor models light
visibility explicitly and efficiently, it is capable of removing shadows from albedo esti-
mation and synthesizing realistic soft or hard shadows under arbitrary novel lighting
conditions.
55
(with normals and visibility) to use as an initialization when improving the
geometry and recovering reflectance, and
• novel data-driven BRDF priors based on training a latent code model on real
measured BRDFs.
Input & Output The input to NeRV is a set of multi-view images of an object
illuminated under multiple arbitrary but known lighting conditions, while NeRFactor
requires only one unknown lighting condition. Both methods require the camera poses
of these images, which can be obtained with an off-the-shelf Structure From Motion
(SFM) package, such as COLMAP [Schönberger and Frahm, 2016]. Both methods
jointly estimate a plausible collection of surface normals, light visibility, albedo, and
spatially-varying BRDFs, which together explain the observed views. NeRFactor
additionally estimates the environment lighting. We then use the recovered geome-
try and reflectance to synthesize images of the object from novel viewpoints under
arbitrary lighting. Modeling visibility explicitly, both methods are able to remove
shadows from albedo and synthesize soft or hard shadows under arbitrary lighting.
56
2.2 Related Work
NeRV and NeRFactor both tackle the problem of inverse rendering, whose literature is
reviewed in Section 2.2.1. We also review the coordinate-based neural object or scene
representation, which is fundamental to both works, in Section 2.2.2. Section 2.2.3
surveys precomputation in computer graphics, which motivates the fast “visibility
lookup” in both NeRV and NeRFactor. Finally, because our models can be applied
to perform object capture for downstream graphics applications, we also review prior
art on material capture in Section 2.2.4.
Intrinsic image decomposition aims to attribute what aspects of an image are due to
material, lighting, or geometric variation [Horn, 1970, Land and McCann, 1971, Horn,
1974, Barrow and Tenenbaum, 1978]. The more general problem that additionally
involves non-Lambertian reflectance, global illumination, etc. is often referred to as
inverse rendering [Sato et al., 1997, Marschner, 1998, Yu et al., 1999, Ramamoorthi
and Hanrahan, 2001]. In other words, the goal of inverse rendering is to factorize the
appearance of an object in observed images into the underlying geometry, material
properties, and lighting conditions. It is a longstanding problem in computer vision
and graphics, the difficulty of which (a consequence of its underconstrained nature) is
typically addressed using one of the following strategies: I) learning priors on shape,
illumination, and reflectance, II) assuming known geometry, or III) using multiple
input images of the scene under one or multiple lighting conditions.
Most recent single-image inverse rendering methods [Barron and Malik, 2014, Li
et al., 2018, Yu and Smith, 2019, Sengupta et al., 2019, Li et al., 2020c, Wei et al.,
2020, Sang and Chandraker, 2020] belong to the first category and use large datasets
of images with labeled geometry and materials to train machine learning models to
predict these properties. Most prior works in inverse rendering that recover full 3D
models for graphics applications [Weinmann and Klein, 2015] fall under the second
category and use 3D geometry obtained from active scanning [Park et al., 2020,
57
Schmitt et al., 2020, Zhang et al., 2021b], proxy models [Dong et al., 2014, Chen
et al., 2020, Gao et al., 2020], silhouette masks [Oxholm and Nishino, 2014, Xia et al.,
2016], or multi-view stereo [Nam et al., 2018] as a starting point before recovering
reflectance and refining geometry.
Both NeRV and NeRFactor belong to the third category: We only require as input
posed images of an object under one unknown or multiple known lighting conditions.
The most relevant prior works are Deep Reflectance Volumes (DRV) that estimates
voxel geometry and BRDF parameters [Bi et al., 2020b], and the follow-up work
Neural Reflectance Fields that replaces DRV’s voxel grid with a continuous volume
represented by a Multi-Layer Perceptron (MLP) [Bi et al., 2020a]. NeRV extends
Neural Reflectance Fields, which requires scenes be illuminated by only a single point
light at a time due to their brute-force visibility computation strategy visualized in
Figure 2-1 and models only direct illumination, to work for arbitrary lighting and
global illumination.
We build upon a recent trend within the computer vision and graphics communities
that replaces traditional shape representations such as polygon meshes or discretized
voxel grids with MLPs that represent geometry as parametric functions. These MLPs
are optimized to approximate continuous 3D geometry by mapping 3D coordinates
to properties of an object or scene (such as volume density, occupancy, or signed dis-
tance) at that location. This strategy has been explored for the tasks of representing
shape [Genova et al., 2019, Mescheder et al., 2019, Park et al., 2019a, Deng et al.,
2020, Sitzmann et al., 2020, Tancik et al., 2020] and scenes under fixed lighting for
view synthesis [Niemeyer et al., 2019, Sitzmann et al., 2019b, Mildenhall et al., 2020,
Liu et al., 2020, Yariv et al., 2020].
As one such coordinate-based representation, Neural Radiance Fields (NeRF) has
been particularly successful for optimizing volumetric geometry and appearance from
observed images for the purpose of rendering photorealistic novel views [Mildenhall
et al., 2020]. It can be thought of as a modern neural reformulation of the clas-
58
sic problem of scene reconstruction: given multiple images of a scene, inferring the
underlying geometry and appearance that best explain those images. While classic
approaches have largely relied on discrete representations such as textured meshes
[Hartley and Zisserman, 2004, Snavely et al., 2006] and voxel grids [Seitz and Dyer,
1999], NeRF has demonstrated that a continuous volumetric function, parameterized
as an MLP, is able to represent complex scenes and render photorealistic novel views.
NeRF works well for view synthesis, but it does not enable relighting because it has
no mechanism to disentangle the outgoing radiance of a surface into an incoming
radiance and an underlying surface material.
One technique that has been used for extending NeRF to support relighting is
conditioning the MLP’s output appearance on a latent code that encodes a per-image
lighting, as in NeRF in the Wild [Martin-Brualla et al., 2021] (and previously with
discretized scene representations [Meshry et al., 2019, Li et al., 2020b]). Although
this strategy can effectively explain the appearance variation of training images, it
cannot be used to render the same scene under new lighting conditions not observed
during training (Figure 2-13), since it does not utilize the physics of light transport.
NeRV uses the same volumetric shape representation as NeRF. On the other hand,
NeRFactor continues with the coordinate-based neural representation, but shows that
starting with the NeRF volume and then optimizing a surface representation enables
59
us to recover a fully-factorized and high-quality 3D model using just images captured
under one unknown illumination. Crucially, using a neural volumetric representation
to estimate the initial geometry enables us to recover factored models for objects that
have proven to be challenging for traditional geometry estimation methods.
A large body of work within the computer graphics community has focused on the
specific subproblem of material acquisition, where the goal is to estimate BRDF
properties from images of materials with known (typically planar) geometry. These
methods have traditionally utilized a signal processing reconstruction strategy, and
used complex controlled camera and lighting setups to adequately sample the BRDF
[Foo, 2015, Matusik et al., 2003, Nielsen et al., 2015]. More recent methods have en-
abled material acquisition from more casual smartphone setups [Aittala et al., 2015,
Hui et al., 2017]. However, this line of work generally requires the geometry be simple
and fully known, while we focus on a more general problem where our only observa-
tions are images of an object with complex shape and spatially-varying reflectance
(plus the environment lighting for NeRV).
60
2.3 Method: Multiple Known Illuminations
We extend Neural Radiance Fields (NeRF) [Mildenhall et al., 2020] to include the
simulation of light transport, which allows NeRFs to be rendered under arbitrary
novel illumination conditions. Instead of modeling a scene as a continuous 3D field of
particles that absorb and emit light as in NeRF, we represent a scene as a 3D field of
oriented particles that absorb and reflect the light emitted by external light sources
(Section 2.3.2). Naïvely simulating light transport through this model is inefficient
and unable to scale to simulate realistic lighting conditions or global illumination. We
remedy this by introducing a neural visibility field representation (optimized alongside
NeRF’s volumetric representation) that allows us to efficiently query the point-to-light
and point-to-point visibilities needed to simulate light transport (Section 2.3.3). The
resulting Neural Reflectance and Visibility Fields (NeRV) [Srinivasan et al., 2021] are
visualized in Figure 2-4.
= ( + )
x
x × × dωi
(b) Light Visibility
(b) Light Visibility (c) Incident Direct Illum. (d) Incident Indirect Illum.
(c) Direct Illumination (e) BRDF
(d) Indirect Illumination (e) BRDF
(a) Our Rendered Image
(a) (Novel
Our Rendered
View and Image
Lighting)
(Novel View and Lighting)
x x x x x x
(f) Normals (g) Albedo (h) Roughness (i) Shadow Map (j) Direct (k) Indirect
(f) Normals (g) Albedo (h) Roughness (i) Shadow Map (j) Direct (k) Indirect
61
2.3.1 Neural Radiance Fields (NeRF)
A NeRF representation does not separate the effect of incident light from the material
properties of surfaces. This means that NeRF is only able to render views of a
scene under the fixed lighting condition presented in the input images – a NeRF
cannot be relit. Modifying NeRF to enable relighting is straightforward, as initially
demonstrated by the Neural Reflectance Fields work of Bi et al. [2020a]. Instead
of representing a scene as a field of particles that emit light, it is represented as a
field of particles that reflect incoming light. With this, given an arbitrary lighting
62
condition, we can simulate the transport of light through the volume as it is reflected
by particles until it reaches the camera with a standard volume rendering integral
[Kajiya and Herzen, 1984]:
∫︁ ∞
𝐿(c, 𝜔𝑜 ) = 𝑉 (x(𝑡), c)𝜎(x(𝑡))𝐿𝑟 (x(𝑡), 𝜔𝑜 ) 𝑑𝑡 , (2.3)
∫︁0
𝐿𝑟 (x, 𝜔𝑜 ) = 𝐿𝑖 (x, 𝜔𝑖 )𝑅(x, 𝜔𝑖 , 𝜔𝑜 ) 𝑑𝜔𝑖 , (2.4)
𝒮
where the view-dependent emission term 𝐿𝑒 (x, 𝜔𝑜 ) in Equation 2.1 is replaced with
an integral, over the sphere 𝒮 of incoming directions, of the product of the incoming
radiance 𝐿𝑖 from any direction and a reflectance function 𝑅 (often called a phase
function in volume rendering) that describes how much light arriving from direction
𝜔𝑖 is reflected towards direction 𝜔𝑜 .
We follow Bi et al. [2020a] and use the standard microfacet Bidirectional Re-
flectance Distribution Function (BRDF) described by Walter et al. [2007] as the re-
flectance function, so 𝑅 at any 3D location is parameterized by a diffuse RGB albedo,
a scalar specular roughness, and a surface normal. We replace NeRF’s radiance MLP
with two MLPs: a “shape MLP” that outputs volume density 𝜎 and a “reflectance
MLP” that outputs BRDF parameters (3D diffuse albedo a and scalar roughness 𝛾)
for any input 3D point: MLP𝜃 : x → 𝜎, MLP𝜓 : x → (a, 𝛾).
Instead of parameterizing the 3D surface normal n as a normalized output of
the shape MLP, as in Bi et al. [2020a], we compute n as the negative normalized
gradient vector of the shape MLP’s output 𝜎 w.r.t. x, computed using automatic
differentiation. We further discuss this choice in Section 2.6.3.
63
camera center
light source
expected termination
depth along camera ray
, estimated
termination depth along
indirect bounce ray
Figure 2-5: The geometry of an indirect illumination path in NeRV. The light ray
departs its source, hits 𝑥′ first, gets reflected to 𝑥, and eventually reaches the camera.
ficult. Even if we only consider direct illumination from light sources to a scene
point, a brute-force solution is already challenging for more than a single point light
source as it requires repeatedly querying the shape MLP for volume density along
paths from each scene point to each light source. Moreover, general scenes can be
illuminated by light arriving from all directions, and addressing this is imperative
to recovering relightable representations in unconstrained scenarios. Simulating even
simple global illumination in a brute-force manner is intractable: Rendering a single
ray in our scenes under one-bounce indirect illumination with brute-force sampling
would require a petaflop of computation, and we need to render roughly a billion rays
over the course of training.
64
In Section 2.3.5 we place losses on the visibility MLP outputs (𝑉˜𝜑 , 𝐷
˜ 𝜑 ) to encourage
them to resemble the (𝑉, 𝐷) corresponding to the current state of the shape MLP.
Below, we provide a detailed walkthrough of how our Neural Visibility Field ap-
proximations simplify the volume rendering integral computation. Figure 2-5 is pro-
vided for reference. We first decompose the reflected radiance 𝐿𝑟 (x, 𝜔𝑜 ) into its direct
and indirect illumination components. Let us define 𝐿𝑒 (x, 𝜔𝑖 ) as radiance due to a
light source arriving at point x from direction 𝜔𝑖 . As defined in Equation 2.3, 𝐿(x, 𝜔𝑖 )
is the estimated incoming radiance at location x from direction 𝜔𝑖 . This means the
incident illumination 𝐿𝑖 decomposes into 𝐿𝑒 + 𝐿 (direct plus indirect light). The
shading calculation for 𝐿𝑟 then becomes:
∫︁
𝐿𝑟 (x, 𝜔𝑜 ) = (𝐿𝑒 (x, 𝜔𝑖 ) + 𝐿(x, −𝜔𝑖 )) 𝑅(x, 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 (2.7)
∫︁ 𝒮 ∫︁
= 𝐿𝑒 (x, 𝜔𝑖 )𝑅(x, 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 + 𝐿(x, −𝜔𝑖 )𝑅(x, 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 .
𝒮
⏟ ⏞ ⏟𝒮 ⏞
component due to direct lighting component due to indirect lighting
To calculate incident direct lighting 𝐿𝑒 we must account for the attenuation of the
(known) environment map 𝐸 due to the volume density along the incident illumina-
tion ray 𝜔𝑖 :
Instead of evaluating 𝑉 as another line integral through the volume, we use the
visibility MLP’s approximation 𝑉˜𝜑 . With this, our full calculation for the direct
lighting component of camera ray radiance 𝐿(c, 𝜔𝑜 ) simplifies to:
∫︁ ∞ (︀ ∫︁
𝑉 x(𝑡), c 𝜎 x(𝑡) 𝑉˜𝜑 x(𝑡), 𝜔𝑖 𝐸 x(𝑡), 9𝜔𝑖 𝑅 x(𝑡), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 𝑑𝑡 . (2.9)
)︀ (︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀
0 𝒮
By approximating the integrals along rays from each point on the camera ray toward
each environment direction when computing the color of a pixel due to direct lighting,
we have reduced the complexity of rendering with direct lighting from quadratic in
the number of samples per ray to linear.
65
Next, we focus on the more difficult task of accelerating the computation of ren-
dering with indirect lighting, for which a brute force approach would scale cubically
with the number of samples per ray. We make two approximations to reduce this
intractable computation. Our first approximation is to replace the outermost integral
(the accumulated radiance reflected towards the camera at each point along the ray)
with a single point evaluation by treating the volume as a hard surface located at the
expected termination depth 𝑡′ = 𝐷(c, −𝜔𝑜 ). Note that we do not use the visibility
MLP’s approximation of 𝑡′ here, since we are already sampling 𝜎(x) along the camera
ray. This reduces the indirect contribution of 𝐿(c, 𝜔𝑜 ) to a spherical integral at a
single point x(𝑡′ ):
∫︁
𝐿 x(𝑡′ ), −𝜔𝑖 𝑅 x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 . (2.10)
(︀ )︀ (︀ )︀
𝒮
To simplify the recursive evaluation of 𝐿 inside this integral, we limit the indirect
contribution to a single bounce, and use the hard surface approximation a second
time to replace the integral along a ray for each incoming direction:
∫︁
′
𝐿(x(𝑡 ), −𝜔𝑖 ) ≈ 𝐿𝑒 (x′ (𝑡′′ ), 𝜔𝑖′ )𝑅(x′ (𝑡′′ ), 𝜔𝑖′ , −𝜔𝑖 )𝑑𝜔𝑖′ , (2.11)
𝒮
where 𝑡′′ = 𝐷
˜ 𝜑 (x(𝑡′ ), 𝜔𝑖 ) is the expected intersection depth along the ray x′ (𝑡′′ ) =
x(𝑡′ ) + 𝑡′′ 𝜔𝑖 as approximated by the visibility MLP. Thus the expression for the
component of camera ray radiance 𝐿(c, 𝜔𝑜 ) due to indirect lighting is:
∫︁∫︁
𝐿𝑒 (x′ (𝑡′′ ), 𝜔𝑖′ )𝑅(x′ (𝑡′′ ), 𝜔𝑖′ , 9𝜔𝑖 )𝑑𝜔𝑖′ 𝑅(x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 , (2.12)
𝒮
and fully expanding the direct radiance 𝐿𝑒 (x′ (𝑡′′ ), 𝜔𝑖′ ) incident at each secondary in-
tersection point gives us:
∫︁∫︁
𝑉˜𝜑 x′ (𝑡′′ ), 𝜔𝑖′ 𝐸 x′ (𝑡′′ ), −𝜔𝑖′ 𝑅 x′ (𝑡′′ ), 𝜔𝑖′ , 9𝜔𝑖 𝑑𝜔𝑖′ 𝑅 x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 , (2.13)
(︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀
𝒮
Finally, we can write out the complete volume rendering equation used by NeRV as
66
the sum of Equations 2.9 and 2.13:
∫︁ ∞ ∫︁
𝐿(c, 𝜔𝑜 ) = 𝑉 x(𝑡), c 𝜎 x(𝑡) 𝑉˜𝜑 x(𝑡), 𝜔𝑖 𝐸 x(𝑡), 9𝜔𝑖 𝑅 x(𝑡), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 𝑑𝑡
(︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀
∫︁∫︁ 0 𝒮
+ 𝑉˜𝜑 x (𝑡 ), 𝜔𝑖′ 𝐸 x′ (𝑡′′ ), 9𝜔𝑖′ 𝑅 x′ (𝑡′′ ), 𝜔𝑖′ , 9𝜔𝑖 𝑑𝜔𝑖′ 𝑅 x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 . (2.14)
(︀ ′ ′′ )︀ (︀ )︀ (︀ )︀ (︀ )︀
𝒮
Figure 2-1 illustrates how the approximations made by NeRV reduce the computa-
tional complexity of computing direct and indirect illumination from quadratic and
cubic (respectively) to linear. This enables the simulation of direct illumination from
environment lighting and one-bounce indirect illumination within the training loop
of optimizing a continuous relightable volumetric scene representation.
2.3.4 Rendering
To render a camera ray x(𝑡) = c − 𝑡𝜔𝑜 passing through a NeRV, we estimate the
volume rendering integral in Equation 2.14 using the following procedure:
1. We draw 256 stratified samples along the ray and query the shape and re-
flectance MLPs for the volume densities, surface normals, and BRDF parame-
ters at each point: 𝜎 = MLP𝜃 (x(𝑡)), n = ∇x MLP𝜃 (x(𝑡)), (a, 𝛾) = MLP𝜓 (x(𝑡)).
2. We shade each point along the ray with direct illumination by estimating the
integral in Equation 2.9. First, we generate 𝐸(x(𝑡), −𝜔𝑖 ) by sampling the known
environment lighting on a 12 × 24 grid of directions 𝜔𝑖 on the sphere around
each point. We then multiply this by the predicted visibility 𝑉˜𝜑 (x(𝑡), 𝜔𝑖 ) and
microfacet BRDF values 𝑅(x(𝑡), 𝜔𝑖 , 𝜔𝑜 ) at each sampled 𝜔𝑖 , and integrate this
product over the sphere to produce the direct illumination contribution.
3. We shade each point along the ray with indirect illumination by estimating the
integral in Equation 2.13. First, we compute the expected camera ray termi-
nation depth 𝑡′ = 𝐷(c, −𝜔𝑜 ) using the density samples from Step 1. Next,
we sample 128 random directions on the upper hemisphere at x(𝑡′ ) and query
the visibility MLP for the expected termination depths along each of these
rays 𝑡′′ = 𝐷
˜ 𝜑 (x(𝑡′ ), 𝜔𝑖 ) to compute the secondary surface intersection points
67
x′ (𝑡′′ ) = x(𝑡′ ) + 𝑡′′ 𝜔𝑖 . We then shade each of these points with direct illu-
mination by following the procedure in Step 2. This estimates the indirect
illumination incident at x(𝑡′ ), which we then multiply by the microfacet BRDF
values 𝑅(x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 ) and integrate over the sphere to produce the indirect
illumination contribution.
4. The total reflected radiance at each point along the camera ray 𝐿𝑟 (x(𝑡), 𝜔𝑜 ) is
the sum of the quantities from Steps 2 and 3. We composite these along the ray
to compute the pixel color using the same quadrature rule [Max, 1995] used in
NeRF:
∑︁
𝐿(c, 𝜔𝑜 ) = 𝑉 (x(𝑡), c)𝛼(𝜎(x(𝑡))𝛿)𝐿𝑟 (x(𝑡), 𝜔𝑜 ) , (2.15)
𝑡
(︃ )︃
∑︁
𝑉 (x(𝑡), c) = exp − 𝜎(x(𝑠))𝛿 , 𝛼(𝑧) = 1 − exp (−𝑧) , (2.16)
𝑠<𝑡
68
these pixels from the current NeRV model. We additionally sample 256 random rays
ℛ′ per training iteration that intersect the volume, and we compute the visibility and
expected termination depth at each location, into either direction along each ray for
use as supervision for the visibility MLP. We minimize the sum of three losses:
∑︁
ℒ= ˜
‖𝜏 (𝐿(r)) − 𝜏 (𝐿(r))‖22 +
r∈ℛ
∑︁ (︁ )︁
𝜆 ˜ 𝜑 (r′ (𝑡)) − 𝐷𝜃 (r′ (𝑡))‖2 ,
‖𝑉˜𝜑 (r′ (𝑡)) − 𝑉𝜃 (r′ (𝑡))‖22 + ‖𝐷 2 (2.17)
r′ ∈ℛ′ ∪ℛ,𝑡
where 𝜏 (𝑥) = 𝑥/1+𝑥 is a tone mapping operator [Gharbi et al., 2019], 𝐿(r) and 𝐿(r)
˜ are
the ground truth and predicted camera ray radiance values (ground-truth values are
simply the colors of input image pixels), 𝑉˜𝜑 (r) and 𝐷
˜ 𝜑 (r) are the predicted visibility
and expected termination depth from our visibility MLP given its current weights 𝜑,
𝑉𝜃 (r) and 𝐷𝜃 (r) are the estimates of visibility and termination depth implied by the
shape MLP given its current weights 𝜃, and 𝜆 = 20 is the weight of the loss terms
encouraging the visibility MLP to be consistent with the shape MLP.
Note that the visibility MLP is not supervised using any “ground truth” visibility
or termination depth: It is only optimized to be consistent with the NeRV’s current
estimate of scene geometry, by evaluating Equation 2.5 and Equation 2.6 using the
densities 𝜎 emitted by the shape MLP𝜃 . We apply a stop_gradient to 𝑉𝜃 and 𝐷𝜃
in the last two terms of the loss, so the shape MLP is not encouraged to degrade its
own performance to better match the output from the visibility MLP. We implement
our model in JAX [Bradbury et al., 2018] and optimize it using Adam [Kingma
and Ba, 2015] with a learning rate that begins at 10−5 and decays exponentially to
10−6 over the course of optimization (the other Adam hyperparameters are default
values: 𝛽1 = 0.9, 𝛽2 = 0.999, and 𝜖 = 10−8 ). Each model is trained for a million
iterations using 128 Tensor Processing Unit (TPU) cores, which takes around one
day to converge.
69
2.4 Method: One Unknown Illumination
2.4.1 Shape
The input to NeRFactor is the same as what is used by Neural Radiance Fields (NeRF)
[Mildenhall et al., 2020], so we can apply NeRF to our input images to compute
initial geometry. NeRF optimizes a neural radiance field: an MLP that maps from
any 3D spatial coordinate and 2D viewing direction to the volume density at that
3D location and color emitted by particles at that location along the 2D viewing
direction. NeRFactor leverages NeRF’s estimated geometry by “distilling” it into a
continuous surface representation that we use to initialize NeRFactor’s geometry. In
particular, we use the optimized NeRF to compute the expected surface location
along any camera ray, the surface normal at each point on the object’s surface, and
the visibility of light arriving from any direction at each point on the object’s surface.
This subsection describes how we derive these functions from the optimized NeRF
and how we re-parameterize them with MLPs so that they can be finetuned after this
initialization step to improve the full re-rendering loss (Figure 2-7).
1
In this section, vectors and matrices (as well as functions that return them) are in bold; scalars
and scalar functions are not.
70
NeRFactor is an pseudo-GT as sup.
: pre-trained; frozen
!i Light : pre-trained; jointly finetuned
all-MLP surface model <latexit sha1_base64="CqK5YU2YsyAHk8kp6/xwQyNdvAA=">AAACBXicbVBNS8NAEN34WetX1KMegkXwVJIi6rHoxWMF+wFNKZvttF26yYbdiVhCL178K148KOLV/+DNf+OmzUFbHyz7eG+GmXlBLLhG1/22lpZXVtfWCxvFza3tnV17b7+hZaIY1JkUUrUCqkHwCOrIUUArVkDDQEAzGF1nfvMelOYyusNxDJ2QDiLe54yikbr2kR9I0dPj0HypL0MY0K6P8IApn0y6dsktu1M4i8TLSYnkqHXtL78nWRJChExQrdueG2MnpQo5EzAp+omGmLIRHUDb0IiGoDvp9IqJc2KUntOXyrwInan6uyOloc4WNZUhxaGe9zLxP6+dYP+yk/IoThAiNhvUT4SD0skicXpcAUMxNoQyxc2uDhtSRRma4IomBG/+5EXSqJS983Ll9qxUvcrjKJBDckxOiUcuSJXckBqpE0YeyTN5JW/Wk/VivVsfs9IlK+85IH9gff4AGUuZmw==</latexit>
accumulated
transmittance
Visibility v : trained from scratch that predicts the
<latexit sha1_base64="zWgGn+KxnGT4WVzvscDZLe5fxpU=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaJUY9ELx4hkUcCGzI79MLI7OxmZpaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWj+7nfGqPSPJaPZpKgH9GB5CFn1FipPu4VS27ZXYCsEy8jJchQ6xW/uv2YpRFKwwTVuuO5ifGnVBnOBM4K3VRjQtmIDrBjqaQRan+6OHRGLqzSJ2GsbElDFurviSmNtJ5Ege2MqBnqVW8u/ud1UhPe+lMuk9SgZMtFYSqIicn8a9LnCpkRE0soU9zeStiQKsqMzaZgQ/BWX14nzUrZuy5X6lel6l0WRx7O4BwuwYMbqMID1KABDBCe4RXenCfnxXl3PpatOSebOYU/cD5/AOV7jQE=</latexit>
MLP
surface normal 𝑛, light
xsurf BRDF visibility 𝑣, albedo 𝑎, <latexit sha1_base64="gXJmz29XFah1xKGcAW/+7Sy8n4k=">AAACA3icbVDLSsNAFJ3UV62vqDvdBIvgqiRF1GXRjcsK9gFNCJPJpB06eTBzIy0h4MZfceNCEbf+hDv/xkmbhbYeGOZwzr3ce4+XcCbBNL+1ysrq2vpGdbO2tb2zu6fvH3RlnApCOyTmseh7WFLOItoBBpz2E0Fx6HHa88Y3hd97oEKyOLqHaUKdEA8jFjCCQUmufmR7MfflNFRfNnFtoBPIZCqCPHf1utkwZzCWiVWSOirRdvUv249JGtIICMdSDiwzASfDAhjhNK/ZqaQJJmM8pANFIxxS6WSzG3LjVCm+EcRCvQiMmfq7I8OhLNZUlSGGkVz0CvE/b5BCcOVkLEpSoBGZDwpSbkBsFIEYPhOUAJ8qgolgaleDjLDABFRsNRWCtXjyMuk2G9ZFo3l3Xm9dl3FU0TE6QWfIQpeohW5RG3UQQY/oGb2iN+1Je9HetY95aUUrew7RH2ifP9fHmPE=</latexit>
Identity zBRDF
and BRDF latent code
<latexit sha1_base64="mw0GeuR83/hro1F8N5+3qQU9ms4=">AAACA3icbVDLSsNAFJ3UV62vqDvdBIvgqiRF1GWpIi6r2Ac0oUymk3bo5MHMjVhDwI2/4saFIm79CXf+jZM2C209MMzhnHu59x434kyCaX5rhYXFpeWV4mppbX1jc0vf3mnJMBaENknIQ9FxsaScBbQJDDjtRIJi3+W07Y7OM799R4VkYXAL44g6Ph4EzGMEg5J6+p7thrwvx776koeeDfQekvrNxWWa9vSyWTEnMOaJlZMyytHo6V92PySxTwMgHEvZtcwInAQLYITTtGTHkkaYjPCAdhUNsE+lk0xuSI1DpfQNLxTqBWBM1N8dCfZltqaq9DEM5ayXif953Ri8MydhQRQDDch0kBdzA0IjC8ToM0EJ8LEimAimdjXIEAtMQMVWUiFYsyfPk1a1Yp1UqtfH5Vo9j6OI9tEBOkIWOkU1dIUaqIkIekTP6BW9aU/ai/aufUxLC1res4v+QPv8AeMHmFE=</latexit>
MLP BRDF
Render
expected
MLP 𝑧BRDF for each surface
Albedo
ray term.
MLP a location 𝑥surf , as well as <latexit sha1_base64="Mzrq9frzKs/D+ICaSB2FHA8Q12o=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZki6rLoxmUF+4B2LJlMpg3NJEOSUcrQ/3DjQhG3/os7/8ZMOwttPRByOOdecnKChDNtXPfbWVldW9/YLG2Vt3d29/YrB4dtLVNFaItILlU3wJpyJmjLMMNpN1EUxwGnnWB8k/udR6o0k+LeTBLqx3goWMQINlZ66AeSh3oS2yvD00Gl6tbcGdAy8QpShQLNQeWrH0qSxlQYwrHWPc9NjJ9hZRjhdFrup5ommIzxkPYsFTim2s9mqafo1CohiqSyRxg0U39vZDjWeTQ7GWMz0oteLv7n9VITXfkZE0lqqCDzh6KUIyNRXgEKmaLE8IklmChmsyIywgoTY4sq2xK8xS8vk3a95l3U6nfn1cZ1UUcJjuEEzsCDS2jALTShBQQUPMMrvDlPzovz7nzMR1ecYucI/sD5/AEyo5L5</latexit>
NeRF
(!i , !o ) ( d , ✓h , ✓d )
the lighting condition. <latexit sha1_base64="PNyVwDzPr+pVAAW9S4Fxuo0bkdg=">AAACJXicdVBNSwMxEM36bf2qevQSLEIFKbtF1IOHohePCvYDuqVk02kbmt0syaxYlv0zXvwrXjwoInjyr5jWHrTVByGP92aYmRfEUhh03Q9nbn5hcWl5ZTW3tr6xuZXf3qkZlWgOVa6k0o2AGZAigioKlNCINbAwkFAPBpcjv34H2ggV3eIwhlbIepHoCs7QSu38edEPlOyYYWi/1Fch9FjbR7jHVGTZEf3XVVl22M4X3JI7Bp0l3oQUyATX7fyr31E8CSFCLpkxTc+NsZUyjYJLyHJ+YiBmfMB60LQ0YiGYVjq+MqMHVunQrtL2RUjH6s+OlIVmtKmtDBn2zbQ3Ev/ymgl2z1qpiOIEIeLfg7qJpKjoKDLaERo4yqEljGthd6W8zzTjaIPN2RC86ZNnSa1c8k5K5ZvjQuViEscK2SP7pEg8ckoq5Ipckyrh5IE8kRfy6jw6z86b8/5dOudMenbJLzifX849p08=</latexit> <latexit sha1_base64="8JxPNoU0Qk/1/Ko8TBjh20evX1Y=">AAACGnicbZBNS8NAEIY3ftb6FfXoZbEIClKSIuqx6MVjBatCE8pmMzWLmw92J2IJ+R1e/CtePCjiTbz4b9zWHLT6wsLLMzPMzhtkUmh0nE9ranpmdm6+tlBfXFpeWbXX1i90misOXZ7KVF0FTIMUCXRRoISrTAGLAwmXwc3JqH55C0qLNDnHYQZ+zK4TMRCcoUF9293xskj0PYQ7LMJyj3oYAbIKRJMgLHf7dsNpOmPRv8atTINU6vTtdy9MeR5DglwyrXuuk6FfMIWCSyjrXq4hY/yGXUPP2ITFoP1ifFpJtw0J6SBV5iVIx/TnRMFirYdxYDpjhpGerI3gf7VejoMjvxBJliMk/HvRIJcUUzrKiYZCAUc5NIZxJcxfKY+YYhxNmnUTgjt58l9z0Wq6B83W2X6jfVzFUSObZIvsEJcckjY5JR3SJZzck0fyTF6sB+vJerXevlunrGpmg/yS9fEFNp+hng==</latexit>
Normal
<latexit sha1_base64="+QEJNuyxbank3U3v1VkBjLYmsJs=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKewGUY9BLx4jmAckS5idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWye1eoYNBe6XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2v3aKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPX0YBpSiyfOIKJZu5WREZYY2JdQCUXQrD88ipp1arBZbV2f1Gp3+RxFOEETuEcAriCOtxBA5pA4BGe4RXePOW9eO/ex6K14OUzx/AH3ucPnqmPKg==</latexit>
Figure 2-6: NeRFactor model and its example output. NeRFactor is a surface model
that factorizes, in an unsupervised manner, the appearance of a scene observed under
one unknown lighting condition. It tackles this severely ill-posed problem by using a
reconstruction loss, simple smoothness regularization, and data-driven BRDF priors.
Modeling visibility explicitly, NeRFactor is a physically-based model that supports
hard and soft shadows under arbitrary lighting.
71
Surface Points Given a camera and a trained NeRF, we compute the location
at which a ray 𝑟(𝑡) = 𝑜 + 𝑡𝑑 from that camera 𝑜 along direction 𝑑 is expected to
terminate according to NeRF’s optimized volume density 𝜎:
(︂∫︁ ∞ )︂ (︂ ∫︁ 𝑡 )︂
(2.18)
(︀ )︀ (︀ )︀
𝑥surf = 𝑜 + 𝑇 (𝑡)𝜎 𝑟(𝑡) 𝑡 𝑑𝑡 𝑑 , 𝑇 (𝑡) = exp − 𝜎 𝑟(𝑠) 𝑑𝑠 ,
0 0
where 𝑇 (𝑡) is the probability that the ray travels distance 𝑡 without being blocked.
Instead of maintaining a full volumetric representation, we fix the geometry to lie
on this surface distilled from the optimized NeRF. This enables much more efficient
relighting during both training and inference because we can compute the outgoing
radiance just at each camera ray’s expected termination location instead of every
point along each camera ray.
∑︁ (︂ 𝜆1 ⃦ ⃦2 𝜆2 ⃦
)︂
(2.19)
⃦
ℓn = ⃦𝑓n (𝑥surf ) − 𝑛a (𝑥surf ) 2 +
⃦ ⃦𝑓n (𝑥surf ) − 𝑓n (𝑥surf + 𝜖) 1 ,
⃦
𝑥
3 3
surf
72
𝑥 on the expected surface increases the robustness of the MLP by providing a “safe
margin” where the output remains well-behaved even when the input is slightly dis-
placed from the surface. As shown in Figure 2-7, NeRFactor’s normal MLP produces
normals that are significantly higher-quality than those produced by NeRF and are
smooth enough to be used for relighting (Figure 2-9).
Light Visibility We compute the visibility 𝑣a to each light source from any point
by marching through NeRF’s 𝜎-volume from the point to each light location, as in Bi
et al. [2020a]. However, similarly to the surface normal estimation described above,
the visibility estimates derived directly from NeRF’s 𝜎-volume are too noisy to be used
directly and result in rendering artifacts (see the supplemental video). We address this
by re-parameterizing the visibility function as another MLP that maps from a surface
location 𝑥surf and a light direction 𝜔i to the light visibility 𝑣: 𝑓v : (𝑥surf , 𝜔i ) ↦→ 𝑣. We
optimize the weights of 𝑓v to encourage the recovered visibility field I) to be close to
the visibility traced from the NeRF, II) to be spatially smooth, and III) to reproduce
the observed appearance. Specifically, the loss function implementing I) and II) is:
∑︁ ∑︁ (︁ (︀ )︀2
ℓv = 𝜆3 𝑓v (𝑥surf , 𝜔i ) − 𝑣a (𝑥surf , 𝜔i )
𝑥surf 𝜔i
⃒)︁
(2.20)
⃒
+𝜆4 𝑓v (𝑥surf , 𝜔i ) − 𝑓v (𝑥surf + 𝜖, 𝜔i )⃒ ,
⃒
where 𝜖 is the random displacement defined above, and 𝜆3 and 𝜆4 are hyperparameters
set to 0.1 and 0.05, respectively.
As the equation shows, smoothness is encouraged across spatial locations given
the same 𝜔i , not the other way around. This is by design, to avoid the visibility at a
certain location getting blurred over different light locations. Note that this is similar
to the visibility fields in NeRV, but in NeRFactor we optimize the visibility MLP
parameters to denoise the visibility derived from a pretrained NeRF and minimize
the re-rendering loss. For computing the NeRF visibility, we use a fixed set of 512
light locations given a predefined illumination resolution (to be discussed later). After
optimization, 𝑓v produces spatially smooth and realistic estimates of light visibility,
73
as can be seen in Figure 2-7 (II) and Figure 2-8 (C) where we visualize the average
visibility over all light directions (i.e., ambient occlusion).
In practice, before the full optimization of our model, we independently pretrain
the visibility and normal MLPs to just reproduce the visibility and normal values from
the NeRF 𝜎-volume without any smoothness regularization or re-rendering loss. This
provides a reasonable initialization of the visibility maps, which prevents the albedo or
BRDF MLP from mistakenly attempting to explain away shadows as being modeled
as “painted on” reflectance variation (see “w/o geom. pretrain.” in Figure 2-19 and
Table 2.1).
2.4.2 Reflectance
𝑎(𝑥surf )
𝑅(𝑥surf , 𝜔i , 𝜔o ) = + 𝑓r (𝑥surf , 𝜔i , 𝜔o ) . (2.21)
𝜋
Prior art in neural rendering has explored the use of parameterizing 𝑓r with analytic
BRDFs, such as microfacet models [Bi et al., 2020a, Srinivasan et al., 2021], within
a NeRF-like setting. Although these analytic models provide an effective BRDF pa-
rameterization for optimization to explore, no prior is imposed upon the parameters
themselves: All materials that are expressible within a microfacet model are consid-
ered equally likely a priori. Additionally, the use of an explicit analytic model limits
the set of materials that can be recovered, and this is insufficient for modelling all
real-world reflectance functions.
Instead of assuming an analytic BRDF, NeRFactor starts with a learned re-
flectance function that is pretrained to reproduce a wide range of empirically observed
real-world reflectance functions while also learning a latent space for those real-world
reflectance functions. By doing so, we learn data-driven priors on real-world BRDFs
74
that encourage the optimization procedure to recover plausible reflectance functions.
The use of such priors is crucial: Because all of our observed images are taken under
one (unknown) illumination, our problem is highly ill-posed, so priors are necessary
to disambiguate the most likely factorization of the scene from the set of all possible
factorizations.
∑︁ 1 ⃦
(2.22)
⃦
ℓa = 𝜆5 ⃦𝑓a (𝑥surf ) − 𝑓a (𝑥surf + 𝜖)⃦ ,
3 1
𝑥surf
Learning Priors From Real-World BRDFs For the specular components of the
BRDF, we seek to learn a latent space of real-world BRDFs and a paired “decoder”
that translates each latent code in the learned space 𝑧BRDF to a full 4D BRDF. To this
end, we adopt the Generative Latent Optimization (GLO) approach by Bojanowski
et al. [2018], which has been previously used by other coordinate-based models such
as Park et al. [2019a] and Martin-Brualla et al. [2021].
The 𝑓r component of our model is pretrained using the the MERL dataset [Ma-
tusik et al., 2003]. Because the MERL dataset assumes isotropic materials, we pa-
75
rameterize the incoming and outgoing directions for 𝑓r using Rusinkiewicz coordi-
nates [Rusinkiewicz, 1998] (𝜑d , 𝜃h , 𝜃d ) (three degrees of freedom) instead of 𝜔i and
𝜔o (four degrees of freedom). Denote this coordinate conversion by 𝑔 : (𝑛, 𝜔i , 𝜔o ) ↦→
(𝜑d , 𝜃h , 𝜃d ), where 𝑛 is the surface normal at that point. We train a function 𝑓r′ (a
re-parameterization of 𝑓r ) that maps from a concatenation of a latent code 𝑧BRDF
(which represents a BRDF identity) and a set of Rusinkiewicz coordinates (𝜑d , 𝜃h , 𝜃d )
to an achromatic reflectance 𝑟:
To train this model, we optimize both the MLP weights and the set of latent codes
𝑧BRDF to reproduce the real-world BRDFs. Simple mean squared errors are computed
on the log of the High-Dynamic-Range (HDR) reflectance values to train 𝑓r′ .
After this pretraining, the weights of this BRDF MLP are frozen during the joint
optimization of our entire model, and we predict only 𝑧BRDF for each 𝑥surf by training
from scratch a BRDF identity MLP (Figure 2-6): 𝑓z : 𝑥surf ↦→ 𝑧BRDF . This can
be thought of as predicting spatially-varying BRDFs for all the surface points in
the plausible space of real-world BRDFs. We optimize the BRDF identity MLP to
minimize the re-rendering loss and the same spatial smoothness prior as in albedo:
⃦ ⃦
∑︁ ⃦𝑓z (𝑥surf ) − 𝑓z (𝑥surf + 𝜖)⃦
ℓz = 𝜆6 1
, (2.24)
𝑥
dim(𝑧 BRDF )
surf
2
In principle, one should be able to perform diffuse-specular separation on the MERL BRDFs and
then learn priors on just the specular lobes. We experimented with this idea by using the separation
provided by Sun et al. [2018a], but this yielded qualitatively worse end results.
76
where 𝜆6 is a hyperparameter set to 0.01, and dim(𝑧BRDF ) denotes the dimension-
ality of the BRDF latent code (3 in our implementation because there are only 100
materials in the MERL dataset).
The final BRDF is the sum of the Lambertian component and the learned non-
diffuse reflectance (subscript of 𝑥surf dropped for brevity):
𝑓a (𝑥) (︁ )︀)︁
+ 𝑓r′ 𝑓z (𝑥), 𝑔 𝑓n (𝑥), 𝜔i , 𝜔o , (2.25)
(︀
𝑅(𝑥, 𝜔i , 𝜔o ) =
𝜋
2.4.3 Illumination
We adopt a simple and direct representation of lighting: an HDR light probe image
[Debevec, 1998] in the latitude-longitude format. In contrast to spherical harmonics
or a mixture of spherical Gaussians, this representation allows our model to represent
detailed high-frequency lighting and therefore to support hard cast shadows. That
said, the challenges of using this representation are clear: It contains a large number
of parameters, and every pixel/parameter can vary independently of all other pix-
els/parameters. This issue can be ameliorated by our use of the light visibility MLP,
which allows us to quickly evaluate a surface point’s visibility to all pixels of the light
probe. Empirically, we use a 16 × 32 resolution for our lighting environments as we do
not expect to recover higher-frequency content in the light probe image beyond that
resolution (the environment is effectively low-pass filtered by each object’s BRDFs as
discussed by Ramamoorthi and Hanrahan [2004], and the objects in our datasets are
not shiny or mirror-like).
To encourage smoother lighting, we apply a simple ℓ2 gradient penalty on the
pixels of the light probe 𝐿 along both the horizontal and vertical directions:
⎛ ⃦⎡ ⎤ ⃦2 ⎞
⃦ [︁ ]︁ ⃦2 ⃦ −1 ⃦
ℓi = 𝜆7 ⃦ −1 1 * 𝐿⃦ + ⃦ * 𝐿⃦ ⎠ , (2.26)
⎝⃦ ⃦ ⃦⎣ ⎦ ⃦
2 ⃦ 1 ⃦
2
77
(given that there are 512 pixels with HDR values). During the joint optimization,
these probe pixels get updated directly by the final reconstruction loss and the gra-
dient penalty.
2.4.4 Rendering
Given the surface normal, visibility of all light directions, albedo, and BRDF at each
point 𝑥surf , as well as the estimated environment lighting, the final physically-based,
non-learnable renderer renders an image that is then compared against the observed
image. The errors in this rendered image are backpropagated up to, but excluding,
the 𝜎-volume of the pretrained NeRF, thereby driving the joint estimation of surface
normals, light visibility, albedo, BRDFs, and illumination.
Given the ill-posed nature of the problem (largely due to our only observing one
unknown illumination), we expect the majority of useful information to be from
direct illumination rather than global illumination and therefore consider only single-
bounce direct illumination (i.e., from the light source to the object surface then to
the camera). This assumption also reduces the computational cost of evaluating our
model. Mathematically, the rendering equation in our setup is (subscript of 𝑥surf
dropped again for brevity):
∫︁
(2.27)
(︀ )︀
𝐿o (𝑥, 𝜔o ) = 𝑅(𝑥, 𝜔i , 𝜔o )𝐿i (𝑥, 𝜔i ) 𝜔i · 𝑛(𝑥) 𝑑𝜔i
Ω
∑︁
(2.28)
(︀ )︀
= 𝑅(𝑥, 𝜔i , 𝜔o )𝐿i (𝑥, 𝜔i ) 𝜔i · 𝑓n (𝑥) ∆𝜔i
𝜔i
∑︁ (︂ 𝑓a (𝑥) (︁ )︀)︁
)︂
𝑓r′ (2.29)
(︀ (︀ )︀
= + 𝑓z (𝑥), 𝑔 𝑓n (𝑥), 𝜔i , 𝜔o 𝐿i (𝑥, 𝜔i ) 𝜔i · 𝑓n (𝑥) ∆𝜔i ,
𝜔i
𝜋
78
tion is the summation of all the previously defined losses: ℓrecon + ℓn + ℓv + ℓa + ℓz + ℓi .
NeRFactor is implemented in TensorFlow 2 [Abadi et al., 2016]. All training uses the
Adam optimizer [Kingma and Ba, 2015] with the default hyperparameters.
Staged Training There are three stages in training NeRFactor. First, we optimize
a NeRF using the input posed images (once per scene) and train a BRDF MLP on
the MERL dataset (only once for all scenes). Both of these MLPs are frozen during
the final joint optimization since the NeRF only provides a shape initialization, and
the BRDF MLP provides a latent space of real-world BRDFs for the optimization
to explore. Future shape refinement happens in NeRFactor’s normal and visibility
MLPs, and the actual material prediction happens in NeRFactor’s albedo and BRDF
identity MLPs. Second, we use this trained NeRF to initialize our geometry by
optimizing the normal and visibility MLPs to simply reproduce the NeRF values,
without any additional smoothness loss or regularization. Finally, we jointly optimize
the albedo MLP, BRDF identity MLP, and light probe pixels from scratch, along with
the pretrained normal and visibility MLPs. Fine-tuning the normal and visibility
MLPs along with the reflectance and lighting allows the errors in NeRF’s initial
geometry to be improved in order to minimize the re-rendering loss (Figure 2-7).
Runtime We train NeRF for 2,000 epochs, which takes 6–8 hours when distributed
over four NVIDIA TITAN RTX GPUs. Prior to the final joint optimization, comput-
79
ing the initial surface normals and light visibility from the trained NeRF takes 30 min
per view on a single GPU for a 16 × 32 light probe (i.e., 512 light source locations).
This step can be trivially parallelized because each view is processed independently.
Geometry pretraining is performed for 200 epochs, which takes around 20 min on one
TITAN RTX. The final joint optimization is performed for 100 epochs, which takes
only 20 min on one TITAN RTX.
2.5 Results
In this section, we first explain how the datasets are constructed for our tasks (Sec-
tion 2.5.1). Next, we present the high-quality geometry achieved by NeRFactor
(Section 2.5.2) and then its joint estimation of shape, reflectance, and direct illu-
mination (Section 2.5.3). Finally, we showcase the applications of NeRFactor’s ap-
pearance factorization: free-viewpoint relighting (Section 2.5.4) and material editing
(Section 2.5.5).
2.5.1 Data
NeRV uses synthetic multi-view images of an object and their ground-truth camera
poses. NeRFactor additionally uses three sources of data: real multi-view images
of an object and their estimated camera poses, real-world measured Bidirectional
Reflectance Distribution Functions (BRDFs), and captured light probes.
Synthetic Renderings We use the synthetic Blender scenes (hotdog, drums, lego,
and ficus) released by Mildenhall et al. [2020], construct a new Blender scene
(armadillo), and replace the illumination used there with our own arbitrary, man-
made illuminations (for NeRV) or natural illuminations (for NeRFactor) taken from
real light probe images (we use publicly available light probes from hdrihaven.com,
Stumpfel et al. [2004], and Blender). This yields significantly more natural input
illumination conditions.
We also disable all non-standard post-rendering effects used by Blender Cycles
80
when rendering the images, such as “filmic” tone mapping, and retain only the stan-
dard linear-to-sRGB tone mapping. We render all images directly to PNGs instead
of EXRs to simulate real-world mobile phone captures where raw High-Dynamic-
Range (HDR) pixel intensities may not be available; this indeed facilitates applying
NeRFactor directly to real scenes as shown in Figure 2-10.
Real Captures NeRFactor uses mobile phone captures of two real scenes released
by Mildenhall et al. [2020]: vasedeck and pinecone. These scenes are captured
by inwards-facing cameras on the upper hemisphere. There are close to 100 images
per scene, and the camera poses are obtained by COLMAP Structure From Motion
(SFM) [Schönberger and Frahm, 2016]. NeRFactor is directly applicable because it
is designed to work with PNGs instead of EXRs.
Measured BRDFs NeRFactor uses real measured BRDFs from the MERL dataset
by Matusik et al. [2003]. The MERL dataset consists of 100 real-world BRDFs mea-
sured by a conventional gonioreflectometer. Because the color components of BRDFs
are not used by our model, we convert the RGB reflectance values to be achromatic
by converting linear RGB values to relative luminance.
NeRFactor jointly estimates an object’s shape in the form of surface points and their
associated surface normals as well as their visibility to each light location. Figure 2-7
visualizes these geometric properties. To visualize light visibility, we take the mean
of the 512 visibility maps corresponding to each pixel of a 16 × 32 light probe and
visualize that average map (i.e., ambient occlusion) as a grayscale image. See the
supplemental video for movies of per-light visibility maps (i.e., shadow maps). As
Figure 2-7 shows, our surface normals and light visibility are smooth and resemble
the ground truth, thanks to the joint estimation procedure that optimizes normals
and visibility to minimize re-rendering errors and encourage spatial smoothness.
If we ablate the spatial smoothness constraints and rely on only the re-rendering
81
I. Surface Normals
(A) Derived from NeRF (B) Jointly Optimized (C) NeRFactor: Jointly Optimized (D) Ground Truth
w/ Smoothness Constraints
82
loss, we end up with noisy geometry that is insufficient for rendering. Although these
geometry-induced artifacts may not show up under low-frequency lighting, harsh light-
ing conditions (such as a single point light with no ambient illumination, i.e., One-
Light-at-A-Time or OLAT) reveal them as demonstrated in our supplemental video.
Perhaps surprisingly, even when our smoothness constraints are disabled, the geom-
etry estimated by NeRFactor is still significantly less noisy than the original NeRF
geometry (compare [A] with [B] of Figure 2-7 and see [I] of Table 2.1) because the
re-rendering loss encourages smoother geometry. See Section 2.6.4 for more details.
83
I. Normals II. Albedo III. View Synthesis IV. FV Relighting (point) V. FV Relighting (image)
Method Angle (°) ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓
SIRFS - 26.0204 0.9420 0.0719 - - - - - - - - -
Oxholm & Nishino† 32.0104 26.3248 0.9448 0.0870 29.8093 0.9275 0.0810 20.9979 0.8407 0.1610 22.2783 0.8762 0.1364
NeRFactor 22.1327 28.7099 0.9533 0.0621 32.5362 0.9461 0.0457 23.6206 0.8647 0.1264 26.6275 0.9026 0.0917
using microfacet 22.1804 29.1608 0.9571 0.0567 32.4409 0.9457 0.0458 23.7885 0.8642 0.1256 26.5970 0.9011 0.0925
w/o geom. pretrain. 25.5302 27.7936 0.9480 0.0677 32.3835 0.9449 0.0491 23.1689 0.8585 0.1384 25.8185 0.8966 0.1027
w/o smoothness 26.2229 27.7389 0.9179 0.0853 32.7156 0.9450 0.0405 23.0119 0.8455 0.1283 26.0416 0.8887 0.0920
using NeRF’s shape 32.0634 27.8183 0.9419 0.0689 30.7022 0.9210 0.0614 22.0181 0.8237 0.1470 24.8908 0.8651 0.1154
Reported numbers are the arithmetic means of all four synthetic scenes (hotdog, ficus, lego, and drums) over eight uniformly sampled novel views.
The top three performing techniques for each metric are highlighted in red, orange, and yellow, respectively. For Tasks IV and V, we relight the
84
scenes with 16 novel lighting conditions: eight OLAT conditions plus the eight light probes included in Blender. We apply color correction and
tonemapping to the albedo estimation before computing the errors (see Section 2.5.3 for details). †Oxholm and Nishino [2014] require the
ground-truth illumination, which we provide, and this baseline represents a significantly enhanced version (see Section 2.6.2).
Table 2.1: Quantitative evaluation of NeRFactor. Although ablating the smoothness constraints (“w/o smoothness”) achieves
good view synthesis quality under the original illumination, the noisy estimates lead to poor relighting performance. NeRFactor
achieves the top overall performance across most metrics, although for some metrics, it is outperformed by the microfacet
variant (“using microfacet”), which tends to either converge to the local optimum of maximum roughness everywhere or produce
non-spatially-smooth BRDFs (see the supplemental video). We are unable to present normal, view synthesis, or relighting
errors for SIRFS [Barron and Malik, 2014] as it does not support non-orthographic cameras or “world-space” geometry (though
Figure 2-14 shows that the geometry recovered by SIRFS is inaccurate).
the center of the cymbal, and the metal rims on the sides of the drums. For ficus,
NeRFactor is able to recover the complex leaf geometry of the potted plant. The
average light visibility (ambient occlusion) maps also correctly portray the average
exposure of each point in the scene to the lights. Albedo is recovered cleanly, with
barely any shadowing or shading detail inaccurately attributed to albedo variation;
note how the shading on the drums is absent in the albedo prediction. Moreover,
the predicted light probes correctly reflect the locations of the primary light sources
and the blue sky (blue pixels in [I]). In all three scenes, the predicted BRDFs are
spatially-varying and correctly reflect that different parts of the scene have different
materials, as indicated by different BRDF latent codes in (E).
Instead of representing illumination with a more sophisticated representation such
as spherical harmonics, we opt for a straightforward representation: a latitude-
longitude map whose pixels are HDR intensities. Because lighting is effectively
convolved by a low-pass filter when reflected by a moderately diffuse BRDF [Ra-
mamoorthi and Hanrahan, 2001], and our objects are not shiny or mirror-like, we do
not expect to recover illumination at a resolution higher than 16 × 32. As shown in
Figure 2-8 (I), NeRFactor estimates a light probe that correctly captures the bright
light source on the far left as well as the blue sky. Similarly, in Figure 2-8 (II), the
dominant light source location is also correctly estimated (the bright white blob on
the left).
85
I. Ground Truth
N/A
I. Prediction
N/A
II. Prediction
N/A
III. Prediction
(A) Rendering (B) Surface Normals (C) Light Visibility (D) Albedo & Illum. (E) BRDF z
Figure 2-8: Joint estimation of shape, reflectance, and lighting by NeRFactor. Here
we visualize factorization produced by NeRFactor (bottom) alongside the ground
truth (top) on three scenes. Although our recovered surface normals, visibility, and
albedo sometimes omit some fine-grained detail, they still closely resemble the ground
truth. Despite that the illuminations recovered by NeRFactor are oversmoothed (due
to the effective low-pass filtering induced by observing illumination only after it has
been convolved by the object’s BRDFs) and incorrect on the bottom half of the hemi-
sphere (since objects are only ever observed from the top hemisphere), the dominant
light sources and occluders are localized nearby their ground-truth locations in the
light probes. Note that we are unable to compare against ground-truth BRDFs as
they are defined using Blender’s shader node trees, while our recovered BRDFs are
parameterized by our learned model.
86
over upper halves of the light probes.
As shown in Figure 2-9 (II), NeRFactor synthesizes correct hard shadows cast
by the hot dogs under the three test OLAT conditions. NeRFactor also produces
realistic renderings of the ficus under the OLAT illuminations (I), especially when
the ficus is back-lit by the point light in (D). Note that the ground truth in (D)
appears brighter than NeRFactor’s results, because NeRFactor models only direct
illumination, whereas the ground-truth image was rendered with global illumination.
When we relight the objects with two new light probes, realistic soft shadows are
synthesized on the plate of hotdog (II).
In ficus, the specularities on the vase correctly reflect the primary light sources
in the both test probes. The ficus leaves also exhibit realistic specular highlights
close to the ground truth in (F). In drums (III), the cymbals are correctly estimated
to be specular and exhibit realistic reflection, though different from the ground-truth
anisotropic reflection (D). This is as expected because all MERL BRDFs are isotropic
[Matusik et al., 2003]. Despite being unable to explain these anisotropic reflections,
NeRFactor correctly leaves them out in albedo rather than interprets them as albedo
paints since doing that would violate the albedo smoothness constraint and contradict
those reflections’ view dependency. In lego, realistic hard shadows are synthesized
by NeRFactor for the OLAT test conditions (IV).
Relighting Real Scenes We apply NeRFactor to the two real scenes, vasedeck
and pinecone, captured by Mildenhall et al. [2020]. These captures are particularly
suitable for NeRFactor: There are around 100 multi-view images of each scene lit
by an unknown environment lighting. As in NeRF, we run COLMAP SFM [Schön-
berger and Frahm, 2016] to obtain the camera intrinsics and extrinsics for each view.
We then train a vanilla NeRF to obtain an initial shape estimate, which we distill
into NeRFactor and jointly optimize together with reflectance and illumination. As
Figure 2-10 (I) shows, the appearance is factorized into illumination (not pictured)
and 3D fields of surface normals, light visibility, albedo, and spatially-varying BRDF
latent codes that together explain the observed views. With such factorization, we
87
I. Ground Truth
I. Prediction
II. Prediction
III. Prediction
IV. Prediction
(A) View Synthesis & (B) OLAT 1 (C) OLAT 2 (D) OLAT 3 (E) “Courtyard” (F) “Sunrise”
Original Illum.
Figure 2-9: Free-viewpoint relighting by NeRFactor. Here we relight the object using
three OLAT illuminations and two real-world illuminations (light probes captured in
the real world). The renderings produced by our model qualitatively resemble the
ground truth and accurately exhibit challenging effects such as specularities and cast
shadows (both hard and soft).
88
relight the scenes by replacing the estimated illumination with novel arbitrary light
probes (Figure 2-10 [II]). Because our factorization is fully 3D, all the intermediate
buffers can be rendered from any viewpoint, and the relighting results shown are also
from novel viewpoints. Note that the scenes are bounded to avoid faraway geometry
blocking light from certain directions and casting shadows during relighting.
I. Factorizing Appearance
(A) An Input View (B) Reconstruction (C) Albedo (D) BRDF z (E) Normals (F) Visibility (mean)
(A) View Synthesis (B) “Interior” (C) “Courtyard” (D) “Studio” (E) “Sunrise” (F) “Sunset”
Figure 2-10: NeRFactor’s results on real-world captures. (I) Given posed multi-view
images of a real-world object lit by an unknown illumination condition (A), NeR-
Factor factorizes the scene appearance into 3D fields of albedo (C), spatially-varying
BRDF latent codes (D), surface normals (E), and light visibility for all incoming
light directions, visualized here as ambient occlusion (F). Note how the estimated
albedo of the flowers are shading-free. (II) With this factorization, one can synthesize
novel views of the scene relit by any arbitrary lighting. Even on these challenging
real-world scenes, NeRFactor is able to synthesize realistic specularities and shadows
across various lighting conditions.
89
2.5.5 Material Editing
Since NeRFactor factorizes diffuse albedo and specular BRDF from appearance, one
can edit the albedo, non-diffuse BRDF, or both and re-render the edited object under
an arbitrary lighting condition from any viewpoint. In this subsection, we override
the estimated 𝑧BRDF to the learned latent code of pearl-paint in the MERL dataset,
and the estimated albedo to colors linearly interpolated from the turbo colormap,
spatially varying based on the surface points’ 𝑥-coordinates. As Figure 2-11 (left)
demonstrates, with the factorization by NeRFactor, we are able to realistically re-
light the original estimated materials with the two challenging OLAT conditions.
Furthermore, the edited materials are also relit with realistic specular highlights and
hard shadows by the same test OLAT conditions (Figure 2-11 [right]).
2.6 Discussion
In this section, we first show how NeRV compares with the baseline methods (none of
which models indirect illumination in contrast to NeRV) in free-viewpoint relighting
(Section 2.6.1). Next, we compare NeRFactor against several competitors in the tasks
of appearance factorization and free-viewpoint relighting (Section 2.6.2). We then
perform ablation studies to evaluate the importance of each major model component
90
of NeRV (Section 2.6.3) and NeRFactor (Section 2.6.4), and then study whether
NeRFactor predicts albedo consistently for the same object but lit by different lighting
conditions (Section 2.6.5).
We evaluate two versions of NeRV: NeRV with Neural Visibility Fields (“NeRV-
NVF”) and NeRV with Test-time Tracing (“NeRV-Trace”). Both methods use the
same training procedure as described above and differ only in how evaluation is per-
formed: NeRV-NVF uses the same visibility approximations used during training at
test time, while NeRV-Trace uses brute-force tracing to estimate visibility to point
light sources to render sharper shadows at test time. We compare against the follow-
ing baselines.
NLT Neural Light Transport (NLT; Section 3.4) requires an input proxy geometry
(which we provide by running marching cubes [Lorensen and Cline, 1987] on NeRFs
[Mildenhall et al., 2020] trained from images of each scene rendered with fixed light-
ing) and trains a convolutional network defined in an object’s texture atlas space
to perform simultaneous relighting and view synthesis. Although our method just
requires images with known but unconstrained lighting conditions for training, NLT
requires multi-view images captured OLAT, where each viewpoint is rendered mul-
tiple times, once per light source. See the supplemental material (Chapter A) for
qualitative comparisons.
91
NeRF+LE & NeRF+Env NeRF plus Learned Embedding (“NeRF+LE”) and
NeRF plus Fixed Environment Embedding (“NeRF+Env”) represent appearance vari-
ation due to changing lighting using latent variables. Both augment the original NeRF
model with an additional input of a 64-dimensional latent code corresponding to the
scene lighting condition. These approaches are similar to NeRF in the Wild [Martin-
Brualla et al., 2021], which also uses a latent code to describe appearance variation
due to variable lighting. NeRF+LE uses a PointNet [Qi et al., 2017a] encoder to
embed the position and color of each light, and NeRF+Env simply uses the flattened
light probe as the latent code.
Neural Reflectance Fields Neural Reflectance Fields [Bi et al., 2020a] uses a
similar neural volumetric representation as NeRV, with the critical difference that
brute-force raymarching is used to compute visibilities. This approach is therefore
unable to consider illumination from sources other than a single point light during
training. At test time when time and memory constraints are less restrictive, it
computes visibilities to all light sources.
We train each method (other than NLT) on nine datasets. Each consists of 150
images of a synthetic scene (hotdog, lego, or armadillo) illuminated by one of these
three lighting conditions: I) “Point” containing a single white point light randomly
sampled on a hemisphere above the scene for each frame, representing a laboratory
setup similar to that of Bi et al. [2020a], II) “Colorful+Point” consisting of a randomly
sampled point light and a set of eight colorful point lights whose locations and colors
are fixed across all images in the dataset (this represents a challenging scenario with
multiple strong light sources that cast shadows and tint the scene), and III) “Ambi-
ent+Point” comprising a randomly sampled point light and a dim gray environment
map (this represents a challenging scenario where scene points are illuminated from
all directions). We separately train each method on each of these nine datasets and
measure performance on the corresponding scene’s test set, which consists of 150
images of the scene under novel lighting conditions (containing either one or eight
point sources) not seen during training, rendered from novel viewpoints not observed
92
during training.
NeRV outperforms all baselines in experiments that correspond to challenging
complex lighting conditions and matches the performance of prior work in experiments
with simple lighting. As visualized in Figure 2-12, the method of Bi et al. [2020a]
performs comparably to NeRV in the case it is designed for: images illuminated by a
single point source. However, their model’s performance degrades when it is trained on
datasets that have complex lighting conditions (Colorful+Point and Ambient+Point
experiments in Table 2.2), as its forward model is unable to simulate light from more
than a single source during training.
(a) Ground Truth (b) Bi et al. (c) NeRV (Ours) (d) Bi et al. (e) NeRV (Ours)
Point Point Ambient+Point Ambient+Point
Figure 2-12: NeRV vs. Bi et al. [2020a]. Both Bi et al. [2020a] (b) and NeRV (c)
recover high-quality relightable models when trained on images illuminated by a single
point source. However, for more complex lighting such as “Ambient+Point,” Bi et al.
[2020a] (d) fails as its brute-force visibility computation is unable to simulate the
surrounding ambient lighting during training. Their model minimizes training loss
by making the scene transparent and is thus unable to render convincing images for
the “single point light” (Row 1) or “colorful set of points lights” (Row 2) conditions.
Because NeRV (e) correctly simulates light transport, its renderings more closely
resemble the ground truth (a).
As visualized in Figure 2-13, NeRV thoroughly outperforms both latent code base-
lines as they are unable to generalize to lighting conditions that are unlike those seen
during training. Our method generally matches or outperforms the NLT baseline,
which requires a controlled laboratory lighting setup and substantially more inputs
than all other methods (the multi-view OLAT dataset we use to train NLT contains
93
hotdog
Train. Illum. Single Point Colorful+Point Ambient+Point OLAT
PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM
NLT − − − − − − 23.57 0.851
NeRF+LE 19.96 0.868 17.88 0.758 20.72 0.869 − −
NeRF+Env 19.94 0.863 19.17 0.824 20.56 0.864 − −
Bi et al. [2020a] 23.74 0.862 22.09 0.799 20.94 0.754 − −
NeRV-NVF 23.93 0.860 24.37 0.885 25.14 0.892 − −
NeRV-Trace 23.76 0.863 24.24 0.886 25.06 0.892 − −
lego
Train. Illum. Single Point Colorful+Point Ambient+Point OLAT
PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM
NLT − − − − − − 24.10 0.936
NeRF+LE 21.42 0.874 21.74 0.890 20.33 0.860 − −
NeRF+Env 21.13 0.855 20.27 0.878 20.24 0.852 − −
Bi et al. [2020a] 22.89 0.897 22.83 0.890 18.10 0.783 − −
NeRV-NVF 22.78 0.866 23.82 0.899 23.32 0.894 − −
NeRV-Trace 23.16 0.883 24.18 0.925 23.79 0.923 − −
armadillo
Train. Illum. Single Point Colorful+Point Ambient+Point OLAT
PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM
NLT − − − − − − 21.62 0.900
NeRF+LE 20.35 0.881 18.76 0.863 17.35 0.859 − −
NeRF+Env 19.60 0.874 17.89 0.863 17.28 0.851 − −
Bi et al. [2020a] 22.35 0.894 21.06 0.892 19.93 0.842 − −
NeRV-NVF 21.14 0.882 22.80 0.910 22.80 0.897 − −
NeRV-Trace 22.14 0.897 23.02 0.921 22.81 0.895 − −
Table 2.2: Quantitative evaluation of NeRV. For every scene, we train each method on
three datasets that contain images of the scene under different illumination conditions
and compare the metrics of all variants on the same testing dataset. Please refer to
Section 2.6.1 for details.
94
8× as many images as our other datasets, and the original NLT paper [Zhang et al.,
2021b] uses 150 OLAT images per viewpoint).
(a) Ground Truth (b) NeRF + LE (c) NeRF + Env. (d) NeRV (Ours)
Ambient+Point Ambient+Point Ambient+Point
Figure 2-13: NeRV vs. latent code models. Modeling appearance changes due to
lighting with a latent code does not generalize to lighting conditions unseen during
training. Here we train the two latent code baselines (b, c) and NeRV (d) on the
Ambient+Point dataset. The latent code models are unable to produce convincing
renderings at test time, while NeRV trained on the same data renders high-quality
images.
In this subsection, we compare NeRFactor with both classic and deep learning-based
state of the art in the tasks of appearance factorization and free-viewpoint relighting.
For quantitative evaluations, we use Peak Signal-to-Noise Ratio (PSNR), Structural
Similarity Index Measure (SSIM) [Wang et al., 2004], and Learned Perceptual Image
Patch Similarity (LPIPS) [Zhang et al., 2018a] as our error metrics.
SIRFS We compare the factorization by NeRFactor with that of the classic Shape,
Illumination, and Reflectance From Shading (SIRFS) method [Barron and Malik,
2014], both qualitatively and quantitatively. SIRFS is a single-image method that
decomposes appearance into surface normals, albedo, and shading (not shadowing)
in the input view. In contrast, NeRFactor is a multi-view approach that estimates
these properties plus BRDFs and visibility (hence, shadows) in the full 3D space
95
alongside the unknown illumination. In other words, NeRFactor gets to observe
many more views than SIRFS, which observes only one view. Under this setup, NeR-
Factor outperforms SIRFS quantitatively as shown by Table 2.1. Figure 2-14 shows
that although SIRFS achieves reasonable albedo estimation, it produces inaccurate
surface normals likely due to its inability to incorporate multiple views or to reason
about shape in “world space.” In addition, SIRFS is unable to render the scene from
arbitrary viewpoints or synthesize shadows during relighting.
(A) SIRFS (B) NeRFactor (ours) (C) Ground Truth (A) SIRFS (B) NeRFactor (ours) (C) Ground Truth
Figure 2-14: NeRFactor vs. SIRFS. Here we compare NeRFactor against SIRFS [Bar-
ron and Malik, 2014] that also recovers normals, albedo, and shading (not shadow)
given only one illumination condition. Although the albedo estimates produced by
SIRFS are reasonable, the surface normals are highly inaccurate (likely due to SIRFS’s
inability to use multiple images to inform shape estimation).
96
BRDF bases [Nishino, 2009, Nishino and Lombardi, 2011, Lombardi and Nishino,
2012]. Also note that this baseline has the advantage of receiving the ground-truth
illumination as input, whereas NeRFactor has to estimate illumination together with
shape and reflectance.
As shown in Figure 2-15 (I), even though this improved version of Oxholm and
Nishino [2014] has access to the ground-truth illumination, it struggles to remove
shadow residuals from the albedo estimation because of its inability to model visibil-
ity (hotdog and lego). As expected, these residuals in albedo negatively affect the
relighting results in Figure 2-15 (II) (e.g., the red shade on the hotdog plate). More-
over, because the BRDF estimated by this baseline is not spatially-varying, BRDFs of
the hot dog buns and the ficus leaves are incorrectly estimated to be as specular as the
plate and vase, respectively. Finally, this baseline is unable to synthesize non-local
light transport effects such as shadows (hotdog and lego), in contrast to NeRFactor
that correctly produces realistic hard cast shadows under the OLAT conditions.
Philip et al. [2019] The recent work of Philip et al. [2019] presents a technique to
relight large-scale scenes, and specifically focuses on synthesizing realistic shadows.
The input to their system is similar to ours: multi-view images of a scene lit by
an unknown lighting condition. However, their technique only supports synthesizing
images illuminated by a single primary light source, such as the Sun (in contrast to
NeRFactor, which supports any arbitrary light probe). As such, we compare it with
NeRFactor only on the task of point light relighting.
As Figure 2-16 demonstrates, NeRFactor qualitatively outperforms this baseline
and synthesizes hard shadows that better resemble the ground truth. The “yellow fog”
in the background of their results (Figure 2-16 [A]) is likely due to poor geometry
reconstruction by their method. Because their network is trained on outdoor scenes
(not images with backgrounds), we additionally compute error metrics after masking
out the yellow fog with the ground-truth object masks (“Philip et al. [2019] + Masks”)
for a more generous comparison. As the table in Figure 2-16 shows, NeRFactor
outperforms Philip et al. [2019] + Masks in terms of both PSNR and SSIM.
97
I. Albedo Estimation II. Relighting (another view)
(A) Oxholm & (B) NeRFactor (C) Ground Truth (A) Oxholm & (B) NeRFactor (C) Ground Truth
Nishino [2014]† (ours) Nishino [2014]† (ours)
Figure 2-15: NeRFactor vs. Oxholm and Nishino [2014] (enhanced). (I) Their method
is unable to remove shadow residuals from albedo for hotdog and lego likely due to
its inability to model visibility or shadows, although it produces reasonable albedo
estimation for ficus wherein shading (instead of shadowing) predominates. In con-
trast, NeRFactor produces albedo maps with little to no shading. (II) As expected,
the baseline’s relighting results are negatively affected by the shadow residuals in
albedo (e.g., the red shade on the plate of hotdog). Furthermore, because their ap-
proach does not support spatially-varying BRDFs, the hot dog buns and ficus leaves
are mistakenly estimated to be as specular as the plate and the vase, respectively.
NeRFactor, on the other hand, correctly estimates different materials for different
parts of the scenes. Note also how NeRFactor is able to synthesize hard shadows
in hotdog and lego, while the baseline does not model visibility or shadows. †See
Section 2.6.2 for how we significantly enhanced the approach of Oxholm and Nishino
[2014]; in addition, we provide the baseline with the ground-truth illumination, since
unlike NeRFactor, it does not estimate the lighting condition.
98
Philip et al. [2019] + Masks achieves a lower (better) LPIPS score because it
renders new viewpoints by reprojecting observed images using estimated proxy ge-
ometry, as is typical of Image-Based Rendering (IBR) algorithms. Thus, it retains the
high-frequency details present in the input images, resulting in a lower LPIPS score.
However, as a physically-based (re-)rendering approach that operates fully in the 3D
space, NeRFactor synthesizes hard shadows that better match the ground truth and
supports relighting with arbitrary light probes such as “Studio,” which has four major
light sources in Figure 2-10.
99
be important for downstream graphics tasks.
(a) NeRV (Ours) with Indirect (b) NeRV (Ours) Direct Only
Figure 2-17: Indirect illumination in NeRV. NeRV’s ability to simulate indirect illu-
mination produces realistic details such as the additional brightness in the bulldozer’s
cab due to interreflections.
(a) NeRV (Ours) with Analytic Normals (b) NeRV (Ours) with MLP-Predicted Normals
Figure 2-18: NeRV with analytic vs. MLP-predicted normals. While obtaining surface
normals analytically (a) or as an output of the shape MLP (b) produces similar
renderings, analytic normals are much closer to the true surface normals.
100
Learned vs. Microfacet BRDFs Instead of using an MLP to parametrize the
BRDF and pretraining it on an external BRDF dataset to learn data-driven priors,
one can adopt an analytic BRDF model such as the microfacet model of Walter et al.
[2007] and ask an MLP to predict spatially-varying roughness for the microfacet
BRDF. As Table 2.1 shows, this model variant achieves good performance across all
tasks but overall underperforms NeRFactor. Note that to improve this variant, we
removed the smoothness constraint on the predicted roughness because even a tiny
smoothness weight still drove the optimization to the local optimum of predicting
maximum roughness everywhere (this local optimum is a “safe” solution that renders
everything more diffuse to satisfy the ℓ2 reconstruction loss). As such, this model
variance sometimes produces noisy rendering due to its non-smooth BRDFs in the
supplemental video.
With vs. Without Geometry Pretraining As shown in Figure 2-6 and discussed
in Section 2.4, we pretrain the normal and visibility MLPs to just reproduce the NeRF
values given 𝑥surf before plugging them into the joint optimization (where they are
then finetuned together with the rest of the pipeline), to prevent the albedo MLP
from mistakenly attempting to explain way the shadows. Alternatively, one can train
these two geometry MLPs from scratch together with the pipeline. As Table 2.1
shows, this variant indeed predicts worse albedo with shading residuals (Figure 2-19
[C]) and overall underperforms NeRFactor.
101
I. Surface Normals
IV. Illumination
(A) NeRFactor (B) NeRFactor w/o (C) NeRFactor w/o (D) NeRFactor (E) NeRFactor (ours) (F) Ground Truth
Using NeRF’s Shape Smoothness Geometry Pretrain. Using Microfacet
Figure 2-19: Qualitative ablation studies of NeRFactor. (A) One can fix the geometry
to that of NeRF and estimate only the reflectance and illumination by ablating the
normal and visibility MLPs of NeRFactor, but the NeRF geometry is too noisy (I) to
be used for relighting (see the supplemental video). (B) Ablating the smoothness reg-
ularization leads to noisy geometry and albedo (I and II). (C) If we train the normal
and visibility MLPs from scratch during the joint optimization (i.e., no pretraining),
the recovered albedo may mistakenly attempt to explain shading and shadows (III).
(D) If we replace the learned BRDF with an MLP predicting the roughness parameter
of a microfacet BRDF, the predicted reflectance either falls into the local optimum
of maximum roughness everywhere or becomes non-smooth spatially (not pictured
here; see the supplemental video). (E) NeRFactor is able to recover plausible nor-
mals, albedo, and illumination without any direct supervision on any factor. The
illuminations recovered by NeRFactor, though oversmoothed, correctly capture the
location of the Sun. See Section 2.5.3 for the color correction and tone mapping
applied to albedo.
102
Estimating the Shape vs. Using NeRF’s Shape If we ablate the normal and
visibility MLPs entirely, this variant is essentially using NeRF’s normals and visibility
without improving upon them (hence “using NeRF’s shape”). As Table 2.1 and the
supplemental video show, even though the estimated reflectance is smooth (encour-
aged by the smoothness priors from the full model), the noisy NeRF normals and
visibility produce artifacts in the final rendering.
In this experiment, we study how different illumination conditions affect the albedo
estimation by NeRFactor. More specifically, we probe how consistent the estimated
albedo predictions are across different input illumination conditions. To this end, we
light the ficus scene with four drastically different lighting conditions as shown in
Figure 2-20, and then estimate the albedo with NeRFactor from these four sets of
multi-view images.
As Figure 2-20 shows, NeRFactor’s predictions are similar across the four input
103
illuminations, with pairwise PSNR ≥ 34.7dB. Note that the performance on Case
D is worse (e.g., the specularity residuals on the vase) than on Case C, despite that
both cases seem to have the Sun as the primary light source. The reason is that
Case D had the Sun pixels properly measured by Stumpfel et al. [2004], whereas Case
C is an internet light probe that clipped the Sun pixels. Therefore, Case D has a
much higher-frequency lighting condition than Case C, making it a harder case for
NeRFactor to correctly factorize the appearance.
2.7 Conclusion
104
world Bidirectional Reflectance Distribution Functions (BRDFs). We demonstrate
that NeRFactor achieves high-quality geometry sufficient for relighting and view syn-
thesis, produces convincing albedo as well as spatially-varying BRDFs, and generates
lighting estimations that correctly reflect the presence or absence of dominant light
sources. With NeRFactor’s factorization, we can relight the object with point lights
or light probe images, render images from arbitrary viewpoints, and even edit the ob-
ject’s albedo and BRDF. We believe this work makes important progress towards the
goal of recovering fully-featured 3D graphics assets from casually-captured photos.
Although we demonstrate that NeRFactor outperforms baseline methods and vari-
ants with different design choices, there are a few important limitations. First, to keep
light visibility computation tractable, we limit the resolution of the light probe images
to 16 × 32, a resolution that may be insufficient for generating very hard shadows
and recovering very high-frequency BRDFs. As such, when the object is lit by a very
high-frequency illumination such as the one in Figure 2-20 (Case D) where the sun
pixels are fully High-Dynamic-Range (HDR), there might be specularity or shadow
residuals in the albedo estimation such as those on the vase.
Second, for fast rendering, NeRFactor considers only single-bounce direct illumi-
nation, so NeRFactor does not properly account for indirect illumination effects. This
is in contrast to NeRV that also models one-bounce indirect illumination in addition
to direct illumination. It is an interesting future direction to combine NeRV and NeR-
Factor into a model that handles images taken under one unknown lighting condition
inputs while modeling indirect illumination.
Finally, NeRFactor initializes its geometry estimation with NeRF in contrast to
NeRV that optimizes the geometry from scratch. While NeRFactor is able to fix
errors made by NeRF (and NeRV) up to a certain degree, it can fail if NeRF estimates
particularly poor geometry in a manner that happens to not affect view synthesis. We
observe this in the real scenes, which contain faraway incorrect “floating” geometry
that is not visible from the input cameras but casts shadows on the objects of interest.
It is much desirable to have a model capable of optimizing geometry from scratch like
NeRV and meanwhile achieving as high-quality geometry as that of NeRFactor.
105
THIS PAGE INTENTIONALLY LEFT BLANK
106
Chapter 3
In this chapter, we model appearance at a middle level of abstraction using the light
transport (LT) function, without further decomposing it into the underlying shape
and reflectance. Specifically, we address the problem of interpolating the LT function
in both light and view directions from sparse samples of the LT function. By doing so,
one is able to synthesize the appearance from a novel viewpoint under any arbitrary
lighting. We start with an introduction of LT acquisition (Section 3.1) and then
review the related work in Section 3.2.
Next, we present Light Stage Super-Resolution (LSSR) [Sun et al., 2020]
that is capable of interpolating the LT function smoothly and stably in the light
direction, thereby enabling continuous, high-frequency relighting from a fixed view-
point (Section 3.3). To also support view synthesis, we further devise Neural Light
Transport (NLT) [Zhang et al., 2021b] that interpolates the LT function in both
view and light directions, thereby supporting simultaneous relighting and view syn-
thesis (Section 3.4).
In Section 3.5, we describe our experiments that evaluate how well LSSR and NLT
perform relighting and/or view synthesis and how they compare with the existing
solutions to the two tasks. We also perform additional analyses, in Section 3.6,
to study the importance of each major component of the LSSR and NLT models,
107
analyze above which frequency band we would need LSSR to achieve high-frequency
relighting, test whether LSSR and NLT are applicable to smaller light stages with
fewer lights, and finally stress-test NLT with degrading quality of the input geometry.
3.1 Introduction
The light transport (LT) of a scene models how light interacts with objects in the
scene to produce an observed image. The process by which geometry and material
properties of the scene interact with global illumination to result in an image is a
complicated but well-understood consequence of physics [Pharr et al., 2016]. Much
progress in computer graphics has been through the development of more expres-
sive and efficient mappings from scene models (geometry, materials, and lighting) to
images. In contrast, inverting this process is ill-posed and therefore more difficult:
Acquiring the LT in a scene from images requires untangling the myriad intercon-
nected effects of occlusion, shading, shadowing, interreflections, scattering, etc.
Solving this task of inferring aspects of LT from images is an active research
area, and even partial solutions have significant practical uses such as phototourism
[Snavely et al., 2006], telepresence [Orts-Escolano et al., 2016], storytelling [Kelly
et al., 2019], and special effects [Debevec, 2012]. A less obvious, but equally important
application of inferring LT from images consists of generating groundtruth data for
machine learning tasks: Many works rely on high-quality renderings of relit subjects
under arbitrary lighting conditions and from multiple viewpoints to perform relighting
[Meka et al., 2019, Sun et al., 2019], view synthesis [Pandey et al., 2019], re-enacting
[Kim et al., 2018], and alpha matting [Sengupta et al., 2020].
Previous work has shown that it is possible to construct a light stage [Debevec
et al., 2000], plenoptic camera [Levoy and Hanrahan, 1996], or gantry [Murray-
Coleman and Smith, 1990] that directly captures a subset of the LT function and
thereby enables the image-based rendering thereof. The simplest light stage com-
prises just one camera but multiple LED lights distributed (roughly) evenly over the
hemisphere. By programmatically activating and deactivating the LED lights, the
108
light stage acquires samples of the LT function for this particular viewing direction,
which we refer to as a One-Light-at-A-Time (OLAT) image set. Because light is ad-
ditive, this OLAT scan serves as “bases,” by linearly combining which one can relight
the subject according to any desired light probe [Debevec et al., 2000].
These techniques are widely used in film productions and within the research
community. However, these systems can only provide sparse sampling of the LT
function limited to the numbers of cameras and lights, resulting in the inability to
produce photorealistic renderings outside the supported camera or light locations.
Specifically, in order to achieve photorealistic relighting under all possible lighting
conditions, one needs to place the lights close enough on the stage dome such that
shadows and specularities in the captured images of adjacent lights “move” by less
than one pixel. Yet, practical constraints (such as cost and the difficulties in powering
and synchronizing many lights) hinders the construction of light stages with high light
densities. Even if such a high-density light stage could be built, the time for acquiring
an OLAT set grows linearly w.r.t. the number of lights, making human subjects (which
must stay stationary during the capture) difficult to capture. For these reasons, even
the most sophisticated light stages in existence today comprise only a few hundred
(e.g., 330 in our case) lights that are spaced many degrees apart.
This means that the LT function is undersampled w.r.t. the angular sampling of
the lights, and that the images rendered by conventional approaches will likely con-
tain ghosting. It does not produce soft shadows or streaking specularities to render
an image using a “virtual light” that lies in between the real lights by computing a
weighted average of the adjacent OLAT images, but instead produces the superposi-
tion of multiple sharp shadows and specular dots (see Figure 3-1 [b]). These artifacts
are particularly problematic for human faces, which exhibit complex reflectance prop-
erties (such as specularities, scattering, etc.) [Weyrich et al., 2006] and are likely to
fall into the uncanny valley [Seyama and Nagayama, 2007].
109
(a) I1 and I2 , captured for adjacent lights (b) Linearly blended (c) LSSR rendering
𝜔1 and 𝜔2 on the light stage. image, (I1 + I2 )/2. I((𝜔1 + 𝜔2 )/2).
Figure 3-1: LSSR overview. (a) Although light stages are a powerful tool for capturing
and subsequently relighting human subjects, their rendering suffers from adjacent
lights on the stage being separated by some distance. (b) It results in ghosting in
shadowed and specular regions to produce an image for a “virtual” light lying between
the stage’s physical lights with conventional image blending techniques, as seen here
on the subject’s eyes and cheek. (c) By training a deep neural network to regress
from a light direction to an image, our model is able to accurately render the subject
for an arbitrary virtual light direction – as the light moves, highlights and shadows
move smoothly rather than incorrectly blend together, thereby enabling realistic high-
frequency relighting effects.
with finite images and the direction of a desired virtual light, LSSR predicts a high-
resolution RGB image that appears to have been lit by the target light, even though
that light is not present on the light stage (see Figure 3-1 [c]). Our approach can
additionally enable the construction of simpler light stages with fewer lights, thereby
reducing cost and increasing the frame rate at which subjects can be scanned. LSSR
also produces denser rendering for Machine Learning (ML) applications that require
light stage data for training1 such as portrait relighting [Sun et al., 2019] and shadow
removal [Zhang et al., 2020].
LSSR needs to work with the inherent aliasing and regularity of the light stage
data. We address this by combining the power of deep neural networks with the ef-
ficiency and generality of conventional linear interpolation methods. Specifically, we
use an active set of the closest lights within our network (Section 3.3.1) and develop a
novel alias-free pooling approach to combine their network activations (Section 3.3.2)
using a weighting operator guaranteed to be smooth when lights enter or exit the ac-
1
Once trained, such ML systems usually take as input single images, without requiring a light
stage at test time.
110
tive set. Our network allows us to super-resolve an OLAT scan of a human face: We
can repeatedly query our trained model with thousands of light directions and treat
the resulting set of synthesized images as though they were acquired by a physically-
unconstrained light stage with an unbounded sampling density. As we will demon-
strate, these super-resolved virtual OLAT scans allow us to produce photorealistic
rendering of human faces with arbitrarily high-frequency illumination contents.
Besides lights, cameras also dictate how densely one can sample the LT function
with a light stage. For similar reasons aforementioned, a state-of-the-art light stage
comprises only around 50–100 cameras, covering a limited number of viewing direc-
tions. This problem of “absent cameras” is arguably more concerning than that of
absent lights: With only the physical lights, rendering can still be made despite with
artifacts like ghosting shadows, while with only the physical cameras, one is simply
unable to render the subject from a “virtual camera.” Indeed, traditional Image-Based
Rendering (IBR) approaches are usually designed for fixed viewpoints and are unable
to synthesize unseen (novel) views under a desired illumination.
111
Interpolating Lights
f (x, !i , !o )
<latexit sha1_base64="aEct1D+jr1UNK7OyKctwzyz6SRQ=">AAACAXicbZDLSgMxFIYz9VbrbdSN4CZYhApSZqqgy6IblxXsBdphyKSZNjSXIcmIpdSNr+LGhSJufQt3vo1pO4hWfwh8/OccTs4fJYxq43mfTm5hcWl5Jb9aWFvf2Nxyt3caWqYKkzqWTKpWhDRhVJC6oYaRVqII4hEjzWhwOak3b4nSVIobM0xIwFFP0JhiZKwVuntx6e4YdiQnPRTSb5JHoVv0yt5U8C/4GRRBplrofnS6EqecCIMZ0rrte4kJRkgZihkZFzqpJgnCA9QjbYsCcaKD0fSCMTy0ThfGUtknDJy6PydGiGs95JHt5Mj09XxtYv5Xa6cmPg9GVCSpIQLPFsUpg0bCSRywSxXBhg0tIKyo/SvEfaQQNja0gg3Bnz/5LzQqZf+kXLk+LVYvsjjyYB8cgBLwwRmogitQA3WAwT14BM/gxXlwnpxX523WmnOymV3wS877F7SDlcY=</latexit>
!o !i
Interpolating Views
<latexit sha1_base64="JXUgETDU3sy/BKOpJY+kVVQMJt4=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5idzCZD5rHOzAphyU948aCIV3/Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyd1ekrQIe6zfrniV/050CoJclKBHI1++as3UCQVVFrCsTHdwE9smGFtGeF0WuqlhiaYjPGQdh2VWFATZvN7p+jMKQMUK+1KWjRXf09kWBgzEZHrFNiOzLI3E//zuqmNr8OMySS1VJLFojjlyCo0ex4NmKbE8okjmGjmbkVkhDUm1kVUciEEyy+vklatGlxUa/eXlfpNHkcRTuAUziGAK6jDHTSgCQQ4PMMrvHmP3ov37n0sWgtePnMMf+B9/gAPxI/7</latexit>
<latexit sha1_base64="CzvJof2xmhdMc+LShQhYz+0IlHI=">AAAB73icbVDLSgNBEJyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5id9CZD5rHOzAphyU948aCIV3/Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNoUkVV7oTEQOcSWhaZjl0Eg1ERBza0fh25refQBum5IOdJBAKMpQsZpRYJ3V6SsCQ9FW/XPGr/hx4lQQ5qaAcjX75qzdQNBUgLeXEmG7gJzbMiLaMcpiWeqmBhNAxGULXUUkEmDCb3zvFZ04Z4FhpV9Liufp7IiPCmImIXKcgdmSWvZn4n9dNbXwdZkwmqQVJF4vilGOr8Ox5PGAaqOUTRwjVzN2K6YhoQq2LqORCCJZfXiWtWjW4qNbuLyv1mzyOIjpBp+gcBegK1dEdaqAmooijZ/SK3rxH78V79z4WrQUvnzlGf+B9/gAY3JAB</latexit>
Surface
x
<latexit sha1_base64="hL+FaLtOT9luwfLW3Ut08xl3Pcw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtZICF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOeHjQA=</latexit>
x <latexit sha1_base64="hL+FaLtOT9luwfLW3Ut08xl3Pcw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtZICF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOeHjQA=</latexit>
UV Location (2D)
!i
<latexit sha1_base64="JXUgETDU3sy/BKOpJY+kVVQMJt4=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5idzCZD5rHOzAphyU948aCIV3/Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyd1ekrQIe6zfrniV/050CoJclKBHI1++as3UCQVVFrCsTHdwE9smGFtGeF0WuqlhiaYjPGQdh2VWFATZvN7p+jMKQMUK+1KWjRXf09kWBgzEZHrFNiOzLI3E//zuqmNr8OMySS1VJLFojjlyCo0ex4NmKbE8okjmGjmbkVkhDUm1kVUciEEyy+vklatGlxUa/eXlfpNHkcRTuAUziGAK6jDHTSgCQQ4PMMrvHmP3ov37n0sWgtePnMMf+B9/gAPxI/7</latexit>
Light Direction (2D)
<latexit sha1_base64="CzvJof2xmhdMc+LShQhYz+0IlHI=">AAAB73icbVDLSgNBEJyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5id9CZD5rHOzAphyU948aCIV3/Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNoUkVV7oTEQOcSWhaZjl0Eg1ERBza0fh25refQBum5IOdJBAKMpQsZpRYJ3V6SsCQ9FW/XPGr/hx4lQQ5qaAcjX75qzdQNBUgLeXEmG7gJzbMiLaMcpiWeqmBhNAxGULXUUkEmDCb3zvFZ04Z4FhpV9Liufp7IiPCmImIXKcgdmSWvZn4n9dNbXwdZkwmqQVJF4vilGOr8Ox5PGAaqOUTRwjVzN2K6YhoQq2LqORCCJZfXiWtWjW4qNbuLyv1mzyOIjpBp+gcBegK1dEdaqAmooijZ/SK3rxH78V79z4WrQUvnzlGf+B9/gAY3JAB</latexit>
!o Viewing Direction (2D)
(A) The 6D Light Transport (B) Capture Setup: multiple (C) Simultaneous Relighting and (D) HDRI Relighting
Function Being Learned views; one light at a time View Synthesis
Figure 3-2: NLT overview. (A) NLT learns to interpolate the 6D light transport
function of a surface as a function of the UV coordinate (2 DOFs), incident light
direction (2 DOFs), and viewing direction (2 DOFs). (B) The subject is imaged from
multiple viewpoints when lit by different directional lights; a geometry proxy is also
captured using active sensors. (C) Querying the learned function at different light
and/or viewing directions enables simultaneous relighting and view synthesis of this
subject. (D) The relit renderings that NLT produces can be combined according to
HDRI maps to perform image-based relighting.
but then we embed a neural network within the parameterization provided by that
classical model, construct the inputs and outputs of the model in ways that leverage
domain knowledge of classical graphics techniques, and train that network to model
all aspects of LT—including those not captured by a classical model. By leveraging
a classical model this way, NLT is able to learn an accurate model of the complicated
LT function for a subject from a small training dataset of sparse observations.
A key novelty of NLT is that our learned model is embedded within the texture
atlas space of an existing geometric model of the subject, which provides a novel
framework for simultaneous relighting and view interpolation. We express the 6D LT
function (Figure 3-2) at each location on the surface of our geometric model as simply
the output of a neural network, which works well (as neural networks are smooth
and universal function approximators [Hornik, 1991]) and obviates the need for a
complicated parameterization of spatially-varying reflectance. We evaluate NLT on
joint relighting and view synthesis using sparse image observations of scanned human
subjects within a light stage, and show state-of-the-art results as well as compelling
practical applications.
112
• an end-to-end, semi-parametric method for learning to interpolate the 6D light
transport function per-subject from real data using convolutional neural net-
works (Section 3.4.2),
• a unified framework for simultaneous relighting and view synthesis by embed-
ding networks into a parameterized texture atlas and leveraging as input a set
of OLAT images (Section 3.4.4), and
• a set of augmented texture-space inputs and a residual learning scheme on
top of a physically accurate diffuse base, which together allow the network to
easily learn non-diffuse, higher-order light transport effects including specular
highlights, subsurface scattering, and global illumination (Section 3.4.1 and
Section 3.4.3).
NLT allows for photorealistic free-viewpoint rendering under controllable lighting
conditions, which not only is a key aspect in compelling user experiences in mixed
reality and special effects, but can be applied to a variety of machine learning tasks
that rely on photorealistic ground-truth data.
The angular undersampling from the light stage relates to much work over the past
two decades on frequency analyses of light transport [Ramamoorthi and Hanrahan,
2001, Sato et al., 2003, Durand et al., 2005] and analyses of sampling rates in Image-
Based Rendering (IBR) [Adelson and Bergen, 1991] for the related problem of view
synthesis [Mildenhall et al., 2019]. This problem also bears some similarities to multi-
image super-resolution [Milanfar, 2010] and angular super-resolution in light fields
[Kalantari et al., 2016, Cheng et al., 2019], where aliased observations are combined
to produce interpolated results. In LSSR, we leverage priors and deep learning to go
beyond these sampling limits, upsampling or super-resolving the sparse lights on the
light stage to achieve continuous, high-frequency relighting.
Recently, many approaches for acquiring a sparse light transport matrix have been
developed, including methods based on compressive sensing [Peers et al., 2009, Sen
113
and Darabi, 2009], kernel Nyström [Wang et al., 2009], optical computing [O’Toole
and Kutulakos, 2010], and neural networks [Ren et al., 2013, 2015, Kang et al., 2018].
However, these methods are not designed for the light stage and are largely orthogonal
to our approach. They seek to acquire the transport matrix for a fixed light sampling
resolution with a sparse set of patterns, while we seek to take this initial sampling
resolution and upsample it to a much higher resolution (which indeed enables con-
tinuous, high-frequency relighting). Most recently, Xu et al. [2018b] proposed a deep
learning approach for image-based relighting from only five light directions, but the
approach cannot reproduce accurate shadows. While we do use many more lights, we
achieve significantly higher-quality results with accurate shadows.
The general approach of using light stages for image-based relighting stands in
contrast to more model-based approaches. Traditionally, instead of super-resolving a
light stage scan, one could use that scan as input to a photometric stereo algorithm
[Woodham, 1980], and attempt to recover the normal and the albedo maps of the
subject. More advanced techniques were developed to produce a parametric model
of the geometry and reflectance for even highly specular objects [Tunwattanapong
et al., 2013]. There are also works that focus on recovering a parametric face model
from a single image [Sengupta et al., 2018], constructing a volumetric model for view
synthesis [Lombardi et al., 2018], or a neural representation of a scene [Tewari et al.,
2020b]. However, the complicated reflectance and geometry of human subjects are
difficult to even parameterize analytically, let alone recover. Although recent progress
may enable accurate captures of human faces using parametric models, there are ad-
ditional difficulties in capturing a complete portrait due to the complexity of hair,
eyes, ears, etc. Indeed, this complexity has motivated the use of image-based relight-
ing via light stages in the visual effects industry for many years [Tunwattanapong
et al., 2011, Debevec, 2012].
114
interpolate sparsely sampled observations. However, these algorithms interpolate the
reflectance function independently on each pixel and do not consider local information
in the neighboring pixels. Thus, their results are smooth and consistent in the light
domain, but might not be consistent in the image domain. Fuchs et al. [2007] treat the
problem as a light super-resolution problem, similar to our work. They use heuristics
to decompose the captured images into diffuse and specular layers, and apply optical
flow and level-set algorithms to interpolate highlights and light visibility, respectively.
This approach works well on highly reflective objects, but as we will demonstrate, it
usually fails on human skin, which contains high frequency bumps and cannot be well
modeled using only the diffuse and specular terms.
In recent years, light stages have also been demonstrated to be invaluable tools
for generating training data for use in machine learning tasks [Meka et al., 2019,
Guo et al., 2019, Sun et al., 2019, Nestmeyer et al., 2019]. This enables user-facing
effects that do not require acquiring a complete light stage scan of the subject, such
as portrait relighting from a single image [Sun et al., 2019, Inc., 2017] or Virtual
Reality (VR) experiences [Guo et al., 2019]. These learning-based applications suffer
from the same undersampling issue as do conventional uses of light stage data. For
example, Sun et al. [2019] observe artifacts when relighting with light probes with
contain high-frequency contents. We believe our method can provide better training
data and significantly improve many of these methods in the future.
Unlike LSSR, NLT additionally upsamples the light stage in the number of cam-
eras. In other words, NLT addresses the problem of recovering a model of light
transport from a sparse set of images of some subject, and then predicting novel im-
ages of that subject from unseen views under novel illuminations. This is a broad
problem statement that relates to and subsumes many tasks in graphics and vision.
We now categorize the existing approaches according to the type of input they take.
The most sparse sampling is just a single image, from which one could attempt to infer
a model (geometry, reflectance, and illumination) of the physical world that resulted
115
in that image [Barrow and Tenenbaum, 1978], usually via hand-crafted [Barron and
Malik, 2014, Li et al., 2020a] or learned priors [Saxena et al., 2008, Eigen et al., 2014,
Sengupta et al., 2018, Li et al., 2018, 2020c, Kanamori and Endo, 2018, Kim et al.,
2018, Gardner et al., 2019, LeGendre et al., 2019, Alldieck et al., 2019, Wiles et al.,
2020, Zhang et al., 2020]. Though practical, the quality gap between what can be
accomplished by single-image techniques and what has been demonstrated by multi-
image techniques is significant. Indeed, none of these methods shows complex light
transport effects such as specular highlights or subsurface scattering [Kanamori and
Endo, 2018, Kim et al., 2018]. Moreover, these methods are usually limited to a single
task, such as relighting [Kanamori and Endo, 2018, Kim et al., 2018, Sengupta et al.,
2018] or view synthesis [Alldieck et al., 2019, Wiles et al., 2020, Li et al., 2020a],
and some support only a limited range of viewpoint change [Kim et al., 2018, Tewari
et al., 2020a].
116
sparse sets of input images, usually by training neural networks to synthesize some
intermediate geometric representation that is then projected into the desired image
[Zhou et al., 2018, Sitzmann et al., 2019a,b, Srinivasan et al., 2019, Flynn et al.,
2019, Mildenhall et al., 2019, 2020, Thies et al., 2020]. Some techniques even entirely
replace the rendering process with a learned “neural” renderer [Thies et al., 2019,
Martin-Brualla et al., 2018, Pandey et al., 2019, Lombardi et al., 2019, 2018, Tewari
et al., 2020b]. Despite effective, these methods generally do not attempt to explicitly
model light transport and hence do not enable relighting—though they are often ca-
pable of preserving view-dependent effects for the fixed illumination condition, under
which the input images were acquired [Thies et al., 2019, Mildenhall et al., 2020].
Additionally, neural rendering often breaks “backwards compatibility” with existing
graphics systems, while our approach infers images directly in texture space that can
be re-sampled by conventional graphics software (e.g., Unity, Blender, etc.) to syn-
thesize novel viewpoints. Recently, Chen et al. [2020] proposed to learn relightable
view synthesis from dense views (200 vs. 55 in this work) under image-based light-
ing; using spherical harmonics as the lighting representation, the work is unable to
produce hard shadow caused by a directional light as in this work.
Similar to the multi-view task is the task of photometric stereo [Woodham, 1980, Basri
et al., 2007] (as cameras function analogously to illuminants in some contexts [Sen
et al., 2005]): repeatedly imaging a subject with a fixed camera but under different
illuminations and then recovering the surface normals. However, most photometric
stereo solutions assume Lambertian reflectance and do not support relighting with
non-diffuse light transport. More recently, Ren et al. [2015], Meka et al. [2019], Sun
et al. [2019], and Sun et al. [2020] show that neural networks can be applied to relight a
scene captured under multiple lighting conditions from a fixed viewpoint. Nestmeyer
et al. [2020] decompose an image into shaded albedo (so no cast shadow) and residuals,
unlike this work that models cast shadow as part of a physically accurate diffuse base.
None of these works supports view synthesis. Xu et al. [2019] perform free-viewpoint
117
relighting, but unlike our approach, they require running the model of Xu et al.
[2018b] as a second stage.
Garg et al. [2006] utilize the symmetry of illuminations and view directions to collect
sparse samples of an 8D reflectance field, and reconstruct a complete field using a
low-rank assumption. Perhaps the most effective approach for addressing sparsity in
light transport estimation is to circumvent this problem entirely and densely sample
whatever is needed to produce the desired renderings. The landmark work of Debevec
et al. [2000] uses a light stage to acquire the full reflectance field of a subject by
capturing a One-Light-at-A-Time (OLAT) scan of that subject, which can be used
to relight the subject by linear combination according to some High-Dynamic-Range
Imaging (HDRI) light probe. Despite its excellent results, this approach lacks an
explicit geometric model, so rendering is limited to a fixed set of viewpoints. Although
this limitation has been partially addressed by Ma et al. [2007] who focus on facial
capture, a recent system of Guo et al. [2019] builds a full volumetric relightable
model using two spherical gradient illumination conditions [Fyffe, 2009]. This system
supports relighting and view synthesis but assumes predefined BRDFs and therefore
cannot synthesize more complex light transport effects present in real images.
Zickler et al. [2006] also pose the problem of appearance synthesis as that of
high-dimensional interpolation, but they use radial basis functions on smaller-scale
data. Our work follows the convention of the nascent field of “neural rendering”
[Thies et al., 2019, Lombardi et al., 2019, 2018, Sitzmann et al., 2019a, Tewari et al.,
2020b, Mildenhall et al., 2020], in which a separate neural network is trained for each
subject to be rendered, and all images of that subject are treated as “training data.”
These approaches have shown great promise in terms of their rendering fidelity, but
they require per-subject training and are unable to generalize across subjects yet.
Unlike prior work that focuses on a specific task (e.g., relighting or view synthesis),
our texture-space formulation allows for simultaneous light and view interpolation.
Furthermore, our model is a valuable training data generator for many works that
118
rely on high-quality renderings of subjects under arbitrary lighting conditions and
from multiple viewpoints, such as [Meka et al., 2019, Sun et al., 2019, Pandey et al.,
2019, Kim et al., 2018, Sengupta et al., 2020].
119
Our work attempts to find an effective and tractable compromise between these
two extremes, in which the power of deep neural networks is combined with the
efficiency and generality of nearest neighbor approaches. This is accomplished by a
linear blending approach that, like barycentric blending, ensures the output rendering
is a smooth function of the input, where the blending is performed on the activations of
a neural network’s encoding of our input images instead of on the raw pixel intensities
of the input images.
The complete network structure of LSSR is shown in Figure 3-3. Given a query
light direction 𝜔, we identify the 𝑘 captured images in the OLAT scan whose cor-
responding light directions are nearby the query light direction, which we call the
active set A(𝜔). These OLAT images {I𝑖 }𝑘𝑖=1 and their corresponding light directions
{𝜔𝑖 }𝑘𝑖=1 are then each independently processed in parallel by the encoder Φ𝑒 (·) of our
CNN (or equivalently, they are processed as a single “batch”), thereby producing a
multi-scale set of internal network activations that describe all 𝑘 images. After that,
the set of 𝑘 activations at each layer of the network are pooled into a single set of
activations at each layer, using weighted averaging where the weights are a function
of the query light and each input light W(𝜔, 𝜔𝑖 ). This weighted average is designed
to remove the aliasing introduced by nearest neighbor sampling for the active set
construction.
Together with the query light direction 𝜔, these pooled feature maps are then
fed into the decoder Φ𝑑 (·) by means of skip links from each level of the encoder,
thereby producing the final predicted image I (𝜔). Formally, our final image synthesis
procedure is: ⎛ ⎞
∑︁
I (𝜔) = Φ𝑑 ⎝ W(𝜔, 𝜔𝑖 )Φ𝑒 (I𝑖 , 𝜔𝑖 ) , 𝜔 ⎠ . (3.2)
𝑖∈A(𝜔)
This hybrid approach of nearest neighbor selection and neural network processing
allows us to learn a single neural network that produces high-quality results and
generalizes well across query light directions and across subjects in our OLAT dataset.
Our approach for the active set construction is explained in Section 3.3.1, our
alias-free pooling is explained in Section 3.3.2, the network architecture is described
120
Figure 3-3: Visualization of the LSSR architecture. The encoder Φ𝑒 (·) takes as input a
concatenation of nearby eight OLAT images in the active set and their light directions,
which are processed by a series of convolutional layers. The resulting activations
of these eight images at each level are then combined using our alias-free pooling
(described in Section 3.3.2) and skip-connected to the decoder. The decoder Φ𝑑 (·)
takes as input the query light direction 𝜔, processes it with fully connected layers,
then upsamples it (along with the skip-connected encoder activations), and finally
decodes the image using a series of transposed convolutional layers. Whether or not
a (transposed) convolutional layer alters resolution is indicated by whether its edge
spans two spatial scales.
in Section 3.3.3, and our progressive training procedure is discussed in Section 3.3.4.
121
Figure 3-4: Construction of the
active sets in LSSR. The LEDs on a
light stage form a regular (hexagonal
here) pattern, giving highly regular
light-to-light distances (a). At test
time, however, a novel light direction
may not lie on this hexagonal grid,
making irregular its distances to the
neighbors (c). We therefore sample a
random subset of the nearest
neighbors as the active set during
training (b), which forces the
network to reason about irregular
distances from the query light to its
neighbors.
pattern (Figure 3-4 [a]). For example, the six nearest neighbors of every point on a
hexagon are guaranteed to have exactly the same distance to that point. In contrast,
at test time, we need to be able to produce rendering for query light directions that
correspond to arbitrary points on the sphere, and those points will likely possess irreg-
ular distances to their neighboring lights (Figure 3-4 [c]). This presents a significant
distribution mismatch between our training and test data. As such, we would expect
poor generalization at test time if we were to naïvely train on highly regular sets of
nearest neighbors.
To address this issue, we adopt a different approach of sampling neighbors for our
active set at training time. For each training iteration, we first identify a larger set of
𝑚 nearest neighbors around the query light (which, in this case, is just one of the real
lights on the stage), and among them randomly select only 𝑘 < 𝑚 neighbors to use
in the active set (in practice, we use 𝑚 = 16 and 𝑘 = 8). As shown in Figure 3-4 (b),
this results in irregular neighbor sampling patterns during training, which simulates
our test-time scenario where the query light is at a variety of locations relative to the
real input lights.
122
training to prevent overfitting. Here we instead randomly remove input images, which
also has the effect of preventing the model from overfitting to the hexagonal pattern
of the light stage by forcing it to operate on more varied inputs. Notice that the
query light itself is included in the candidate set, to reflect the fact that at test
time, the virtual query light may be right next to a physical light. As we will show
in Section 3.6.1 and in the supplementary video, this active set selection approach
results in a learned model whose synthesized shadows move more smoothly and at a
more regular rate than is achieved with a naïve nearest neighbor sampling approach.
A critical component in LSSR is the skip links from each level of the encoder to
its corresponding level of the decoder. This model component is responsible for
producing network activations from the eight images in our active set and reducing
them to one set of activations to be decoded into the target image. This requires
a pooling operator that is permutation-invariant since the images in our active set
may correspond to arbitrary light directions and be presented in any order. Standard
permutation-invariant pooling operators, such as average or max pooling, are not
sufficient for our case because they do not suppress “aliasing” as discussed below.
As the query light direction moves across the sphere, its neighboring images will
enter and leave the active set of LSSR, which will cause the network activations
within our encoder to change abruptly (see Figure 3-5). If we use simple average
or max pooling, the activations in our decoder will also vary abruptly, resulting in
flickering artifacts or temporal instability in our output as the light direction varies.
The root cause of this problem is that our active set is an aliased observation of the
input images. Analogously, one should pass the point-sampled signal (e.g., an image)
through effective prefiltering (e.g., Gaussian blur) to suppress the aliasing artifacts.
Because average or max pooling allows this aliasing to persist, we introduce an
alias-free pooling technique to address this issue. We use a weighted average as our
pooling operator where the weight of each item in our active set is a continuous
function of the query light direction and is guaranteed to be 0 when this neighbor
123
Figure 3-5: LSSR’s alias-free
pooling. As the query light moves,
neighbors leave and enter the active
set, introducing aliasing that results
in jarring temporal artifacts. To
address this, we use an alias-free
pooling technique where the network
activations are averaged with
weights varying smoothly and
becoming exactly zero when lights
enter or leave the set.
enters or leaves the active set. We define our weighting function between the query
light direction 𝜔 and each OLAT light direction 𝜔𝑖 as follows:
(︂ )︂
W(𝜔, 𝜔𝑖 ) = max 0, 𝑒
̃︁ 𝑠(𝜔·𝜔𝑖 −1)
− min 𝑒𝑠(𝜔·𝜔𝑗 −1)
, (3.3)
𝑗∈A(𝜔)
W(𝜔,
̃︁ 𝜔𝑖 )
W(𝜔, 𝜔𝑖 ) = ∑︀ , (3.4)
𝑗 W(𝜔, 𝜔𝑗 )
̃︁
where 𝑠 is a learnable parameter that adjusts the decay of the weight with respect
to the distance, and each 𝜔 is a normalized vector in the 3D space. During training,
parameter 𝑠 will be automatically adjusted to balance between using just the nearest
neighbor (𝑠 = +∞) and an unweighted average of all neighbors (𝑠 = 0).
Our weighting function is an offset spherical Gaussian, similar to the normalized
Gaussian distance between the query light’s Cartesian coordinates and those of the
other lights in our active set, where we have subtracted the raw weight for the most
distant light in the active set (and clipped the resulting weights at 0). This adaptive
truncation is necessary because the lights on the light stage may be spaced irregularly
(due to holes for cameras or other reasons), which means that a fixed truncation may
be too aggressive in setting weights to 0 in regions where lights are sampled less
frequently. We instead leverage the fact that when a light exits the active set, a new
light will enter it at exactly the same time with exactly the same distance to the query
light. This allows us to truncate our Gaussian weights using the maximum distance
in the active set, which ensures that lights have a weight of exactly 0 as they leave or
124
enter the active set. This results in rendering that evolves smoothly as we move the
query light direction.
The remaining components of LSSR consist of the conventional building blocks used
in constructing CNNs, as seen in Figure 3-3. The encoder of our network consists
of 3 × 3 convolutional blocks (with a stride of 2 to halve the resolution), each of
which is followed by group normalization [Wu and He, 2018] and a PReLU [He et al.,
2015]. The number of hidden units of each layer begins at 32 and doubles after
each layer, but is clipped at 512. The input to our encoder is a set of eight RGB
input images corresponding to the nearby light directions in our active set, to each
of which we concatenate the 𝑥𝑦𝑧-coordinates of the target light (tiled to every pixel),
giving us eight 6-channel input images. These images are processed along the “batch”
dimension of our network and therefore treated identically at each level of the encoder.
These eight images are then pooled down to a single “image” (i.e., a single batch) of
activations using our alias-free pooling (Section 3.3.2), which is then concatenated to
the internal activations of the decoder.
The decoder begins with a series of fully-connected (a.k.a. dense) blocks that take
as input the query light direction 𝜔, each of which is followed by instance normal-
ization [Ulyanov et al., 2016] and a PReLU. These activations are then upsampled to
4 × 4. Each layer of the decoder consists of a 3 × 3 transposed convolutional block
(with a stride of 2 to double resolution), again followed by group normalization and
a PReLU. The input to each layer’s convolutional block is a concatenation of the
upsampled activations from the previous decoder level, with the pooled activations
skip-connected from the encoder at the same spatial resolution. The final activation
function is a sigmoid function that outputs pixel values ∈ [0, 1]. Because our network
is fully convolutional [Long et al., 2015], it can be evaluated on images of an arbitrary
resolution, with the GPU memory being the sole limiting factor. In practice, we train
on 512 × 512 images for the sake of speed, but test on 1024 × 1024 images to maximize
the image quality.
125
3.3.4 Loss Functions & Training Strategy
We supervise the training of LSSR using an ℓ1 loss on pixel intensities. Formally, our
loss function is:
∑︁
ℒ𝑑 = ‖M ⊙ (I𝑖 − I (𝜔𝑖 ))‖1 , (3.5)
𝑖
where I𝑖 is the ground-truth image under light 𝑖, and I (𝜔𝑖 ) is our prediction. When
computing the loss over the image, we use a precomputed binary mask M to exclude
the background pixels.
During training, we construct each training instance by randomly selecting a sub-
ject in our training dataset and then one OLAT light direction 𝑖. The image cor-
responding to that light I𝑖 will be used as the ground-truth image that our model
will attempt to reconstruct, and the query light direction is the light corresponding
to that image 𝜔𝑖 . We then identify a set of eight neighboring image-light pairs to
include in our active set using the selection procedure described in Section 3.3.1. Our
only data augmentation is a randomly-positioned 512 × 512 crop in each batch.
Progressive training has been found effective for accelerating and stabilizing the
training of Generative Adversarial Networks (GANs) for high-resolution image syn-
thesis [Karras et al., 2018]. Although our model is not a GAN (but a convolutional
encoder-decoder architecture with skip connections), we found it to also benefit from
progressive training. We first inject the downsampled image input directly into a
coarse layer of our encoder and supervise training by imposing a reconstruction loss
at a coarse layer of our decoder, resulting in a shallower model that is easier to train.
As training proceeds, we add additional convolutional layers to the encoder and de-
coder, thereby gradually increasing the resolution of our model until we arrive at
the complete network and the full image resolution. In total, we train our network
for 200,000 iterations, using eight NVIDIA V100 GPUs, which takes approximately
ten hours. Please see the detailed training procedure in the supplementary material
(Chapter B).
Our model is implemented in TensorFlow [Abadi et al., 2016] and trained using
Adam [Kingma and Ba, 2015] with a batch size of 1 (the batch dimension of our
126
tensors is used to represent the eight images in our active set), a learning rate of
10−3 , and the default hyperparameter settings (𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−7 ).
Figure 3-6: Gap in photorealism that NLT attempts to close. Even when high-
quality geometry and albedo can be captured (e.g., by Guo et al. [2019]), photoreal-
istic rendering remains challenging because any geometric inaccuracy will show up as
visual artifacts (e.g., black rims or holes in the hair), and manually creating spatially-
varying, photorealistic materials is onerous, if possible at all. NLT aims to close this
gap by learning directly from real images the residuals that account for geometric
inaccuracies and non-diffuse LT, such as global illumination.
The method relies on recent advances in computer vision that have enabled ac-
curate 3D reconstructions of human subjects, such as the technique of Collet et al.
127
[2015], which takes as input several images of a subject and produces as output a
mesh of that subject and a UV texture map describing its albedo. At first glance,
this appears to address the entirety of our problem: Given a textured mesh, we can
perform simultaneous view synthesis and relighting by simply re-rendering that mesh
from some arbitrary camera and under an arbitrary illumination. However, this sim-
plistic model of reflectance and illumination only permits equally simplistic relighting
and view synthesis, assuming Lambertian reflectance:
Here 𝐿
˜ 𝑜 (x, 𝜔𝑜 ) is the diffuse rendering of a point x with a surface normal n(x) and
albedo 𝜌(x), lit by a directional light 𝜔𝑖 with an incoming intensity 𝐿𝑖 (x, 𝜔𝑖 ) and
viewed from 𝜔𝑜 . This reflectance model is only sufficient for describing matte surfaces
and direct illumination. More recent methods (such as the Relightables [Guo et al.,
2019]) also make strong assumptions about materials by modeling reflectance with a
cosine lobe model.
The shortcomings of these methods are obvious when compared to a more expres-
sive rendering approach, such as the rendering equation [Kajiya, 1986], which makes
far fewer simplifying assumptions:
∫︁
𝐿𝑜 (x, 𝜔𝑜 ) = 𝐿𝑒 (x, 𝜔𝑜 ) + 𝑓𝑠 (x, 𝜔𝑖 , 𝜔𝑜 ) 𝐿𝑖 (x, 𝜔𝑖 ) (𝜔𝑖 · n(x)) d 𝜔𝑖 . (3.7)
Ω
directional light instead of integrating over the hemisphere of all incident directions Ω,
it approximates an object’s Bidirectional Reflectance Distribution Function (BRDF)
𝑓𝑠 (·) as a single scalar, and it ignores emitted radiance 𝐿𝑒 (·) (in addition to scattering
and transmittance, which this rendering equation does not model either). The goal
of our learning-based model is to close the gap between 𝐿𝑜 (x, 𝜔𝑜 ) and 𝐿
˜ 𝑜 (x, 𝜔𝑜 ), and
Though not perfect for relighting, the geometry and texture atlas provided by
128
Guo et al. [2019] offers us a mapping from each image of a subject onto a canonical
texture atlas that is shared across all views of that subject. This motivates the high-
level approach of our model: We use this information to map the input images of
the subject from the camera space (XY pixel coordinates) to the texture space (UV
texture atlas coordinates), then use a semi-parametric neural network embedded in
this texture space to fuse multiple observations and synthesize an RGB texture atlas
for the desired relit and/or novel-view image. This is then resampled back into the
camera space of the desired viewpoint, thereby giving us an output rendering of the
subject under the desired illumination and viewpoint.
In Section 3.5.1 and Section 3.4.1, we describe our data acquisition setup and the
input data to our framework. In Section 3.4.2, we detail the texture-space, two-path
neural network architecture at the core of our model, which consists of: I) “observation
paths” that take as input a set of observed RGB images that have been warped into
the texture space and produce a set of intermediate neural activations, and II) a
“query path” that uses these activations to synthesize a texture-space rendering of
the subject according to some desired light and/or viewing direction.
The texture-space inputs encode a rudimentary geometric understanding of the
scene and correspond to the arguments of the 6D LT function (i.e., UV location on
the 3D surface x, incident light direction 𝜔𝑖 , and viewing direction 𝜔𝑜 ). By using
a skip-link between the query path’s diffuse base image and its output as described
in Section 3.4.3, our model is encouraged to learn a residual between the provided
Lambertian rendering with geometric artifacts and the real-world appearance, which
not only guarantees the physical correctness of the diffuse LT, but also directs the
network’s attention towards learning higher-order, non-diffuse LT effects. In Sec-
tion 3.4.5, we explain how our model is trained end-to-end to minimize a photometric
loss and a perceptual loss in the camera space. Our model is visualized in Figure 3-7.
In order to perform light and view interpolation, we use as input to our model a
set of OLAT images, the subject’s diffuse base, and the dot products of the surface
129
normals with the desired or observed viewing directions or light directions (a.k.a.
“cosine maps”), all in the UV space. This augmented input allows our learned model
to leverage insights provided by classic graphics models, as the dot products between
the normals and the viewing or lighting directions are the standard primitives in
parametric reflectance models (Equation 3.6, Equation 3.7, etc.).
UV-Wrapped Legend
Query Path + Prediction
conv.
blocks
feature
maps
µ mean
- ⊙ ⊙
skip links
UV wrap.
Non-Diffuse Residuals Fully Lit Light View Ground-Truth
= Observed – Diffuse Base ×K Cosines Visibility recon. loss
Image
Figure 3-7: NLT Model. Our network consists of two paths. The “observation paths”
take as input 𝐾 nearby observations (as texture-space residual maps) sampled around
the target light and viewing directions, and encode them into multiscale features that
are pooled to remove the dependence on their order and number. These pooled
features are then concatenated to the feature activations of the “query path,” which
takes as input the desired light and viewing directions (in the form of cosine maps)
as well as the physically accurate diffuse base (also in the texture space). This path
predicts a residual map that is added to the diffuse base to produce the texture
rendering. With the (differentiable) UV wrapping predefined by the geometry proxy,
we then resample the texture-space rendering back into the camera space where the
prediction is compared against the ground-truth image. Because the entire network
is embedded in the texture space of a subject, the same model can be trained to
perform relighting, view synthesis, or both simultaneously, depending on the input
and supervision.
130
our approach, such as the embedding of our model in UV space (which removes
the dependency on viewpoints and implicitly provides aligned correspondence across
multiple views) and our use of a residual learning scheme (to encourage training
to focus on higher-order LT effects). Li et al. [2019] also successfully employ deep
learning in the texture space and regress Precomputed Radiance Transfer (PRT)
coefficients for deformable objects, but they learn only predefined diffuse and glossy
light transport from synthetic rendering.
We use three types of buffers in our model, as described below.
Cosine Map Assuming directional light sources, we calculate the cosine map of
a light as the dot product between the light’s direction 𝜔 and the surface’s normal
vector n(x). For each view and light (both observed and queried), we compute two
cosine maps: a view cosine map n(x) · 𝜔𝑜 and a light cosine map n(x) · 𝜔𝑖 . Crucially,
these maps are masked by visibility computed via ray casting from each camera onto
the geometry proxy, such that the light cosines also provide rough understanding of
cast shadows (texels with zero visibility; see Figure 3-7), leaving the network an easier
task of adding, e.g., global illumination effects to these hard shadows.
Diffuse Base The diffuse base is obtained by summing up all OLAT images for
each view or equivalently, illuminating the subject from all directions simultaneously
(because light is additive). These multiple views are then averaged together in the
texture space, which mitigates the view-dependent effects and produces a texture
map that resembles albedo. Note that multiplying the diffuse base by a light cosine
map produces the diffuse rendering (with hard cast shadows) for that light 𝐿
˜ 𝑜 (x, 𝜔𝑖 ).
The construction of this diffuse base is visualized in the bottom middle of Figure 3-7.
Residual Map We compute the difference between each observed OLAT image
and the aforementioned diffuse base, thereby capturing the “non-diffuse and non-
local” residual content of each input image. These residual maps are available only
for the sparsely captured OLAT from fixed viewpoints. To synthesize a novel view
for any desired lighting condition, our network uses a semi-parametric approach that
131
interpolates previously seen observations and their residual maps, generating the final
rendering.
Our semi-parametric approach is shown in Figure 3-7: The network takes as input
multiple UV buffers in two distinct branches, namely a “query path” and “observation
paths.” The query path takes as input a set of texture maps that can be generated
from the captured geometry, i.e., view/light cosine maps and a diffuse base. The
observation paths represent the semi-parametric nature of our framework and have
access to non-diffuse residuals of the captured OLAT images. The two branches are
merged in an end-to-end fashion to synthesize an unseen lighting condition from any
desired viewpoint.
To synthesize a new image of the subject under a desired lighting and viewpoint,
we have access to potentially all the OLAT images from multiple viewpoints. The
goal of the observation paths is to combine these images and extract meaningful fea-
tures that are passed to the query path to perform the final rendering. However,
using all these observations as input is not practical during training due to memory
and computational limits. Therefore, for a desired novel view and light condition,
we randomly select only 𝐾 = 1 or 3 OLAT images from the “neighborhood” as ob-
servations (the precise meaning of “neighborhood” will be clarified in Section 3.4.5).
The random sampling prevents the network from “cheating” by memorizing fixed
neighbors-to-query mappings and encourages it to learn that for a given query, differ-
ent observation selections should lead to the same prediction (also observed by Sun
et al. [2020]).
These observed images (in the form of UV-space residual maps as shown in Fig-
ure 3-7) are then fed in parallel (i.e., processed as a “batch”) into the observation
paths of our network, which can alternatively be thought of as 𝐾 distinct networks
that all share the same weights. The resulting set of 𝐾 network activations are then
averaged across the set of images by taking their arithmetic mean2 , thereby becoming
2
In practice, we observe no improvement when we replace the uniform weights with the barycen-
132
invariant to their cardinality and order, and are then passed to the query path.
While the goal of the observation paths is to process input images and glean
reflectance information from them, the goal of the query path is to take as input
information that encodes the non-diffuse residuals of nearby lights/views and then
predict radiance values of the queried light and view positions at each UV location.
We therefore concatenate the aggregated activations from the observation paths to the
self-produced activations of the query path using a set of cross-path skip-connections.
The query path then decodes a texture-space rendering of the subject under the
desired light and viewing directions, which is then resampled to the perspective of
the desired viewpoint using conventional, differentiable UV wrapping.
Our proposed architecture has several advantages over a single-path network that
would take as input all the available observations, which would be prohibitively ex-
pensive in terms of memory and computation. Because our observation paths do not
depend on a fixed order or number of images, during training, we can simply select a
dynamic subset of whatever observations that are best suited to the desired lighting
and viewpoint. This ability is useful because the lights and cameras in our dataset
are sampled at different rates—lights are around 4× denser than cameras. The supe-
riority of this dual-path design is demonstrated by both qualitative and quantitative
experiments in Section 3.6.1.
When synthesizing the output texture-space image in the query path of our net-
work, we do not predict the final image directly. Instead, we have a residual skip-
link [He et al., 2016] from the input diffuse base to the output of our network.
Formally, we train our deep neural network to synthesize a residual ∆𝐿 that is
then added to our diffuse base 𝐿
˜ 𝑜 (x, 𝜔𝑜 ) to produce our final predicted rendering
˜ 𝑜 (x, 𝜔𝑜 ). This approach of adding a physically-based diffuse ren-
𝐿𝑜 (x, 𝜔𝑜 ) = ∆𝐿 + 𝐿
dering allows our network to focus on learning higher-order, non-diffuse, non-local
light transport effects (specularities, scattering, etc.) instead of having to “re-learn”
tric coordinates of the query w.r.t. its 𝐾 = 3 observations.
133
the fundamentals of image formation (basic colors, rough locations and shapes of cast
shadows, etc.). Because these residuals are the unconstrained output of a network,
this model is able to describe any output image: Positive residuals can be added to
represent specularities, and negative residuals can be added to represent shading or
shadowing. This residual approach causes our model to be implicitly regularized to-
wards a simplified but physically-plausible diffuse model – the network can “fall back”
to the diffuse base rendering by simply emitting zeros.
We demonstrate that our method is capable of modeling complicated lighting ef-
fects including specular highlights (BRDFs), subsurface scattering (BSSRDFs), and
diffuse interreflection (global illumination), in the context of relighting a toy dragon
scene. We consider a 3D model with perfect geometry and known material properties
and render it in a virtual scene similar to a light stage setup using Cycles (Blender’s
built-in, physically-based renderer). We produce a diffuse render of the scene as a
baseline, and then re-render it using both our model and Blender with three light-
ing effects: specular highlights, subsurface scattering, and diffuse interreflections, to
demonstrate that NLT is capable of modeling those effects. The results are shown in
Figure 3-8 and Figure 3-9.
Specular Highlights In Blender, we mix a glossy shader into the dragon’s diffuse
shader and re-render the scene, resulting in a render with highlights. We then train
our model to infer the residuals for relighting. In Figure 3-8 (center), we show the NLT
renderings under two novel light directions (unseen during training) alongside with
the ground-truth renderings. The residual image predicted by our model correctly
models the specular highlights, and our rendering closely resembles the ground truth.
Subsurface Scattering Our model can capture lighting effects that cannot be
captured by a BRDF, such as subsurface scattering. We mix a subsurface scattering
shader into the dragon’s diffuse shader, and then train our model to learn these effects
in relighting. As shown in Figure 3-8 (right), the NLT results are almost identical to
the ground truth.
134
Specularity Subsurface Scattering
Novel Light 1
Novel Light 2
Diffuse Base Pred. Residuals Diffuse Base + Ground Truth Pred. Residuals Diffuse Base + Ground Truth
(+ only) Pred. Residuals (+ only) Pred. Residuals
Figure 3-8: Modeling non-diffuse BSSRDFs as residuals for relighting in NLT. A dif-
fuse base (left) captures all diffuse LT (e.g., hard shadows) under a novel point light.
By learning a residual on top of this base rendering, NLT can reproduce non-diffuse
LT (here, specularities and subsurface scattering) from the actual scene appearance.
When predicting specularities (center), NLT emits exclusively positive residuals (neg-
ative part hence not shown) to add bright highlights to the diffuse base. When
predicting scattering (right), the additive residuals represent additional illumination
provided by nearby subsurface light transport.
Novel Light 2 Novel Light 1
Figure 3-9: Modeling global illumination as residuals for relighting in NLT. The diffuse
bases are the same as in Figure 3-8. In addition to intrinsic material properties, NLT
can also learn to express global illumination (e.g., diffuse interreflection) as residuals.
Here we add a diffuse green wall to the right of the scene (left). Under Novel Light 1
(right top), the wall provides additional green indirect illumination, so the residuals
are green and mostly positive. Notably, the residuals are not necessarily all positive:
Under Novel Light 2 (right bottom), the residuals are mostly negative and high in
blue and red, effectively casting “negative purple” indirect illumination that results
in a greenish tinge.
135
Diffuse Interreflection To demonstrate global illumination, in Figure 3-9 we place
a matte green wall into the scene, and we see that NLT is able to accurately predict
the non-local light transport of a green glow cast by the wall onto the dragon.
Embedded in the texture space, NLT is a unified framework that can perform re-
lighting, view synthesis, or both simultaneously. The architecture described in Sec-
tion 3.4.2 takes as input the cosine maps that encode the light and viewing directions,
as well as a set of observed residual maps from nearby lights and/or views (neighbor
selection scheme in Section 3.4.5). Since there is no model design specific to relighting
or view synthesis, the model is agnostic to which task it is solving other than interpo-
lating the 6D LT function. Therefore, by varying both lights and views in the training
data, the model can be trained to render the subject under any desired illumination
from any camera position (a.k.a. simultaneously relighting and view synthesis). We
demonstrate this capability in Section 3.5.8 and the supplemental video.
Both paths of our architecture are modifications of the U-Net [Ronneberger et al.,
2015], where our query path is a complete encoder-decoder architecture (with skip-
connections) that decodes the final predicted image, while our observation path is
just an encoder. Following standard conventions, each scale of the network consists
of two convolutional layers (except at the very start and end), where downsampling
and upsampling are performed using strided (possibly transposed) convolutions, and
the channel number of the feature maps is doubled after each downsampling and
halved after each upsampling. Detailed descriptions of the architectures of these
two networks are provided in Table 3.1. No normalization is used. Note that the
activations from the observation paths are appended to the query path before its
internal skip connections, meaning that observation activations are effectively skip-
connected to the decoder of the query network.
136
Observation Path Query Path
ID Operator Output Shape ID Operator Output Shape
O1 conv(16, 1 × 1, 1) 𝐻 𝑊 16 Q1 conv(16, 1 × 1, 1) 𝐻 𝑊 16
O2 conv(16, 3 × 3, 2) 𝐻/2 𝑊/2 16 Q2 append (mean(O1)) 𝐻 𝑊 32
O3 conv(16, 3 × 3, 1) 𝐻/2 𝑊/2 16 Q3 conv(16, 3 × 3, 2) 𝐻/2 𝑊/2 16
O4 conv(32, 3 × 3, 2) 𝐻/4 𝑊/4 32 Q4 conv(16, 3 × 3, 1) 𝐻/2 𝑊/2 16
O5 conv(32, 3 × 3, 1) 𝐻/4 𝑊/4 32 Q5 append (mean(O3)) 𝐻/2 𝑊/2 32
... ... ... Q6 conv(32, 3 × 3, 2) 𝐻/4 𝑊/4 32
O14 conv(1024, 3 × 3, 2) 𝐻/128 𝑊/128 1024 Q7 conv(32, 3 × 3, 1) 𝐻/4 𝑊/4 32
O15 conv(1024, 3 × 3, 1) 𝐻/128 𝑊/128 1024 Q8 append (mean(O5)) 𝐻/4 𝑊/4 64
O16 conv(2048, 3 × 3, 2) 𝐻/256 𝑊/256 2048 ... ... ...
O17 conv(2048, 3 × 3, 1) 𝐻/256 𝑊/256 2048 Q44 append (Q8) 𝐻/4 𝑊/4 80
Q45 convT (8, 3 × 3, 2) 𝐻/2 𝑊/2 8
conv(𝑑, 𝑤 × ℎ, 𝑠) denotes a two-dimensional
Q46 convT (8, 3 × 3, 1) 𝐻/2 𝑊/2 8
convolutional layer (a.k.a. conv2D) with 𝑑 output
Q47 append (Q5) 𝐻/2 𝑊/2 40
channels, a filter size of (𝑤 × ℎ), and a stride of
Q48 convT (4, 3 × 3, 2) 𝐻 𝑊 4
𝑠, and is always followed by a leaky ReLU [Maas
Q49 convT (4, 3 × 3, 1) 𝐻 𝑊 4
et al., 2013] activation function. convT is the
Q50 append (Q2) 𝐻 𝑊 36
transpose of convand is also followed by a leaky
Q51 convT (3, 1 × 1, 1) 𝐻 𝑊 3
ReLU.
Table 3.1: Neural network architecture of NLT. The layers that reflect skip con-
nections from the activations of our observation path are highlighted in blue, and
U-Net-like skip-links within the query path are highlighted in green.
We trained our model to minimize losses in the image space between the predicted
image 𝐿𝑜 (x𝑖 , 𝜔𝑜 ) and the ground-truth captured image. To this end, we first resample
the UV-space prediction back to the camera space, and then compute the total loss
as a combination of a robust photometric loss [Barron, 2019] and a perceptual loss
(LPIPS) [Zhang et al., 2018a]. We use the loss function of Barron [2019] with 𝛼 = 1
(a.k.a. pseudo-Huber loss) applied to a CDF9/7 wavelet decomposition [Cohen et al.,
1992] in the YUV color space:
√︃(︂ )︂2
∑︁ CDF (YUV (𝐿𝑜 (x𝑖 , 𝜔𝑜 ) − 𝐿*𝑜 (x𝑖 , 𝜔𝑜 )))
ℓ𝐼 = + 1 − 1. (3.8)
𝑖
𝑐
137
losses ℓ = ℓ𝐼 + ℓ𝑃 . Empirically, we found that using the same weight for both losses
achieved the best results.
We trained our model by minimizing ℓ using Adam [Kingma and Ba, 2015] with
a learning rate of 2.5 × 10−4 , a batch size of 1, and the following optimizer hyperpa-
rameters: 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−7 . Our model is implemented in TensorFlow
[Abadi et al., 2016] and trained on a single NVIDIA Tesla P100, which takes less than
12 hours for the real scenes and much less for synthetic scenes.
Resolutions For relighting and view synthesis, our texture-space images have a
resolution of 1024 × 1024, and the camera-space images have a resolution of 1536 ×
1128. For simultaneous relighting and view synthesis, the resolutions used are 512 ×
512 in the UV space and 1024 × 752 in the camera space.
3
See Section 3.3 for how we addressed a similar issue using our active sets and alias-free pooling
in LSSR [Sun et al., 2020].
138
3.5 Results
In this section, we first introduce our data capture process using a light stage (Sec-
tion 3.5.1) and discuss the evaluation metrics used in our experiments (Section 3.5.2).
We then show LSSR’s capabilities of continuous directional relighting (Section 3.5.3)
and high-frequency image-based relighting (Section 3.5.4), and its application of light-
ing softness control (Section 3.5.5). Section 3.5.6 compares LSSR against its base-
lines, which do not utilize any 3D geometry either. Finally, in Section 3.5.7 and
Section 3.5.8, we present how NLT enables fixed-viewpoint relighting (which is al-
ready supported by LSSR) and additionally free-viewpoint relighting with additional
input of geometry proxy.
LSSR uses the One-Light-at-A-Time (OLAT) portrait dataset from Sun et al. [2019],
which contains 22 subjects with multiple facial expressions captured using a light
stage with a seven-camera system. The light stage consists of 302 LEDs uniformly
distributed on a spherical dome, and capturing a subject takes roughly 6 s. Each cap-
ture produces an OLAT scan of a specific facial expression per camera, which consists
of 302 images, and we treat the OLAT scans from different cameras as independent
OLAT scans, since we are not considering viewpoint change in LSSR.
Following previous works [Meka et al., 2019, Sun et al., 2019], we ask the subject to
stay still during the acquisition phase, which lasts about 6 s for a full OLAT sequence.
Since it is nearly impossible for the performer to stay perfectly still, we align all the
images using the optical flow technique of Meka et al. [2019]: We capture “all-lights-
on” images throughout the scan that are used as “tracking frames,” and compute 2D
flow fields between each tracking frame and a reference tracking frame taken from
the middle of the sequence. These flow fields are then interpolated from the tracking
frames to the rest of the images to produce a complete alignment. As such, for a
given camera, the captured images in each OLAT scan are aligned and only differ in
lighting directions.
139
For LSSR, we manually select four OLAT scans with a mixture of subjects and
views as our validation set, and select another 16 scans with good coverage of genders
and skin tones as training data. Our 16 training scans cover only five of the seven
cameras, as the remaining two are covered by the validation data. We train the
LSSR network using all lights from our OLAT data in a canonical, global lighting
coordinate frame, which allows us to train a single network for all viewpoints in our
training data. We train a single model for all subjects in our training dataset, which
we found matches the performance of training an individual network for each subject.
For NLT, we additionally acquired a base mesh for use as the geometry proxy that
NLT requires. Following the approach of Guo et al. [2019], we use 32 high-resolution
active IR cameras and 16 custom dot illuminator projectors to construct a high-
quality parameterized base mesh of each subject fully automatically. These data are
critical to our approach, as the estimated geometry provided by this system provides
the substrate that our learned model is embedded within in the form of a texture
atlas. However, this captured 3D model is far from perfect due to approximations
in the mesh model (that cannot accurately model fine structures such as hair) and
hand-crafted priors in the reflectance estimation (that relies on a cosine lobe Bidi-
rectional Reflectance Distribution Function [BRDF] model). This is demonstrated in
140
Figure 3-10:
Sample images
used for training
NLT. These
Camera 1
multi-view, One-
Light-at-A-Time
(OLAT) images
are sparse samples
of the 6D light
transport function
that NLT
interpolates. A
proxy of the
Camera 2
underlying
geometry is also
required by NLT,
but it can be as
rough as 500
vertices (see
Light 1 Light 2 Light 3 Section 3.6.4).
Figure 3-6. Our model overcomes these issues and enables photorealistic renderings,
as demonstrated in Section 3.5.7 and Section 3.5.8. Additionally, we demonstrate in
Section 3.6.4 that our neural rendering approach is robust to geometric degradation
and can work with geometry proxies of as few as 500 vertices.
We collect a dataset of 70 human subjects with fixed poses, each of which provides
around 18,000 frames under 331 lighting conditions and 55 viewpoints (before filtering
out glare-polluted and overly dark frames, as aforementioned). We randomly hold out
six lighting conditions and two viewpoints for training. The subjects are selected to
maximize diversity in terms of clothing, skin color, and age. By training our model
to reproduce held-out images from these light stage scans, we are able to learn a
general LT function that can be used to produce rendering for arbitrary viewpoints
and illuminations. Because our scans do not share the same UV parameterization,
we train a separate model for each subject.
141
3.5.2 Evaluation Metrics
Empirically evaluating our models presents a significant challenge: Both LSSR and
NLT attempt to super-resolve an undersampled scan from a light stage, which means
that the only ground truth available for benchmarking is also undersampled in both
light and view directions. In other words, the goals are to accurately synthesize images
for virtual lights and/or cameras in between the physical lights and/or cameras on
the stage, but we do not have ground-truth images that correspond to those virtual
lights and/or cameras. For this reason, qualitative results (figures and videos) are
preferred, and we encourage the reader to view them.
For the quantitative results to be presented, we use held-out real images lit by
physical lights from physical cameras on our light stage as a validation set. When
evaluating on one of these validation images, LSSR does not use the active-set selec-
tion technique of Section 3.3.1, but instead just samples the 𝑘 = 8 nearest neighbors
(excluding the validation image itself from the input). Holding out the validation
image from the input is critical, as otherwise the model could simply reproduce the
input image as an error-free output. This held-out validation approach is not ideal, as
all such evaluations will follow the same regular sampling pattern of our light stage.
This evaluation task is therefore more biased than the real task of predicting images
away from the sampling pattern of the light stage.
142
cuses on structures in the images, and (E-)LPIPS captures perceptual differences.
Again, images and videos may be more informative about the quality achieved.
Traditional image-based relighting methods produce accurate results when the target
lighting is concentrated around the physical lights of the stage, but may introduce
ghosting artifacts or inaccurate shadows when no physical light is nearby. In Figure 3-
11, we interpolate between two physical lights on the stage. As shown in Figure 3-11
(b, c), linear blending or Xu et al. [2018b] with adaptive sampling fails to produce
realistic results and always contains multiple superposed shadows or highlights. The
shadows produced by Meka et al. [2019] are sharp, but are not moving smoothly
when the light moves. In contrast, LSSR is able to produce sharp and realistic
images for arbitrary light directions: Highlights and cast shadows move smoothly
as we change the light direction, and our results have comparable sharpness to the
(non-interpolated) ground-truth images that are available.
260:10 • Sun et al.
Captured image under light A Interpolation between captured lights ! Captured image under light B
(a) Ours
Fig. 9. Here we use produce interpolated images corresponding to “virtual” lights between two real lights of the light stage. Our model (a) produces renderings
Figure 3-11: Interpolation by LSSR between two physical lights. Here LSSR produces
where sharp shadows and accurate highlights move realistically. Linear blending (b) and Xu et al. [2018] with adaptive sampling result in ghosting artifacts
the interpolated images corresponding to “virtual” lights between two real lights on the
and duplicated highlights. The results from Meka et al. [2019] contain blurry highlights and shadows with unrealistic motion.
light stage. (a) LSSR produces images where sharp shadows and accurate highlights
map. By taking a linear combination of all such images (weighted
move realistically. (b, c) Linear blending and byXu etpixel
their al. values
[2018b] with
and solid adaptive
angles), sampling
we are able to produce a
rendering that matches the sampling resolution of the environment
result in ghosting artifacts and duplicated highlights. (d) The results from Meka et al.
map. As shown in Fig. 10, this approach produces images with sharp
[2019] contain blurry highlights and shadowsshadows
withand unrealistic
minimal ghosting motion.
when given a high-frequency envi-
ronment, while linear blending does not. In this example, we use
an environment resolution of 256 ⇥ 128, which corresponds to a
super-resolved light stage with 32,768 lights. Please see our video
for more environment relighting results.
We now analyze the relationship between the image quality gain
143 from our model and the frequency of the lighting. Speci�cally, we
(a) With super-resolution (b) Without super-resolution evaluate for which environments, and at what frequencies, our
algorithm will be required for accurate rendering, and conversely
Fig. 10. Our model (a) is able to produce accurate relighting results under how our model performs in low-frequency lighting environments
high-frequency environments by super-resolving the light stage before where previous solutions are adequate. For this purpose, we use one
3.5.4 High-Frequency Image-Based Relighting
OLAT scans captured by a light stage can be linearly blended to reproduce images
that appear to have been captured under some environmental lighting. The pixel
values of a light probe are usually distributed to the nearest or neighboring lights on
the light stage for blending. This traditional approach causes ghosting artifacts in
shadows and specularities, due to the finite sampling of light directions on the light
stage. Although this ghosting is hardly noticeable when the lighting is low-frequency,
it can be significant when the lighting contains high-frequency contents, such as the
sun in the sky. These ghosting artifacts can be ameliorated by using LSSR. Given
a light probe, our algorithm predicts an image corresponding to the light direction
of each pixel in the light probe. By taking a linear combination of all such images
(weighted by their pixel values and solid angles), we are able to produce rendering
that matches the sampling resolution of the light probe. As shown in Figure 3-12,
this approach produces images with sharp shadows and minimal ghosting when given
a high-frequency light probe, whereas linear blending does not. In this example, we
use 256 × 128 light probes, corresponding to a super-resolved light stage with 32,768
lights. Please see our video for more image-based relighting results.
LSSR’s ability to render images under arbitrary light directions also allows us to
control the softness of the shadow. Given a light direction, we can densely synthesize
images corresponding to the light directions around it and average those images to
produce rendering with realistic soft shadows (the sampling radius of these lights
determines the softness of the resulting shadow). As shown in Figure 3-13, LSSR is
able to synthesize realistic shadows with controllable softness, which is not possible
using traditional linear blending methods.
144
(a) With Super-Resolution by LSSR (b) Without Super-Resolution
Figure 3-12: High-frequency image-based relighting by LSSR. (a) Our model is able
to produce accurate relighting results under high-frequency environment lighting, by
super-resolving the light stage before performing image-based relighting [Debevec
et al., 2000]. (b) Using the light stage data as-is results in ghosting.
145
of the shadow. Given a light direction, we can densely synthesize
images corresponding to the light directions around it, and average
those images together to produce a rendering with realistic soft
3.5.6 Geometry-Free Relighting
We compare our LSSR results against the existing relighting approaches. The linear
blending baseline in Table 3.2 produces competitive results, despite being very simple:
just linearly blending the input images according to our alias-free weights. Because
linear blending directly interpolates aligned pixel values, it is often able to retain
accurate high-frequency details in the flat region, and this strategy works well in
minimizing the error metrics. However, linear blending produces significant ghosting
artifacts in shadows and highlights, as shown in Figure 3-14. Although these errors
are easy to detect visually, they appear hard to measure empirically.
146
260:8 • Sun et al.
(g) Xu (h) Xu
(a) Ours (d) Linear (e) Fuchs (f) Photometric (i) Meka
(b) Groundtruth (c) Ours et al. [2018] w/ et al. [2018] w/
(full image) blending et al. [2007] stereo et al. [2019]
optimal sample adaptive sample
Figure 3-14: Relighting by LSSR and the baselines. Here we present a qualitative
Fig. 6. Here we present a qualitative comparison between our method and other light interpolation algorithms. Traditional methods (linear blending, Fuchs
et al. [2007], photometric stereo) retain detail but su�er from ghosting artifacts in shadowed regions. Results from Xu et al. [2018] and Meka et al. [2019]
comparison between our method and other light interpolation algorithms. Traditional
exhibit significant oversmoothing and brightness changes. Our method retains details and synthesizes shadows that resemble the ground truth.
methods (linear blending, Fuchs et al. [2007], photometric stereo) retain details but
interpolates aligned pixel values, it is often able to retain accurate
suffer from ghosting artifacts in the shadowedmodel
high frequency details in the �at region, and this strategy works
does not attempt to factorize the human subject into a pre-
regions. Rendering by Xu et al. [2018b]
de�ned re�ectance model wherein interpolation can be explicitly
andforMeka
well et al.
minimizing [2019]
our error exhibits
metrics. However, noticeable
linear blending oversmoothing
performed. Our modeland brightness
is instead changes.
trained to identify a latentOur
vector
produces signi�cant ghosting artifacts in shadows and highlights, space of network activations in which naive linear interpolation
method
as retains
shown in Fig. 6. Thoughdetails
these errors and synthesizes
are easy shadows
to detect visually, resultsthat resemble
in accurate the
non-linearly ground
interpolated truth.
images, which results
they appear to be hard to measure empirically. in more accurate renderings.
We evaluate against the layer-based technique of Fuchs et al. The technique of Xu et al. [2018] (retrained on our training data)
[2007] by decomposing an OLAT into di�use, specular, and visibility represents another possible candidate for addressing our problem.
Comparisons
layers, and interpolating the Against Photometric
illumination individually for each layer.Stereo Using
This technique doesthe
not layer
natively decomposition
solve our problem. In orderpro-to
Although the method works well on specular objects as shown in the �nd the optimal lighting directions for relighting, it requires as
ducedpaper,
original by itFuchs
performs et
less al.
well [2007],
on OLATs ofwe additionally
human subjects, as perform
input all 302 photometric
high-resolution images stereo
in each on
OLATthe
scanOLAT
in the �rst
shown in Tab. 1. This appears to be due to the complex specularities step, which signi�cantly exceeds the memory constraints of modern
on
datahuman byskin not being
simple trackedregression
linear accurately by thetooptical �ow
estimate a GPUs. To addressalbedo
per-pixel this, we �rst jointly train
image andthenormal
Sample-Netmap.and the
algorithm of Fuchs et al. [2007]. Additionally, the interpolation of Relight-Net on our images (downsampled by a factor of 4⇥ due to
the visibility layer sometimes contains artifacts, which results in cast memory constraints) to identify 8 optimal directions from the 302
Usingbeing
shadows thispredicted
normal mapThat
incorrectly. and albedo
being image, wetotal
said, the algorithm candirections
then ofuse thestage.
the light Lambertian reflectance
Using those 8 optimal directions,
results in fewer ghosting artifacts than the linear blending algorithm, we then retrain the Relight-Net using the full-resolution images from
toshown
as renderin Fig. 6aand
new diffuse
as re�ected by theimage correspondingourtotraining
E-LPIPS metric. the data,
query light indirection,
as prescribed Xu et al. [2018].which we
Table 1 shows
Using the layer decomposition produced by Fuchs et al. [2007], we that this approach works poorly on our task. This may be because
additionally
add to perform photometric stereo
the specular layer on the
byOLAT data by simple
[Fuchs this technique
et al., 2007] is built around
to produce 8 �xed
the input rendering.
final images and is naturally
As
linear regression to estimate a per-pixel albedo image and normal disadvantaged compared to our approach, which is able to use any
map. Using this normal map and albedo image we can then use of the 302 light stage images as input. We therefore also evaluate a
shown in
Lambertian Tableto3.2,
re�ectance renderthis
a newapproach underperforms
di�use image correspond- variant that
of Xu etof Fuchs
al. [2018] whereetweal.
use [2007], likelyselection
the same active-set due
ing to the query light direction, which we add to the specular layer approach used by our model to select the images used to train their
to the
from [Fuchs reflectance of human
et al. 2007] to produce faces being
our �nal rendering. As shown non-Lambertian.
Relight-Net. By usingAdditionally, theapproach
our active-set selection scattering
(Sec. 3.1)
in Tab. 1, this approach underperforms that of Fuchs et al. [2007], this baseline is able to better reason about local information, which
effect
likely due toofthehuman
re�ectance hair
of human isfaces
poorly modeled in terms
being non-Lambertian. of a per-pixel
improves performance as shown inalbedo and this
Tab. 1. However, normal
baseline
Additionally, the scattering e�ect of human hair is poorly modeled still results in �ickering artifacts when rendering with moving lights,
in terms of a per-pixel albedo and normal vector. These limiting because (unlike our approach) it is sensitive to the aliasing induced
vector. These limiting assumptions result inwhen
assumptions result in overly sharpened and incorrect shadow pre-
overly sharpened and incorrect shadow
images leave and enter the active set.
dictions, as shown in Fig. 6. In contrast to this photometric stereo We also evaluate Deep Re�ectance Fields [Meka et al. 2019] for
predictions,
approach as shown
and the layer-based in of
approach Figure 3-14.
Fuchs et al. [2007], our our task, which is also outperformed by our model. This is likely
Fuchs et al. [2007], LSSR does not attempt to factorize the human subject into a
predefined reflectance model wherein interpolation can be explicitly performed. Our
model is instead trained to identify a latent vector space of network activations in
which naive linear interpolation results in accurate non-linearly interpolated images,
producing more accurate rendering.
147
Comparisons Against Xu et al. [2018b] The technique of Xu et al. [2018b]
(retrained on our training data) represents another possible candidate for addressing
our problem. This technique, though, does not natively solve our problem: To find the
optimal lighting directions for relighting, it requires as input all 302 high-resolution
images in each OLAT scan in the first step, which significantly exceeds the memory
constraints of modern GPUs. To address this, we first jointly train their Sample-Net
and Relight-Net on our images (downsampled by 4× due to memory constraints) to
identify the eight optimal directions from the 302 directions.
Using those eight optimal directions, we then retrain Relight-Net using the full-
resolution images from our training data, as prescribed by Xu et al. [2018b]. Ta-
ble 3.2 shows that this approach works poorly on our task. This may be because
this technique is built around eight fixed input images and is naturally disadvantaged
compared with our approach that is able to use any of the 302 light stage images as
input. We therefore also evaluate a variant of Xu et al. [2018b], where we use the
same active set selection as used by LSSR to train their Relight-Net. By using our
active set (Section 3.3.1), this enhanced baseline is able to better reason about local
information, which improves the performance as shown in Table 3.2. However, this
baseline still results in flickering artifacts when rendering with moving lights, because
unlike LSSR, it is sensitive to the aliasing induced when images leave and enter the
active set.
Although NLT can interpolate the LT function in both light and view directions,
here we demonstrate that NLT achieves relighting results similar to those of LSSR
148
if we query NLT at just novel light directions 𝜔𝑖 . Because NLT requires geometry
proxy as additional input (to support free-viewpoint relighting; see Section 3.5.8), for
fixed-viewpoint relighting, image-based methods that use no 3D geometry, such as
LSSR, suffice and are more convenient.
First, we quantitatively evaluate our model against the state-of-the-art relighting
solutions and ablations of our model, and report our results in Table 3.3 in terms
of PSNR, SSIM [Wang et al., 2004], and LPIPS [Zhang et al., 2018a]. We see that
NLT outperforms all baselines and ablations, although simple baselines such as diffuse
rendering and barycentric blending also obtain high scores. This appears to be due to
these metrics under-emphasizing high-frequency details and high-order light transport
effects. These results are more easily interpreted using the visualization in Figure 3-
15, where we see that the renderings produced by our approach more closely resemble
the ground truth than those of other models. In particular, our method synthesizes
shadows, specular highlights, and self-occlusions with higher precision when compared
against simple barycentric blending, as well as state-of-art neural rendering algorithms
such as Nalbach et al. [2017] and Xu et al. [2018b]. Our approach also produces more
realistic results than the geometric 3D capture pipeline of Guo et al. [2019]. See the
supplemental video for more examples.
149
Ground Truth NLT (ours) Ground Truth NLT (ours) Ground Truth NLT (ours)
A B C
1. Ground Truth 2. NLT (ours) 3. Nearest Light 4. Barycentric 5. Deep Shading 6. [Xu et al. 2018] 7. Relightables
Blending [2017] [2019]
Figure 3-15: NLT relighting with a directional light. Here we visualize the perfor-
mance of NLT for the task of relighting using directional lights. We show represen-
tative examples of full-body subjects with zoom-ins focusing on cast shadows (A, C)
and facial specular highlights (B). Note how NLT is able to outperform all the other
approaches with sharper and ghosting-free results that are drastically different from
the nearest neighbors.
150
ment maps. To do this, we synthesize 331 directional OLAT images that cover the
whole light stage dome. These images are then converted to light stage weights by
approximating each light with a Gaussian around its center, and we produce the
HDRI relighting results by simply using a linear combination of the rendered OLAT
images [Debevec et al., 2000]. As shown in Figure 3-16, we are able to reproduce
view-dependent effects as well as specular highlights with high fidelity, and generate
compelling composites of the subjects in virtual scenes. See the supplemental video
for more examples.
Figure 3-16: HDRI relighting by NLT. Because NLT can relight a subject with any
directional light, it can be used to render OLAT “bases” that can then be linearly
combined to relight the scene for a given HDRI map (shown as insets) [Debevec et al.,
2000]. The relit subjects exhibit realistic specularities and shadows.
So far, we have shown how LSSR and NLT both support relighting from the original
viewpoint. Now we focus on simultaneous relighting and view synthesis, by querying
151
our NLT model also at novel viewing directions.
A quantitative analysis is presented in Table 3.4, where we see that our approach
outperforms the baselines and is comparable with Thies et al. [2019], which (unlike our
technique) only performs view synthesis and does not enable relighting. A qualitative
analysis is visualized in Figure 3-17. We see that the inferred residuals produced by
NLT are able to account for the non-diffuse, non-local light transport and mitigate the
majority of artifacts in the diffuse base caused by geometric inaccuracy. We see that
renderings from NLT exhibit accurate specularities and sharper details, especially
when compared with other machine learning methods, thereby demonstrating that
our model is able to capture view-dependent effects. See the supplementary video for
more examples.
Simultaneous Relighting & View Synthesis In Figure 3-18, we show the unique
ability of our model to synthesize illumination and viewpoints simultaneously with
an unprecedented quality for human capture. Note that our model’s ability to natu-
rally handle this simultaneous task is a direct consequence of embedding our neural
network within the UV space of the subject. All that is required to enable simultane-
ous relighting and view interpolation is interleaving the training data for both tasks
and training a single instance of our network (more details in Section 3.4.4). Fig-
ure 3-18 shows that our method accurately models shadows and global illumination,
while correctly capturing high-frequency details such as specular highlights. See the
152
A
1. Diffuse Base 2. Pred. Residuals 3. NLT (ours) 4. Ground Truth 5. DNR [2019] 6. Deep Shading 7. Relightables
(+ only) [2017] (UV) [2019]
Figure 3-17: View synthesis by NLT. NLT is able to handle view-dependent specu-
larities (eyes, nose tips, cheeks), high-frequency geometry variation (Subjects B’s and
D’s hair), and global illumination (Subjects A, B, and C’s shirts). We see a sub-
stantial improvement over the state-of-the-art view synthesis method of Thies et al.
[2019] (Column 5), which tends to produce blurry results (the missing specularities
in Subject B’s eyes), and over the recent geometric approach of Guo et al. [2019]
(Column 7), which lacks non-Lambertian material effects.
153
supplementary video for more examples.
The recent work of Mildenhall et al. [2020], Neural Radiance Fields (NeRF),
achieves impressive view synthesis given approximately 100 views of an object. Here
we qualitatively compare NLT against NeRF with 10 levels of positional encoding
for the location and 4 for the viewing direction. NeRF does not require any proxy
geometry, but in this particular setting, it has to work with a limited number of
views (around 55), which are insufficient to capture the full volume. As Figure 3-
19 (left) shows, NLT synthesizes more realistic facial and eye specularity as well as
higher-frequency hair details.
3.6 Discussion
In this section, we present the ablation studies that demonstrate the importance of
each major model component in LSSR and NLT (Section 3.6.1). We then attempt to
answer the interesting question in Section 3.6.2: Above which frequency band does
one need LSSR to achieve high-quality image-based relighting? Section 3.6.3 then
addresses a related question whether LSSR and NLT can super-resolve a smaller light
stage. Finally, we explore how NLT’s performance degrades as the quality of the
input geometry proxy deteriorates (Section 3.6.4).
154
Illumination varies (view constant)
Figure 3-18: Simultaneous relighting and view synthesis by NLT. NLT is able to
perform simultaneous relighting and view synthesis, and produces accurate renderings
(including view- and light-dependent effects) for unobserved viewpoints and light
directions. Along the 𝑥-axis we vary illumination, and along the 𝑦-axis we vary the
view. This functionality is enabled by our decision to embed our neural network
architecture within the texture atlas of a subject.
155
View Synthesis Simultaneous Relighting and View Synthesis
A C
D
NeRF NLT (ours) Ground Truth NeRF + Light NLT (ours) Ground Truth
Figure 3-19: Comparing NLT against NeRF and NeRF+Light. In view synthesis
(left), NeRF struggles to synthesize realistic facial specularities, high-frequency hair
details, and specularity in the eyes (red boxes in A & B). In simultaneous relighting
and view synthesis (right), the NeRF+Light extension fails to synthesize facial details
(red boxes in C & D) and hard shadows (yellow boxes in C).
We first evaluate some ablated versions of LSSR, with results shown in Table 3.2. We
then present comparisons between the full NLT model and its ablated versions, to
demonstrate the contribution of each major model component.
LSSR With Naïve Neighbors In this ablation, we use the 𝑘 = 8 nearest neigh-
bors as our active set during training. This variant leads to a match in the sampling
pattern between our training and validation data, thereby achieving better numeri-
cal performance (Table 3.2). This apparent performance improvement is misleading,
since the validation set has the same regular light layout as the training set, but the
test set presents irregular sampling pattern (see Section 3.5.2). As such, this variant
suffers from significant overfitting during our real test-time scenario, where the query
light does not fall on the regular hexagonal grid of the light stage. In Figure 3-20,
we visualize the output of this variant and LSSR as a function of the query light
direction. We see that LSSR is able to synthesize a cast shadow that is a smooth
linear function of the query light angle (after accounting for foreshortening, etc.). The
variant, however, fails to synthesize this linearly-varying shadow, due to the aliasing
156
and overfitting problems described earlier. See the supplemental video for additional
visualization. Light Stage Super-Resolution: Continuous High-Frequency Relighting • 260:7
(a) One rendering, for reference (b) Our model (c) Our model w/ naive neighbors (d) Our model w/ avg pooling
Figure 3-20: Continuous directional relighting by LSSR. (a) We show an LSSR ren-
Fig. 5. A visualization of how our learned model synthesizes renderings in which shadows move smoothly as a function of light direction. In (a) we show a
rendering from our model for some virtual light ` with a horizontal angle of , and highlight one image strip that includes a horizontal cast shadow. In (b) we
dering for some virtual light with a horizontal angle of 𝜃, and highlight one image
repeatedly query our model with values that should induce a linear horizontal translation of the shadow boundary in the image plane, and by stacking
strip that includes a horizontal cast shadow. (b) We repeatedly query our model with
these image strips we can see this linear trend emerge (highlighted in red). In (c) and (d) we do the same for ablations of our model that do not have our
active-set random selection procedure nor our alias-free pooling, and we see that the resulting shadow boundary does not vary smoothly or linearly.
𝜃 values that should induce a linear horizontal translation of the shadow boundary
in the image plane. By stacking these image4.1strips,
Table 1. Here we benchmark our model against prior work and ablations
of our model on our validation dataset. We report the arithmetic mean of
Ablationwe see this linear trend emerge
Study
(highlighted
each in red).
metric across the validation (c,topd)
set. The We
three doof each
results themetric We
same for �rst evaluate against
the ablated models ablated versions of our model,
without ourwith results
active
are highlighted in red, orange, yellow, respectively. While “Ours w/naive shown in Tab. 1. In the “Ours w/naive neighbors” ablation we use the
set selection
neighbors“ has the lowest procedure
error according toor
this alias-free pooling.k =We
evaluation, “Our model“ observe
8 nearest neighborsthat thesetresulting
in our active shadow
during training. This setup
leads to a match between our training and validation data, which
boundary does not vary smoothly or
performs be�er in our real test-time scenario where the synthesized light
linearly.
does not lie in a regular hexagonal grid (see text and Fig. 5 for details). results in better numerical performance (as shown in Tab. 1) but also
signi�cant over�tting: this apparent improvement in performance
Algorithm RMSE H1 DSSIM E-LPIPS is misleading, because the validation set of our dataset has the same
Our model 0.0160 0.0203 0.0331 0.00466 overly-regular sampling as the training set. During our real test-
Ours w/naive neighbors 0.0156 0.0199 0.0322 0.00449 time scenario in which we synthesize with lights that do not lie on
Ours w/avg-pooling 0.0203 0.0241 0.0413 0.00579 the regular hexagonal grid of our light stage, we see this ablated
LSSR With Average
Linear blending 0.0191 0.0232Pooling In this ablation,
0.0366 0.00503 we replace
model generalizes the
poorly. In Fig. 5 wealias-free pooling
visualize the output of our
Fuchs et al. [2007] 0.0195 0.0258 0.0382 0.00485 model and ablations of our model as a function of the query light
of our model
Photometric stereo with0.0284simple average
0.0362 0.0968 pooling.
0.00895 Asdirection.
shown We seeinthat our model
Table 3.2,is able to synthesize
ablating a castcom-
this shadow
Xu et al. [2018] that is a smooth linear function in the image plane of the angle of
w/ 8 optimal lightsthe performance
0.0410 0.0437 0.1262 0.01666 the query light (after accounting for foreshortening, etc. ). Ablations
ponent hurts quantitatively ofand, more do
our technique importantly,
not reproduce thiscauses flickering
linearly-varying shadow,
w/ adaptive input 0.0259 0.0291 0.1156 0.00916
Meka et al. [2019] 0.0505 0.0561 0.1308 0.01482 due to the aliasing and over�tting problems described earlier. See
in the real test-time scenario where we smoothly vary our
the supplemental videolight source
for additional (the quantita-
visualizations.
In the “Ours w/avg-pooling” ablation we replace the alias-free
tive evaluation
validation approach is notcannot
ideal, as allreflect this).willBecause
such evaluations follow average
pooling pooling
of our model withassigns non-zero
simple average pooling. Asweights
shown in
the same regular sampling pattern of our light stage. This evaluation Tab 1, ablating this component reduces performance. But more im-
to isimages
task as biased
therefore more theythanenter
the realand
task of exit the
predicting active portantly,
images ablating this component also causes �ickering during
set, renderings from this model variant
our real test-time scenario in which we smoothly vary our light
away from the sampling pattern of the light stage.
Selecting an appropriate metric for measuring image reconstruc- source, and this is not re�ected in our quantitative evaluation. Be-
contain significant temporal instability. See cause
tion accuracy for our task is not straightforward. Conventional
the average
supplemental
pooling assignsvideo
a non-zero forweight
examples.
to images as they
image interpolation techniques often result in ghosting artifacts or enter and exit our active set, renderings from this model will con-
duplicated highlights, which are perceptually salient but often not tain signi�cant temporal instability. See the supplemental video for
penalized heavily by traditional image metrics such as per-pixel examples.
NLT
RMSE. WeWithout Observation
therefore evaluate image quality using Paths Instead of our two-path query and observation
multiple image
metrics: RMSE, the Sobolev H 1 norm [Ng et al. 2003], DSSIM [Wang 4.2 Related Work Comparison
etnetwork (Section 3.4.2), we can just train the query
al. 2004], and E-LPIPS [Kettunen et al. 2019]. RMSE measures We compare path ofagainst
our results our related
network without
approaches any
that are capable
pixel-wise error, the H 1 norm emphasizes image gradient error, of solving the relighting problem. The “Linear blending” baseline
while DSSIM and E-LPIPS approximate an overall perceptual dif- in Tab. 1 produces competitive results, despite being a very simple
observation. As shown in Figure 3-21, this ablation
ference between the predicted image and the ground truth. Still,
struggles to synthesize details for
algorithm: we simply blend the input images of our light stage
images and videos are preferred for comparison. according to our alias-free weights. Because linear blending directly
each possible view and lighting condition, and produces oversmoothed results.
ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020.
NLT Without Residual Learning Instead of using our residual learning ap-
proach (Section 3.4.3), we can allow our network to directly predict the output image.
157
Ground Truth NLT (ours) NLT w/o NLT w/o NLT w/o
Residuals Observations LPIPS [2018]
Figure 3-21: NLT and its ablated variants for relighting. Removing different compo-
nents of NLT reduces rendering quality: No direct access to the diffuse base makes it
more challenging for the network to learn hard shadows, having no observation path
deprives the network of information from nearby views or lights, and removing the
perceptual loss of Zhang et al. [2018a] blurs the shadow boundary.
As shown in Figure 3-21, not using the diffuse base at all reduces the quality of the
rendered image, likely because the network is then forced to waste its capacity on
inferring shadows and albedo.
The middle ground between no diffuse base and our full method is using the diffuse
bases only as network input, but not for the skip link. Comparing the “Deep Shading”
rows and “NLT w/o obs.” rows of Table 3.3 and Table 3.4 reveals the importance
of the skip connection to diffuse bases: In both relighting and view synthesis, NLT
without observations (which has the skip link) outperforms Deep Shading (which uses
the diffuse bases only as network input) in LPIPS. Our proposed residual learning
scheme allows our model to focus on learning higher-order light transport effects,
which results in more realistic renderings.
NLT Without Perceptual Loss We find that adding a perceptual loss as pro-
posed by Zhang et al. [2018a] helps the network produce higher-frequency details
(such as the hard shadow boundary in Figure 3-21). Quantitative evaluations ver-
ify this observation: Full NLT with the perceptual loss achieves the best perceptual
scores in both tasks of relighting and view synthesis.
158
3.6.2 Image-Based Relighting Under Varying Light Frequency
We now analyze the image quality gain achieved by LSSR w.r.t. the light frequency.
Specifically, we evaluate for which environmental lighting and at what frequency,
LSSR will be necessary for accurate rendering, and conversely how it performs under
low-frequency lighting where previous solutions are adequate. For this purpose, we
use one OLAT scan and render it under 380 high-quality indoor and outdoor lighting
(probes downloaded from hdrihaven.com) using both LSSR and Linear Blending.
We then measure the image quality gain from our model by computing DSSIM be-
tween our rendering and that by Linear Blending. We measure the frequency of the
environmental lighting by decomposing it into spherical harmonics (up to degree 50)
and finding the degree below which 90% of the energy can be recovered.
As shown in Figure 3-22, the benefit of using our model becomes more significant
as the frequency of the lighting increases. For low-frequency lighting (up to degree
15 spherical harmonics), our model produces almost identical results compared with
the traditional Linear Blending method. This is a desired property, showing that our
method reduces gracefully to Linear Blending for-low frequency lighting, and thus
produces high-quality results for both low- and high-frequency lighting. As the fre-
quency of the lighting becomes higher, LSSR’s rendering contains sharper and more
accurate shadows without ghosting artifacts. Note that there is some variation among
the environment maps as expected; even a very high-frequency environment could co-
incidentally have its brightest lights aligned with one of the lights on the stage, leading
to low errors in Linear Blending and comparable results to our method. Neverthe-
less, the trend is clear in Figure 3-22 with many high-frequency probes requiring our
algorithm for lighting super-resolution.
According to the plot in Figure 3-22, we conclude that LSSR is necessary when the
lighting frequency is equal to or greater than about 20 (that is more than 212 = 441
basis functions). This number is on the same order as the number of lights on the
stage (𝑛 = 302). Therefore, our frequency analysis is consistent with the intuition.
If the lighting cannot be recovered using the limited light bases on the light stage,
159
Light
Our
Model
Linear
Blending
Figure 3-22: Quality gain by LSSR w.r.t. lighting frequency. In the top plot, each
blue dot represents a light probe. We render a portrait under this lighting using both
linear blending and LSSR, and measure the image differences using SSIM to evaluate
the quality gain achieved our algorithm. The improvement becomes more apparent
when the lighting contains more high-frequency contents. In the bottom figure, we
compare the rendered images using LSSR and linear blending under lighting with
different frequencies. Our model produces similar results to linear blending when
the lighting variation is low-frequency (left two columns). As the lighting becomes
higher-frequency, LSSR produces better rendering with fewer artifacts and sharper
shadows (right two columns).
160
then LSSR is required to generate denser bases to accurately render the shadow and
highlights.
161
video for the qualitative comparison.
Light Stage Super-Resolution: Continuous High-Frequency Relighting • 260:9
(a) Ours
Figure 3-24: LSSR vs. linear blending: relighting with sparser lights. As we decrease
Fig. 7. Here we compare the performance of our model against linear blending as we reduce n, the number of lights in our light stage. As we decrease the
number of available lights from n = 302 to n = 100, the quality of our model’s rendered shadow degrades slowly. Linear blending, in contrast, is unable to
the number of available lights from 𝑛 = 302 to 100, the quality of LSSR’s rendered
produce an accurate rendering even with access to all lights.
train a relighting model, we observe ghosting shadows in our relit rendering (yellow
arrow), similar to those produced by barycentric blending.
162
3.6.4 Degrading the Input Geometry Proxy
Here we analyze how our model performs with respect to different factors. We show
that as the geometry degrades, our neural rendering approach consistently outper-
forms traditional reprojection-based methods, which heavily rely on the geometry
quality. In relighting, we show that our model performs reasonably when the number
of illuminants is reduced, demonstrating the potential applicability of NLT to smaller
light stages.
Because NLT leverages a geometry proxy to generate a texture parameterization,
we study its robustness against geometry degradation in the context of view synthesis.
We decimate our mesh progressively from the original 100,000 vertices down to only
500 vertices (bottom left of Figure 3-26). At each mesh resolution, we train one
NLT model with 𝐾 = 3 nearby views and evaluate it on the held-out views. With
the geometry proxy, one can also reproject nearby observed views to the query view,
followed by different types of blending [Buehler et al., 2001, Eisemann et al., 2008].
We NLT against Eisemann et al. [2008] at each decimation level.
NLT (ours)
Floating
Textures Floating
Textures
[2008]
→
NLT
(ours)
0.5k 1k 5k 10k 50k 100k 0.5k Vertices 5k Vertices 100k Vertices (original)
Figure 3-26: Performance of NLT w.r.t. quality of the geometry proxy. As we dec-
imate the geometry proxy from 100, 000 vertices down to only 500 vertices, NLT
remains performant in terms of LPIPS (lower is better; bands indicate 95% confi-
dence intervals), while Floating Textures, a reprojection-based method, suffers from
the low quality of the geometry proxy, producing missing pixels (e.g., in the hair) and
misplaced high-frequency patterns (e.g., shadow boundaries), as highlighted by the
yellow arrows. Both NLT and Floating Textures use the same three nearby views.
As Figure 3-26 shows, even at the extreme decimation level of 500 vertices, NLT
163
produces reasonable rendering with no missing pixels, because it has learned to hallu-
cinate pixels that are non-visible from any of the nearby views. In contrast, Floating
Textures [Eisemann et al., 2008] leaves missing pixels unfilled (e.g., in the hair) due
to reprojection errors stemming from the rough geometry proxy. As the geometry
proxy gets more accurate, Floating Textures improves but still struggles to render
high-frequency patterns correctly (such as the shadow boundary beside the nose,
highlighted by a yellow arrow), even at the original mesh resolution. In comparison,
the high-frequency patterns in the NLT rendering match the ground truth. Quanti-
tatively, NLT also outperforms Floating Textures in terms of LPIPS (lower is better)
across all mesh resolutions.
3.7 Conclusion
The light stage is a crucial tool for enabling the image-based relighting of human
subjects in novel environments, but as we have demonstrated, light stage scans are
undersampled w.r.t. the angle of incident light, which means that synthesizing virtual
lights by simply combining images would result in ghosting in shadows and specu-
lar highlights. We have presented a learning-based solution, Light Stage Super-
Resolution (LSSR) [Sun et al., 2020], for super-resolving light stage scans, thereby
allowing us to create a “virtual” light stage with a much higher angular lighting res-
olution and therefore render accurate shadows and highlights under high-frequency
lighting.
Our network works by embedding input images from the light stage into a learned
space where network activations can then be averaged, and then decoding those acti-
vations according to some query light direction to reconstruct an image. In construct-
ing LSSR, we have identified two critical issues: an overly regular sampling pattern
in the light stage training data and aliasing introduced when pooling activations of a
set of nearest neighbors. These issues are addressed through our use of dropout-like
supersampling of neighbors in our active set and our alias-free pooling technique. By
combining ideas from conventional linear interpolation with the expressive power of
164
deep neural networks, LSSR is able to produce rendering where shadows and high-
lights move smoothly as a function of the light direction.
This work is by no means the final word for the task of light stage super-resolution
or image-based relighting. Approaches similar to ours could be applied to other
general light transport acquisition problems, to other physical scanning setups, or to
other kinds of objects besides human subjects. Although our network can work on
inputs with different image resolutions, GPU memory has been a major bottleneck
to apply our approach to images with much higher resolutions such as 4K resolution.
A much more memory-efficient approach for light stage super-resolution is expected
for production-level use in the visual effects industry.
Although we exclusively pursue the One-Light-at-A-Time (OLAT) scanning ap-
proach with our light stage, alternative patterns where multiple lights are active
simultaneously could be explored, which may enable a sparser light stage design. De-
spite the undersampling of the light stage being self-evident in our visualizations, it
may be interesting to develop a formal theory of this undersampling w.r.t. materials
and camera resolution, so as to understand what degree of undersampling can be
tolerated in the limit. We have made a first step in this direction with the graph in
Figure 3-22. We believe that LSSR represents an exciting direction for future research
and has the potential to further cut the cost for reproducing accurate high-frequency
relighting effects.
What remains unaddressed by LSSR is viewpoint change in the task of view syn-
thesis. We continued to propose Neural Light Transport (NLT) [Zhang et al.,
2021b] to support both viewpoint and light interpolation, thereby supporting simul-
taneous relighting and view synthesis. We have presented NLT, a semi-parametric
deep learning framework that allows for simultaneous relighting and view synthesis
of full-body scans of human subjects.
Our approach is enabled by prior work [Guo et al., 2019] that provides a method
for recovering geometric models and texture atlases, and uses as input OLAT images
captured by a light stage. Our model works by embedding a deep neural network into
the UV texture space provided by a mesh and texture atlas, and then training that
165
model to synthesize texture-space RGB images corresponding to observed light and
viewing directions. Our model consists of a dual-path neural network architecture for
aggregating information from observed images and synthesizing new images, which
is further enhanced through the use of augmented texture-space inputs that leverage
insights from conventional graphics techniques and a residual learning scheme that
allows training to focus on higher-order light transport effects such as highlights, scat-
tering, and global illumination. Multiple comparisons and experiments demonstrate
clear improvement over previous specialized relighting or view synthesis solutions,
and our approach additionally enables simultaneous relighting and view synthesis.
Our method has occasional failure modes as shown in Figure 3-27, where complex
light transport effects, such as the ones on the glittery chain, are hard to synthesize,
and the final renderings lack high-frequency details.
Figure 3-27: A
failure case of
NLT’s view
synthesis. NLT
may fail to
synthesize views
of complicated
light transport
effects such as
Ground Truth NLT (ours)
those on the
glittery chain.
Similar to recent neural rendering approaches [Lombardi et al., 2018, 2019, Thies
et al., 2019, Sitzmann et al., 2019a, Mildenhall et al., 2020], NLT must be trained
individually per scene, and generalizing to unseen scenes is an important future step
for the field. In addition, neural rendering of dynamic scenes is desirable, especially
in this case of human subjects. Using a fixed texture atlas may directly enable our
method to work for dynamic performers.
Additionally, the fixed 1024×1024 resolution of our texture-space model limits our
model’s ability to synthesize higher-frequency contents, especially when the camera
zooms very close to the subject, or when an image patch is allocated too few texels
166
(see the hair artifact in Figure 3-17 [D]). This could be solved by training on higher-
resolution images, but this would increase memory requirements and likely require
significant engineering effort.
167
THIS PAGE INTENTIONALLY LEFT BLANK
168
Chapter 4
169
reconstructs 3D shapes from a single-view RGB image, and demonstrate how GenRe
is capable of reconstructing shapes from novel class categories unseen during training.
We also perform additional analyses, in Section 4.7, to study if any object detector
emerges naturally from training ShapeHD’s network for shape reconstruction, how the
naturalness loss adds structural details to the ShapeHD output, when ShapeHD tends
to fail, how the input viewpoint affects GenRe’s generalization power, and finally
whether GenRe is able to reconstruct non-rigid shapes and simple shape primitives
when trained only on cars, chairs, and airplanes.
4.1 Introduction
In this chapter, we aim to push the limits of 3D shape completion from a single depth
image and of 3D shape reconstruction from a single color image. Specifically, our
goals are to develop models that achieve high-quality reconstruction with structural
details (ShapeHD [Wu et al., 2018]) and generalize beyond the training shape classes
to unseen shape categories (GenRe [Zhang et al., 2018b]). Towards these goals, we
built Pix3D [Sun et al., 2018b], a real-world dataset of images and the 3D shapes
inside with pixel-level alignment.
Recently, researchers have made impressive progress on the these tasks [Choy
et al., 2016, Tulsiani et al., 2017, Dai et al., 2017], making use of gigantic 3D datasets
[Chang et al., 2015, Xiang et al., 2014, 2016]. Many of these methods tackle the ill-
posed nature of the problem by using deep convolutional networks to regress possible
3D shapes. Leveraging the power of deep networks, their systems learn to avoid
producing implausible shapes (Figure 4-1 [b]). However, from Figure 4-1 (c) we
see that there is still ambiguity that a supervised network fails to model: From
just one view (Figure 4-1 [a]), there exist multiple natural shapes that explain the
observation equally well. In other words, there is no deterministic ground truth for
each observation. Through pure supervised learning, the network tends to generate
blurry “mean shapes” that minimize the loss exactly due to this ambiguity.
To tackle this issue and enable higher-quality reconstruction with structural de-
170
(b) Unnatural Shapes
(a) Observation
Figure 4-1: Two levels of ambiguity in single-view 3D shape perception. For each
2D observation (a), there exist many possible 3D shapes that explain this observa-
tion equally well (b, c), but only a small fraction of them correspond to real, daily
shapes (c). Methods that exploit deep networks for recognition reduce, to a certain
extent, ambiguity on this level. By using an adversarially learned naturalness model,
ShapeHD aims to model ambiguity on the next level: Even among the realistic shapes,
there are still multiple shapes explaining the observation well (c).
171
2015]. Because the problem is well-known to be ill-posed—there exist many 3D expla-
nations for any 2D visual observation—modern systems have explored looping in var-
ious structures into this learning process. For example, Wu et al. [2017] use intrinsic
images or 2.5D sketches [Marr, 1982] as an intermediate representation, and concate-
nates two learned mappings for shape reconstruction: 𝑓2D→3D = 𝑓2.5D→3D ∘ 𝑓2D→2.5D .
These methods, however, ignore the fact that mapping a 2D image or a 2.5D
sketch to a 3D shape involves complex but deterministic geometric projections (see
Section 1.1.4). Simply using a neural network to approximate these projections,
instead of modeling this mapping explicitly, leads to inference models that are over-
parametrized (and hence subject to overfitting training classes). It also misses valu-
able inductive biases that can be wired in through such projections. Both of these
factors contribute to poor generalization to unseen classes.
In contrast to these artificial systems, humans can imagine, from just a single
image, the full 3D shape of a novel object that is never seen before. Vision researchers
have long argued that the key to this ability may be a sophisticated hierarchy of
representations, extending from images through surfaces to volumetric shape, which
process different aspects of shape in different representational formats [Marr, 1982].
In the remainder of this chapter, we explore how these ideas can be integrated into
single-image 3D shape reconstruction to enable generalization to novel shape classes
unseen during training.
To this end, we propose to disentangle geometric projections from shape recon-
struction to better generalize to unseen shape categories. Building upon the MarrNet
framework [Wu et al., 2017], we further decompose 𝑓2.5D→3D into a deterministic ge-
ometric projection 𝑝 from 2.5D to a partial 3D model and a learnable completion 𝑐
of the 3D model. A straightforward version of this idea would be to perform shape
completion in the 3D voxel grid: 𝑓2.5D→3D = 𝑐3D→3D ∘ 𝑝2.5D→3D . However, shape com-
pletion in 3D is challenging, as the manifold of plausible shapes is sparser in 3D than
in 2D, and empirically this fails to reconstruct shapes well.
Instead we perform completion based on spherical maps. Spherical maps are
surface representations defined on the UV coordinates of a unit sphere, where the
172
value at each coordinate is calculated as the minimal distance travelled from this point
to the 3D object surface along the sphere’s radius. Such a representation combines
appealing features of 2D and 3D: Spherical maps are a form of 2D images, on which
neural inpainting models work well; but they have a semantics that allows them to
be projected into 3D to recover full shape geometry. They essentially allow us to
complete non-visible object surfaces from visible ones, as a further intermediate step
to full 3D reconstruction. We now have 𝑓2.5D→3D = 𝑝S→3D ∘ 𝑐S→S ∘ 𝑝2.5D→S , where S
stands for spherical maps.
173
Input (Novel Class) Our Reconstruction Input (Novel Class) Our Reconstruction
Shape completion is an essential task in geometry processing and has wide applica-
tions. Traditional methods have attempted to complete shapes with local surface
primitives, or to formulate it as an optimization problem [Nealen et al., 2006, Sorkine
and Cohen-Or, 2004], e.g., Poisson surface reconstruction solves an indicator function
on a voxel grid via the Poisson equation [Kazhdan and Hoppe, 2013, Kazhdan et al.,
2006]. Recently, there have also been a growing number of papers on exploiting shape
structures and regularities [Mitra et al., 2006, Thrun and Wegbreit, 2005] and papers
on leveraging strong database priors [Sung et al., 2015, Li et al., 2015, Brock et al.,
2016]. These methods, however, often require the database to contain exact parts of
the shape, and thus have limited generalization power.
With the advances in large-scale shape repositories like ShapeNet [Chang et al.,
2015], researchers began to develop fully data-driven methods, some building upon
deep convolutional networks. To name a few, Voxlets [Firman et al., 2016] employs
random forests for predicting unknown voxel neighborhoods. Wu et al. [2015] use
a deep belief network to obtain a generative model for a given shape database, and
174
Thanh Nguyen et al. [2016] extend the method for mesh repairing.
Probably the most related paper to ShapeHD is 3D-EPN [Dai et al., 2017], which
achieves impressive results on 3D shape completion from partial depth scans by lever-
ing 3D convolutional networks and non-parametric patch-based shape synthesis meth-
ods. ShapeHD has advantages over 3D-EPN in two aspects. First, with a naturalness
loss, ShapeHD can choose among multiple hypotheses that explain the observation,
therefore reconstructing a high-quality 3D shape with fine details; in contrast, the
output from 3D-EPN without non-parametric shape synthesis is often blurry. Sec-
ond, our completion takes a single feed-forward pass without any postprocessing, and
is thus much faster (<100 ms) than 3D-EPN.
The problem of recovering the object shape from a single image is challenging, as it
requires both powerful recognition systems and prior shape knowledge. As an early
attempt, Huang et al. [2015] propose to borrow shape parts from existing Computer-
Aided Design (CAD) models. With the development of large-scale shape repositories
like ShapeNet [Chang et al., 2015] and methods like deep convolutional networks,
researchers have built more scalable and efficient models in recent years [Kar et al.,
2015, Choy et al., 2016, Girdhar et al., 2016, Rezende et al., 2016, Tatarchenko et al.,
2016, Wu et al., 2016, Yan et al., 2016, Häne et al., 2017, Novotny et al., 2017,
Tulsiani et al., 2017, Wu et al., 2017]. While most of these approaches encode objects
in voxels from vision, there have also been attempts to reconstruct objects in point
clouds [Fan et al., 2017, Groueix et al., 2018] or octave trees [Riegler et al., 2017a,b,
Tatarchenko et al., 2017]. The shape priors learned in these approaches, however, are
in general only applicable to their training classes, with very limited generalization
power for reconstructing shapes from unseen categories. In contrast, GenRe exploits
2.5D sketches and spherical representations for better generalization to objects outside
training classes.
175
4.2.3 2.5D Sketch Recovery
A related direction is to estimate 2.5D sketches (e.g., depth and surface normal maps)
from an RGB image. The origin of intrinsic image estimation dates back to the early
years of computer vision [Barrow and Tenenbaum, 1978]. Through years, researchers
have explored recovering 2.5D sketches from texture, shading, or color images [Horn
and Brooks, 1989, Zhang et al., 1999, Weiss, 2001, Tappen et al., 2003, Bell et al., 2014,
Barron and Malik, 2014]. With the development of depth sensors [Izadi et al., 2011]
and larger-scale RGB-D datasets [Silberman et al., 2012, Song et al., 2017, McCormac
et al., 2017], there have also been papers on estimating depth [Eigen and Fergus, 2015,
Chen et al., 2016], surface normals [Wang et al., 2015, Bansal and Russell, 2016],
and other intrinsic images [Janner et al., 2017, Shi et al., 2017] with deep networks.
Inspired by MarrNet [Wu et al., 2017], we reconstruct 3D shapes via modeling 2.5D
sketches but incorporating a naturalness loss for much higher quality, and focus on
reconstructing shapes from novel shape categories unseen during training.
176
this issue: Although adversarial modeling of 3D shape space may resolve the ambi-
guity discussed earlier, its training could be challenging [Dai et al., 2017]. Due to
this issue, when Gwak et al. [2017] explored adversarial networks for single-image
3D reconstruction, they opted to use GANs to model 2D projections instead of 3D
shapes. This weakly supervised setting, however, hampers their reconstructions. In
ShapeHD, we develop our naturalness loss by adversarial modeling of the 3D shape
space, outperforming the state of the art significantly.
Spherical projections have been shown effective in 3D shape retrieval [Esteves et al.,
2018], classification [Cao et al., 2017], and finding possible rotational as well as reflec-
tive symmetries [Kazhdan et al., 2004, 2002]. Recent papers [Cohen et al., 2018, 2017]
have studied differentiable, spherical convolution on spherical projections, aiming to
preserve rotational equivariance within a neural network. These designs, however,
perform convolution in the spectral domain with limited frequency bands, causing
aliasing and loss of high-frequency information. In particular, convolution in the
spectral domain is not suitable for shape reconstruction where the quality highly de-
pends on the high-frequency components. In addition, the ringing effects caused by
aliasing would introduce undesired artifacts.
In computer vision, abundant attempts have been made to tackle the problem of
few-shot recognition. We refer readers to the review article [Xian et al., 2017] for a
comprehensive list. A number of earlier papers have explored sharing features across
categories to recognize new objects from a few examples [Bart and Ullman, 2005,
Torralba et al., 2007, Farhadi et al., 2009, Lampert et al., 2009]. More recently, many
researchers have begun to study zero- or few-shot recognition with deep networks
[Antol et al., 2014, Akata et al., 2016, Wang and Hebert, 2016, Hariharan and Gir-
shick, 2017, Wang et al., 2017]. Especially, Peng et al. [2015] explored the idea of
177
learning to recognize novel 3D models via domain adaptation.
While these proposed methods are for recognizing and categorizing images or
shapes, in GenRe we explore reconstructing the 3D shape of an object from unseen
classes. This problem has received little attention in the past, possibly due to its
considerable difficulty. A few imaging systems have attempted to recover 3D shapes
from single shots by making use of special cameras [Proesmans et al., 1996, Sagawa
et al., 2011]. In contrast, we study 3D reconstruction from a single RGB image.
Very recently, researchers have begun to look at the generalization power of 3D re-
construction algorithms [Rock et al., 2015, Jayaraman et al., 2018, Funk and Liu,
2017, Shin et al., 2018]. Here we present a novel approach that makes use of spherical
representations for better generalization.
For decades, researchers have been building datasets of 3D objects, either as a reposi-
tory of 3D CAD models [Bogo et al., 2014, Shilane et al., 2004, Bronstein et al., 2008]
or as images of 3D shapes with pose annotations [Leibe and Schiele, 2003, Savarese
and Fei-Fei, 2007]. Both directions have witnessed the rapid development of web-scale
databases: ShapeNet [Chang et al., 2015] was proposed as a large repository of more
than 50k models covering 55 categories, and Xiang et al. [2014] built Pascal 3D+ and
ObjectNet3D [Xiang et al., 2016], two large-scale datasets with alignment between
2D images and the 3D shapes inside. While these datasets have helped in advancing
the field of 3D shape modeling, they have their respective limitations: Datasets like
ShapeNet or Elastic2D3D [Lahner et al., 2016] do not have real images, and recent 3D
reconstruction challenges using ShapeNet have to be exclusively on synthetic images
[Yi et al., 2017]; Pascal 3D+ and ObjectNet3D have only rough alignment between
images and shapes, because objects in the images are matched to a pre-defined set of
CAD models, not their actual shapes. This has limited their usage as a benchmark
for 3D shape reconstruction [Tulsiani et al., 2017].
With depth sensors like Kinect [Izadi et al., 2011, Janoch et al., 2011], the commu-
nity has built various RGB-D or depth-only datasets of objects and scenes. We refer
178
readers to the review article from Firman [Firman, 2016] for a comprehensive list.
Among those, many object datasets are designed for benchmarking robot manipula-
tion [Calli et al., 2015, Hodan et al., 2017, Lai et al., 2011, Singh et al., 2014]. These
datasets often contain a relatively small set of hand-held objects in front of clean back-
grounds. Tanks and Temples [Knapitsch et al., 2017] is an exciting new benchmark
with 14 scenes, designed for high-quality, large-scale, multi-view 3D reconstruction.
In comparison, our dataset, Pix3D [Sun et al., 2018b], focuses on reconstructing a 3D
object from a single image, and contains much more real-world objects and images.
Probably the dataset closest to Pix3D is the large collection of object scans from
Choi et al. [2016], which contains a rich and diverse set of shapes, each with an RGB-
D video. Their dataset, however, is not ideal for single-image 3D shape modeling for
two reasons. First, the object of interest may be truncated throughout the video;
this is especially the case for large objects like sofas. Second, their dataset does
not explore the various contexts that an object may appear in, as each shape is only
associated with one scan. In Pix3D, we address both problems by leveraging powerful
web search engines and crowdsourcing.
Another closely related benchmark is IKEA [Lim et al., 2013], which provides
accurate alignment between images of IKEA objects and 3D CAD models. This
dataset is therefore particularly suitable for fine pose estimation. However, it contains
only 759 images and 90 shapes, relatively small for shape modeling1 . In contrast,
Pix3D contains 10,069 images (13.3×) and 395 shapes (4.4×) of greater variations.
In this section, we briefly present how ShapeHD [Wu et al., 2018] achieves high-quality
single-image 3D shape reconstruction with fine details by incorporating adversarially
learned priors into MarrNet [Wu et al., 2017].
ShapeHD consists of three components: a 2.5D sketch estimator and a 3D shape
estimator that predicts a 3D shape from an RGB image via 2.5D sketches (Figure 4-3
1
Only 90 of the 219 shapes in the IKEA dataset have associated images.
179
[I, II], inspired by MarrNet), and a deep naturalness model that penalizes the shape
estimator if the predicted shape is unnatural (Figure 4-3 [III]). Models trained with a
supervised reconstruction loss alone often generate blurry mean shapes. Our learned
naturalness model helps in avoiding this issue.
2D 2.5D 3D W
(I) 2.5D Sketch Estimation (II) 3D Shape Completion (III) Shape Naturalness
Figure 4-3: ShapeHD model. For single-view shape reconstruction, ShapeHD com-
prises three components: (I) a 2.5D sketch estimator that predicts depth, surface
normal, and silhouette images from a single image, (II) a 3D shape completion mod-
ule that regresses 3D shapes from silhouette-masked depth and surface normal images,
and (III) an adversarially pretrained convolutional network that serves as the natu-
ralness loss function. While finetuning the 3D shape completion network, we use two
losses: a supervised loss on the output shape and a naturalness loss offered by the
pretrained discriminator.
2.5D Sketch Estimation Network Our 2.5D sketch estimator has an encoder-
decoder structure that predicts the object’s depth, surface normals, and silhouette
from an RGB image (Figure 4-3 [I]). We use a ResNet-18 [He et al., 2016] to encode
a 256 × 256 image into 512 feature maps of size 8 × 8. The decoder consists of four
transposed convolutional layers with a kernel size of 5×5, and a stride and padding of
2. The predicted depth and surface normal images are then masked by the predicted
silhouette and used as the input to our shape completion network.
180
256 × 256 image (one for depth and three for surface normals) into a 200-D latent
vector. The vector then goes through a decoder of five transposed convolutional and
ReLU layers to generate a 128×128×128 voxelized shape. Binary cross-entropy losses
between predicted and target voxels are used as the supervised loss ℓvoxel .
𝑥)‖2 − 1)2 ] ,
𝑥)] − E [𝐷(𝑥)] + 𝜆 E [(‖O𝑥^ 𝐷(ˆ
ℓWGAN = E [𝐷(˜ (4.1)
˜∼𝑃𝑔
𝑥 𝑥∼𝑃𝑟 ^∼𝑃𝑥
𝑥
where 𝐷 is the discriminator, 𝑃𝑔 and 𝑃𝑟 are distributions of generated shapes and real
shapes, respectively. The last term is the gradient penalty from Gulrajani et al. [2017].
During training, the discriminator attempts to minimize the overall loss ℓWGAN , while
181
the generator attempts to maximize the loss via the first term in Equation 4.1, so
we can define our naturalness loss as ℓnatural = − E [𝐷(˜
𝑥)], where 𝑃𝑐 are the recon-
˜∼𝑃𝑐
𝑥
structed shapes from our completion network.
We train our network in two stages. We first pretrain the three components of our
model separately. The shape completion network is then fine-tuned with both voxel
and naturalness losses.
Our 2.5D sketch estimation network and 3D completion network are trained with
images rendered with ShapeNet [Chang et al., 2015] objects (see Section 4.6.1 and
Section 4.6.5 for details). We train the 2.5D sketch estimator using an ℓ2 loss and
Stochastic Gradient Descent (SGD) with a learning rate of 0.001 for 120 epochs.
We only use the supervised loss ℓvoxel for training the 3D estimator at this stage,
again with SGD, a learning rate of 0.1, and a momentum of 0.9 for 80 epochs. The
naturalness network is trained in an adversarial manner, where we use Adam [Kingma
and Ba, 2015] with a learning rate of 0.001 and a batch size of 4 for 80 epochs. We
set 𝜆 = 10 as suggested by Gulrajani et al. [2017].
We then fine-tune our completion network with both voxel loss and naturalness
losses as ℓ = ℓvoxel + 𝛼ℓnatural . We compare the scale of gradients from the losses and
train our completion network with 𝛼 = 2.75 × 10−11 using SGD for 80 epochs. Our
model is robust to these parameters; they are only for ensuring gradients of various
losses are of the same magnitude.
182
4.4 Method: Generalizing to Unseen Classes
Single-image reconstruction algorithms learn a parametric function 𝑓2D→3D that maps
a 2D image to a 3D shape. We tackle the problem of generalization to novel shape
classes unseen during training, by regularizing 𝑓2D→3D . The key regularization that we
impose is to factorize 𝑓2D→3D into geometric projections and learnable reconstruction
modules.
Our Generalizable Reconstruction (GenRe) model [Zhang et al., 2018b] consists of
three learnable modules, connected by geometric projections as shown in Figure 4-4.
The first module is a single-view depth estimator 𝑓2D→2.5D (Figure 4-4 [a]), taking
a color image as input and estimates its depth map. As the depth map can be
interpreted as the visible surface of the object, the reconstruction problem becomes
predicting the object’s complete surface given this partial estimate.
Geometric Projection
Network Module
a b c
Figure 4-4: GenRe model. Our model for generalizable single-image 3D reconstruc-
tion (GenRe) has three components: (a) a depth estimator that predicts depth in
the original view from a single RGB image, (b) a spherical inpainting network that
inpaints a partial, single-view spherical map, and (c) a voxel refinement network that
integrates two backprojected 3D shapes (from the inpainted spherical map and from
depth) to produce the final output.
183
module (Figure 4-4 [c]) to tackle this problem. It takes two 3D shapes as input, one
projected from the inpainted spherical map and the other from the estimated depth
map, and outputs a final 3D shape.
The first component of our network predicts a depth map from an image with a clean
background. Using depth as an intermediate representation facilitates the reconstruc-
tion process by distilling essential geometric information from the input image [Wu
et al., 2017].
Further, depth estimation is a class-agnostic task: Shapes from different classes
often share common geometric structure, despite distinct visual appearances. Take
beds and cabinets as examples. Although they are of different anatomy in general,
both have perpendicular planes and hence similar patches in their depth images. We
demonstrate this both qualitatively and quantitatively in Section 4.6.6.
With spherical maps, we cast the problem of 3D surface completion into 2D spherical
map inpainting. Empirically we observe that networks trained to inpaint spherical
maps generalize well to new shape classes (Figure 4-5). Also, compared with voxels,
spherical maps are more efficient to process, as 3D surfaces are sparse in nature;
quantitatively, as we demonstrate in Section 4.6.7 and Section 4.6.8, using spherical
maps results in better performance.
As spherical maps are signals on the unit sphere, it is tempting to use network
architectures based on spherical convolution [Cohen et al., 2018]. They are however
not suitable for our task of shape reconstruction. This is because spherical convolu-
tion is conducted in the spectral domain. Every conversion to and from the spectral
domain requires capping the maximum frequency, causing extra aliasing and infor-
mation loss. For tasks such as recognition, the information loss may be negligible
compared with the advantage of rotational invariance offered by spherical convolu-
184
RGB Input Inpainted Ground Truth RGB Input Inpainted Ground Truth
Figure 4-5: GenRe’s spherical inpainting module generalizing to new classes. Trained
on chairs, cars, and planes, the module completes the partially visible leg of the table
(red boxes) and the unseen cabinet bottom (purple boxes) from partial spherical maps
projected from ground-truth depth.
tion. But for reconstruction, the loss leads to blurred output with only low-frequency
components. We empirically find that standard convolution works much better than
spherical convolution under our setup.
185
encoding a 256 × 256 RGB image into 512 feature maps of size 1 × 1. The decoder is a
mirrored version of the encoder, replacing all convolution layers with transposed con-
volution layers. In addition, we adopt the U-Net structure [Ronneberger et al., 2015]
and feed the intermediate outputs of each block of the encoder to the corresponding
block of the decoder. The decoder outputs the depth map in the original view at the
resolution of 256 × 256. We use an ℓ2 loss between predicted and target images.
Spherical Map Inpainting Network The spherical map inpainting network has
a similar architecture as the single-view depth estimator. To reduce the gap between
standard and spherical convolutions, we use periodic padding to both inputs and
training targets in the longitude dimension, making the network aware of the periodic
nature of spherical maps.
Voxel Refinement Network Our voxel refinement network takes as input voxels
projected from the estimated, original-view depth and from the inpainted spherical
map, and recovers the final shape in voxel space. Specifically, the encoder takes as
input a two-channel 128 × 128 × 128 voxel (one for coarse shape estimation and the
other for surface estimation), and outputs a 320-D latent vector. In decoding, each
layer takes an extra input directly from the corresponding level of the encoder.
186
is fully differentiable.
Looking into Figure 4-6, we realize existing datasets have limitations for the task of
modeling a 3D object from a single image. ShapeNet [Chang et al., 2015] is a large
dataset for 3D models, but does not come with real images; Pascal 3D+ [Xiang et al.,
2014] and ObjectNet3D [Xiang et al., 2016] have real images, but the image-shape
alignment is rough because the 3D models do not match the objects in images; IKEA
[Lim et al., 2013] has high-quality image-3D alignment, but it only contains 90 3D
models and 759 images.
We desire a dataset that has all three merits—a large-scale dataset of real images
and ground-truth shapes with precise 2D-3D alignment. Our dataset, named Pix3D,
187
Mismatched 3D shapes
Figure 4-6: Pix3D vs. existing datasets. We present Pix3D, a new large-scale dataset
of diverse image-shape pairs. Each 3D shape in Pix3D is associated with a rich and
diverse set of images, each with an accurate 3D pose annotation to ensure precise
2D-3D alignment. In comparison, existing datasets have limitations: 3D models may
not match the objects in images, pose annotations may be imprecise, or the dataset
size may be relatively small.
has 395 3D shapes of nine object categories [Sun et al., 2018b]. Each shape is asso-
ciated with a set of real images, capturing the exact object in diverse environments.
Further, the 10,069 image-shape pairs have precise 3D annotations, giving pixel-level
alignment between shapes and their silhouettes in the images.
188
the reconstruction results. A well-designed metric should reflect the visual quality
of the reconstructions. In this paper, we calibrate commonly used metrics, including
Intersection over Union (IoU), Chamfer Distance (CD), and Earth Mover’s Distance
(EMD), on how well they capture human perception of shape similarity. Based on
this, we benchmark state-of-the-art algorithms for 3D object modeling on Pix3D to
demonstrate their strengths and weaknesses.
Figure 4-7 summarizes how we build Pix3D. We collect images from search engines
and shapes from 3D repositories; we also take pictures and scan shapes ourselves.
Finally, we use labeled keypoints on both images and 3D shapes to align them.
Efficient PnP
Figure 4-7: Building Pix3D. We build Pix3D in two steps. First, we collect image-
shape pairs by crawling web images of IKEA furniture as well as scanning objects
and taking pictures ourselves. Second, we align the shapes with their 2D silhouettes
by minimizing the 2D coordinates of the keypoints and their projected positions from
3D, using the Efficient PnP and the Levenberg-Marquardt algorithm.
We obtain the raw image-shape pairs in two ways. One is to crawl images of IKEA
furniture from the web and align them with CAD models provided in the IKEA
dataset [Lim et al., 2013]. The other is to directly scan 3D shapes and take pictures.
189
Extending IKEA The IKEA dataset [Lim et al., 2013] contains 219 high-quality
3D models of IKEA furniture, but has only 759 images for 90 shapes. Therefore, we
choose to keep the 3D shapes from IKEA dataset, but expand the set of 2D images
using online image search engines and crowdsourcing.
For each 3D shape, we first search for its corresponding 2D images through Google,
Bing, and Baidu, using its IKEA model name as the keyword. We obtain 104,220
images for the 219 shapes. We then use AMT to remove irrelevant ones. For each
image, we ask three AMT workers to label whether this image matches the 3D shape
or not. For images whose three responses differ, we ask three additional workers and
decide whether to keep them based on majority voting. We end up with 14,600 images
for the 219 IKEA shapes.
To align a 3D model with its projection in a 2D image, we need to solve for its 3D
pose (translation and rotation) and the camera parameters used to capture the image.
We use a keypoint-based method inspired by Lim et al. [2013]. Denote the key-
2
https://structure.io
3
https://occipital.com
190
points’ 2D coordinates as x2D = {𝑥1 , 𝑥2 , · · · , 𝑥𝑛 } and their corresponding 3D coor-
dinates as X3D = {𝑋1 , 𝑋2 , · · · , 𝑋𝑛 }. We solve for camera parameters and 3D poses
that minimize the reprojection error of the keypoints. Specifically, we want to find
the projection matrix 𝑃 that minimizes:
∑︁
ℒ(𝑃 ; X3D , x2D ) = ‖Proj𝑃 (𝑋𝑖 ) − 𝑥𝑖 ‖22 , (4.2)
𝑖
191
Implementation Details For each 3D shape, we manually label its 3D keypoints.
The number of keypoints ranges from 8 to 24. For each image, we ask three AMT
workers to label if each keypoint is visible on the image, and if so, where it is. We
only consider visible keypoints during the optimization.
The 2D keypoint annotations are noisy, which severely hurts the performance of
the optimization algorithm. We try two methods to increase its robustness. The first
is to use RANdom SAmple Consensus (RANSAC). The second is to use only a subset
of 2D keypoint annotations. For each image, denote 𝐶 = {𝑐1 , 𝑐2 , 𝑐3 } as its three sets
of human annotations. We then enumerate the seven nonempty subsets 𝐶𝑘 ⊆ 𝐶; for
each keypoint, we compute the median of its 2D coordinates in 𝐶𝑘 . We apply our
optimization algorithm on every subset 𝐶𝑘 and keep the output with the minimum
projection error. After that, we let three AMT workers choose, for each image, which
of the two methods offers better alignment, or neither performs well. At the same
time, we also collect attributes (i.e., truncation, occlusion) for each image. Finally, we
finetune the annotations ourselves using the Graphical User Interface (GUI) offered
in ObjectNet3D [Xiang et al., 2016]. Altogether there are 395 3D shapes and 10,069
images. Sample 2D-3D pairs are shown in Figure 4-8.
4.6 Results
In this section, we analyze our real-world dataset—Pix3D, evaluate which shape er-
ror metric matches human perception the best, and finally describe our experiments
evaluating ShapeHD and GenRe. Specifically, Section 4.6.4 and Section 4.6.5 show
how ShapeHD is capable of high-fidelity 3D shape completion from a single-view
depth map and 3D shape reconstruction from a single-view RGB image, respectively.
Section 4.6.6, Section 4.6.7, and Section 4.6.8 demonstrate how GenRe is able to ac-
curately estimate depth for novel shape categories unseen during training, reconstruct
novel objects from the training classes, and reconstruct objects from novel test classes
unseen during training, respectively.
192
3D Shape Image Alignment 3D Shape Image Alignment 3D Shape Image Alignment
Figure 4-8: Sample images and shapes in Pix3D. From left to right: 3D shapes, 2D
images, and 2D-3D alignment. Rows 1–2 show some chairs we scanned, Rows 3–4
show a few IKEA objects, and Rows 5–6 show some objects of other categories we
scanned.
193
4.6.1 Data
Here we describe our synthetic data for training and the real data for testing.
Synthetic Data We render each of the ShapeNet Core55 [Chang et al., 2015] ob-
jects in 20 random, fully unconstrained views. For each view, we randomly set the
azimuth and elevation angles of the camera, but the camera’s up vector is fixed to be
the world +𝑦 axis, and the camera always looks at the object center. The focal length
is fixed at 50 mm on a 35 mm film. In ShapeHD, to boost the realism of the rendered
RGB images, we put three different types of backgrounds behind the object during
rendering. One third of the images are rendered in a clean white background; one
third are rendered in High-Dynamic-Range (HDR) backgrounds with illumination
channels that produce realistic lighting; we render the remaining one third images
onto backgrounds randomly sampled from the SUN database [Xiao et al., 2010]. We
use Mitsuba [Jakob, 2010], a physically-based rendering engine, for all our renderings.
We used 90% of the data for training and 10% for testing.
For the generalization experiments for GenRe, we train our models on the three
largest ShapeNet classes (car, chair, and airplane), and test them on the next 10
largest classes: bench, vessel, rifle, sofa, table, phone, cabinet, speaker, lamp,
and display. Besides ShapeNet renderings, we also test GenRe on non-rigid shapes
such as humans and horses [Bronstein et al., 2008] (Section 4.7.5) and on highly
regular shape primitives (Section 4.7.6).
194
Real Data We also test our models, trained only on synthetic data, on real images
from PASCAL 3D+ [Xiang et al., 2014] and Pix3D [Sun et al., 2018b].
We now present some statistics of Pix3D and contrast it with its predecessors.
Figure 4-9 shows the category distributions of 2D images and 3D shapes in Pix3D.
Our dataset covers a large variety of shapes, each of which has a large number of in-
the-wild images. Chairs cover the significant part of Pix3D, because they are common,
highly diverse, and well-studied by recent literature [Dosovitskiy et al., 2017, Tulsiani
et al., 2017, Gwak et al., 2017].
2000 120
80
1000
40
0 0
Bed Bookcase Chair Desk Misc Sofa Table Tool Wardrobe Bed Bookcase Chair Desk Misc Sofa Table Tool Wardrobe
Figure 4-9: Image and shape distributions across categories of Pix3D. Each shape
in Pix3D is associated with multiple images providing various contexts, in which the
shape is likely to appear.
195
with images. For example, there are only 15 unoccluded and untruncated images of
sofas in IKEA, while Pix3D has 1,092.
4.6.2 Metrics
196
surface meshes and create the densely sampled point clouds. Finally, we randomly
sample 1,024 points from each point cloud and normalize them into a unit cube for
distance calculation.
The CD between 𝑆1 , 𝑆2 ⊆ R3 is defined as:
1 ∑︁ 1 ∑︁
CD(𝑆1 , 𝑆2 ) = min ‖𝑥 − 𝑦‖2 + min ‖𝑥 − 𝑦‖2 . (4.3)
|𝑆1 | 𝑦∈𝑆2 |𝑆2 | 𝑥∈𝑆1
𝑥∈𝑆1 𝑦∈𝑆2
For each point in each cloud, CD finds the nearest point in the other point set and
sums up all distances. CD has been used in shape retrieval challenges [Yi et al., 2017].
For EMD, we follow the definition in Fan et al. [2017]. The EMD between 𝑆1 , 𝑆2 ⊆
R3 (of equal size, i.e., |𝑆1 | = |𝑆2 |) is:
1 ∑︁
EMD(𝑆1 , 𝑆2 ) = min ||𝑥 − 𝜑(𝑥)||2 , (4.4)
|𝑆1 | 𝜑:𝑆1 →𝑆2 𝑥∈𝑆
1
Which Metric Is the Best? We then conduct two user studies to compare these
metrics and benchmark how they capture human perception.
We run three shape reconstructions algorithms (3D-R2N2 [Choy et al., 2016], DRC
[Tulsiani et al., 2017], and 3D-VAE-GAN [Wu et al., 2016]) on 200 randomly selected
images of chairs. We then, for each image and every pair of its three reconstructions,
ask three AMT workers to choose the one that looks the closest to the object in
the image. We also compute how each pair of objects rank in each metric. Finally,
we calculate the Spearman’s rank correlation coefficients between different metrics
(i.e., IoU, EMD, CD, and human perception). Table 4.2 suggests that EMD and CD
correlate better with human ratings.
4.6.3 Baselines
197
Voxels Voxels are arguably the most common representation for 3D shapes in the
deep learning era due to their amenability to 3D convolution. For this representation,
we consider e 3D Recurrent Reconstruction Neural Network (3D-R2N2) [Choy et al.,
2016], Differentiable Ray Consistency (DRC) [Tulsiani et al., 2017], MarrNet [Wu
et al., 2017], and Octree Generating Network (OGN) [Tatarchenko et al., 2017] as
baselines. Our model uses 1283 voxels of [0, 1] occupancy. All these baselines and our
ShapeHD take a single image as input, without requiring any object mask.
In ShapeHD, we compare with a state-of-the-art shape completion method: 3D-
Encoder Predictor Network (3D-EPN) [Dai et al., 2017]. To ensure a fair comparison,
we convert depth maps to partial surfaces registered in a canonical global coordinate
defined by ShapeNet Core55 [Chang et al., 2015], which is required by 3D-EPN.
While the original 3D-EPN paper generates their partial observations by rendering
and fusing multi-view depth maps, our method takes a single-view depth map as
input and is solving a more challenging problem.
Mesh & Point Clouds Considering the cubic complexity of the voxel representa-
tion, recent papers have explored meshes [Groueix et al., 2018, Yao et al., 2018] and
point clouds [Fan et al., 2017] in the context of neural networks. In this work, we
consider AtlasNet [Groueix et al., 2018] and Point Set Generation Network (PSGN)
[Fan et al., 2017] as baselines. Like GenRe, both PSGN and AltasNet require object
silhouettes as input in addition to the single RGB image.
Spherical Maps As introduced in Section 4.1, one can also represent 3D shapes as
spherical maps. We include two baselines with spherical maps: first, a one-step base-
line that predicts final spherical maps directly from RGB images (“GenRe-1step”);
second, a two-step baseline that first predicts single-view spherical maps from RGB
198
images and then inpaints them (“GenRe-2step”). Both baselines use the aforemen-
tioned “U-ResNet” image-to-image network architecture.
To provide justification for using spherical maps, we provide a baseline (“3D Com-
pletion”) that directly performs 3D shape completion in voxel space. This baseline
first predicts depth from an input image and then projects the depth map into the
voxel space. A completion module takes the projected voxels as input and predicts
the final result.
To provide a performance upper bound for our spherical inpainting and voxel
refinement networks (Figure 4-4 [b, c]), we also include the results when our model
has access to ground-truth depth in the original view (“GenRe-Oracle”) and to ground-
truth full spherical maps (“GenRe-SphOracle”).
For 3D shape completion from a single depth image, we only use the last two modules
of ShapeHD: the 3D shape estimator and deep naturalness network.
Qualitative Results In Figure 4-10 and Figure 4-11, we show 3D shapes predicted
by ShapeHD from single-view depth images. While common encoder-decoder struc-
ture usually generates mean shapes with few details, our ShapeHD predicts shapes
with large variance and fine details. In addition, even when there is strong occlusion
in the depth image, our model can predict a high-quality, plausible 3D shape that
looks good perceptually, and infer parts not present in the input images.
We now show results of ShapeHD on real depth scans. We capture six depth
maps of different chairs using a Structure sensor4 and use the captured depth maps
to evaluate our model. All the corresponding normal maps used as input are estimated
from depth measurements. Figure 4-12 shows that ShapeHD completes 3D shapes
well given a single-view depth map. ShapeHD is more flexible than 3D-EPN, as we
do not need any camera intrinsics or extrinsics to register depth maps. In our case,
none of these parameters is known, so 3D-EPN cannot be applied.
4
http://structure.io
199
Input ShapeHD (2 Views) Ground Truth Input ShapeHD (2 Views) Ground Truth
Figure 4-10: 3D shape completion from single-view depth by ShapeHD. From left to
right: input depth maps, shapes reconstructed by ShapeHD in the canonical view
and a novel view, and ground-truth shapes in the canonical view. Assisted by the
adversarially learned naturalness losses, ShapeHD recovers highly accurate 3D shapes
with fine details. Sometimes the reconstructed shape deviates from the ground truth
but can be viewed as another plausible explanation of the input (e.g., the airplane on
the left, third row).
200
Input 3D-EPN ShapeHD w/o ShapeHD (2 Views) Ground Truth
ℒ"#$%&#' (2 Views) (2 Views)
Figure 4-11: 3D shape completion by ShapeHD. Our results contain more details than
3D-EPN. We observe that the adversarially trained naturalness losses help fix errors,
add details (e.g., the plane wings in Row 3, car seats in Row 6, and chair arms in
Row 8), and smooth planar surfaces (e.g., the sofa back in Row 7).
201
Scanned Depth Photo of Scanned Depth Photo of
ShapeHD (2 Views) ShapeHD (2 Views)
(Single-View) the Object (Single-View) the Object
Figure 4-12: 3D shape completion by ShapeHD using real depth data. ShapeHD is
able to reconstruct the shape well from just a single view. From left to right: input
depth, two views of our results, and color images of the objects.
Ablation When using the naturalness loss, the network is penalized for generating
mean shapes that are unreasonable but minimize the supervised loss. In Figure 4-11,
we show reconstructed shapes from our ShapeHD with and without naturalness loss
(i.e., before fine-tuning with 𝐿natural ), together with ground truth shapes and shapes
predicted by 3D-EPN [Dai et al., 2017]. Our results contain finer details compared
with those from 3D-EPN. Also, the performance of ShapeHD improves greatly with
the naturalness loss, which predicts more reasonable and complete shapes.
202
IoU (323 ) CD
Methods
chair car plane Avg chair car plane Avg
3D-EPN .147 .274 .155 .181 .227 .200 .125 .192
ShapeHD w/o 𝐿natural .466 .698 .488 .529 .112 .083 .071 .093
ShapeHD .488 .698 .452 .529 .096 .078 .068 .084
Table 4.3: Average shape completion errors of ShapeHD on ShapeNet. Our model
outperforms the state of the art by a large margin. The learned naturalness losses
consistently improve the CD between our results and the ground truth.
203
Input DRC (3D) AtlasNet ShapeHD GT Input DRC (3D) AtlasNet ShapeHD GT
Methods bench boat cabin car chair disp lamp phone plane rifle sofa speak table Avg
DRC (3D) .122 .131 .127 .077 .128 .128 .168 .102 .166 .107 .106 .138 .138 .126
AtlasNet† .123 .130 .169 .107 .141 .162 .171 .138 .105 .096 .131 .172 .161 .139
ShapeHD .121 .103 .126 .066 .125 .124 .157 .084 .073 .053 .102 .141 .124 .108
Input Est. Depth ShapeHD (2 views) GT Input Est. Depth ShapeHD (2 views) GT
Input DRC (3D) AtlasNet ShapeHD GT Input DRC (3D) AtlasNet ShapeHD GT
Methods bench boat cabin disp lamp phone rifle sofa speak table Avg
DRC (3D) .175 .161 .189 .278 .225 .268 .153 .149 .203 .221 .202
AtlasNet† .155 .114 .202 .244 .261 .263 .121 .126 .206 .262 .195
ShapeHD .166 .129 .182 .252 .235 .229 .232 .133 .193 .199 .195
204
Results on Real Data We then evaluate ShapeHD on two real datasets, PASCAL
3D+ [Xiang et al., 2014] and Pix3D [Sun et al., 2018b]. Here we train our model on
our synthetic ShapeNet rendering and use the trained models released by the authors
as baselines. All methods take ground-truth 3D shapes as supervision during training.
As shown in Figure 4-15 and Figure 4-16, ShapeHD works well, inferring a reasonable
shape even in the presence of strong self-occlusion. In particular, in Figure 4-15, we
compare our reconstructions against the best-performing alternatives (DRC on chair
and airplane, and AtlasNet on car). In addition to preserving details, our model
captures the shape variations of the objects, while the competitors produce similar
reconstructions across instances.
Best Best
Input ShapeHD Input ShapeHD
alternative alternative
Quantitatively, Table 4.4 and Table 4.5 suggest that ShapeHD performs signifi-
cantly better than the other methods in almost all metrics. The only exception is the
205
(a) Input (b) AtlasNet (c) DRC (3D) (d) ShapeHD (e) GT
Figure 4-16: 3D shape reconstruction by ShapeHD on Pix3D. For each input im-
age, we show reconstructions by AtlasNet, DRC, and our ShapeHD alongside with
the ground truth. ShapeHD reconstructs complete 3D shapes with fine details that
resemble the ground truth.
206
CD on PASCAL 3D+ cars, for which OGN performs the best. However, as PASCAL
3D+ has only around 10 CAD models for each object category as the ground-truth
3D shapes, the ground-truth labels and scores can be inaccurate, failing to reflect
human perception [Tulsiani et al., 2017].
50
CD
Methods 40
# Test examples
chair car airplane Avg 30
207
4.6.6 Estimating Depth for Novel Shape Classes
Figure 4-17: GenRe’s depth estimator generalizing to novel shape classes. Left: Our
single-view depth estimator, trained on car, chair, and airplane, generalizes to
novel classes: bus, train, and table. Right: As the novel test class gets increas-
ingly dissimilar to the training classes (left to right), depth prediction does not show
statistically significant degradation (𝑝 > 0.05).
We present results on generalizing to novel objects from the training classes. All
models are trained on car, chair, and airplane, and tested on unseen objects from
the same three categories.
As shown in Table 4.6 (Seen), GenRe is the best-performing viewer-centered
model. It also outperforms most of the object-centered models except AtlasNet.
GenRe’s preformance is impressive given that object-centered models tend to per-
form much better on objects from seen classes [Shin et al., 2018]. This is because
object-centered models, by exploiting the concept of canonical views, actually solve
an easier problem. The performance drop from the object-centered DRC to the
208
viewer-centered DRC supports this empirically. However, for objects from unseen
classes, the concept of canonical views is no longer well-defined. As we will see in
Section 4.6.8, this hurts the generalization power of the object-centered methods.
Unseen
Models Seen
bch vsl rfl sfa tbl phn cbn spk lmp dsp Avg
Object- DRC .072 .112 .100 .104 .108 .133 .199 .168 .164 .145 .188 .142
Centered AtlasNet .059 .102 .092 .088 .098 .130 .146 .149 .158 .131 .173 .127
DRC .092 .120 .109 .121 .107 .129 .132 .142 .141 .131 .156 .129
MarrNet .070 .107 .094 .125 .090 .122 .117 .125 .123 .144 .149 .120
Shin et al. [2018] .065 .092 .092 .102 .085 .105 .110 .119 .117 .142 .142 .111
3D Completion .076 .102 .099 .121 .095 .109 .122 .131 .126 .138 .141 .118
Viewer-
GenRe-1step .063 .104 .093 .114 .084 .108 .121 .128 .124 .126 .151 .115
Centered
GenRe-2step .061 .098 .094 .117 .084 .102 .115 .125 .125 .118 .118 .110
GenRe (ours) .064 .089 .092 .112 .082 .096 .107 .116 .115 .124 .130 .106
GenRe-Oracle .045 .050 .048 .031 .059 .057 .054 .076 .077 .060 .060 .057
GenRe-SphOracle .034 .032 .030 .021 .044 .038 .037 .044 .045 .031 .040 .036
Table 4.6: 3D shape reconstruction by GenRe on training and novel classes. The
novel classes are ordered from the most to the least similar to the training classes.
Our model is viewer-centered by design but achieves performance on par with the
object-centered state of the art (AtlasNet) in reconstructing the seen classes. As for
generalization to novel classes, our model outperforms the state of the art across 9
out of the 10 classes in terms of CD.
We show how GenRe generalizes to novel shape classes unseen during training.
Synthetic Rendering We use the 10 largest ShapeNet classes other than chair,
car, and airplane as our test set. Table 4.6 (Unseen) shows that our model consis-
tently outperforms the state of the art, except for rifle, in which AtlasNet performs
the best. Qualitatively, GenRe produces reconstructions that are much more consis-
tent with input images, as shown in Figure 4-18. In particular, on unseen classes,
our results still attain good consistency with the input images, while the competitors
either lack structural details present in the input (e.g., 5) or retrieve shapes from the
training classes (e.g., 4, 6, 7, 8, 9).
209
1 6
2
7
3
8
4
9
5 10
Input Best Baseline GenRe (Ours) Ground Truth Input Best Baseline GenRe (Ours) Ground Truth
Figure 4-18: GenRe’s reconstruction within and beyond training classes. Each row
from left to right: the input image, two views from the best-performing baseline for
each testing object (1–4, 6–9: AtlasNet; 5, 10: Shin et al. [2018]), two views of our
GenRe predictions, and the ground truth. All models are trained on the same dataset
of cars, chairs, and airplanes.
Comparing our model with its variants, we find that the two-step approaches
(GenRe-2step and GenRe) outperform the one-step approach across all novel cate-
gories. This empirically supports the advantage of our two-step modeling strategy
that disentangles geometric projections from shape reconstruction.
Real Images We further compare how GenRe, AtlasNet, and Shin et al. [2018]
perform on real images from Pix3D. Here, all models are trained on ShapeNet car,
chair, and airplane, and tested on real images of bed, bookcase, desk, sofa, table,
and wardrobe.
Quantitatively, Table 4.7 shows that GenRe outperforms the two competitors
across all novel classes except bed, for which Shin et al. [2018] perform the best. For
chair, one of the training classes, the object-centered AtlasNet leverages the canon-
ical view and outperforms the two viewer-centered approaches. Qualitatively, our
reconstruction preserves the details present in the input (e.g., the hollow structures
in the second row of Figure 4-19).
Because neither depth maps nor spherical maps provide information inside the
shapes, our model predicts only surface voxels that are not guaranteed watertight.
Consequently, IoU cannot be used as an evaluation metric. We hence evaluate the
reconstruction quality using CD. For models that output voxels, including DRC and
210
AtlasNet Shin GenRe
chair .080 .089 .093
bed .114 .106 .113
bookcase .140 .109 .101
desk .126 .121 .109
sofa .095 .088 .083
table .134 .124 .116
wardrobe .121 .116 .109
our GenRe model, we sweep voxel thresholds from 0.3 to 0.7 with a step size of 0.05 for
isosurfaces, compute CD with 1,024 points sampled from all isosurfaces, and report
the best average CD for each object class.
Shin et al. [2018] report that object-centered supervision produces better recon-
structions for objects from the training classes, whereas viewer-centered supervision
is advantaged in generalizing to novel classes. Therefore, for DRC and AtlasNet, we
train each network with both types of supervision. Note that AtlasNet, when trained
with viewer-centered supervision, tends to produce unstable predictions that render
CD meaningless. Hence, we only present CD for the object-centered AtlasNet.
4.7 Discussion
For ShapeHD, we visualize what the network is learning (Section 4.7.1), analyze
the effects of the naturalness loss over time (Section 4.7.2), and discuss common
failure modes (Section 4.7.3). For GenRe, we study how the input viewpoint affects
the model’s ability to generalize to unseen shape classes (Section 4.7.4), if a model
trained on rigid shapes is able to generalize to non-rigid ones (Section 4.7.5), and
211
finally whether the model can reconstruct simple, regular shapes well (Section 4.7.6).
Figure 4-20: Visualization of how ShapeHD attends to details in the depth maps.
Row 1: car wheel detectors; Row 2: chair back and leg detectors. The left responds
to the strided pattern in particular. Row 3: chair arm and leg detectors; Row 4:
airplane engine and curved surface detectors. The right responds to a specific pattern
across different classes.
212
4.7.2 Training With the Naturalness Loss Over Time
We study the effect of the naturalness loss over time. In Figure 4-21, we plot the
loss of the completion network with respect to the fine-tuning epoch. We realize that
the voxel loss goes down slowly but consistently. If we visualize the reconstructed
examples at different timestamps, we clearly see that details are being added to the
shape. These fine details occupy a small region in the voxel grid, so training with the
supervised loss alone is unlikely to recover them. In contrast, with the adversarially
learned perceptual loss, ShapeHD recovers the details successfully.
0.071
0.070
Training Loss
(a)
0.069
(a) (b)
(b)
0.068
(d)
(c)
0.067
0 20 40 60 80
Epoch (c) (d)
Figure 4-21: How ShapeHD improves over time with the naturalness loss. The pre-
dicted shape becomes increasingly realistic as details are being added.
We present the failure cases of ShapeHD in Figure 4-22. We observe our model has
these common failing modes: It sometimes gets confused by deformable object parts
(e.g., wheels on the top left); it may miss uncommon object parts (top right, the ring
above the wheels); it has difficulty in recovering very thin structure (bottom right),
and may generate other patterns instead (bottom left). While the voxel representation
makes it possible to incorporate the naturalness loss, intuitively, it also encourages
the network to focus on thicker shape parts, as they carry more weights in the loss
function.
213
Input ShapeHD (3 views) GT Input ShapeHD (3 views) GT
Figure 4-22: Common failure modes of ShapeHD. Top left: The model sometimes gets
confused by deformable object parts (e.g., wheels). Top right: The model might miss
uncommon object parts (the ring above the wheels). Bottom row: The model has
difficulty in recovering very thin structure and may generate other structure patterns
instead.
The generic viewpoint assumption states that the observer is not in a special position
relative to the object [Freeman, 1994]. This makes us wonder if the “accidentalness”
of the viewpoint affects the quality of GenRe’s reconstruction.
As a quantitative analysis, we test our model trained on ShapeNet chair, car, and
airplane on 100 randomly sampled ShapeNet tables, each rendered in 200 different
views sampled uniformly on a sphere. We then compute, for each of the 200 views,
the median CD of the 100 reconstructions. Finally, in Figure 4-23, we visualize these
median CDs as a heatmap over an elevation-azimuth view grid. As the heatmap
shows, our model makes better predictions when the input view is generic than when
it is accidental, consistent with the intuition.
214
Azimuth ✓ = 2⇡
<latexit sha1_base64="qxzGJJ2kQ5UCyzYfSDYkSRigEGo=">AAACBHicbVDJSgNBEO1xjXGLesylMQiewkwQ9CJEvXiMYBbIDKGnU5M06VnorhHjkIMXf8WLB0W8+hHe/Bs7y0ETHxQ83quiqp6fSKHRtr+tpeWV1bX13EZ+c2t7Z7ewt9/Qcao41HksY9XymQYpIqijQAmtRAELfQlNf3A19pt3oLSIo1scJuCFrBeJQHCGRuoUii7CPWYXDyJMsU9H1MU+IDuvuInoFEp22Z6ALhJnRkpkhlqn8OV2Y56GECGXTOu2YyfoZUyh4BJGeTfVkDA+YD1oGxqxELSXTZ4Y0SOjdGkQK1MR0on6eyJjodbD0DedIcO+nvfG4n9eO8XgzMtElKQIEZ8uClJJMabjRGhXKOAoh4YwroS5lfI+U4yjyS1vQnDmX14kjUrZscvOzUmpejmLI0eK5JAcE4eckiq5JjVSJ5w8kmfySt6sJ+vFerc+pq1L1mzmgPyB9fkDmoaYCg==</latexit>
0
<latexit sha1_base64="6ZTwbptvK00HUiMuNssEoeJJPkc=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1Fip6Q7KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGNn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1U9t+o1ryv12zyOIpzBOVyCBzWowz00oAUMEJ7hFd6cR+fFeXc+lq0FJ585hT9wPn8AeRmMtA==</latexit>
<latexit
.157
Accidental Views
.076
Generic Views Elevation
<latexit sha1_base64="Zv+82o4CtFUAlRlOBY9VuEYFsXo=">AAACA3icbVDLSgNBEJyNrxhfUW96GQyCp7Argl6EoAgeI5gHZEOYnfQmQ2YfzPQGwxLw4q948aCIV3/Cm3/jZJODJhY0FFXddHd5sRQabfvbyi0tr6yu5dcLG5tb2zvF3b26jhLFocYjGammxzRIEUINBUpoxgpY4EloeIPrid8YgtIiCu9xFEM7YL1Q+IIzNFKneOAiPGB6I2GYKXRM3bgvLt1YdIolu2xnoIvEmZESmaHaKX653YgnAYTIJdO65dgxtlOmUHAJ44KbaIgZH7AetAwNWQC6nWY/jOmxUbrUj5SpEGmm/p5IWaD1KPBMZ8Cwr+e9ifif10rQv2inIowThJBPF/mJpBjRSSC0KxRwlCNDGFfC3Ep5nynG0cRWMCE48y8vkvpp2bHLzt1ZqXI1iyNPDskROSEOOScVckuqpEY4eSTP5JW8WU/Wi/VufUxbc9ZsZp/8gfX5Awewl74=</latexit>
=⇡ Error( , ✓)
<latexit sha1_base64="kXfNM1n/0U5V4psYcJyvnjzZSRU=">AAACBHicbVBNS8NAEN34WetX1WMvi0WoICURQY9FETxWsB/QlLLZTpqlm2zYnYgl9ODFv+LFgyJe/RHe/DemHwdtfTDweG+GmXleLIVB2/62lpZXVtfWcxv5za3tnd3C3n7DqERzqHMllW55zIAUEdRRoIRWrIGFnoSmN7ga+8170Eao6A6HMXRC1o+ELzjDTOoWii7CA6bXWis9KrtxIE6oiwEgO853CyW7Yk9AF4kzIyUyQ61b+HJ7iichRMglM6bt2DF2UqZRcAmjvJsYiBkfsD60MxqxEEwnnTwxokeZ0qO+0llFSCfq74mUhcYMQy/rDBkGZt4bi/957QT9i04qojhBiPh0kZ9IioqOE6E9oYGjHGaEcS2yWykPmGYcs9zGITjzLy+SxmnFsSvO7VmpejmLI0eK5JCUiUPOSZXckBqpE04eyTN5JW/Wk/VivVsf09YlazZzQP7A+vwB5DmXkg==</latexit>
Figure 4-23: Reconstruction errors of GenRe across different input viewpoints. The
vertical (horizontal) axis represents elevation (azimuth). Accidental views (dark blue
box) lead to large errors, while generic views (green box) result in smaller errors. Er-
rors are computed for 100 tables; these particular tables are for visualization purposes
only.
generic shape priors learned from rigid objects (ShapeNet car, chair, and airplane).
Input GenRe (Ours) Ground Truth Input GenRe (Ours) Ground Truth
215
chair, and airplane during training, and we assume our model has access to the
ground-truth single-view depth (i.e., GenRe-Oracle).
As Figure 4-25 shows, although our model hallucinates the unseen parts of these
shape primitives, it fails to exploit global shape symmetry to produce correct predic-
tions. This is not surprising given that our network design does not explicitly model
such regularity. A possible future direction is to incorporate priors that facilitate
learning high-level concepts such as symmetry.
4.8 Conclusion
We have presented Pix3D [Sun et al., 2018b], a large-scale dataset of well-aligned 2D
images and 3D shapes, and also explored how three commonly used metrics corre-
spond to human perception through behavioral studies. With this high-quality test set
and informative error metrics, we then continued to develop two models for single-
image 3D shape reconstruction: ShapeHD [Wu et al., 2018] aimed at high-fidelity
reconstruction with structural details and Generalizable Reconstruction (GenRe)
[Zhang et al., 2018b] generalizing to novel shape classes unseen during training.
For ShapeHD, we proposed to use learned shape priors to overcome the 2D-
3D ambiguity and to learn from the multiple hypotheses that explain a single-view
observation. Our model achieves state-of-the-art results with structural details on
3D shape completion and reconstruction. We hope our results will inspire further
research in 3D shape modeling, in particular on explaining the ambiguity behind
partial observations.
For GenRe, we have studied the problem of generalizable single-image 3D re-
construction. We exploit various image and shape representations including 2.5D
sketches, spherical maps, and voxels. We have presented a novel viewer-centered
model that integrates these representations for generalizable, high-quality 3D shape
reconstruction. The experiments demonstrate that GenRe achieves state-of-the-art
performance on shape reconstruction for both seen and unseen classes. We hope our
system will inspire future research along this challenging but rewarding direction.
216
Chapter 5
217
the light source—the Earth (Section 5.3). Then, EarthGAN’s task is to generate the
mostly likely Earth image that, as the light source, gave rise to the Moon observation
at that particular timestamp. The fact that we rely on only the average color of the
Moon is crucial for everyday users to be potentially able to apply our method to their
casual “backyard capture.”
In Section 5.4, we start by describing the characteristics and multi-modal nature of
our data. We then test EarthGAN at the actual timestamps for which we have ground-
truth Earth images, to gauge the quality of the Earth image generated by EarthGAN
given an observation timestamp and an average Moon color. Since EarthGAN requires
a dataset of Earth images, one could alternatively run simple nearest neighbor-based
algorithms to retrieve the “best” Earth image given the same input. These non-
parametric methods have the advantage of excellent simplicity and interpretability.
As such, we also evaluate two simple models from this category against EarthGAN
and demonstrate their limitations, the most important of which is their inability to
“hallucinate” novel contents unseen in their entirety during training.
Next, we “super-resolve” the Earth rotation in time, by querying EarthGAN at
a finer time resolution of 5 min. Through this experiment, we demonstrate that
EarthGAN has learned from the data to synthesize photorealistic, continuous Earth
rotation despite having seen only snapshots of the Earth that are 1h+ apart during
training. We then frame the modeling of atmospheric conditions such as clouds and
other nuances as multi-modal generation commonly seen in Generative Adversarial
Networks (GANs), and demonstrate how EarthGAN controls these nuances with a
random vector. Finally, we perform analyses on whether we can rely on just the times-
tamp without observing the Moon at all, whether the choice of the GAN backbone
changes the results, and how other time encoding schemes affect the generation.
5.1 Introduction
Leaving the source, light travels to and interacts with the object, resulting in the
object’s appearance that we observe. More formally, appearance (filtered signal)
218
can be thought of as a convolution of the object’s reflectance (filter) over lighting
(signal) [Ramamoorthi and Hanrahan, 2004]. As such, a signal processing formulation
of lighting recovery often involves explicit modeling of the filter, i.e., the object’s
reflectance [Ramamoorthi and Hanrahan, 2001]. Such approaches, however, may
not be viable for cases where we do not have an accurate model for the object’s
reflectance, or where physically-based rendering is challenging (e.g., due to additional
nuances not exhaustively modeled in the rendering process). The Moon-Earth case
that this chapter studies is exactly one such case.
In this chapter, we study the problem of lighting recovery from the appearance
of the illuminated object. Specifically, we investigate a special case where the Earth
serves as the light source that we aim to recover, and the Moon is the illuminated
object that we observe. This is a simplification of the actual light transport: In
reality, light from the Sun travels to the Earth, and the Earth reflects the light, in a
spatially-varying manner, to the Moon. Because part of the Moon is also lit directly
by the Sun at a much higher light intensity, only the dark side of the Moon carries
signals about the (indirect) illumination from the Earth. In our simplified setup, the
Sun is removed, and the Earth is made emissive to directly provide illumination to
the Moon.
As previously alluded to, this Moon-Earth setup makes it challenging to solve the
problem using a physically-based approach like the high-level abstraction presented
in Chapter 2. There are mainly three reasons. Firstly, although Bidirectional Re-
flectance Distribution Function (BRDF) approximations such as Hapke [1981] are
available for the Moon surface, it is still hard to obtain an accurate reflectance model
on top of the complex Moon surface geometry. Secondly, the image formation process
involves various factors that have effects but are hard to model and even unknown,
such as atmospheric turbulence. Finally, a physically-based model will not be able to
recover high-frequency lighting when given just a single-pixel observation of the illu-
minated object. Furthermore, whether physically-based modeling is necessary in this
case is debatable: While physically-based modeling is general and can be applied to
recover any “rare” lighting, the lighting in our case, (emissive) images of the Earth, is
219
highly specialized and possesses many strong regularities: The Earth is always mostly
round, and consists of blue pixels for the ocean, yellow or green pixels for the conti-
nents, and white pixels for the clouds. Furthermore, there are abundant high-quality
Earth images, taken by a spacecraft camera, available online.
Because our dataset contains pairs of Moon observations and their correspond-
ing Earth images, we can alternatively train an image-to-image Convolutional Neural
Network (CNN) in the supervised way. However, this supervised alternative is un-
able to properly handle the one-to-many mappings in our case: There are multiple
possible Earth appearances, e.g., with different cloud patterns, that can give rise to
the observed Moon appearance. Imposing a supervised loss on the CNN would lead
to blurry “mean” images that satisfy the ℓ2 loss. Therefore, a GAN-based approach is
proper for our problem as it has been shown capable of high-resolution, multi-modal
220
generation [Wang et al., 2018, Karras et al., 2019, Park et al., 2019b].
We compare EarthGAN against two nearest neighbor baselines and demonstrate
EarthGAN’s superiority of being able to synthesize novel pixels (cf. the baselines
retrieving only seen snapshots). Qualitative and quantitative experiments justify our
design choices including the utilization of the (low-resolution) Moon observation, the
GAN architecture, and our timestamp encoding scheme.
Our work is related to several areas in computer vision and graphics. In this section,
we organize the relevant literature into three categories: Non-Line-of-Sight (NLOS)
imaging, lighting recovery, and Generative Adversarial Networks (GANs).
Because our goal is to image the Earth from the Earth, Non-Line-of-Sight (NLOS)
imaging is a relevant topic. That said, our approach can barely be called an “imaging”
system given that its reliance on mostly a database (strong priors) and only a single-
pixel observation of the Moon (weak observation).
Broadly, NLOS approaches can be categorized into active and passive methods.
Active methods rely on energy-emitting sensors such as a Time-of-Flight (TOF) cam-
era, whereas passive approaches rely on just energy-absorbing devices such as a con-
ventional RGB camera. In the active category, Heide et al. [2014], Shin et al. [2015,
2016], Laurenzis et al. [2016] use laser to shine at a point visible to both the observable
and hidden scenes and measure the time that the light takes to return [Pandharkar
et al., 2011, Shin et al., 2016]. By measuring the time of flight and intensity, one can
infer depth, shape, and reflectance of the hidden objects [Shin et al., 2014]. However,
TOF cameras have the limitations of requiring specialized hardware setup, being un-
able to introduce additional light in uncontrollable cases like our Earth-Moon scene,
and being vulnerable to ambient lighting.
221
In the passive category, Bouman et al. [2017] turn corners into cameras and re-
cover a video of the hidden scene from the computer-observable intensity change near
the corners. Other works have also considered turning naturally existing structures
into cameras. For instance, Cohen [1982], Torralba and Freeman [2012] have used
naturally occurring pinholes (such as windows) and pinspecks for NLOS imaging.
In addition, Nishino and Nayar [2006] have extracted environment lighting from the
specular reflections off human eyes. Also related is the work of Wu et al. [2012] that
visualizes small, imperceptible color changes in videos.
The most relevant is probably the work by Hasinoff et al. [2011] who used occlusion
geometry to improve the conditioning of the diffuse light transport inversion, turning
the Moon into a diffuse reflector to make an image of the Earth. Freeman [2020]
spoke about several attempts to photograph the Earth using the Moon as a camera
and the computational imaging projects resulting from those attempts.
In contrast to these methods that are mostly physically-based, our EarthGAN
approach is data-driven and operates without modeling the actual image forma-
tion process. EarthGAN exploits the fact that there are strong regularities in our
lighting—the Earth images—and learns data-driven priors of what an Earth image
should look like. With such strong priors, it is able to recover an Earth image given
just a single-pixel observation of the Moon (and the corresponding timestamp).
222
varying lighting by predicting a separate light probe for each pixel of the input image
[Shelhamer et al., 2015, Garon et al., 2019, Li et al., 2020c]. Karsch et al. [2014]
recover 3D area lights by detecting visible light sources and retrieving non-visible
lights from a labeled dataset.
These spatially-varying lighting estimation approaches, though, do not ensure
that the estimated lighting is spatially coherent in the 3D space. To handle this
problem, Song and Funkhouser [2019] first obtain 3D understanding of the scene,
project observed pixels to the target light probe according to the query location, and
finally inpaint the missing regions of the light probe using a neural network. Although
this method ensures spatial coherence of the estimated lighting for observed lights,
the unseen light sources that are inpainted may not be spatially consistent. The
recent work by Gardner et al. [2019] assumes a fixed number of light sources and
regresses those lights’ colors, intensities, and positions using a neural network. This
work ensures that the estimated lighting is spatially consistent. Another recent work
is Lighthouse [Srinivasan et al., 2020], which takes as input perspective and spherical
panorama images, and outputs spatially-coherent and -varying lighting.
As we see, many prior works in this domain concern the spatially-varying and
-coherent nature of lighting. These properties, however, are of little relevance under
our setup, where the lighting to estimate lies on a monitor-like plane far from the
illuminated object. That said, the machine learning approaches that some of these
works take resemble our data-driven recovery at a high level.
The influential work of Kingma and Welling [2014] presented Variational Autoen-
coders (VAEs) as a method that learns to encode images into low-dimensional latent
codes that can get reconstructed back to the input images. Similarly, Goodfellow
et al. [2014] proposed GANs where the generator learns to synthesize images that are
indistinguishable from the real images, while the discriminator aims to improve its
ability to tell a generated image from a real one. As such, both models are capable
of synthesizing images when fed with random vectors.
223
Yet the user often wants to generate images based on some input rather than
randomly. Known as “conditional image synthesis,” the task is to generate photore-
alistic images that satisfy the given condition. The simplest form of a condition is
probably a class label, e.g., cat, while image conditions such as a segmentation map
are also of interest. Researchers have proposed class-conditional models that can
synthesize images satisfying the input class labels [Mirza and Osindero, 2014, Odena
et al., 2017, Brock et al., 2018, Mescheder et al., 2018, Miyato and Koyama, 2018].
Text-conditional models have also been proposed to, e.g., generate an image based on
a caption [Reed et al., 2016, Zhang et al., 2017, Hong et al., 2018, Xu et al., 2018a].
When both the input and output are images, researchers have devised image-to-image
models such as Karacan et al. [2016], Liu et al. [2017], Isola et al. [2017], Zhu et al.
[2017a,b], Huang et al. [2018], Karacan et al. [2019].
In enabling image-to-image models to generate high-resolution output, Wang et al.
[2018] developed pix2pixHD that produces 2048 × 2048 images with a multi-scale
generator and discriminator. Observing that normalization layers often “wash away”
semantic information when the condition maps are passed as input through the net-
work layers, Park et al. [2019b] proposed SPADE that uses the input condition map
to modulate the activations after the normalization, achieving high-quality image gen-
eration given a condition image such as a segmentation map. In this work, we build
upon SPADE and make it an “imaging” network that takes as input a weak observa-
tion of the Moon (as well as the corresponding timestamp) and makes an image of
the Earth that is likely responsible for the observation, relying on strong priors about
the Earth regularities learned from our data.
5.3 Method
During training, the input to Generative Adversarial Networks for the Earth (Earth-
GANs) is a collection of Earth images associated with their timestamps and average
colors of the corresponding Moon images. The generator of the EarthGAN then learns
a mapping from the timestamp and average Moon color to a full Earth image, while
224
the discriminator aims to differentiate the generated image from the real Earth im-
age given the timestamp and the average Moon color. Because the input timestamp
and average color do not fully dictate the Earth’s appearance (e.g., different cloud
configurations overlaid on a “clean Earth” may lead to the same Moon observation),
we additionally condition our generation on a randomness vector 𝑧 given by encoding
the ground-truth image during training, as in SPADE [Park et al., 2019b].
At test time, we randomly sample a 𝑧 from its target distribution, to which the
image encoder has learned to stay close during training, and ask the generator to
produce a full-resolution Earth image given 𝑧, the test timestamp, and the average
Moon color. With a trained EarthGAN, one can explore the regularities of the Earth
images learned from the image collection: By querying EarthGAN at actual times-
tamps for which we have ground-truth Earth images captured, we evaluate how well
EarthGAN recovers highly regular lighting from a single-pixel observation; by query-
ing the model at time intervals finer than the capture granularity, we probe whether
EarthGAN has learned the underlying evolution of lighting as a function of time or
has just memorized some snapshots seen during training; by varying 𝑧 while keeping
the other conditions constant, we investigate what aspects of the Earth images are
explained by randomness rather than the timestamp and average color.
Here we describe how to collect a dataset of Earth images, our scene setup for data
generation, and the rendering specifics.
Earth Photographs The Earth images are crawled from the data release by NASA
[2015], taken by the Earth Polychromatic Imaging Camera (EPIC) on the Deep Space
Climate Observatory (DSCOVR) spacecraft. The DSCOVR spacecraft was launched
to the Earth-Sun Lagrange-1 (L-1) point in February of 2015, with the mission of
monitoring space weather events such as geomagnetic storms. EPIC is the on-board
camera capturing 2048 × 2048 images of the entire Sun-lit face of the Earth through
a Cassegrain telescope with a 30 cm aperture.
225
EPIC takes images every 60 min to 100 min, so there are usually around ten images
available per day. There are two types of images that EPIC takes: the “natural color”
images that are created using bands lying within the human visual range and the
“enhanced color” images that are additionally processed to enhance land features. We
use the natural color images as they simulate what a conventional camera would have
produced [NASA, 2015]. For rendering, we use these images as the Earth plane texture
at their original resolution of 2048 × 2048, while we downsample them to 256 × 256 for
training the EarthGAN. Each photograph is associated with a timestamp specifying
on which date and at what time it was captured (e.g., 2016-01-08 00:55:16). We
use all images from 2016 to 2020 (inclusive) as our training data and all 2021 images
up to May 31 as the test data.
Scene Setup We set up a Blender1 scene where we use a Lambertian sphere with
albedo by NASA [2019] to approximate the Moon, and an emissive plane (like a
computer monitor) to display the Earth images. Although one could alternatively use
the actual Moon topographic map, such as the one released by NASA [2019] or USGS
[2020], as a displacement map on top of the sphere geometry, and a similar sphere-like
geometry onto which the Earth texture is mapped, the gain by this “physically correct”
setup should be negligible in our data-driven approach that operates at a high level
of abstraction, without explicitly modeling the geometry. Similarly for reflectance,
it is reasonable to just use a Lambertian approximation (instead of the Bidirectional
Reflectance Distribution Function [BRDF] by Hapke [1981]) for the Moon and make
the Earth plane a textured emitter (rather than a non-emitter that reflects light from
the Sun to Moon in a spatially-varying manner).
Despite the shape and reflectance approximations, we respect the Moon-Earth
distance by placing the two objects at a distance in scale w.r.t. their sizes. The
radius of the Moon sphere is set to the actual Moon radius, and the Earth plane is
sized such that the texture, when mapped onto the plane, has the Earth radius be
roughly correct as viewed from the Moon. The Earth plane is perpendicular to the
1
https://www.blender.org/
226
Earth-Moon line. The camera sits at the world origin and looks at the Moon center.
Figure 5-1 shows the actual Sun-Earth-Moon system, our simplified version thereof,
and a screenshot of the scene setup in Blender.
Bright Dark
Side Side plify
S im “Glowing”
Earth
Moon
© Philipp Salzgeber
Earth
Sun
Lighting:
Object: The Earth
Dark Bright
Side Side The Moon
Figure 5-1: Our simplification of the Sun-Earth-Moon system. We simplify the actual
light transport from the Sun to the Earth then to (the dark side of) the Moon to light
directly coming to the Moon from a “glowing Earth.” Therefore, the emissive Earth
plane is the lighting we aim to recover, and the Moon sphere is the Earth-lit object
that we observe.
227
has additional yellow (or blue) tint compared with the mean Moon image. In other
words, when we take averages of such Moon images, the average colors do inform us
of some characteristics of the Earth illuminations.
Earth Illumination
Figure 5-2: How the Moon
responds differently to distinct
Earth illuminations. When the
African continent (or mostly
ocean) illuminates the Moon in
Example 1 (or Example 2), the
Moon appearance has
Moon Response – Mean Moon Image
additional yellow (or blue) tint
compared with the average
Moon image. This proves the
existence of signals that we
can use to estimate the Earth
appearance by observing the
Moon.
Example 1 Example 2
228
on the same date of year, and color distance is simply the ℓ2 distance between two
RGB tuples. Figure 5-3 provides pictorial descriptions of these two baselines.
: Query Obs.
B : Closest
: Others
G NN Candidate Pool
R (training images captured at
Moon Mean around the same time on the same
day of previous years)
Figure 5-3: Illustration of the nearest neighbor baselines for EarthGAN. Right: Given
the query timestamp, we generate the pool of NN candidates.Left: NN-time (blue box)
finds the NN based on only timestamps.NN-obs (green box) additionally computes
the mean colors of the Moon observations and then returns the NN candidate with
the closest Moon mean to the observed mean.
Because the NN baselines have to “snap” the query timestamp to one of the train-
ing timestamps, which are almost 2 h apart (Section 5.3.1), they produce discrete
snapshots instead of the desired continuous Earth rotation when we query them at
a continuous series of timestamps. In other words, they are unable to interpolate
between two timestamps or synthesize novel pixels that are not seen during training.
As such, we desire a generative model that learns a continuous function of the
Earth appearance w.r.t. the timestamp and Moon observation, such that the synthe-
sized Earth appearance evolves in a photorealistic and smooth way when we query the
model between two adjacent capture timestamps. This generative model must also be
229
conditional: Instead of generating random samples as an unconditional GAN does, it
needs to condition its output Earth image on the timestamp and Moon observation.
Intuitively, the model is trained to synthesize the Earth lighting that has given rise
to the appearance of the Moon at that particular timestamp.
Our EarthGAN model is one such model, based on SPADE [Park et al., 2019b],
that models the Earth appearance as a smooth function of the timestamp 𝑡, the
one-pixel observation (mean color) of the Moon 𝑜, and a randomness vector 𝑧. The
generator 𝐺 takes as input the condition c = (𝑡, 𝑜, 𝑧) and generates an Earth image
𝐺(c). The discriminator then learns to differentiate a real pair (c, x) from a generated
fake one (c, 𝐺(c)). Formally, EarthGAN models the conditional distribution of the
Earth appearance given the conditions via the following minimax game [Goodfellow
et al., 2014, Wang et al., 2018]:
min max E(c,x)∼𝑝data (c,x) [log 𝐷(c, x)] + Ec∼𝑝data (c) [log(1 − 𝐷 (c, 𝐺(c))] , (5.1)
𝐺 𝐷
where 𝑝data (·)’s are the data distributions, effectively our collection of condition-image
pairs. We defer the implementation and loss details to the end of this section.
Although timestamp 𝑡 seemingly fully dictates what the Earth looks like (e.g.,
which pixels belong to America and which to the Pacific Ocean) according to as-
tronomy, atmospheric conditions such as clouds add randomness to the actual Earth
appearance. Intuitively, the additional observation 𝑜 helps disambiguate certain cases,
e.g., whether the Earth appears more gray likely due to the overlaid clouds or more
blue if the atmospheric conditions allow us to see direclty the ocean. However, 𝑜 does
not solve the problem entirely: Different cloud patterns may still give the same 𝑜 at
𝑡, and furthermore, there may be other nuances that EarthGAN is not yet modeling.
This sounds like the familiar problem of multi-modal generation often encountered
in GANs: Given the same condition (e.g., a segmentation map), the model needs the
capability of generating multiple plausible images (e.g., different RGB images that
all satisfy the segmentation map).
To this end, we prepend an image encoder to our GAN model, as how multi-
230
modal generation is achieved in SPADE [Park et al., 2019b], to regress (parameters
of) a multivariate Gaussian distribution from the ground-truth Earth image during
training. Intuitively, the image encoder and 𝐺 form a Variational Autoencoder (VAE)
[Kingma and Welling, 2014], with the encoder trying to capture the “style” of the
image. With this design, we can sample a randomness vector 𝑧 from the learned latent
space and have it capture factors and nuances (other than 𝑜 or 𝑡) that might affect
the Earth appearance. At test time, we sample 𝑧 from the prior distribution—a zero-
mean, unit-variance multivariate Gaussian, which was used in the Kullback–Leibler
divergence during training. When we explore how 𝑡 and 𝑜 affect the generation, we
sample just one 𝑧 and use that throughout. On the other hand, when we explore
how 𝑧 affects the generation, we keep 𝑡 and 𝑜 constant, and vary only 𝑧. Figure 5-4
visualizes our EarthGAN model.
Multi- real
Image SPADE
Concat. Scale or
Encoder Generator
Discrim. fake
KL Divergence
<latexit sha1_base64="FszSNLPf//Yb7b+7yleKTJvdpfE=">AAACLXicbVDLSsNAFJ34rPVVdelmsAgVpCRF1I1Q1IVupKJ9QBPKZDpph84kYWYilJAfcuOviOCiIm79DSdpRW09MMzhnHu59x43ZFQq0xwZc/MLi0vLuZX86tr6xmZha7shg0hgUscBC0TLRZIw6pO6ooqRVigI4i4jTXdwkfrNByIkDfx7NQyJw1HPpx7FSGmpU7i0OVJ9jFh8k5RsN2BdOeT6i20eJWeZ6XqxmRzCb27f0R5HP951ctApFM2ymQHOEmtCimCCWqfwYncDHHHiK8yQlG3LDJUTI6EoZiTJ25EkIcID1CNtTX3EiXTi7NoE7mulC71A6OcrmKm/O2LEZXqErkxXlNNeKv7ntSPlnTox9cNIER+PB3kRgyqAaXSwSwXBig01QVhQvSvEfSQQVjrgvA7Bmj55ljQqZeu4XLk9KlbPJ3HkwC7YAyVggRNQBVegBuoAg0fwDEbgzXgyXo1342NcOmdMenbAHxifXwX1qc4=</latexit>
N (µ = 0, ⌃ = I)
Figure 5-4: EarthGAN model. Given the Moon observation, we compute the mean as
our observation 𝑜. We also encode the string timestamp into a 3-vector 𝑡, respecting
the time and date semantics. We repeat the concatenation of 𝑜 and 𝑡 across the
spatial dimensions, producing a condition “map” at the same resolution as the Earth
image. The Earth image is encoded into (parameters of) a multivariate Gaussian
distribution encouraged to stay close the the zero-mean, unit-variance Gaussian. The
SPADE generator aims to generate an Earth image given the condition map and a
random sample from the multivariate Gaussian. The multi-scale discriminator then
tries to tell whether the pair of the condition map and Earth image is real. We use
the same loss function as in SPADE.
231
might consider using a string encoder from the Natural Language Processing (NLP)
literature, but this would destroy the semantics that we understand well: the peri-
odicities in dates and time (e.g., time is periodic with a cycle of 24 h), their bounds
(e.g., July 31 → August 1 instead of July 32), etc. In addition, training an additional
string encoder would add another layer of obscurity to the already inexplicable GAN
dynamics.
As such, we opt for a simple encoding scheme that not only preserves the seman-
tics (which facilitates learning of the continuous Earth rotation, as we will show in
Section 5.4.3), but also is compatible with the input format required by SPADE-like
models. Specifically, we normalize month 𝑚 ∈ {1, 2, . . . , 12} as 𝑚′ = (𝑚−1)/11, day2
𝑑 ∈ {1, 2, . . . , 31} as 𝑑′ = (𝑑−1)/30, and second in the day 𝑠 ∈ {1, 2, . . . , 86400} as
𝑠′ = 𝑠−1/86399. Then, each element of the encoded timestamp 𝑡 = (𝑚′ , 𝑑′ , 𝑠′ ) is in [0, 1];
equivalently, 𝑡 falls inside a unit cube just like the RGB observation 𝑜.
We leave out the year information on purpose, so two timestamps that are dif-
ferent only in year will be mapped to the same 𝑡. This timestamp encoding scheme
is beneficial because it ensures that EarthGAN observes drastically different Earth
appearance for similar timestamps, as shown in Figure 5-5. Consequently, EarthGAN
is motivated to explain these appearance variations using the randomness vector 𝑧.
Additionally, the year bit is too sparse when mapped to the real axis (i.e., 2015 → 0,
2016 → 1/6, . . . , 2021 → 1).
232
Training Testing
Figure 5-5: Different Earth appearances at similar timestamps. These four times-
tamps are close in time of day and on the same date of year (January 1), so the
continental and oceanic patterns remain roughly the same across these timestamps.
However, the final Earth appearances are drastically different because of the non-
stationary cloud patterns. These appearance variations motivate EarthGAN to con-
trol the clouds and other nuances with 𝑧. This figure also demonstrates that nearest
neighbors may not reconstruct the test image well due to varying cloud patterns.
Implementation Details & Losses We follow the SPADE design by Park et al.
[2019b] in both network architectures and loss functions. Specifically, the generator is
a a series of SPADE residual blocks, at each of which the condition maps are injected.
The discriminator is a multi-resolution convolutional network based on PatchGAN
[Isola et al., 2017, Wang et al., 2018]. Same as SPADE, EarthGAN uses the loss
function from pix2pixHD [Wang et al., 2018] except that the least squared loss term
[Mao et al., 2017] is replaced with the hinge loss [Miyato et al., 2018, Zhang et al.,
2019]. We train EarthGAN on one NVIDIA TITAN RTX for around eight hours.
5.4 Results
In this section, we first explain how we generate test conditions to evaluate Earth-
GAN’s performance. We then perform qualitative and quantitative evaluations of
how well EarthGAN recovers the Earth lighting in 2021, for which we have ground-
233
truth Earth photos captured by EPIC. Next, we probe EarthGAN’s limit by testing
it on timestamps at a finer granularity (5 min apart) than the actual timestamps
(almost 120 min apart), answering the question whether EarthGAN learns any un-
derlying regularities in these Earth images. We also compare EarthGAN’s synthesis
against that of the Nearest Neighbor (NN) baselines, demonstrating how EarthGAN
outperforms the NN baselines significantly when the query timestamp is far from
the nearby timestamps, or when the ground-truth image contains novel pixels unseen
during training. We demonstrate how EarthGAN learns to model clouds and other
nuances as a function of the randomness vector 𝑧. Finally, we perform an ablation
study where we demonstrate the importance of each major design choice.
Generation of our test data is similar to the training data generation described in
Section 5.3.1. There are three test sets that are designed to reveal I) whether the
lighting recovery is high-quality, II) whether EarthGAN learns the underlying data
regularities, and III) what aspect of the Earth appearance EarthGAN learns to control
with 𝑧, respectively.
For I), we generate our test set with all 2021 timestamps together with their
corresponding Moon observations (recall that EarthGAN is trained on the 2016–2020
data). For these test points, we have the ground-truth Earth images captured by
EPIC, so computing quantitative errors is straightforward. For II), we generate our
test set by uniformly generating timestamps separated by 5 min within the entire
day. For convenience, the two date dimensions (month and day) are fixed to May
31 (the last 2021 date that we consider), the mean color is also fixed to that of the
same day, and 𝑧 is set to all 0’s. We do not have ground truth for these timestamps
since they do not correspond to the actual capture timestamps, but we expect a
good model to produce smooth evolution of the Earth appearance given the Earth’s
underlying rotation. For III), we generate our test set by randomly sampling 𝑧 from
the standard Gaussian while keeping timestamp 𝑡 and observation 𝑜 constant. Again,
these constant values are taken from May 31 for convenience.
234
In terms of evaluation, we use three metrics: Peak Signal-to-Noise Ratio (PSNR),
structural similarity (SSIM) [Wang et al., 2004], and the Learned Perceptual Image
Patch Similarity (LPIPS) [Zhang et al., 2018a]. Because per-pixel error metrics such
as PSNR fail to capture perceptual quality of the synthesis, we highly recommend
the reader to view the qualitative results in the figures and the supplemental video.
This is also observed in Chapter 3; see Section 3.5.2 for more discussion.
We query the trained EarthGAN with the actual 2021 timestamps for which we
have ground-truth Earth photos (Test Set I in Section 5.4.1). As Figure 5-6 shows,
EarthGAN is able to generate photorealistic Earth images given the timestamps and
Moon observations. The generated continental and oceanic patterns resemble those
of the ground truth, relying mostly on the time part of the conditions. The cloud
patterns are controlled by additionally the Moon observation and the randomness
vector 𝑧. Although these cloud patterns do not replicate the ground-truth patterns,
they are perceptually photorealistic.
Although Figure 5-6 seems to suggest the two NN baselines, NN-time that re-
trieves the NN using just timestamps and NN-obs that additionally uses mean Moon
observations, are on par with EarthGAN, the good performance of the NN baselines
relies heavily on EPIC’s regular sampling pattern: It takes photos at a mostly fixed
time interval, making it easy to find neighbors that are captured roughly at the same
time of day given the query timestamp. As such, the NN baselines are able to capture
correctly the continental and oceanic patterns, by simply retrieving neighbors that
are also captured around the same time of day.
This sampling regularity breaks entirely when we start querying EarthGAN at
arbitrary timestamps as in Section 5.4.3, where we ask EarthGAN to synthesize the
Earth appearance at an interval of 5 min. As shown in Section 5.4.3, these NN-based
methods will “snap” the query timestamp to a timestamp from EPIC’s sampling
pattern for several consecutive frames and suddenly switch to the next timestamp,
producing temporally unstable snapshots of the Earth appearance.
235
(I)
(II)
(III)
236
Quantitatively, Table 5.1 also suggests that NN-time is the best performing model
because of the issue discussed above. We encountered exactly the same problem in
Chapter 3, where the regular pattern of the light stage lights makes the baseline
approaches look performant quantitatively since we can compute errors only on the
physical lights that do fall onto this regular pattern, but the baseline methods fail at
test time when the query light no longer falls onto the regular pattern. Please see
Section 3.5.2 for an extensive discussion. Again, we strongly encourage the reader
to view the video comparison between EarthGAN and NN-time when we query both
models with novel timestamps that do not fall onto EPIC’s sampling pattern.
We test the trained EarthGAN on Test Set II as specified Section 5.4.1: novel times-
tamps at a finer granularity (5 min apart) than the actual timestamps (almost 120 min
apart). This task can be thought of as an attempt to “super-resolve” the Earth ro-
tation in time. As Figure 5-7 (top) shows, when we query the trained EarthGAN at
dense novel timestamps that are only 5 min apart, EarthGAN generates photorealis-
tic, smooth evolution of the Earth appearance that corresponds to the elapsed time
of 5 min, despite having seen only discrete snapshots separated by up to 2 h.
To clearly show these subtle but non-trivial appearance changes, we additionally
show zoom-in visualization of the same crop of the Earth images across different query
timestamps in Figure 5-8. The two boundary timestamps, for which we synthesize
237
2021-01-01 2021-01-01 2021-01-01
13:30:00 13:35:00 13:40:00 Query
Ours
“snapping” “jump”
2019-01-01 2019-01-01
NN-time 12:44:50 14:32:53
Figure 5-7: Continuous Earth rotation learned by EarthGAN. Top: Despite having
seen only snapshots that are almost 2 h apart, EarthGAN learns the underlying con-
tinuous Earth rotation and synthesizes smooth evolution of the Earth appearance
at an interval of 5 min. Bottom: The NN-time baseline retrieves the same training
image (12:44:50) for the first two query timestamps (13:30:00 and 13:35:00). In
other words, the query timestamps get “snapped” to the same training timestamp.
Because the training data are almost 2 h apart, NN-time “jumps” from 12:44:50 di-
rectly to 14:32:53 despite that we only advance the query timestamp by 5 min, from
113:35:00 to 113:40:00.
238
the Earth images in Figure 5-8 (top), are 115 min apart to simulate the time inter-
val between real EPIC captures. EarthGAN is able to “interpolate” from the start
timestamp to the end timestamp, synthesizing the Earth appearance every 5 min in
between, as shown in Figure 5-8 (bottom). Close inspection of Figure 5-8 (bottom)
reveals that the synthesis results do not stay stationary and suddenly jump to the
next pattern (as done by the NN baselines), but rather evolve smoothly, with the
Australian continent moving gradually from the left of the zoom-in window to the
right. Crucially, EarthGAN has no knowledge of the Earth rotation mechanics and is
not instructed to produce smoothly-varying synthesis, but rather learns this underly-
ing Earth motion from data, by observing discrete snapshots (2 h apart) of the Earth
appearance.
239
We compare EarthGAN against the NN baselines qualitatively in Figure 5-7 (bot-
tom). In contrast to EarthGAN that synthesizes a smooth evolution of the Earth
appearance (Figure 5-7 [top]), NN-time “snaps” the query timestamps, 13:30:00 and
13:35:00, both to the nearest training timestamp, 12:44:50, hence mistakenly pro-
ducing the same reconstruction for different timestamps (yellow box of Figure 5-7).
When we advance the query time stamp by just another 5 min to 13:40:00, NN-time
abruptly “jumps” to the next training timestamp, 14:32:53, that is almost 1 h ahead
of the query timestamp, thereby producing the wrong continental pattern (green box
of Figure 5-7).
We ask the trained EarthGAN to synthesize multiple Earth images from randomly
sampled 𝑧’s but fixed timestamp 𝑡 and observation 𝑜. This test (using Test Set III
of Section 5.4.1) sheds light on what aspects of the Earth appearance EarthGAN
models with 𝑧 when it is not explicitly asked to. Recall that with our time encoding
scheme that discards the year information by design, EarthGAN observes multiple
possible Earth appearances for a given timestamp, as demonstrated in Figure 5-
5. This encourages EarthGAN to model appearance variance even for the same
timestamp and (similar) Moon observation.
240
Ground Truth Random Samples of Our Generation
1 2
3 4
NN-time
5 6
Figure 5-9: How EarthGAN learns to model the clouds. Besides the timestamp and
average Moon color, atmospheric conditions such as clouds and other nuances also
affect the Earth appearance. Earth learns to model appearance variations due to these
factors with its randomness vector 𝑧. By sampling different 𝑧 vectors, we generate
multiple possible Earth appearances for the same timestamp and Moon observation.
241
5.4.5 Ablation Studies
We now study the importance of the major design choices in developing EarthGAN.
Specifically, we investigate whether we need to observe the Moon at all to make an
image of the Earth when we already have the timestamp, whether the choice of the
GAN architecture affects the generation quality, and why the current time encoding
scheme is superior to the alternatives.
Without Observing the Moon Given the clear regularities in our Earth images
and the strong dependency of the Earth appearance on the timestamp, we study
whether one needs a single-pixel observation of the Moon at all. To this end, we train
an EarthGAN that conditions the Earth appearance only on the timestamp 𝑡 and the
randomness vector 𝑧, without having access to the mean Moon color.
Although Figure 5-10 (D, E, F) suggests that not observing the Moon at all
does not degrade the visual quality of the generation in that all images in (D, E)
look photorealistic, the quantitative evaluation in Table 5.1 proves that additionally
observing the single-pixel Moon yields more accurate reconstruction across all three
error metrics.
Choice of the GAN Architecture Because StyleGAN has been proven successful
in generating high-resolution photorealistic human faces [Karras et al., 2019, 2020],
we study whether the previous results still hold if we replace our SPADE backbone
to a StyleGAN backbone. As Figure 5-10 (A, E, F) shows, EarthGAN with a Style-
GAN backbone still produces photorealistic results, but upon viewing the video of
this model variant’s generation, we notice that it fails to learn the continuous Earth
rotation as learned by EarthGAN with a SPADE backbone. It remains future work
why a similar GAN backbone fails to learn the continuous Earth rotation.
242
(I)
(II)
(A) Ours using (B) Ours using (C) Ours using (D) Ours w/o (E) Ours (F) Ground
StyleGAN time enc. 1 time enc. 2 obs. Truth
Figure 5-10: Ablation studies of EarthGAN’s design choices. (A) EarthGAN with
a StyleGAN backbone fails to learn the continuous Earth rotation (not pictured
here). (B, C) Other timestamp encoding alternatives may lead to inaccurate con-
tinental/oceanic patterns. (D, E, F) Not observing the Moon at all still produces
photorealistic synthesis but hurts the reconstruction accuracy per the quantitatively
evaluation (Table 5.1).
in and are more suitable for a specific task (e.g., 3D surface completion) than other
representations. Similarly, the timestamp representation—how we encode timestamp
strings into numerical values—is crucial for EarthGAN’s performance because dif-
ferent encoding schemes have different semantics built in: For instance, encoding
January 1 to 0 and January 31 to 1 provides semantics about the periodicity and
boundness in day of month.
243
5.5 Conclusion
We have presented Generative Adversarial Networks for the Earth (Earth-
GANs) for recovering the Earth appearance, as the light source, from a single-pixel
observation of the Moon. Specifically, EarthGAN takes as input the timestamp and a
single-pixel observation (mean color) of the Moon, and then outputs an Earth image
that is likely responsible for the Moon observation. EarthGAN learns the strong reg-
ularities present in the Earth images (captured by a spacecraft camera) and produces
photorealistic Earth images indistinguishable from real photographs. Importantly,
EarthGAN learns the smooth evolution of the Earth appearance due to the underly-
ing Earth rotation, despite having seen only discrete snapshots.
The main idea behind EarthGAN is learning the strong priors on what an Earth
images should look like from a collection of around 23,000 Earth images, their cor-
responding Moon renders due to the Earth as the light source, and the timestamps.
This data-driven approach potentially allows one to use as input single images taken
with a mobile phone camera. Conditioned on the timestamp and just the average
color of the Moon observation, EarthGAN recovers the Earth as the light source ac-
curately and learns to control the unexplained Earth appearance variations with a
randomness vector, with which EarthGAN learns to associate cloud patterns.
Although EarthGAN achieves promising results, it is not without limitations.
Firstly, the imaging system has been simplified such that the Earth directly serves as
the light source, emitting light from a monitor-like plane, whereas in reality, light from
the Sun hits the Earth and gets reflected, in a spatially-varying manner (e.g., land
and ocean have different reflectance properties), to the Moon. Other simplifications
include that the Moon is modeled as a Lambertian sphere in the scene without using
its actual topography and reflectance. Secondly, we compute the average Moon color
using all of the Moon pixels, while in practice, only the dark side of the Moon should
be used since the bright side is also lit by the Sun. Yet the bright side may provide
useful calibration signals for the phone camera at hand. Finally, it remains unclear
whether EarthGAN can be readily applied to real-world images.
244
Chapter 6
In this dissertation, we have presented the broad problem of inverse rendering and
further discussed four subtopics thereof: I) joint shape, reflectance, and lighting
from appearance (Chapter 2). II) light transport function from appearance (Chap-
ter 3), III) shape from appearance (Chapter 4), and IV) lighting from appearance
(Chapter 5),
These four instances represent three levels of abstraction to tackle inverse
rendering. I) At a low level of abstraction, we have proposed methods that fully fac-
torize the object appearance into shape, reflectance, and illumination, which then get
re-rendered back to the RGB images in a physically-based manner (though simpli-
fied) [Srinivasan et al., 2021, Zhang et al., 2021c]. Despite challenging, such low-level
decomposition explicitly solves for every term in the rendering equation, thereby sup-
porting further applications that a mid- or high-level solution is incapable of, such as
editing and exporting of geometry or material. II) At a middle level, we have shown
how to interpolate the light transport function from sparse samples thereof to enable
relighting, view synthesis, or both tasks simultaneously [Sun et al., 2020, Zhang et al.,
2021b]. This abstraction level properly conceals the underlying complex BXDFs and
ray bounces, and suffices for high-quality relighting and view synthesis. III) At a high
level of abstraction, we have trained deep learning models to learn direct mappings
from a single image to shape (Chapter 4) [Sun et al., 2018b, Wu et al., 2018, Zhang
et al., 2018b] or lighting (Chapter 5), without modeling the other scene constituents or
245
the rendering process. Relying on data-driven priors learned from large-scale datasets,
these high-level methods circumvent the need for exhaustive modeling of the image
formation process and enable applications to single images.
Next, we outline some high-level challenges and future directions around these
four subtopics of this dissertation.
While our models achieved full appearance decomposition (Chapter 2), we made
many simplifying assumptions about the scene elements and the rendering process
itself. For instance, Neural Reflectance and Visibility Fields (NeRV) assumes known
lighting [Srinivasan et al., 2021], Neural Factorization of Shape and Reflectance (NeR-
Factor) assumes direct illumination only [Zhang et al., 2021c], and both NeRV and
NeRFactor consider just Bidirectional Reflectance Distribution Functions (BRDFs;
cf. the more general BXDFs) and non-emissive objects. Furthermore, both methods
follow the trend of expressing everything with function approximators such as Multi-
Layer Perceptrons (MLPs). It remains unclear what the optimal way is to export
these “neural models” into a traditional graphics pipeline. One straightforward way
is meshing the geometry and converting the neural reflectance into an analytic model
(or even producing a reflectance look-up table by repeatedly querying the trained
networks), but this approach is clearly suboptimal and may defeat the purpose of
using neural models in the first place.
246
In single-image 3D shape reconstruction (Chapter 4), we have mostly used voxels
as our shape representation, and the network architectures are designed or selected ac-
cordingly [Sun et al., 2018b, Wu et al., 2018, Zhang et al., 2018b]. Recently, there have
been significant advances in representing shapes using implicit functions [Sitzmann
et al., 2019a,b, Park et al., 2019a, Sitzmann et al., 2020, Mildenhall et al., 2020], as
discussed in Section 1.1.1. This representation switch may lead to a paradigm shift
in how we design the shape reconstruction networks and shape operations such as
ray casting. Despite a potential paradigm shift, we believe what we learned from
Generalizable Reconstruction (GenRe) is transferable to future methods: One should
hardcode the physical processes that we understand well, such as geometric projec-
tions, instead of learning them from scratch for better generalizability [Zhang et al.,
2018b].
For data-driven recovery of the Earth appearance from the Moon observation
(Chapter 5), we demonstrated only simulation results and acknowledge that there may
be additional practical challenges in applying Generative Adversarial Networks for
the Earth (EarthGANs) to real-world images of the Moon. For example, in practice,
useful signals lie only in the dark region of the Moon since the rest is Sun-lit with
the weak signals from the Earth overwhelmed by the Sun illumination. As such, one
might need a specific phase of the Moon to perform this data-driven “imaging.” In
addition, because we will be capturing the dark side of the Moon, there might be
challenges in capturing the dark region with a high signal-to-noise ratio. That said,
EarthGAN is still promising given that it requires only a single-pixel observation of
the Moon, thanks to the high-level data-driven approach.
As a closing remark, we have witnessed the paradigm shift of personal computing
devices, which brought us to the current era of pervasive mobile phones and laptops.
Will Extended Reality (XR) be the next mode of working, gaming, communicating,
etc.? Either way, we hope that this dissertation contributes to the upcoming new
technology, by accelerating and democratizing 3D content capture and creation.
247
THIS PAGE INTENTIONALLY LEFT BLANK
248
Appendix A
This appendix chapter contains additional implementation details for Neural Re-
flectance and Visibility Fields (NeRV) [Srinivasan et al., 2021] and additional quali-
tative results from the experiments discussed in Chapter 2.
Please view our supplementary video for a brief overview of NeRV, qualitative
results with smoothly-moving novel light and camera paths, and demonstrations of
additional graphics applications.
249
𝜌2
𝐷(h, n, 𝛾) = , (A.2)
𝜋((n · h)2 (𝜌2 − 1) + 1)2
𝐹 (𝜔𝑖 , h) = 𝐹0 + (1 − 𝐹0 )(1 − (𝜔𝑖 · h))5 , (A.3)
(n · 𝜔𝑜 )(n · 𝜔𝑖 )
𝐺(𝜔𝑖 , 𝜔𝑜 , 𝛾) = , (A.4)
((n · 𝜔𝑜 )(1−𝑘)+𝑘)((n · 𝜔𝑖 )(1−𝑘)+𝑘)
𝜔𝑜 + 𝜔𝑖 𝛾4
2
𝜌=𝛾 , h= , 𝑘= , (A.5)
‖𝜔𝑜 + 𝜔𝑖 ‖ 2
where a is the diffuse albedo, 𝛾 is the roughness, and n is the surface normal at 3D
point x. We use 𝐹0 = 0.04, which is the typical value of dielectric (non-conducting)
materials. Note that our definition of the BRDF includes the multiplication by the
Lambert cosine term (n · 𝜔𝑖 ) in order to simplify the equations in Chapter 2.
A.3 Limitations
Recovering a NeRV is a straightforward optimization problem: We optimize the pa-
rameters of the Multi-Layer Perceptrons (MLPs) that comprise a NeRV scene repre-
sentation to minimize the error of re-rendering the input images. NeRV currently does
not incorporate any priors into the optimization problem, so a promising direction for
future work would be to integrate priors on geometry and reflectance (such as learned
priors or simple hand-crafted priors to encourage smooth geometry or reflectance pre-
dictions) into the NeRV optimization so that a relightable 3D scene representation
250
Ground Truth
NeRV (Ours)
NLT
Figure A-1: NeRV vs. NLT. Neural Light Transport (NLT) [Zhang et al., 2021b] uses
a controlled laboratory lighting setup with eight times as many images as used by
NeRV, and an input proxy geometry (which is recovered by training a NeRF on a
set of images with fixed illumination). The artifacts seen in the shadows of NLT’s
renderings demonstrate the difference between recovering geometry that works well
for view synthesis (as NLT does) and recovering geometry that works well for both
view synthesis and relighting (as NeRV does).
251
Train Illum. Single Point Colorful + Point Ambient + Point
Ground Truth
NeRV (Ours)
Bi et al.
NeRF + LE
NeRF + Env
Ground Truth
NeRV (Ours)
Bi et al.
NeRF + LE
NeRF + Env
Figure A-2: Additional results and baseline comparisons for NeRV. NeRV is able to
render convincing images from novel viewpoints under novel lighting conditions. The
method of Bi et al. [2020a] is unable to recover accurate models when trained with
illumination more complex than a single point light (Columns 3–6). Methods that
use latent codes to explain variation in appearance due to lighting (NeRF+LE and
NeRF+Env) are unable to generalize to lighting conditions different than those seen
during training.
252
could be recovered from fewer viewpoints or fewer observed lighting conditions.
Successfully recovering a NeRV representation relies on jointly optimizing the
geometry, reflectance, and visibility MLPs. We have noticed failure cases where the
reflectance MLP seems to converge faster than the geometry and visibility MLPs and
is stuck in a local minimum. For example, in cases where the scene is observed under
very few illumination conditions, the reflectance MLP sometimes quickly converges
to include shadows and light tints in the recovered albedo, and is not able to recover
even after the visibility MLP catches up to correctly explain those shadows. Further
investigations into the optimization landscape and dynamics of NeRV could help shed
light on this issue.
Finally, the NeRV optimization problem trains a geometry MLP along with a
visibility MLP that is meant to approximate integrals of the geometry MLP’s output.
Although we impose a loss that encourages these two MLPs to be consistent with
each other, there is no guarantee that these two MLPs will be exactly consistent.
Investigating potential strategies to enforce such consistency may be helpful.
253
THIS PAGE INTENTIONALLY LEFT BLANK
254
Appendix B
In this appendix chapter, we provide details on the network architecture and pro-
gressive training scheme of Light Stage Super-Resolution (LSSR) [Sun et al., 2020]
introduced in Chapter 3. We also provide more results and baseline comparisons.
255
Figure B-1: Network architecture and progressive training scheme of LSSR. The 𝛼𝑑
parameters control the progressive training and growing of the network for each scale
𝑑 of the network by modulating the resolution at which input images are used, and
output images are compared to the ground truth.
the 𝑑’th stage of training, we use a convex combination of the auxiliary image at level
𝑑 and an upsampled version of the auxiliary image at level 𝑑 + 1 as the current model
prediction. Our loss in stage 𝑑 is imposed between that combined image and the true
image, downsampled to the native resolution of level 𝑑 of our network. This approach
ensures that the internal activation of our decoder at level 𝑑 is sufficient to enable the
reconstruction of an accurate RGB image (via the auxiliary branch), which means
that the training of stage 𝑑 results in network weights that are well-suited to initialize
the as-yet-untrained model weights on level 𝑑 − 1 of the decoder in the next stage.
256
sion of the last stage’s predicted image, but at the end of that stage’s training the
loss is imposed entirely on the current stage’s predicted image. These 𝛼𝑑 factors also
modulate the input to the encoder: As indicated in Figure B-1, the input to each
level of the encoder is a weighted average of the output from the earlier level and
a downsampled version of the input images. This means that the annealing of each
𝛼𝑑 value has a similar effect on the progressive growing of the encoder as it does for
the decoder: The deeper layers of the decoder are trained first using downsampled
images, and then each finer layer of the decoder is added and blended in at each stage
of training.
Our model is trained using a single optimizer instance with four stages, each of
which corresponds to a spatial scale. For the first three stages, we train the model
in two parts: 30,000 iterations at that stage’s spatial resolution, followed by 20,000
iterations as 𝛼𝑑 is linearly interpolated from that scale to the next. At our final stage,
we train the model for 50,000 iterations. At each stage 𝑑, our model minimizes only
ℒ𝑑 . Note that this gradual annealing of each 𝛼𝑑 during each scale means that the loss
is always a continuous function of the optimization iteration, as ℒ𝑑 at the beginning
of training for stage 𝑑 equals ℒ𝑑+1 at the end of training for stage 𝑑 + 1. In total, we
train our network for 200,000 iterations.
257
260:2 • Sun et al.
(a) Groundtruth (b) Ours (c) Linear blending (d) Fuchset al. [2007]
(e) Photometric stereo (f) Xu et al. [2018] w/ optimal sample (g) Xu et al. [2018] w/ adaptive sample (h) Meka et al. [2019]
(a) Groundtruth (b) Ours (c) Linear blending (d) Fuchset al. [2007]
(e) Photometric stereo (f) Xu et al. [2018] w/ optimal sample (g) Xu et al. [2018] w/ adaptive sample (h) Meka et al. [2019]
(a) Groundtruth (b) Ours (c) Linear blending (d) Fuchset al. [2007]
(e) Photometric stereo (f) Xu et al. [2018] w/ optimal sample (g) Xu et al. [2018] w/ adaptive sample (h) Meka et al. [2019]
Fig. 2. Full resolution qualitative comparison between our method and other light interpolation algorithms.
Figure B-2: More comparisons between LSSR and the baselines.
ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020.
258
Appendix C
Supplement: Pix3D
Here we explain in detail our evaluation protocol for single-image 3D shape reconstruc-
tion algorithms. As different voxelization methods may result in objects of different
scales in the voxel grid, for a fair comparison, we preprocess all voxels and point
clouds before calculating Intersection over Union (IoU), Chamfer Distance (CD), and
Earth Mover’s Distance (EMD).
For IoU, we first find the bounding box of the object with a threshold of 0.1,
pad the bounding box into a cube, and then use trilinear interpolation to resample
the cube to the desired resolution (323 ). Some algorithms reconstruct shapes at a
resolution of 1283 . In this case, we first apply a 4× max pooling before trilinear
interpolation because without the max pooling, the sampling grid can be too sparse
to capture thin structures. After the resampling of both the output voxel and the
ground-truth voxel, we search for the optimal threshold that maximizes the average
IoU score over all objects, from 0.01 to 0.50 with a step size of 0.01.
259
For CD and EMD, we first sample a point cloud from the voxelized reconstructions.
For each shape, we compute its isosurface with a threshold of 0.1 and then sample
1,024 points from the surface. All point clouds are then translated and scaled such
that the bounding box of the point cloud is centered at the origin with its longest
side being 1. We then compute CD and EMD for each pair of point clouds.
In Section 4.6.2, we have compared three different metrics using two human studies.
We here compare them in yet another way: For a 3D shape, we retrieve the three
nearest neighbors from Pix3D according to IoU, EMD and CD. As Figure C-1 shows,
EMD and CD perform slightly better than IoU in this task.
Query Top-3 Retrieval Results (IoU) Top-3 Retrieval Results (EMD) Top-3 Retrieval Results (CD)
Figure C-1: Retrieving nearest neighbors in Pix3D using different metrics. Here
we show the three nearest neighbors retrieved from Pix3D using the three different
metrics. EMD and CD work slightly better than IoU.
260
C.3 Sample Data in Pix3D
We supply more sample data in Figure C-2, Figure C-3, and Figure C-4. Figure C-2
shows that each shape in Pix3D is associated with a rich set of 2D images. Figure C-3
and Figure C-4 show the diversity of 3D shapes and the quality of 2D-3D alignment
in Pix3D.
261
3D Shape Image Alignment 3D Shape Image Alignment 3D Shape Image Alignment
Figure C-3: Sample images and their corresponding shapes in Pix3D. From left to
right: 3D shapes, 2D images, and 2D-3D alignment. Rows 1–2 are beds, Rows 3–4
are bookshelves, Rows 5–6 are scanned chairs, Rows 7–8 are chairs whose 3D shapes
come from IKEA [Lim et al., 2013], and Rows 9–10 are desks.
262
3D Shape Image Alignment 3D Shape Image Alignment 3D Shape Image Alignment
Figure C-4: More sample images and their corresponding shapes in Pix3D. From left
to right: 3D shapes, 2D images, and 2D-3D alignment. Rows 1–2 are miscellaneous
objects, Rows 3–4 are sofas, Rows 5–6 are tables, Rows 7–8 are tools, and Rows 9–10
are wardrobes.
263
THIS PAGE INTENTIONALLY LEFT BLANK
264
Appendix D
Supplement: Generalizable
Reconstruction (GenRe)
In this appendix chapter, we provide the details about data preparation and model
architecture for Generalizable Reconstruction (GenRe) [Zhang et al., 2018b], intro-
duced in Chapter 4.
We describe how we prepare our data for network training and testing.
Scene Setup The camera is fully specified by its azimuth and elevation angles as
its distance from the object is fixed at 2.2, its up vector is always the world +𝑦 axis,
and it always looks at the world origin, where the object center lies. The focal length
of our camera is fixed at 50 mm on a 35 mm film. The depth values are measured
from the camera center (i.e., ray depth), rather than from the image plane.
Rendering We render 20 images of random views (or 200 fixed views in the view-
point study) for each object of interest. To boost the rendering realism and diversity,
we use three types of background: the SUN backgrounds [Xiao et al., 2010], High-
Dynamic-Range (HDR) environment lighting crawled on the web, and pure white
265
backgrounds. Specifically, for each rendering, we randomly sample a background
type and then a random instance of that type. We use Mitsuba [Jakob, 2010] for our
rendering.
Data Augmentation For network training, we augment our RGB images with
three techniques: color jittering, adding lighting noise, and color normalization. In
color jittering, we multiply the brightness, contrast, and saturation, one by one in a
random order, by a random factor uniformly sampled from [0.6, 1.4]. We then add
AlexNet-style lighting noise [Krizhevsky et al., 2012] and perform the standard color
normalization with statistics derived from the ImageNet dataset [Deng et al., 2009].
266
)
BasicBlock(
(conv1): Conv2d(64, 128, kernel=3, stride=2, pad=1)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(128, 128, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(downsample):
Conv2d(64, 128, kernel=1, stride=2)
BatchNorm2d(128, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(128, 128, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(128, 128, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(128, 256, kernel=3, stride=2, pad=1)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(256, 256, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(downsample):
Conv2d(128, 256, kernel=1, stride=2)
BatchNorm2d(256, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(256, 256, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(256, 256, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(256, 512, kernel=3, stride=2, pad=1)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(512, 512, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1)
(downsample):
Conv2d(256, 512, kernel=1, stride=2)
BatchNorm2d(512, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(512, 512, kernel=3, stride=1, pad=1)
267
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(512, 512, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1)
).
The decoder is a mirrored version of the encoder, with all convolution layers
replaced by transposed convolution layers. Additionally, we adopt the U-Net structure
[Ronneberger et al., 2015] by feeding the intermediate outputs of each encoder block
to the corresponding decoder block. The decoder outputs an image of relative depth
values in the original view at the same resolution as input. Specifically, the decoder
comprises:
RevBasicBlock(
(deconv1): ConvTranspose2d(512, 256, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(256, 256, kernel=3, stride=2, pad=1, out_pad=1)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(upsample):
ConvTranspose2d(512, 256, kernel=1, stride=2, out_pad=1)
BatchNorm2d(256, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(256, 256, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(256, 256, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(512, 128, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(128, 128, kernel=3, stride=2, pad=1, out_pad=1)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(upsample):
ConvTranspose2d(512, 128, kernel=1, stride=2, out_pad=1)
BatchNorm2d(128, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(128, 128, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(128, 128, kernel=3, stride=1, pad=1)
268
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(256, 64, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(64, 64, kernel=3, stride=2, pad=1, out_pad=1)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(upsample):
ConvTranspose2d(256, 64, kernel=1, stride=2, out_pad=1)
BatchNorm2d(64, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(64, 64, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(64, 64, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(128, 64, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(64, 64, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(upsample):
ConvTranspose2d(128, 64, kernel=1, stride=1)
BatchNorm2d(64, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(64, 64, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(64, 64, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1)
)
ConvTranspose2d(128, 64, kernel=3, stride=2, pad=1, out_pad=1)
BatchNorm2d(64, eps=1e-05, momentum=0.1)
ReLU(inplace)
ConvTranspose2d(64, 1, kernel=8, stride=2, pad=3, out_pad=0).
Relative depth values provided by the predicted depth images are insufficient for
conversions to spherical maps or voxels as there are still two degrees of freedom
undetermined: the minimum and maximum (or equivalently, the scale). Therefore,
we have an additional branch decoding, also from the 512 feature maps, the minimum
269
and maximum of the depth values. Specifically, this branch decoder contains:
Using the pretrained ResNet-18 as our network initialization, we then train this
network with supervision on both the depth image (relative) and the minimum as well
as maximum values. Under this setup, our network predicts effectively the absolute
depth values of the input view, which allows us to project these depth values to the
spherical representation or voxel grid.
This network is trained with a batch size of 4. We use Adam [Kingma and Ba,
2015] with a learning rate of 1 × 10−3 , 𝛽1 = 0.5, and 𝛽2 = 0.9 for the optimization.
Our inpainting network shares the same architecture as the single-view depth estima-
tor. To mimic the boundary conditions of spherical maps, we use replication padding
for the vertical dimension (elevation) and periodic padding for the horizontal dimen-
sion (azimuth). The padding size is 16 for both dimensions.
This network is trained with a batch size of 4. We use Adam with a learning rate
of 1 × 10−4 , 𝛽1 = 0.5, and 𝛽2 = 0.9 for the optimization.
Our voxel refinement network adopts the U-Net structure [Ronneberger et al., 2015]
and uses a sequence of 3D convolution and transposed convolution layers. The input
tensor of batch size 𝑁 has shape 𝑁 × 2 × 128 × 128 × 128, where one channel contains
voxels projected from the predicted original-view depth map, and the other contains
270
voxels projected from the inpainted spherical map. After fusion, the output tensor is
of shape 𝑁 × 1 × 128 × 128 × 128. Specifically, the network is structured as:
Unet(
Conv3d_block(
Conv3d(2, 20, kernel=8, stride=2, pad=3)
BatchNorm3d(20, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Conv3d_block(
Conv3d(20, 40, kernel=4, stride=2, pad=1)
BatchNorm3d(40, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Conv3d_block(
Conv3d(40, 80, kernel=4, stride=2, pad=1)
BatchNorm3d(80, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Conv3d_block(
Conv3d(80, 160, kernel=4, stride=2, pad=1)
BatchNorm3d(160, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Conv3d_block(
Conv3d(160, 320, kernel=4, stride=2, pad=1)
BatchNorm3d(320, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Conv3d_block(
Conv3d(320, 640, kernel=4, stride=1)
BatchNorm3d(640, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
full_conv_block(
Linear(in_features=640, out_features=640, bias=True)
LeakyReLU(negative_slope=0.01)
)
Deconv3d_skip(
ConvTranspose3d(1280, 320, kernel=4, stride=1)
BatchNorm3d(320, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Deconv3d_skip(
ConvTranspose3d(640, 160, kernel=4, stride=2, pad=1)
BatchNorm3d(160, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
271
Deconv3d_skip(
ConvTranspose3d(320, 80, kernel=4, stride=2, pad=1)
BatchNorm3d(80, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Deconv3d_skip(
ConvTranspose3d(160, 40, kernel=4, stride=2, pad=1)
BatchNorm3d(40, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Deconv3d_skip(
ConvTranspose3d(80, 20, kernel=8, stride=2, pad=3)
BatchNorm3d(20, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Deconv3d_skip(
ConvTranspose3d(40, 1, kernel=4, stride=2, pad=1)
)
).
This network is trained with a batch size of 4. We use Adam with a learning rate
of 1 × 10−5 , 𝛽1 = 0.5, and 𝛽2 = 0.9 for the optimization.
272
Bibliography
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor-
Flow: A System for Large-Scale Machine Learning. In USENIX Symposium on
Operating Systems Design and Implementation (OSDI), 2016. 79, 126, 138
Edward H Adelson and James R Bergen. The Plenoptic Function and the Elements
of Early Vision. Computational Models of Visual Processing, 1991. 113, 116
Miika Aittala, Tim Weyrich, and Jaakko Lehtinen. Two-Shot SVBRDF Capture for
Stationary Materials. ACM Transactions on Graphics (TOG), 34(4):1–13, 2015.
60
Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. Multi-Cue Zero-
Shot Learning With Strong Supervision. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2016. 177
Stanislaw Antol, C Lawrence Zitnick, and Devi Parikh. Zero-Shot Learning via Visual
Abstraction. In European Conference on Computer Vision (ECCV), 2014. 177
Aayush Bansal and Bryan Russell. Marr Revisited: 2D-3D Alignment via Surface
Normal Prediction. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2016. 176
273
Harry G Barrow, Jay M Tenenbaum, Robert C Bolles, and Helen C Wolf. Parametric
Correspondence and Chamfer Matching: Two New Techniques for Image Matching.
In International Joint Conference on Artificial Intelligence (IJCAI), 1977. 196
Ronen Basri, David Jacobs, and Ira Kemelmacher. Photometric Stereo With General,
Unknown Lighting. International Journal of Computer Vision (IJCV), 72(3):239–
257, 2007. 117
Sean Bell, Kavita Bala, and Noah Snavely. Intrinsic Images in the Wild. ACM
Transactions on Graphics (TOG), 33(4):159, 2014. 40, 176
Sai Bi, Zexiang Xu, Pratul P Srinivasan, Ben Mildenhall, Kalyan Sunkavalli, Miloš
Hašan, Yannick Hold-Geoffroy, David Kriegman, and Ravi Ramamoorthi. Neural
Reflectance Fields for Appearance Acquisition. arXiv, 2020a. 19, 54, 58, 59, 62,
63, 73, 74, 92, 93, 94, 250, 252
Sai Bi, Zexiang Xu, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David
Kriegman, and Ravi Ramamoorthi. Deep Reflectance Volumes: Relightable Re-
constructions From Multi-View Photometric Images. In European Conference on
Computer Vision (ECCV), 2020b. 58
Federica Bogo, Javier Romero, Matthew Loper, and Michael J Black. FAUST: Dataset
and Evaluation for 3D Mesh Registration. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2014. 178
Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. Optimizing
the Latent Space of Generative Networks. In International Conference on Machine
Learning (ICML), 2018. 75
Mark Boss, Raphael Braun, Varun Jampani, Jonathan T Barron, Ce Liu, and Hendrik
Lensch. NeRD: Neural Reflectance Decomposition From Image Collections. In
IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 59
Katherine L Bouman, Vickie Ye, Adam B Yedidia, Frédo Durand, Gregory W Wor-
nell, Antonio Torralba, and William T Freeman. Turning Corners Into Cameras:
Principles and Methods. In IEEE/CVF International Conference on Computer
Vision (ICCV), 2017. 222
274
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary,
Dougal Maclaurin, and Skye Wanderman-Milne. JAX: Composable Transforma-
tions of Python+NumPy Programs. http://github.com/google/jax, 2018. 69
Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Generative and
Discriminative Voxel Modeling With Convolutional Neural Networks. arXiv, 2016.
174
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Scale GAN Training for
High Fidelity Natural Image Synthesis. In International Conference on Learning
Representations (ICLR), 2018. 224
Gershon Buchsbaum. A Spatial Processor Model for Object Colour Perception. Jour-
nal of the Franklin Institute, 310(1):1–26, 1980. 83
Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen.
Unstructured Lumigraph Rendering. In SIGGRAPH, 2001. 116, 163
Dan A Calian, Jean-François Lalonde, Paulo Gotardo, Tomas Simon, Iain Matthews,
and Kenny Mitchell. From Faces to Outdoor Light Probes. Computer Graphics
Forum (CGF), 37(2):51–61, 2018. 222
Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha Srinivasa, Pieter Abbeel, and
Aaron M Dollar. Benchmarking in Manipulation Research: Using the Yale-CMU-
Berkeley Object and Model Set. IEEE Robotics and Automation Magazine (RAM),
22(3):36–52, 2015. 179
Zhangjie Cao, Qixing Huang, and Karthik Ramani. 3D Object Classification via
Spherical Projections. In International Conference on 3D Vision (3DV), 2017. 177
Joel Carranza, Christian Theobalt, Marcus A Magnor, and Hans-Peter Seidel. Free-
Viewpoint Video of Human Actors. In SIGGRAPH, 2003. 116
Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang,
Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. ShapeNet:
An Information-Rich 3D Model Repository. arXiv, 2015. 170, 171, 174, 175, 178,
182, 187, 194, 198, 203
Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-Image Depth Perception
in the Wild. In Advances in Neural Information Processing Systems (NeurIPS),
2016. 176
275
Zhang Chen, Anpei Chen, Guli Zhang, Chengyuan Wang, Yu Ji, Kiriakos N Ku-
tulakos, and Jingyi Yu. A Neural Rendering Framework for Free-Viewpoint Re-
lighting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2020. 58, 117
Zhen Cheng, Zhiwei Xiong, Chang Chen, and Dong Liu. Light Field Super-Resolution:
A Benchmark. In IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition Workshops (CVPRW), 2019. 113
Sungjoon Choi, Qian-Yi Zhou, Stephen Miller, and Vladlen Koltun. A Large Dataset
of Object Scans. arXiv, 2016. 179
Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese.
3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruc-
tion. In European Conference on Computer Vision (ECCV), 2016. 170, 175, 197,
198, 204
Albert Cohen, Ingrid Daubechies, and J-C Feauveau. Biorthogonal Bases of Com-
pactly Supported Wavelets. Communications on Pure and Applied Mathematics,
45(5):485–560, 1992. 137
Daniel Cohen and Zvi Sheffer. Proximity Clouds—An Acceleration Technique for 3D
Grid Traversal. The Visual Computer, 11(1):27–38, 1994. 52
Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Convolutional Networks
for Spherical Signals. arXiv, 2017. 177
Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical CNNs. In
International Conference on Learning Representations (ICLR), 2018. 177, 184
Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Cal-
abrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-Quality Streamable
Free-Viewpoint Video. ACM Transactions on Graphics (TOG), 34(4):1–13, 2015.
127
Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. Shape Completion Using
3D-Encoder-Predictor CNNs and Shape Synthesis. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2017. 170, 175, 177, 198, 202
Paul Debevec. Rendering Synthetic Objects Into Real Scenes: Bridging Traditional
and Image-Based Graphics With Global Illumination and High Dynamic Range
Photography. In SIGGRAPH, 1998. 77, 222
Paul Debevec. The Light Stages and Their Applications to Photoreal Digital Actors.
In SIGGRAPH Asia, 2012. 108, 114
276
Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin,
and Mark Sagar. Acquiring the Reflectance Field of a Human Face. In SIGGRAPH,
2000. 108, 109, 118, 145, 151
Michael Deering, Stephanie Winner, Bic Schediwy, Chris Duffy, and Neil Hunt. The
Triangle Processor and Normal Vector Shader: A VLSI System for High Perfor-
mance Graphics. In SIGGRAPH, 1988. 130
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A
Large-Scale Hierarchical Image Database. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2009. 137, 266
Yue Dong, Guojun Chen, Pieter Peers, Jiawan Zhang, and Xin Tong. Appearance-
From-Motion: Recovering Spatially Varying Surface Reflectance Under Unknown
Lighting. ACM Transactions on Graphics (TOG), 33(6):1–12, 2014. 58
Alexey Dosovitskiy and Thomas Brox. Generating Images With Perceptual Similarity
Metrics Based on Deep Networks. In Advances in Neural Information Processing
Systems (NeurIPS), 2016. 176
Frédo Durand, Nicolas Holzschuch, Cyril Soler, Eric Chan, and François X Sillion. A
Frequency Analysis of Light Transport. In SIGGRAPH, 2005. 113
David Eigen and Rob Fergus. Predicting Depth, Surface Normals and Semantic
Labels With a Common Multi-Scale Convolutional Architecture. In IEEE/CVF
International Conference on Computer Vision (ICCV), 2015. 176
David Eigen, Christian Puhrsch, and Rob Fergus. Depth Map Prediction From a
Single Image Using a Multi-Scale Deep Network. In Advances in Neural Information
Processing Systems (NeurIPS), 2014. 116
277
Haoqiang Fan, Hao Su, and Leonidas Guibas. A Point Set Generation Network for
3D Object Reconstruction From a Single Image. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2017. 175, 197, 198
Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing Objects by
Their Attributes. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2009. 177
Michael Firman. RGBD Datasets: Past, Present and Future. In IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2016.
179
Michael Firman, Oisin Mac Aodha, Simon Julier, and Gabriel J Brostow. Structured
Completion of Unobserved Voxels From a Single Depth Image. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 174
John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan
Overbeck, Noah Snavely, and Richard Tucker. DeepView: View Synthesis With
Learned Gradient Descent. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2019. 117
William T Freeman. TUM AI Lecture Series – The Moon Camera (Bill Freeman).
https://www.youtube.com/watch?v=Ytkkl917paM, 2020. Accessed: 08/25/2021.
222
Martin Fuchs, Hendrik PA Lensch, Volker Blanz, and Hans-Peter Seidel. Superreso-
lution Reflectance Fields: Synthesizing Images for Intermediate Light Directions.
Computer Graphics Forum (CGF), 26(3):447–456, 2007. 115, 146, 147
Christopher Funk and Yanxi Liu. Beyond Planar Symmetry: Modeling Human Per-
ception of Reflection and Rotation Symmetries in the Wild. In IEEE/CVF Inter-
national Conference on Computer Vision (ICCV), 2017. 178
Graham Fyffe. Cosine Lobe Based Relighting From Gradient Illumination Pho-
tographs. In SIGGRAPH Posters, 2009. 118
Duan Gao, Guojun Chen, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. Deferred
Neural Lighting: Free-Viewpoint Relighting From Unstructured Photographs.
ACM Transactions on Graphics (TOG), 39(6):1–15, 2020. 58
278
Marc-André Gardner, Kalyan Sunkavalli, Ersin Yumer, Xiaohui Shen, Emiliano Gam-
baretto, Christian Gagné, and Jean-François Lalonde. Learning to Predict Indoor
Illumination From a Single Image. ACM Transactions on Graphics (TOG), 36(6):
1–14, 2017. 222
Gaurav Garg, Eino-Ville Talvala, Marc Levoy, and Hendrik P Lensch. Symmetric
Photography: Exploiting Data-Sparseness in Reflectance Fields. In Eurographics
Symposium on Rendering Techniques (EGSR), 2006. 118
Mathieu Garon, Kalyan Sunkavalli, Sunil Hadap, Nathan Carr, and Jean-François
Lalonde. Fast Spatially-Varying Indoor Lighting Estimation. In IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2019. 223
Kyle Genova, Forrester Cole, Aaron Sarna Daniel Vlasic, William T Freeman, and
Thomas Funkhouser. Learning Shape Templates With Structured Implicit Func-
tions. In IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
58
Michaël Gharbi, Tzu-Mao Li, Miika Aittala, Jaakko Lehtinen, and Frédo Durand.
Sample-Based Monte Carlo Denoising Using a Kernel-Splatting Network. ACM
Transactions on Graphics (TOG), 38(4):1–12, 2019. 69
Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning
a Predictable and Generative Vector Representation for Objects. In European
Conference on Computer Vision (ECCV), 2016. 175
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets.
In Advances in Neural Information Processing Systems (NeurIPS), 2014. 171, 176,
181, 223, 230
Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. The
Lumigraph. In SIGGRAPH, 1996. 116
Paul Green, Jan Kautz, and Frédo Durand. Efficient Reflectance and Visibility Ap-
proximations for Environment Map Rendering. Computer Graphics Forum (CGF),
26(3):495–502, 2007. 60
Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu
Aubry. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2018. 175, 198
279
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron
Courville. Improved Training of Wasserstein GANs. In Advances in Neural Infor-
mation Processing Systems (NeurIPS), 2017. 181, 182
Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch, Xueming Yu, Matt Whalen,
Geoff Harvey, Sergio Orts-Escolano, Rohit Pandey, Jason Dourgarian, et al. The
Relightables: Volumetric Performance Capture of Humans With Realistic Relight-
ing. ACM Transactions on Graphics (TOG), 38(6):1–19, 2019. 115, 118, 127, 128,
129, 140, 149, 153, 165
Romain Guy and Mathias Agopian. Physically Based Rendering in Filament. https:
//google.github.io/filament/Filament.html, 2018. Accessed: 08/25/2021.
249
Christian Häne, Shubham Tulsiani, and Jitendra Malik. Hierarchical Surface Pre-
diction for 3D Object Reconstruction. In International Conference on 3D Vision
(3DV), 2017. 175
Pat Hanrahan and Wolfgang Krueger. Reflection From Layered Surfaces Due to
Subsurface Scattering. In SIGGRAPH, 1993. 32
Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vi-
sion. Cambridge University Press, 2004. 36, 59, 116
Samuel W Hasinoff, Anat Levin, Philip R Goode, and William T Freeman. Diffuse
Reflectance Imaging With Astronomical Applications. In IEEE/CVF International
Conference on Computer Vision (ICCV), 2011. 222
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep Into
Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2015. 125
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning
for Image Recognition. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2016. 133, 180, 185, 266
280
Felix Heide, Lei Xiao, Wolfgang Heidrich, and Matthias B Hullin. Diffuse Mirrors:
3D Reconstruction From Diffuse Indirect Illumination Using Inexpensive Time-
of-Flight Sensors. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2014. 221
Tomáš Hodan, Pavel Haluza, Štepán Obdržálek, Jiri Matas, Manolis Lourakis, and
Xenophon Zabulis. T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-
Less Objects. In IEEE/CVF Winter Conference on Applications of Computer Vi-
sion (WACV), 2017. 179
Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring Se-
mantic Layout for Hierarchical Text-to-Image Synthesis. In IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2018. 224
Berthold K P Horn. Shape From Shading: A Method for Obtaining the Shape of a
Smooth Opaque Object From One View. Technical report, Massachusetts Institute
of Technology, 1970. 25, 57
Berthold K P Horn and Michael J Brooks. Shape From Shading. MIT press, 1989.
176
Qixing Huang, Hai Wang, and Vladlen Koltun. Single-View Reconstruction via Joint
Analysis of Image and Shape Collections. ACM Transactions on Graphics (TOG),
34(4):87, 2015. 175
Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal Unsupervised
Image-to-Image Translation. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2018. 224
Zhuo Hui, Kalyan Sunkavalli, Joon-Young Lee, Sunil Hadap, Jian Wang, and Aswin C
Sankaranarayanan. Reflectance Capture Using Univariate Sampling of BRDFs. In
IEEE/CVF International Conference on Computer Vision (ICCV), 2017. 60
281
David S Immel, Michael F Cohen, and Donald P Greenberg. A Radiosity Method for
Non-Diffuse Environments. In SIGGRAPH, 1986. 37
Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning Visual
Groups From Co-Occurrences in Space and Time. In International Conference on
Learning Representations Workshops (ICLRW), 2016. 176
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-Image Trans-
lation With Conditional Adversarial Networks. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2017. 224, 233
Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard A Newcombe,
Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew J Davi-
son, and Andrew W Fitzgibbon. KinectFusion: Real-Time 3D Reconstruction and
Interaction Using a Moving Depth Camera. In ACM Symposium on User Interface
Software and Technology (UIST), 2011. 176, 178
Varun Jain and Hao Zhang. Robust 3D Shape Correspondence in the Spectral Do-
main. In IEEE International Conference on Shape Modeling and Applications
(SMI), 2006. 196
Michael Janner, Jiajun Wu, Tejas Kulkarni, Ilker Yildirim, and Joshua B Tenenbaum.
Self-Supervised Intrinsic Image Decomposition. In Advances in Neural Information
Processing Systems (NeurIPS), 2017. 40, 176
Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T Barron, Mario Fritz, Kate
Saenko, and Trevor Darrell. A Category-Level 3-D Object Dataset: Putting the
Kinect to Work. In IEEE/CVF International Conference on Computer Vision
Workshops (ICCVW), 2011. 178
Henrik Wann Jensen, Stephen R Marschner, Marc Levoy, and Pat Hanrahan. A
Practical Model for Subsurface Light Transport. In SIGGRAPH, 2001. 32
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Losses for Real-Time
Style Transfer and Super-Resolution. In European Conference on Computer Vision
(ECCV), 2016. 176
282
James T Kajiya. The Rendering Equation. In SIGGRAPH, 1986. 37, 128
James T Kajiya and Brian P Von Herzen. Ray Tracing Volume Densities. SIG-
GRAPH, 1984. 62, 63
Kaizhang Kang, Zimin Chen, Jiaping Wang, Kun Zhou, and Hongzhi Wu. Effi-
cient Reflectance Capture Using an Autoencoder. ACM Transactions on Graphics
(TOG), 37(4):1–10, 2018. 114
Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-
Specific Object Reconstruction From a Single Image. In IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2015. 175
Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. Learning to Gen-
erate Images of Outdoor Scenes From Attributes and Semantic Layouts. arXiv,
2016. 224
Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. Manipulating
Attributes of Natural Scenes via Hallucination. ACM Transactions on Graphics
(TOG), 39(1):1–17, 2019. 224
Brian Karis and Epic Games. Real Shading in Unreal Engine 4. Proc. Physically
Based Shading Theory Practice, 4(3), 2013. 249
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive Growing of
GANs for Improved Quality, Stability, and Variation. In International Conference
on Learning Representations (ICLR), 2018. 126, 255
Tero Karras, Samuli Laine, and Timo Aila. A Style-Based Generator Architecture for
Generative Adversarial Networks. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2019. 221, 242
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo
Aila. Analyzing and Improving the Image Quality of StyleGAN. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 242
Kevin Karsch, Kalyan Sunkavalli, Sunil Hadap, Nathan Carr, Hailin Jin, Rafael Fonte,
Michael Sittig, and David Forsyth. Automatic Scene Inference for 3D Object Com-
positing. ACM Transactions on Graphics (TOG), 33(3):1–15, 2014. 223
283
Michael Kazhdan and Hugues Hoppe. Screened Poisson Surface Reconstruction. ACM
Transactions on Graphics (TOG), 32(3):29, 2013. 174
Michael Kazhdan, Bernard Chazelle, David Dobkin, Adam Finkelstein, and Thomas
Funkhouser. A Reflective Symmetry Descriptor. In European Conference on Com-
puter Vision (ECCV), 2002. 177
Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Symmetry De-
scriptors and 3D Shape Matching. In Symposium on Geometry Processing (SGP),
2004. 177
Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson Surface Reconstruc-
tion. In Symposium on Geometry Processing (SGP), 2006. 174
Sean Kelly, Samantha Cordingley, Patrick Nolan, Christoph Rhemann, Sean Fanello,
Danhang Tang, Jude Osborn, Jay Busch, Philip Davidson, Paul Debevec, et al.
AR-ia: Volumetric Opera for Mobile Augmented Reality. In SIGGRAPH Asia
XR, 2019. 108
Markus Kettunen, Erik Härkönen, and Jaakko Lehtinen. E-LPIPS: Robust Perceptual
Image Similarity via Random Transformation Ensembles. arXiv, 2019. 142
Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias
Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian
Theobalt. Deep Video Portraits. ACM Transactions on Graphics (TOG), 37(4):
1–14, 2018. 108, 116, 119
Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.
In International Conference on Learning Representations (ICLR), 2015. 69, 79,
126, 138, 182, 270
Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In Interna-
tional Conference on Learning Representations (ICLR), 2014. 223, 231
Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and Temples:
Benchmarking Large-Scale Scene Reconstruction. ACM Transactions on Graphics
(TOG), 36(4):78, 2017. 179
Vladislav Kreavoy, Dan Julius, and Alla Sheffer. Model Composition From Inter-
changeable Components. In Pacific Conference on Computer Graphics and Appli-
cations (PG), 2007. 196
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification
With Deep Convolutional Neural Networks. In Advances in Neural Information
Processing Systems (NeurIPS), 2012. 266
Zorah Lahner, Emanuele Rodola, Frank R Schmidt, Michael M Bronstein, and Daniel
Cremers. Efficient Globally Optimal 2D-to-3D Deformable Shape Matching. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2016. 178
284
Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. A Large-Scale Hierarchical
Multi-View RGB-D Object Dataset. In IEEE International Conference on Robotics
and Automation (ICRA), 2011. 179
Edwin H Land and John J McCann. Lightness and Retinex Theory. Journal of the
Optical Society of America, 61(1):1–11, 1971. 57, 83
Martin Laurenzis, Andreas Velten, and Jonathan Klein. Dual-Mode Optical Sensing:
Three-Dimensional Imaging and Seeing Around a Corner. Optical Engineering, 56
(3):031202, 2016. 221
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham,
Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang,
et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversar-
ial Network. In IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2017. 176
Chloe LeGendre, Wan-Chun Ma, Graham Fyffe, John Flynn, Laurent Charbonnel,
Jay Busch, and Paul Debevec. DeepLight: Learning Illumination for Unconstrained
Mobile Mixed Reality. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2019. 116, 222
Bastian Leibe and Bernt Schiele. Analyzing Appearance and Contour Based Methods
for Object Categorization. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2003. 178
Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An Accurate O(n)
Solution to the PnP Problem. International Journal of Computer Vision (IJCV),
81(2):155, 2009. 191
Marc Levoy and Pat Hanrahan. Light Field Rendering. In SIGGRAPH, 1996. 108,
116
285
Thomas Lewiner, Hélio Lopes, Antônio Wilson Vieira, and Geovan Tavares. Efficient
Implementation of Marching Cubes’ Cases With Topological Guarantees. Journal
of Graphics Tools, 8(2):1–15, 2003. 186, 196
Yangyan Li, Angela Dai, Leonidas Guibas, and Matthias Nießner. Database-Assisted
Object Retrieval for Real-Time 3D Reconstruction. Computer Graphics Forum
(CGF), 34(2):435–446, 2015. 174
Yikai Li, Jiayuan Mao, Xiuming Zhang, William T Freeman, Joshua B Tenenbaum,
Noah Snavely, and Jiajun Wu. Multi-Plane Program Induction With 3D Box Priors.
In Advances in Neural Information Processing Systems (NeurIPS), 2020a. 116
Yue Li, Pablo Wiedemann, and Kenny Mitchell. Deep Precomputed Radiance Trans-
fer for Deformable Objects. Proceedings of the ACM on Computer Graphics and
Interactive Techniques (PACMCGIT), 2(1):1–16, 2019. 131
Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely. Crowdsampling the Plenoptic
Function. In European Conference on Computer Vision (ECCV), 2020b. 59
Zhengqin Li, Zexiang Xu, Ravi Ramamoorthi, Kalyan Sunkavalli, and Manmohan
Chandraker. Learning to Reconstruct Shape and Spatially-Varying Reflectance
From a Single Image. ACM Transactions on Graphics (TOG), 37(6):1–11, 2018.
57, 116
Zhengqin Li, Mohammad Shafiei, Ravi Ramamoorthi, Kalyan Sunkavalli, and Man-
mohan Chandraker. Inverse Rendering for Complex Indoor Scenes: Shape,
Spatially-Varying Lighting and SVBRDF From a Single Image. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2020c. 57, 116,
223
Joseph J Lim, Hamed Pirsiavash, and Antonio Torralba. Parsing IKEA Objects: Fine
Pose Estimation. In IEEE/CVF International Conference on Computer Vision
(ICCV), 2013. 179, 187, 189, 190, 195, 262
Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt.
Neural Sparse Voxel Fields. In Advances in Neural Information Processing Systems
(NeurIPS), 2020. 58
Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised Image-to-Image Transla-
tion Networks. In Advances in Neural Information Processing Systems (NeurIPS),
2017. 224
Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. Deep Appearance
Models for Face Rendering. ACM Transactions on Graphics (TOG), 37(4):1–13,
2018. 114, 117, 118, 166, 246
286
Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas
Lehrmann, and Yaser Sheikh. Neural Volumes: Learning Dynamic Renderable
Volumes From Images. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
117, 118, 166, 246
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully Convolutional Networks
for Semantic Segmentation. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2015. 125
William E Lorensen and Harvey E Cline. Marching Cubes: A High Resolution 3D
Surface Construction Algorithm. In SIGGRAPH, 1987. 30, 91, 186
Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, and
Paul Debevec. Rapid Acquisition of Specular and Diffuse Normal Maps from Po-
larized Spherical Gradient Illumination. In Eurographics Symposium on Rendering
Techniques (EGSR), 2007. 118
Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier Nonlinearities Improve
Neural Network Acoustic Models. In International Conference on Machine Learning
(ICML), 2013. 137
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen
Paul Smolley. Least Squares Generative Adversarial Networks. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 233
Donald W Marquardt. An Algorithm for Least-Squares Estimation of Nonlinear
Parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):
431–441, 1963. 191
David Marr. Vision. W H Freeman and Company, 1982. 31, 172
Stephen R Marschner. Inverse Rendering for Computer Graphics. Cornell University,
1998. 57
Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan
Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, Peter
Lincoln, et al. LookinGood: Enhancing Performance Capture With Real-Time
Neural Re-Rendering. ACM Transactions on Graphics (TOG), 37(6):1–14, 2018.
117
Ricardo Martin-Brualla, Noha Radwan, Mehdi S M Sajjadi, Jonathan T Barron,
Alexey Dosovitskiy, and Daniel Duckworth. NeRF in the Wild: Neural Radiance
Fields for Unconstrained Photo Collections. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2021. 59, 75, 92, 250
Vincent Masselus, Pieter Peers, Philip Dutré, and Yves D Willemsy. Smooth Recon-
struction and Compact Representation of Reflectance Functions for Image-Based
Relighting. In Eurographics Symposium on Rendering Techniques (EGSR), 2004.
114
287
Wojciech Matusik, Hanspeter Pfister, Matt Brand, and Leonard McMillan. A Data-
Driven Reflectance Model. ACM Transactions on Graphics (TOG), 22(3):759–769,
2003. 60, 75, 81, 87
Tim Maughan. Virtual Reality: The Hype, the Problems and the
Promise. https://www.bbc.com/future/article/20160729-virtual-reality-
the-hype-the-problems-and-the-promise, 2016. Accessed: 08/04/2021. 25
Nelson Max. Optical Models for Direct Volume Rendering. IEEE Transactions on
Visualization and Computer Graphics (TVCG), 1(2):99–108, 1995. 62, 68
Abhimitra Meka, Christian Haene, Rohit Pandey, Michael Zollhöfer, Sean Fanello,
Graham Fyffe, Adarsh Kowdle, Xueming Yu, Jay Busch, Jason Dourgarian, et al.
Deep Reflectance Fields: High-Quality Facial Reflectance Field Inference From
Color Gradient Illumination. ACM Transactions on Graphics (TOG), 38(4):1–12,
2019. 108, 115, 117, 119, 139, 143, 146, 147, 148
Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which Training Meth-
ods for GANs Do Actually Converge? In International Conference on Machine
Learning (ICML), 2018. 224
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas
Geiger. Occupancy Networks: Learning 3D Reconstruction in Function Space. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2019. 58
Moustafa Meshry, Dan B Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey,
Noah Snavely, and Ricardo Martin-Brualla. Neural Rerendering in the Wild. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2019. 59
288
Mehdi Mirza and Simon Osindero. Conditional Generative Adversarial Nets. arXiv,
2014. 224
Niloy J Mitra, Leonidas J Guibas, and Mark Pauly. Partial and Approximate Sym-
metry Detection for 3D Geometry. ACM Transactions on Graphics (TOG), 25(3):
560–568, 2006. 174
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral
Normalization for Generative Adversarial Networks. In International Conference
on Learning Representations (ICLR), 2018. 233
Tomas Möller and Ben Trumbore. Fast, Minimum Storage Ray-Triangle Intersection.
Journal of Graphics Tools, 2(1):21–28, 1997. 29
Oliver Nalbach, Elena Arabadzhiyska, Dushyant Mehta, H-P Seidel, and Tobias
Ritschel. Deep Shading: Convolutional Neural Networks for Screen Space Shading.
Computer Graphics Forum (CGF), 36(4):65–78, 2017. 130, 149
Giljoo Nam, Joo Ho Lee, Diego Gutierrez, and Min H Kim. Practical SVBRDF Ac-
quisition of 3D Objects With Unstructured Flash Photography. ACM Transactions
on Graphics (TOG), 37(6):1–12, 2018. 58
Andrew Nealen, Takeo Igarashi, Olga Sorkine, and Marc Alexa. Laplacian Mesh Opti-
mization. In ACM International Conference on Computer Graphics and Interactive
Techniques in Australasia and Southeast Asia (GRAPHITE), 2006. 174
289
Ren Ng, Ravi Ramamoorthi, and Pat Hanrahan. All-Frequency Shadows Using Non-
Linear Wavelet Lighting Approximation. ACM Transactions on Graphics (TOG),
22(3):376–381, 2003. 142
Jannik Boll Nielsen, Henrik Wann Jensen, and Ravi Ramamoorthi. On Optimal, Min-
imal Brdf Sampling for Reflectance Acquisition. ACM Transactions on Graphics
(TOG), 34(6):1–11, 2015. 60
Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differen-
tiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D
Supervision. In IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2019. 58
Ko Nishino and Shree K Nayar. Corneal Imaging System: Environment From Eyes.
International Journal of Computer Vision (IJCV), 70(1):23–40, 2006. 222
David Novotny, Diane Larlus, and Andrea Vedaldi. Learning 3D Object Categories
by Looking Around Them. In IEEE/CVF International Conference on Computer
Vision (ICCV), 2017. 175
Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional Image Syn-
thesis With Auxiliary Classifier GANs. In International Conference on Machine
Learning (ICML), 2017. 224
Michael Oechsle, Songyou Peng, and Andreas Geiger. UNISURF: Unifying Neural Im-
plicit Surfaces and Radiance Fields for Multi-View Reconstruction. In IEEE/CVF
International Conference on Computer Vision (ICCV), 2021. 72
Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kow-
dle, Yury Degtyarev, David Kim, Philip L Davidson, Sameh Khamis, Mingsong
Dou, et al. Holoportation: Virtual 3D Teleportation in Real-Time. In ACM Sym-
posium on User Interface Software and Technology (UIST), 2016. 108
Matthew O’Toole and Kiriakos N Kutulakos. Optical Computing for Fast Light
Transport Analysis. ACM Transactions on Graphics (TOG), 29(6):1–12, 2010. 114
Geoffrey Oxholm and Ko Nishino. Multiview Shape and Reflectance From Natural
Illumination. In IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2014. 19, 58, 84, 96, 97, 98
290
Rohit Pandey, Anastasia Tkach, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor,
Ricardo Martin-Brualla, Andrea Tagliasacchi, George Papandreou, Philip David-
son, Cem Keskin, Shahram Izadi, and Sean Fanello. Volumetric Capture of Humans
With a Single RGBD Camera via Semi-Parametric Learning. In IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2019. 108, 117,
119
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven
Lovegrove. DeepSDF: Learning Continuous Signed Distance Functions for Shape
Representation. In IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2019a. 29, 30, 58, 75, 247
Jeong Joon Park, Aleksander Holynski, and Steve Seitz. Seeing the World in a Bag
of Chips. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2020. 57
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic Image
Synthesis With Spatially-Adaptive Normalization. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2019b. 221, 224, 225, 230, 231,
233
Pieter Peers, Dhruv K Mahajan, Bruce Lamond, Abhijeet Ghosh, Wojciech Matusik,
Ravi Ramamoorthi, and Paul Debevec. Compressive Light Transport Sensing.
ACM Transactions on Graphics (TOG), 28(1):1–18, 2009. 113
Xingchao Peng, Baochen Sun, Karim Ali, and Kate Saenko. Learning Deep Object
Detectors from 3D Models. In IEEE/CVF International Conference on Computer
Vision (ICCV), 2015. 177
Alex Paul Pentland. A New Sense for Depth of Field. IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 9(4):523–531, 1987. 42
Matt Pharr, Wenzel Jakob, and Greg Humphreys. Physically Based Rendering: From
Theory to Implementation. Morgan Kaufmann Publishers Inc., 3rd edition, 2016.
51, 52, 108
Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A Efros, and George Drettakis.
Multi-View Relighting Using a Geometry-Aware Network. ACM Transactions on
Graphics (TOG), 38(4):1–14, 2019. 19, 97, 99
Marc Proesmans, Luc Van Gool, and André Oosterlinck. One-Shot Active 3D Shape
Acquisition. In International Conference on Pattern Recognition (ICPR), 1996.
178
291
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep Learning
on Point Sets for 3D Classification and Segmentation. In IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2017a. 29, 92
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep
Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in
Neural Information Processing Systems (NeurIPS), 2017b. 29
Gilles Rainer, Wenzel Jakob, Abhijeet Ghosh, and Tim Weyrich. Neural BTF Com-
pression and Interpolation. Computer Graphics Forum (CGF), 38(2):235–244, 2019.
114
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and
Honglak Lee. Generative Adversarial Text to Image Synthesis. In International
Conference on Machine Learning (ICML), 2016. 224
Peiran Ren, Jiaping Wang, Minmin Gong, Stephen Lin, Xin Tong, and Baining Guo.
Global Illumination With Radiance Regression Functions. ACM Transactions on
Graphics (TOG), 32(4):1–12, 2013. 114
Peiran Ren, Yue Dong, Stephen Lin, Xin Tong, and Baining Guo. Image-Based
Relighting Using Neural Networks. ACM Transactions on Graphics (TOG), 34(4):
1–12, 2015. 114, 117
Danilo Jimenez Rezende, SM Eslami, Shakir Mohamed, Peter Battaglia, Max Jader-
berg, and Nicolas Heess. Unsupervised Learning of 3D Structure From Images. In
Advances in Neural Information Processing Systems (NeurIPS), 2016. 175
Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and Andreas Geiger. OctNetFusion:
Learning Depth Fusion From Data. In International Conference on 3D Vision
(3DV), 2017a. 175
Gernot Riegler, Ali Osman Ulusoys, and Andreas Geiger. OctNet: Learning Deep
3D Representations at High Resolutions. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2017b. 175
Tobias Ritschel, Thorsten Grosch, Jan Kautz, and Stefan Müller. Interactive Illumi-
nation With Coherent Shadow Maps. In Eurographics Symposium on Rendering
Techniques (EGSR), 2007. 60
292
Tobias Ritschel, Thorsten Grosch, Min H Kim, H-P Seidel, Carsten Dachsbacher,
and Jan Kautz. Imperfect Shadow Maps for Efficient Computation of Indirect
Illumination. ACM Transactions on Graphics (TOG), 27(5):1–8, 2008. 60
Tobias Ritschel, Thomas Engelhardt, Thorsten Grosch, H-P Seidel, Jan Kautz, and
Carsten Dachsbacher. Micro-Rendering for Scalable, Parallel Final Gathering.
ACM Transactions on Graphics (TOG), 28(5):1–8, 2009. 60
Jason Rock, Tanmay Gupta, Justin Thorsen, JunYoung Gwak, Daeyun Shin, and
Derek Hoiem. Completing 3D Object Shape From One Depth Image. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 178
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Net-
works for Biomedical Image Segmentation. In International Conference on Medical
Image Computing and Computer Assisted Intervention (MICCAI), 2015. 119, 136,
186, 268, 270
Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The Earth Mover’s Distance as
a Metric for Image Retrieval. International Journal of Computer Vision (IJCV),
40(2):99–121, 2000. 196
Ryusuke Sagawa, Hiroshi Kawasaki, Shota Kiyota, and Ryo Furukawa. Dense One-
Shot 3D Reconstruction by Detecting Continuous Regions With Parallel Line Pro-
jection. In IEEE/CVF International Conference on Computer Vision (ICCV),
2011. 178
Hanan Samet. Implementing Ray Tracing With Octrees and Neighbor Finding. Com-
puters & Graphics, 13(4):445–460, 1989. 52
Shen Sang and Manmohan Chandraker. Single-Shot Neural Relighting and SVBRDF
Estimation. In European Conference on Computer Vision (ECCV), 2020. 57
Imari Sato, Takahiro Okabe, Yoichi Sato, and Katsushi Ikeuchi. Appearance Sampling
for Obtaining a Set of Basis Images for Variable Illumination. In IEEE/CVF
International Conference on Computer Vision (ICCV), 2003. 113
Yoichi Sato, Mark D Wheeler, and Katsushi Ikeuchi. Object Shape and Reflectance
Modeling From Observation. In SIGGRAPH, 1997. 57
293
Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3D: Learning 3D Scene Structure
From a Single Still Image. IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI), 31(5):824–840, 2008. 116
Carolin Schmitt, Simon Donne, Gernot Riegler, Vladlen Koltun, and Andreas Geiger.
On Joint Estimation of Pose, Geometry and SVBRDF From a Handheld Scanner.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2020. 58
Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-From-Motion Re-
visited. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2016. 56, 81, 87
Steven M Seitz and Charles R Dyer. Photorealistic Scene Reconstruction by Voxel
Coloring. International Journal of Computer Vision (IJCV), 35(2):151–173, 1999.
59
Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski.
A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2006. 42
Pradeep Sen and Soheil Darabi. Compressive Dual Photography. Computer Graphics
Forum (CGF), 28(2):609–618, 2009. 113
Pradeep Sen, Billy Chen, Gaurav Garg, Stephen R Marschner, Mark Horowitz, Marc
Levoy, and Hendrik PA Lensch. Dual Photography. ACM Transactions on Graphics
(TOG), 24(3):745–755, 2005. 37, 117
Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David W Jacobs.
SfSNet: Learning Shape, Refectance and Illuminance of Faces in the Wild. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2018. 114, 116
Soumyadip Sengupta, Jinwei Gu, Kihwan Kim, Guilin Liu, David W Jacobs, and
Jan Kautz. Neural Inverse Rendering of an Indoor Scene From a Single Image. In
IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 57, 222
Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, and Ira
Kemelmacher-Shlizerman. Background Matting: The World Is Your Green Screen.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2020. 108, 119
Jun’ichiro Seyama and Ruth S Nagayama. The Uncanny Valley: Effect of Realism
on the Impression of Artificial Human Faces. Presence, 16(4):337–351, 2007. 109
Evan Shelhamer, Jonathan T Barron, and Trevor Darrell. Scene Intrinsics and Depth
From a Single Image. In IEEE/CVF International Conference on Computer Vision
Workshops (ICCVW), 2015. 223
294
Jian Shi, Yue Dong, Hao Su, and Stella X Yu. Learning Non-Lambertian Object
Intrinsics across ShapeNet Categories. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2017. 176
Philip Shilane, Patrick Min, Michael Kazhdan, and Thomas Funkhouser. The Prince-
ton Shape Benchmark. In IEEE International Conference on Shape Modeling and
Applications (SMI), 2004. 178
Daeyun Shin, Charless C Fowlkes, and Derek Hoiem. Pixels, Voxels, and Views:
a Study of Shape Representations for Single View 3D Object Shape Prediction.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2018. 178, 198, 208, 209, 210, 211
Dongeek Shin, Ahmed Kirmani, Vivek K Goyal, and Jeffrey H Shapiro. Computa-
tional 3D and Reflectivity Imaging With High Photon Efficiency. In International
Conference on Image Processing (ICIP), 2014. 221
Dongeek Shin, Ahmed Kirmani, Vivek K Goyal, and Jeffrey H Shapiro. Photon-
Efficient Computational 3-D and Reflectivity Imaging With Single-Photon Detec-
tors. IEEE Transactions on Computational Imaging, 1(2):112–125, 2015. 221
Dongeek Shin, Feihu Xu, Dheera Venkatraman, Rudi Lussana, Federica Villa, Franco
Zappa, Vivek K Goyal, Franco NC Wong, and Jeffrey H Shapiro. Photon-Efficient
Computational Imaging With a Single-Photon Camera. In Computational Optical
Sensing and Imaging, pages CW5D–4. Optical Society of America, 2016. 221
Aliaksandra Shysheya, Egor Zakharov, Kara-Ali Aliev, Renat Bashirov, Egor Burkov,
Karim Iskakov, Aleksei Ivakhnenko, Yury Malkov, Igor Pasechnik, Dmitry Ulyanov,
et al. Textured Neural Avatars. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2019. 116
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor Segmen-
tation and Support Inference From RGBD Images. In European Conference on
Computer Vision (ECCV), 2012. 176
Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for
Large-Scale Image Recognition. In International Conference on Learning Repre-
sentations (ICLR), 2015. 137
Arjun Singh, James Sha, Karthik S Narayan, Tudor Achim, and Pieter Abbeel. Big-
BIRD: A Large-Scale 3D Database of Object Instances. In IEEE International
Conference on Robotics and Automation (ICRA), 2014. 179
Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein,
and Michael Zollhofer. DeepVoxels: Learning Persistent 3D Feature Embeddings.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2019a. 117, 118, 166, 246, 247
295
Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene Representation
Networks: Continuous 3D-Structure-Aware Neural Scene Representations. In Ad-
vances in Neural Information Processing Systems (NeurIPS), 2019b. 58, 117, 247
Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon
Wetzstein. Implicit Neural Representations With Periodic Activation Functions.
In Advances in Neural Information Processing Systems (NeurIPS), 2020. 58, 247
Peter-Pike Sloan, Jan Kautz, and John Snyder. Precomputed Radiance Transfer
for Real-Time Rendering in Dynamic, Low-Frequency Lighting Environments. In
SIGGRAPH, 2002. 60
Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo Tourism: Exploring Photo
Collections in 3D. ACM Transactions on Graphics (TOG), 25(3):835–846, 2006.
59, 108
Amir Arsalan Soltani, Haibin Huang, Jiajun Wu, Tejas D Kulkarni, and Joshua B
Tenenbaum. Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and
Silhouettes With Deep Generative Networks. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2017. 198
Shuran Song and Thomas Funkhouser. Neural Illumination: Lighting Prediction for
Indoor Environments. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2019. 223
Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas
Funkhouser. Semantic Scene Completion from a Single Depth Image. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 176
Pratul P Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Milden-
hall, and Jonathan T Barron. NeRV: Neural Reflectance and Visibility Fields for
Relighting and View Synthesis. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2021. 26, 31, 32, 33, 34, 35, 36, 37, 38, 40, 41,
45, 49, 53, 61, 74, 104, 245, 246, 249
296
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks From Over-
fitting. Journal of Machine Learning Research (JMLR), 15(1):1929–1958, 2014.
122
Jessi Stumpfel, Chris Tchou, Andrew Jones, Tim Hawkins, Andreas Wenger, and
Paul Debevec. Direct HDR Capture of the Sun and Sky. In AFRIGRAPH, 2004.
80, 103, 104
Tiancheng Sun, Henrik Wann Jensen, and Ravi Ramamoorthi. Connecting Measured
BRDFs to Analytic BRDFs by Data-Driven Diffuse-Specular Separation. ACM
Transactions on Graphics (TOG), 37(6):1–15, 2018a. 76
Tiancheng Sun, Jonathan T Barron, Yun-Ta Tsai, Zexiang Xu, Xueming Yu, Graham
Fyffe, Christoph Rhemann, Jay Busch, Paul Debevec, and Ravi Ramamoorthi.
Single Image Portrait Relighting. ACM Transactions on Graphics (TOG), 38(4):
1–12, 2019. 108, 110, 115, 117, 119, 139
Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhemann, Paul
Debevec, Yun-Ta Tsai, Jonathan T Barron, and Ravi Ramamoorthi. Light Stage
Super-Resolution: Continuous High-Frequency Relighting. ACM Transactions on
Graphics (TOG), 39(6):1–12, 2020. 26, 35, 38, 41, 45, 107, 109, 117, 119, 132, 138,
164, 245, 246, 255
Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tian-
fan Xue, Joshua B Tenenbaum, and William T Freeman. Pix3D: Dataset and Meth-
ods for Single-Image 3D Shape Modeling. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2018b. 27, 30, 36, 46, 169, 170, 179, 188,
195, 205, 216, 245, 247, 259
Minhyuk Sung, Vladimir G Kim, Roland Angst, and Leonidas Guibas. Data-Driven
Structural Priors for Shape Completion. ACM Transactions on Graphics (TOG),
34(6):175, 2015. 174
297
Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-View 3D Models
From Single Images With a Convolutional Network. In European Conference on
Computer Vision (ECCV), 2016. 175
Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree Generating Net-
works: Efficient Convolutional Architectures for High-Resolution 3D Outputs. In
IEEE/CVF International Conference on Computer Vision (ICCV), 2017. 29, 175,
198
Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Sei-
del, Patrick Pérez, Michael Zollhofer, and Christian Theobalt. StyleRig: Rigging
StyleGAN for 3D Control Over Portrait Images. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2020a. 116
Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi,
Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias
Nießner, et al. State of the Art on Neural Rendering. Computer Graphics Forum
(CGF), 39(2):701–727, 2020b. 114, 117, 118
Duc Thanh Nguyen, Binh-Son Hua, Khoi Tran, Quang-Hieu Pham, and Sai-Kit Ye-
ung. A Field Model for Repairing 3D Shapes. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2016. 175
Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred Neural Rendering:
Image Synthesis Using Neural Textures. ACM Transactions on Graphics (TOG),
38(4):1–12, 2019. 117, 118, 152, 153, 166, 246
Justus Thies, Michael Zollhöfer, Christian Theobalt, Marc Stamminger, and Matthias
Nießner. Image-Guided Neural Object Rendering. In International Conference on
Learning Representations (ICLR), 2020. 117
Sebastian Thrun and Ben Wegbreit. Shape From Symmetry. In IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), 2005. 174
Antonio Torralba and William T Freeman. Accidental Pinhole and Pinspeck Cameras:
Revealing the Scene Outside the Picture. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2012. 222
Antonio Torralba, Kevin P Murphy, and William T Freeman. Sharing Visual Features
for Multiclass and Multiview Object Detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 29(5), 2007. 177
Yun-Ta Tsai and Rohit Pandey. Portrait Light: Enhancing Portrait Light-
ing With Machine Learning. https://ai.googleblog.com/2020/12/portrait-
light-enhancing-portrait.html, 2020. Accessed: 08/21/2021. 26
Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-View
Supervision for Single-View Reconstruction via Differentiable Ray Consistency. In
298
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2017. 170, 175, 178, 195, 197, 198, 207
Borom Tunwattanapong, Graham Fyffe, Paul Graham, Jay Busch, Xueming Yu, Ab-
hijeet Ghosh, and Paul Debevec. Acquiring Reflectance and Shape From Continu-
ous Spherical Harmonic Illumination. ACM Transactions on Graphics (TOG), 32
(4):1–12, 2013. 114
Greg Turk and Marc Levoy. Zippered Polygon Meshes From Range Images. In
SIGGRAPH, 1994. 29
Bruce Walter, Stephen R Marschner, Hongsong Li, and Kenneth E Torrance. Micro-
facet Models for Refraction Through Rough Surfaces. In Eurographics Symposium
on Rendering Techniques (EGSR), 2007. 32, 63, 101, 249
Jiaping Wang, Yue Dong, Xin Tong, Zhouchen Lin, and Baining Guo. Kernel Nyström
Method for Light Transport. In SIGGRAPH, 2009. 114
Peng Wang, Lingqiao Liu, Chunhua Shen, Zi Huang, Anton van den Hengel, and
Heng Tao Shen. Multi-Attention Network for One Shot Learning. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 177
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan
Catanzaro. High-Resolution Image Synthesis and Semantic Manipulation With
Conditional GANs. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2018. 221, 224, 230, 233
Xiaolong Wang, David Fouhey, and Abhinav Gupta. Designing Deep Networks for
Surface Normal Estimation. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2015. 176
Yu-Xiong Wang and Martial Hebert. Learning to Learn: Model Regression Networks
for Easy Small Sample Learning. In European Conference on Computer Vision
(ECCV), 2016. 177
299
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image Quality
Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on
Image Processing (TIP), 13(4):600–612, 2004. 95, 142, 149, 235
Greg Ward and Rob Shakespeare. Rendering With Radiance: The Art and Science
of Lighting Visualization. Morgan Kaufmann Publishers, 1998. 75
Xin Wei, Guojun Chen, Yue Dong, Stephen Lin, and Xin Tong. Object-Based Illu-
mination Estimation With Rendering-Aware Neural Networks. In European Con-
ference on Computer Vision (ECCV), 2020. 57
Yair Weiss. Deriving Intrinsic Images From Image Sequences. In IEEE/CVF Inter-
national Conference on Computer Vision (ICCV), 2001. 40, 176
Tim Weyrich, Wojciech Matusik, Hanspeter Pfister, Bernd Bickel, Craig Donner,
Chien Tu, Janet McAndless, Jinho Lee, Addy Ngan, Henrik Wann Jensen, et al.
Analysis of Human Faces Using a Measurement-Based Skin Reflectance Model.
ACM Transactions on Graphics (TOG), 25(3):1013–1024, 2006. 109
Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin Johnson. SynSin: End-
to-End View Synthesis From a Single Image. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2020. 116
Andrew P Witkin. Recovering Surface Shape and Orientation From Texture. Artificial
Intelligence, 17(1-3):17–45, 1981. 42
Hao-Yu Wu, Michael Rubinstein, Eugene Shih, John Guttag, Frédo Durand, and
William T Freeman. Eulerian Video Magnification for Revealing Subtle Changes
in the World. ACM Transactions on Graphics (TOG), 31(4):1–8, 2012. 222
Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T Freeman, and Joshua B Tenen-
baum. Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-
Adversarial Modeling. In Advances in Neural Information Processing Systems
(NeurIPS), 2016. 171, 175, 176, 181, 197
Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T Freeman, and
Joshua B Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches.
In Advances in Neural Information Processing Systems (NeurIPS), 2017. 172, 175,
176, 179, 184, 185, 198
300
Jiajun Wu, Chengkai Zhang, Xiuming Zhang, Zhoutong Zhang, William T Freeman,
and Joshua B Tenenbaum. Learning 3D Shape Priors for Shape Completion and
Reconstruction. In European Conference on Computer Vision (ECCV), 2018. 27,
30, 43, 46, 169, 170, 171, 179, 216, 245, 247
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang,
and Jianxiong Xiao. 3D ShapeNets: A Deep Representation for Volumetric Shapes.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2015. 174
Rui Xia, Yue Dong, Pieter Peers, and Xin Tong. Recovering Shape and Spatially-
Varying Surface Reflectance Under Unknown Illumination. ACM Transactions on
Graphics (TOG), 35(6):1–12, 2016. 58
Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-Shot Learning – the Good, the
Bad and the Ugly. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2017. 177
Yu Xiang, Wonhui Kim, Wei Chen, Jingwei Ji, Christopher Choy, Hao Su, Roozbeh
Mottaghi, Leonidas Guibas, and Silvio Savarese. ObjectNet3D: A Large Scale
Database for 3D Object Recognition. In European Conference on Computer Vision
(ECCV), 2016. 170, 178, 187, 192, 195
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba.
SUN Database: Large-Scale Scene Recognition From Abbey to Zoo. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2010. 194, 265
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang,
and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation With Atten-
tional Generative Adversarial Networks. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2018a. 224
Zexiang Xu, Kalyan Sunkavalli, Sunil Hadap, and Ravi Ramamoorthi. Deep Image-
Based Relighting From Optimal Sparse Samples. ACM Transactions on Graphics
(TOG), 37(4):1–13, 2018b. 114, 118, 119, 143, 146, 147, 148, 149
Zexiang Xu, Sai Bi, Kalyan Sunkavalli, Sunil Hadap, Hao Su, and Ravi Ramamoorthi.
Deep View Synthesis From Sparse Photometric Images. ACM Transactions on
Graphics (TOG), 38(4):1–13, 2019. 117
301
Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective
Transformer Nets: Learning Single-View 3D Object Reconstruction Without 3D
Supervision. In Advances in Neural Information Processing Systems (NeurIPS),
2016. 175
Shunyu Yao, Tzu Ming Harry Hsu, Jun-Yan Zhu, Jiajun Wu, Antonio Torralba,
William T Freeman, and Joshua B Tenenbaum. 3D-Aware Scene Manipulation
via Inverse Graphics. In Advances in Neural Information Processing Systems
(NeurIPS), 2018. 198
Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Ronen Basri,
and Yaron Lipman. Multiview Neural Surface Reconstruction by Disentangling
Geometry and Appearance. In Advances in Neural Information Processing Systems
(NeurIPS), 2020. 58
Li Yi, Hao Su, Lin Shao, Manolis Savva, Haibin Huang, Yang Zhou, Benjamin Gra-
ham, Martin Engelcke, Roman Klokov, Victor Lempitsky, et al. Large-Scale 3D
Shape Reconstruction and Segmentation From ShapeNet Core55. In IEEE/CVF
International Conference on Computer Vision (ICCV), 2017. 178, 197
Yizhou Yu, Paul Debevec, Jitendra Malik, and Tim Hawkins. Inverse Global Illu-
mination: Recovering Reflectance Models of Real Scenes From Photographs. In
SIGGRAPH, 1999. 57
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang,
and Dimitris N Metaxas. StackGAN: Text to Photo-Realistic Image Synthesis With
Stacked Generative Adversarial Networks. In IEEE/CVF International Conference
on Computer Vision (ICCV), 2017. 224
Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-Attention
Generative Adversarial Networks. In International Conference on Machine Learn-
ing (ICML), 2019. 233
Kai Zhang, Fujun Luan, Qianqian Wang, Kavita Bala, and Noah Snavely. PhySG: In-
verse Rendering With Spherical Gaussians for Physics-Based Material Editing and
Relighting. In IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2021a. 59
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The
Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2018a. 95, 137,
142, 149, 158, 235
302
Ruo Zhang, Ping-Sing Tsai, James Edwin Cryer, and Mubarak Shah. Shape-From-
Shading: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence (TPAMI), 21(8):690–706, 1999. 176
Xiuming Zhang, Sean Fanello, Yun-Ta Tsai, Tiancheng Sun, Tianfan Xue, Rohit
Pandey, Sergio Orts-Escolano, Philip Davidson, Christoph Rhemann, Paul De-
bevec, Jonathan T Barron, Ravi Ramamoorthi, and William T Freeman. Neural
Light Transport for Relighting and View Synthesis. ACM Transactions on Graph-
ics (TOG), 40(1):1–17, 2021b. 26, 27, 35, 36, 37, 38, 42, 45, 58, 95, 107, 111, 127,
165, 245, 246, 251
Xiuming Zhang, Pratul P Srinivasan, Boyang Deng, Paul Debevec, William T Free-
man, and Jonathan T Barron. NeRFactor: Neural Factorization of Shape and Re-
flectance Under an Unknown Illumination. ACM Transactions on Graphics (TOG),
2021c. 26, 31, 32, 33, 34, 35, 36, 37, 38, 41, 45, 50, 54, 104, 245, 246
Xuaner Zhang, Jonathan T Barron, Yun-Ta Tsai, Rohit Pandey, Xiuming Zhang, Ren
Ng, and David E Jacobs. Portrait Shadow Manipulation. ACM Transactions on
Graphics (TOG), 39(4):78–1, 2020. 110, 116
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
Object Detectors Emerge in Deep Scene CNNs. In International Conference on
Learning Representations (ICLR), 2014. 212
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo
Magnification: Learning View Synthesis Using Multiplane Images. ACM Transac-
tions on Graphics (TOG), 37(4):1–12, 2018. 117
Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative
Visual Manipulation on the Natural Image Manifold. In European Conference on
Computer Vision (ECCV), 2016. 176
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired Image-to-
Image Translation Using Cycle-Consistent Adversarial Networks. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2017a. 224
Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver
Wang, and Eli Shechtman. Multimodal Image-to-Image Translation by Enforcing
Bi-Cycle Consistency. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2017b. 224
Todd Zickler, Ravi Ramamoorthi, Sebastian Enrique, and Peter N Belhumeur. Re-
flectance Sharing: Predicting Appearance From a Sparse Set of Images of a Known
303
Shape. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),
28(8):1287–1302, 2006. 118
304