Shape, Reflectance, and Illumination From Appearance

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 304

Shape, Reflectance, and Illumination From

Appearance
by
Xiuming Zhang
B.Eng., National University of Singapore (2015)
S.M., Massachusetts Institute of Technology (2018)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2021
© Massachusetts Institute of Technology 2021. All rights reserved.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
August 27, 2021
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
William T. Freeman
Thomas and Gerd Perkins Professor of Electrical Engineering and
Computer Science
Thesis Supervisor

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Leslie A. Kolodziejski
Professor of Electrical Engineering and Computer Science
Chair, Department Committee on Graduate Students
2
Shape, Reflectance, and Illumination From Appearance
by
Xiuming Zhang

Submitted to the Department of Electrical Engineering and Computer Science


on August 27, 2021, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science

Abstract
The image formation process describes how light interacts with the objects in a scene
and eventually reaches the camera, forming an image that we observe. Inverting this
process is a long-standing, ill-posed problem in computer vision, which involves es-
timating shape, material properties, and/or illumination passively from the object’s
appearance. Such “inverse rendering” capabilities enable 3D understanding of our
world (as desired in autonomous driving, robotics, etc.) and computer graphics appli-
cations such as relighting, view synthesis, and object capture (as desired in Extended
Reality [XR], etc.).
In this dissertation, we study inverse rendering by recovering three-dimensional
(3D) shape, reflectance, illumination, or everything jointly under different setups.
The input across different setups varies from single images to multi-view images lit
by multiple known lighting conditions, then to multi-view images under one unknown
illumination. Across the setups, we explore optimization-based recovery that exploits
multiple observations of the same object, learning-based reconstruction that heavily
relies on data-driven priors, and a mixture of both. Depending on the problem, we
perform inverse rendering at three different levels of abstraction: I) At a low level of
abstraction, we develop physically-based models that explicitly solve for every term
in the rendering equation, II) at a middle level, we utilize the light transport function
to abstract away intermediate light bounces and model only the final “net effect,”
and III) at a high level, we treat rendering as a black box and directly invert it
with learned data-driven priors. We also demonstrate how higher-level abstraction
leads to models that are simple and applicable to single images but also possess fewer
capabilities.
This dissertation discusses four instances of inverse rendering, gradually ascending
in the level of abstraction. In the first instance, we focus on the low-level abstraction
where we decompose appearance explicitly into shape, reflectance, and illumination.
To this end, we present a physically-based model capable of such full factorization
under one unknown illumination and another that handles one-bounce indirect illumi-
nation. In the second instance, we ascend to the middle level of abstraction, at which
we model appearance with the light transport function, demonstrating how this level

3
of modeling easily supports relighting with global illumination, view synthesis, and
both tasks simultaneously. Finally, at the high level of abstraction, we employ deep
learning to directly invert the rendering black box in a data-driven fashion. Specif-
ically, in the third instance, we recover 3D shapes from single images by learning
data-driven shape priors and further make our reconstruction generalizable to novel
shape classes unseen during training. Also relying on data-driven priors, the fourth
instance concerns how to recover lighting from the appearance of the illuminated
object, without explicitly modeling the image formation process.

Thesis Supervisor: William T. Freeman


Title: Thomas and Gerd Perkins Professor of Electrical Engineering and Computer
Science

4
Acknowledgments

These five years at MIT has been truly an amazing journey: I learned “like taking
a drink from a fire hose” (former MIT President Wiesner) and made lifelong friends
with whom I can share the ups and downs in taking that drink.

First, I would like to express my heartfelt gratitude to my advisor, Bill Freeman,


for his endless support and advice. Since Day 1, Bill has been giving me full freedom
to pursue research that I am excited about. In every project, he steered me into
the right direction and provided invaluable feedback every time we chatted. Bill is a
creative scientist-cum-artist who is always trying to image X from Y where X and Y
are crazy pairs like the Earth and the Moon, Boston and rainbow, etc. He is also an
elegant academic noble who teaches me not only computer vision but also how to be
a better person. The question I always ask myself is “What would Bill do?” I had a
slow start at the early phase of my Ph.D., and it was Bill’s “slow down to speed up”
that kept me hanging in there. Bill’s wisdom such as making toy models will continue
to guide me throughout my career and life. I could not ask for a better advisor than
Bill. Thank you, Bill.

I would also like to thank my dissertation committee members: Antonio Torralba


and Jon Barron. Although I did not interact directly with Antonio much, his impact
reaches every corner of our office. His famous quote “Bugs are good because that
means your algorithm is not hopeless” cheered me up every time I found a bug. Jon
is one of the pioneers in the theme of this dissertation, and I started learning about
this field by reading his (inspiring) papers. I was very fortunate to have interned
with him twice at Google, and several papers constituting this dissertation originated
from those internships, so it is redundant to say how much impact Jon had on my
research. He is basically my advisor in industry. Jon is incredibly knowledgeable
about everything (in depth), and his explanations of things are always crystal clear.
I wish through years of learning and practicing, I could have just one inch of Jon’s
breadth and depth of knowledge. It is my honor and pleasure to have both Antonio
and Jon on my dissertation committee. Thank you both for your service.

5
I owe a debt of gratitude to my advisors who got me started in research during
my undergraduate time: Thomas Yeo, Mert Sabuncu, and Beth Mormino. Thomas
was my Bachelor’s thesis advisor, with whom I worked together on a daily basis for
around two years. Technical and meanwhile attentive to details, he showed me what
top-notch research was while I did not even know much about machine learning. Since
my graduation, he has been continuing to support me in many aspects, from graduate
school applications to recommendation letter writing. Even though the collaboration
with Mert had been mostly online, he generously offered many helpful suggestions on
graduate research in our first (and probably only) in-person interaction back in 2016.
Without the rigorous research training from them, I would not be here writing this
dissertation today.

Besides those already mentioned, I was fortunate to have worked with many in-
telligent collaborators during my Ph.D. (in approximately chronological order): Ji-
ajun Wu*, Zhoutong Zhang*, Chengkai Zhang*, Josh Tenenbaum*, Tianfan Xue*,
Xingyuan Sun*, Charles He, Tali Dekel, Stefanie Mueller, Andrew Owens, Yikai
Li, Jiayuan Mao, Noah Snavely, Cecilia Zhang, Ren Ng, David E. Jacobs, Sergio
Orts-Escolano*, Rohit Pandey*, Christoph Rhemann*, Sean Fanello*, Yun-Ta Tsai*,
Tiancheng Sun*, Zexiang Xu*, Ravi Ramamoorthi*, Paul Debevec*, Boyang Deng*,
Pratul Srinivasan*, Matt Tancik*, Ben Mildenhall*, Steven Liu, Richard Zhang, Jun-
Yan Zhu, and Bryan Russell. This dissertation would not have been possible without
the input from the co-authors marked with an asterisk. I want to particularly thank
two labmates from this list: Jiajun and Zhoutong. As a senior student in the Lab, Jia-
jun provided valuable advice and help in bootstrapping my computer vision research;
the knowledge I gained from exploring 3D vision with Jiajun laid the foundation for
this dissertation. Zhoutong, despite being my peer, constantly amazes me with his
breadth of knowledge in vision and graphics; a “walking Visionpedia” is what I call
him. It was my privilege to have learned so much from everyone listed above.

I would not have been able to get through these challenging years without the
support from the staff in EECS and CSAIL (in no particular order): Janet Fischer,
Alicia Duarte, Kathy McCoy, Katrina LaCurts, Roger White, Sheila Sharbetian, Fern

6
Keniston, Rachel Gordon, Adam Conner-Simons, Garrett Wollman, Steve Ruggiero,
Jay Sekora, Tom Buehler, Jason Dorfman, Jon Proulx, Mark Pearrow, etc. Janet
and Katrina provided so much helpful advice as I navigated to today. I worked
with Rachel, Adam, Jason, and Tom on the MoSculp news article. They were such
a supportive and strong team that made MoSculp a hit. I would also like to thank
everyone in The Infrastructure Group, without whose solid technical skills and prompt
help, I would not be able to do any of the research presented in this dissertation.
I am thankful to everyone in the Vision Graphics Neighborhood and beyond at
MIT. I enjoyed every conversation we had in our (tiny) kitchen. We went hiking,
watched several musicals and plays, and witnessed the total solar eclipse together.
Thanks to all of you for making my MIT life colorful. To all my friends scattered
around the world, thank you, too, for the support and friendship.
I want to thank my entire family for their unwavering love and support, especially
my parents, Yanbin Sun and Chunmin Zhang, who have always been supporting my
decisions unconditionally (even though some came with great financial costs). I hope
we all agree that we have made the right calls. Being far away from home (and now
trapped by COVID), I was not able to go back home as often as I wanted during my
Ph.D.; as the single child, I wish I had done more. If only Nai-Nai and Lao-Ye were
still around with us today, they would be so proud to see their grandson graduating
with a Ph.D. Thank you, and I love you all.
Lastly, I thank my girlfriend, Hanzheng Li, for always being there for me. I owe a
lot of my success to her. Despite being a classical pianist, she knows all about nonsense
like Reviewer #2, weak reject, etc. She always manages to cheer me up when bad
things happen and to calm me down before overexcitement becomes sorrow – just the
perfect other half for me. The quality of life since I met her has risen significantly,
and I look forward to the next chapter of life together with her.

7
To Yanbin, Chunmin, and Hanzheng

8
Brief Contents

1 Introduction 25
1.1 Image Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.2 Inverting the Image Formation Process . . . . . . . . . . . . . . . . . 39
1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2 Low-Level Abstraction: Physically-Based Appearance Factorization 49


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.3 Method: Multiple Known Illuminations . . . . . . . . . . . . . . . . . 61
2.4 Method: One Unknown Illumination . . . . . . . . . . . . . . . . . . 70
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3 Mid-Level Abstraction: The Light Transport Function 107


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.3 Method: Precise, High-Frequency Relighting . . . . . . . . . . . . . . 119
3.4 Method: Free-Viewpoint Relighting . . . . . . . . . . . . . . . . . . . 127
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

9
4 High-Level Abstraction: Data-Driven Shape Reconstruction 169
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.3 Method: Learning & Using Shape Priors . . . . . . . . . . . . . . . . 179
4.4 Method: Generalizing to Unseen Classes . . . . . . . . . . . . . . . . 183
4.5 Method: Building a Real-World Dataset . . . . . . . . . . . . . . . . 187
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

5 High-Level Abstraction: Data-Driven Lighting Recovery 217


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

6 Conclusion & Discussion 245

A Supplement: Neural Reflectance and Visibility Fields (NeRV) 249


A.1 BRDF Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . 249
A.2 Additional Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 250
A.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

B Supplement: Light Stage Super-Resolution (LSSR) 255


B.1 Progressive Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
B.2 Baseline Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

C Supplement: Pix3D 259


C.1 Evaluation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
C.2 Nearest Neighbors With Different Metrics . . . . . . . . . . . . . . . 260
C.3 Sample Data in Pix3D . . . . . . . . . . . . . . . . . . . . . . . . . . 261

10
D Supplement: Generalizable Reconstruction (GenRe) 265
D.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
D.2 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

11
THIS PAGE INTENTIONALLY LEFT BLANK

12
Contents

1 Introduction 25
1.1 Image Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.1.1 Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.1.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.1.3 Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.1.4 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.1.5 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.2 Inverting the Image Formation Process . . . . . . . . . . . . . . . . . 39
1.2.1 Joint Estimation of Shape, Reflectance, & Illumination . . . . 40
1.2.2 Interpolating the Light Transport Function . . . . . . . . . . . 41
1.2.3 Shape Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 42
1.2.4 Lighting Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2 Low-Level Abstraction: Physically-Based Appearance Factorization 49


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2.1 Inverse Rendering . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2.2 Coordinate-Based Neural Representations . . . . . . . . . . . 58
2.2.3 Precomputation in Computer Graphics . . . . . . . . . . . . . 60
2.2.4 Material Acquisition . . . . . . . . . . . . . . . . . . . . . . . 60
2.3 Method: Multiple Known Illuminations . . . . . . . . . . . . . . . . . 61
2.3.1 Neural Radiance Fields (NeRF) . . . . . . . . . . . . . . . . . 62

13
2.3.2 Neural Reflectance Fields . . . . . . . . . . . . . . . . . . . . 62
2.3.3 Light Transport via Neural Visibility Fields . . . . . . . . . . 63
2.3.4 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.5 Training & Implementation Details . . . . . . . . . . . . . . . 68
2.4 Method: One Unknown Illumination . . . . . . . . . . . . . . . . . . 70
2.4.1 Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.4.2 Reflectance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.4.3 Illumination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.4.4 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 79
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.2 Shape Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.5.3 Joint Estimation of Shape, Reflectance, & Illumination . . . . 83
2.5.4 Free-Viewpoint Relighting . . . . . . . . . . . . . . . . . . . . 85
2.5.5 Material Editing . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.6.1 Baseline Comparisons: Multiple Known Illuminations . . . . . 91
2.6.2 Baseline Comparisons: One Unknown Illumination . . . . . . 95
2.6.3 Ablation Studies: Multiple Known Illuminations . . . . . . . . 99
2.6.4 Ablation Studies: One Unknown Illumination . . . . . . . . . 100
2.6.5 Estimation Consistency Across Different Illuminations . . . . 103
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3 Mid-Level Abstraction: The Light Transport Function 107


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.2.1 Single Observation . . . . . . . . . . . . . . . . . . . . . . . . 115
3.2.2 Multiple Views . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.2.3 Multiple Illuminants . . . . . . . . . . . . . . . . . . . . . . . 117

14
3.2.4 Multiple Views & Illuminants . . . . . . . . . . . . . . . . . . 118
3.3 Method: Precise, High-Frequency Relighting . . . . . . . . . . . . . . 119
3.3.1 Active Set Construction . . . . . . . . . . . . . . . . . . . . . 121
3.3.2 Alias-Free Pooling . . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 125
3.3.4 Loss Functions & Training Strategy . . . . . . . . . . . . . . . 126
3.4 Method: Free-Viewpoint Relighting . . . . . . . . . . . . . . . . . . . 127
3.4.1 Texture-Space Inputs . . . . . . . . . . . . . . . . . . . . . . . 129
3.4.2 Query & Observation Networks . . . . . . . . . . . . . . . . . 132
3.4.3 Residual Learning of High-Order Effects . . . . . . . . . . . . 133
3.4.4 Simultaneous Relighting & View Synthesis . . . . . . . . . . . 136
3.4.5 Network Architecture, Losses, & Other Details . . . . . . . . . 136
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.5.1 Hardware Setup & Data Acquisition . . . . . . . . . . . . . . 139
3.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 142
3.5.3 Precise Directional Relighting . . . . . . . . . . . . . . . . . . 143
3.5.4 High-Frequency Image-Based Relighting . . . . . . . . . . . . 144
3.5.5 Lighting Softness Control . . . . . . . . . . . . . . . . . . . . 144
3.5.6 Geometry-Free Relighting . . . . . . . . . . . . . . . . . . . . 146
3.5.7 Geometry-Based Relighting . . . . . . . . . . . . . . . . . . . 148
3.5.8 Changing the Viewpoint . . . . . . . . . . . . . . . . . . . . . 151
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.6.1 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.6.2 Image-Based Relighting Under Varying Light Frequency . . . 159
3.6.3 Subsampling the Light Stage . . . . . . . . . . . . . . . . . . . 161
3.6.4 Degrading the Input Geometry Proxy . . . . . . . . . . . . . . 163
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

4 High-Level Abstraction: Data-Driven Shape Reconstruction 169


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

15
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.2.1 3D Shape Completion . . . . . . . . . . . . . . . . . . . . . . 174
4.2.2 Single-Image 3D Reconstruction . . . . . . . . . . . . . . . . . 175
4.2.3 2.5D Sketch Recovery . . . . . . . . . . . . . . . . . . . . . . . 176
4.2.4 Perceptual Losses & Adversarial Learning . . . . . . . . . . . 176
4.2.5 Spherical Projections . . . . . . . . . . . . . . . . . . . . . . . 177
4.2.6 Zero- & Few-Shot Recognition . . . . . . . . . . . . . . . . . . 177
4.2.7 3D Shape Datasets . . . . . . . . . . . . . . . . . . . . . . . . 178
4.3 Method: Learning & Using Shape Priors . . . . . . . . . . . . . . . . 179
4.3.1 Shape Naturalness Network . . . . . . . . . . . . . . . . . . . 181
4.3.2 Training Paradigm . . . . . . . . . . . . . . . . . . . . . . . . 182
4.4 Method: Generalizing to Unseen Classes . . . . . . . . . . . . . . . . 183
4.4.1 Single-View Depth Estimator . . . . . . . . . . . . . . . . . . 184
4.4.2 Spherical Map Inpainting Network . . . . . . . . . . . . . . . 184
4.4.3 Voxel Refinement Network . . . . . . . . . . . . . . . . . . . . 185
4.4.4 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.5 Method: Building a Real-World Dataset . . . . . . . . . . . . . . . . 187
4.5.1 Collecting Image-Shape Pairs . . . . . . . . . . . . . . . . . . 189
4.5.2 Image-Shape Alignment . . . . . . . . . . . . . . . . . . . . . 190
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.6.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.6.4 Single-View Shape Completion . . . . . . . . . . . . . . . . . . 199
4.6.5 Single-View Shape Reconstruction . . . . . . . . . . . . . . . . 203
4.6.6 Estimating Depth for Novel Shape Classes . . . . . . . . . . . 208
4.6.7 Reconstructing Novel Objects From Training Classes . . . . . 208
4.6.8 Reconstructing Objects From Unseen Classes . . . . . . . . . 209
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.7.1 Network Visualization . . . . . . . . . . . . . . . . . . . . . . 212

16
4.7.2 Training With the Naturalness Loss Over Time . . . . . . . . 213
4.7.3 Failure Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.7.4 Effects of Viewpoints on Generalization . . . . . . . . . . . . . 214
4.7.5 Generalizing to Non-Rigid Shapes . . . . . . . . . . . . . . . . 214
4.7.6 Generalizing to Highly Regular Shapes . . . . . . . . . . . . . 215
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

5 High-Level Abstraction: Data-Driven Lighting Recovery 217


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
5.2.1 Non-Line-of-Sight Imaging . . . . . . . . . . . . . . . . . . . . 221
5.2.2 Lighting Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.2.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . 223
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5.3.1 Data & Simulation . . . . . . . . . . . . . . . . . . . . . . . . 225
5.3.2 Nearest Neighbor-Based Recovery . . . . . . . . . . . . . . . . 228
5.3.3 Generative Adversarial Network-Based Recovery . . . . . . . . 229
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
5.4.1 Test Data & Evaluation Metrics . . . . . . . . . . . . . . . . . 234
5.4.2 Earth Recovery Given the Moon . . . . . . . . . . . . . . . . . 235
5.4.3 Learning the Continuous Earth Rotation . . . . . . . . . . . . 237
5.4.4 Multi-Modal Generation & the Clouds . . . . . . . . . . . . . 240
5.4.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

6 Conclusion & Discussion 245

A Supplement: Neural Reflectance and Visibility Fields (NeRV) 249


A.1 BRDF Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . 249
A.2 Additional Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 250
A.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

17
B Supplement: Light Stage Super-Resolution (LSSR) 255
B.1 Progressive Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
B.2 Baseline Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

C Supplement: Pix3D 259


C.1 Evaluation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
C.2 Nearest Neighbors With Different Metrics . . . . . . . . . . . . . . . 260
C.3 Sample Data in Pix3D . . . . . . . . . . . . . . . . . . . . . . . . . . 261

D Supplement: Generalizable Reconstruction (GenRe) 265


D.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
D.2 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
D.2.1 Single-View Depth Estimator . . . . . . . . . . . . . . . . . . 266
D.2.2 Spherical Map Inpainting Network . . . . . . . . . . . . . . . 270
D.2.3 Voxel Refinement Network . . . . . . . . . . . . . . . . . . . . 270

18
List of Figures

1-1 Relationships among the object, light, and camera. . . . . . . . . . . 28


1-2 Visualization of different shape representations. . . . . . . . . . . . . 29
1-3 Example BRDFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1-4 Direct vs. indirect illumination. . . . . . . . . . . . . . . . . . . . . . 34
1-5 Example shadow map and ambient occlusion map. . . . . . . . . . . 35

2-1 How NeRV reduces the computational complexity. . . . . . . . . . . 52


2-2 Example input and output of NeRV. . . . . . . . . . . . . . . . . . . 53
2-3 NeRFactor overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2-4 Example decomposition of NeRV. . . . . . . . . . . . . . . . . . . . . 61
2-5 The geometry of an indirect illumination path in NeRV. . . . . . . . 64
2-6 NeRFactor model and its example output. . . . . . . . . . . . . . . . 71
2-7 High-quality geometry recovered by NeRFactor. . . . . . . . . . . . . 82
2-8 Joint estimation of shape, reflectance, and lighting by NeRFactor. . . 86
2-9 Free-viewpoint relighting by NeRFactor. . . . . . . . . . . . . . . . . 88
2-10 NeRFactor’s results on real-world captures. . . . . . . . . . . . . . . 89
2-11 Material editing and relighting by NeRFactor. . . . . . . . . . . . . . 90
2-12 NeRV vs. Bi et al. [2020a]. . . . . . . . . . . . . . . . . . . . . . . . 93
2-13 NeRV vs. latent code models. . . . . . . . . . . . . . . . . . . . . . . 95
2-14 NeRFactor vs. SIRFS. . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2-15 NeRFactor vs. Oxholm and Nishino [2014] (enhanced). . . . . . . . . 98
2-16 NeRFactor vs. Philip et al. [2019]. . . . . . . . . . . . . . . . . . . . 99
2-17 Indirect illumination in NeRV. . . . . . . . . . . . . . . . . . . . . . 100

19
2-18 NeRV with analytic vs. MLP-predicted normals. . . . . . . . . . . . 100
2-19 Qualitative ablation studies of NeRFactor. . . . . . . . . . . . . . . . 102
2-20 Albedo estimation of NeRFactor across different illuminations. . . . 103

3-1 LSSR overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110


3-2 NLT overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3-3 Visualization of the LSSR architecture. . . . . . . . . . . . . . . . . 121
3-4 Construction of the active sets in LSSR. . . . . . . . . . . . . . . . . 122
3-5 LSSR’s alias-free pooling. . . . . . . . . . . . . . . . . . . . . . . . . 124
3-6 Gap in photorealism that NLT attempts to close. . . . . . . . . . . . 127
3-7 NLT Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3-8 Modeling non-diffuse BSSRDFs as residuals for relighting in NLT. . . 135
3-9 Modeling global illumination as residuals for relighting in NLT. . . . 135
3-10 Sample images used for training NLT. . . . . . . . . . . . . . . . . . 141
3-11 Interpolation by LSSR between two physical lights. . . . . . . . . . . 143
3-12 High-frequency image-based relighting by LSSR. . . . . . . . . . . . 145
3-13 Controlling lighting softness with LSSR. . . . . . . . . . . . . . . . . 145
3-14 Relighting by LSSR and the baselines. . . . . . . . . . . . . . . . . . 147
3-15 NLT relighting with a directional light. . . . . . . . . . . . . . . . . 150
3-16 HDRI relighting by NLT. . . . . . . . . . . . . . . . . . . . . . . . . 151
3-17 View synthesis by NLT. . . . . . . . . . . . . . . . . . . . . . . . . . 153
3-18 Simultaneous relighting and view synthesis by NLT. . . . . . . . . . 155
3-19 Comparing NLT against NeRF and NeRF+Light. . . . . . . . . . . 156
3-20 Continuous directional relighting by LSSR. . . . . . . . . . . . . . . 157
3-21 NLT and its ablated variants for relighting. . . . . . . . . . . . . . . 158
3-22 Quality gain by LSSR w.r.t. lighting frequency. . . . . . . . . . . . . 160
3-23 LSSR vs. linear blending: relighting errors w.r.t. light density. . . . . 161
3-24 LSSR vs. linear blending: relighting with sparser lights. . . . . . . . 162
3-25 NLT relighting with sparser lights. . . . . . . . . . . . . . . . . . . . 162
3-26 Performance of NLT w.r.t. quality of the geometry proxy. . . . . . . 163

20
3-27 A failure case of NLT’s view synthesis. . . . . . . . . . . . . . . . . . 166

4-1 Two levels of ambiguity in single-view 3D shape perception. . . . . . 171


4-2 GenRe overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4-3 ShapeHD model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
4-4 GenRe model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4-5 GenRe’s spherical inpainting module generalizing to new classes. . . 185
4-6 Pix3D vs. existing datasets. . . . . . . . . . . . . . . . . . . . . . . . 188
4-7 Building Pix3D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
4-8 Sample images and shapes in Pix3D. . . . . . . . . . . . . . . . . . . 193
4-9 Image and shape distributions across categories of Pix3D. . . . . . . 195
4-10 3D shape completion from single-view depth by ShapeHD. . . . . . . 200
4-11 3D shape completion by ShapeHD. . . . . . . . . . . . . . . . . . . . 201
4-12 3D shape completion by ShapeHD using real depth data. . . . . . . 202
4-13 3D shape reconstruction by ShapeHD on ShapeNet. . . . . . . . . . 204
4-14 3D shape reconstruction by ShapeHD on novel categories. . . . . . . 204
4-15 3D shape reconstruction by ShapeHD on PASCAL 3D+. . . . . . . . 205
4-16 3D shape reconstruction by ShapeHD on Pix3D. . . . . . . . . . . . 206
4-17 GenRe’s depth estimator generalizing to novel shape classes. . . . . . 208
4-18 GenRe’s reconstruction within and beyond training classes. . . . . . 210
4-19 GenRe’s reconstruction on real images from novel classes. . . . . . . 211
4-20 Visualization of how ShapeHD attends to details in the depth maps. 212
4-21 How ShapeHD improves over time with the naturalness loss. . . . . 213
4-22 Common failure modes of ShapeHD. . . . . . . . . . . . . . . . . . . 214
4-23 Reconstruction errors of GenRe across different input viewpoints. . . 215
4-24 Single-view completion of non-rigid shapes from depth by GenRe. . . 215
4-25 Completion of highly regular shapes (primitives) by GenRe. . . . . . 215

5-1 Our simplification of the Sun-Earth-Moon system. . . . . . . . . . . 227


5-2 How the Moon responds differently to distinct Earth illuminations. . 228
5-3 Illustration of the nearest neighbor baselines for EarthGAN. . . . . . 229

21
5-4 EarthGAN model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5-5 Different Earth appearances at similar timestamps. . . . . . . . . . . 233
5-6 Earth recovery by EarthGAN. . . . . . . . . . . . . . . . . . . . . . 236
5-7 Continuous Earth rotation learned by EarthGAN. . . . . . . . . . . 238
5-8 Smooth evolution of the Earth appearance learned by EarthGAN. . . 239
5-9 How EarthGAN learns to model the clouds. . . . . . . . . . . . . . . 241
5-10 Ablation studies of EarthGAN’s design choices. . . . . . . . . . . . . 243

A-1 NeRV vs. NLT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251


A-2 Additional results and baseline comparisons for NeRV. . . . . . . . . 252

B-1 Network architecture and progressive training scheme of LSSR. . . . 256


B-2 More comparisons between LSSR and the baselines. . . . . . . . . . 258

C-1 Retrieving nearest neighbors in Pix3D using different metrics. . . . . 260


C-2 Diverse images associated the same shape in Pix3D. . . . . . . . . . 261
C-3 Sample images and their corresponding shapes in Pix3D. . . . . . . . 262
C-4 More sample images and their corresponding shapes in Pix3D. . . . . 263

22
List of Tables

2.1 Quantitative evaluation of NeRFactor. . . . . . . . . . . . . . . . . . 84


2.2 Quantitative evaluation of NeRV. . . . . . . . . . . . . . . . . . . . . 94
2.3 Quantitative ablation studies of NeRV. . . . . . . . . . . . . . . . . . 100

3.1 Neural network architecture of NLT. . . . . . . . . . . . . . . . . . . 137


3.2 Relighting errors of LSSR. . . . . . . . . . . . . . . . . . . . . . . . . 146
3.3 NLT Relighting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.4 View synthesis errors of NLT. . . . . . . . . . . . . . . . . . . . . . . 152

4.1 Dataset quality of Pix3D. . . . . . . . . . . . . . . . . . . . . . . . . 196


4.2 Correlation between different shape metrics and human judgments. . 196
4.3 Average shape completion errors of ShapeHD on ShapeNet. . . . . . 203
4.4 3D shape reconstruction by ShapeHD on PASCAL 3D+. . . . . . . . 207
4.5 3D shape reconstruction by ShapeHD on Pix3D. . . . . . . . . . . . 207
4.6 3D shape reconstruction by GenRe on training and novel classes. . . 209
4.7 Reconstruction errors of GenRe on Pix3D. . . . . . . . . . . . . . . . 211

5.1 Quantitative evaluation of EarthGAN. . . . . . . . . . . . . . . . . . 237

23
THIS PAGE INTENTIONALLY LEFT BLANK

24
Chapter 1

Introduction

One way to view computer vision is thinking of it as “inverse computer graphics.”


Computer graphics covers the whole procedure of building scene geometry, crafting
shaders for each part of that geometry, setting up lighting and a camera, and even-
tually rendering everything into a photorealistic image. Computer vision, on the
contrary, aims to recover all these intermediate factors from the observed image(s).
This definition of computer vision encompasses several classic problems: Shape From
Shading [Horn, 1970] and Structure From Motion [Ullman, 1979] recover geometry
from images, Intrinsic Image Decomposition [Barrow and Tenenbaum, 1978] recovers
reflectance, and Barron and Malik [2014] additionally recover illumination.
Following this definition of computer vision, we present several approaches all
aimed to recover different intermediate factors (such as shape, reflectance, and light-
ing) from what they collectively lead to—images. These inverse problems are well-
known to be ill-posed, so different priors (data-driven or predefined) are employed
to differentiate a plausible combination of the different factors from other possible
but less likely ones. Solving these inverse problems benefits many downstream vi-
sion and graphics applications. Just to name a few, the development of Extended
Reality (XR) is hindered by the cost and difficulty in creating high-quality 3D assets
[Maughan, 2016], and the ability to automatically recover shape and material prop-
erties from just images would circumvent the heavy manual labor of object scanning,
thereby greatly accelerating and democratizing XR content creation [Inc., 2021]; an

25
algorithm capable of estimating facial geometry and reflectance would enable “magic”
portrait relighting features on consumer mobile phones, such as Google’s Portrait
Light [Tsai and Pandey, 2020].

Throughout this dissertation, we tackle these inverse rendering problems at three


levels of abstraction. At a low level of abstraction, we devise physically-based
methods that explicitly solve for every term in the (simplified) rendering equation: the
object’s shape, reflectance, and illumination that collectively explain the observed im-
ages. To this end, the first half of Chapter 2 studies whether one can jointly optimize
shape, reflectance, and indirect illumination from scratch given multi-view images of
an object lit by multiple arbitrary but known lighting conditions [Srinivasan et al.,
2021]. To relax the capture requirement of multiple known lighting conditions, the
second half of Chapter 2 is dedicated to achieving a similar decomposition of shape
and reflectance but under just one unknown lighting condition [Zhang et al., 2021c].
With these models, we achieve high-quality geometry estimation, free-viewpoint re-
lighting, and material editing. Both approaches model the actual image formation
process at a low level, relying more on physics than on data (which high-level ap-
proaches usually depend on). Despite challenging, this low-level abstraction enables
applications that a mid- or high-level one would not support, such as shape or material
editing and asset export (e.g., into a traditional graphics engine).

Ascending to a middle level of abstraction, we tackle another two inverse ren-


dering problems by abstracting away the complex object-light interaction with the
light transport (LT) function. Intuitively, the LT function “summarizes” the actual
LT by directly returning the resultant radiance given some convenient descriptions of
the camera and light (the simplest being light and view directions). In the first half
of Chapter 3, we take an entirely image-based approach and focus on interpolating
the LT function in just the light direction, which enables continuous, ghosting-free
directional relighting and high-frequency image-based relighting [Sun et al., 2020]. In
the second half, we interpolate the LT function additionally in the view direction,
thereby performing simultaneous relighting and view synthesis [Zhang et al., 2021b].
In both approaches, we estimate a mid-level representation—the LT function—from

26
the subject’s appearance observed under various lighting conditions (also from differ-
ent views for Zhang et al. [2021b]), without further factorizing the function into the
underlying shape and reflectance. This mid-level abstraction allows our models to
easily include global illumination effects, but it does not support shape or material
editing (which the low-level abstraction permits) and requires multiple images of the
object (in contrast to the high-level abstraction that is applicable to single images).
Finally, at a high level of abstraction, we aim to directly regress the inter-
mediate factors (e.g., shape, lighting, etc.) from their resultant appearance, without
modeling the actual image formation process. This level of abstraction treats render-
ing as a black box to be inverted and usually involves training end-to-end machine
learning models on large datasets to learn data-driven priors directly on the inter-
mediate factors. Specifically, in this dissertation, we explore two instances of such
methods: 3D shape reconstruction from single images (Chapter 4) and lighting recov-
ery from the appearance of the illuminated object (Chapter 5). In the first problem of
shape reconstruction, we train neural networks to directly regress 3D shapes from sin-
gle images thereof, leveraging the data-driven shape priors learned from a large-scale
shape dataset [Sun et al., 2018b, Wu et al., 2018]. We further make such networks
generalizable to novel shape classes unseen during training, by wiring geometric pro-
jections (which we understand well and can specify exactly) as inductive bias into
our model [Zhang et al., 2018b]. In the second problem of lighting recovery, we train
a conditional generative model to learn regularities in our lighting conditions, such
that when given the appearance of the illuminated, the model generates a plausible
lighting condition responsible for the observation. With this high-level abstraction,
we ignore the physics of the image formation process and take data-driven approaches
that accept single-image input, leveraging the power of machine learning.

1.1 Image Formation

In this section, we briefly introduce the image formation process in nature or computer
graphics. Figure 1-1 shows a cartoon visualization of the relationships among the

27
object, light, and camera, an example real photo of the scene, and a computer graphics
render aiming to reproduce that real photo. We present only a simplified process that
is sufficient for what this dissertation concerns. In this simplified framework, there
are four key scene elements—shape, materials, lighting, and the camera—and the
rendering process that combines these elements into an RGB image of the scene. In
the following subsections, we elaborate on each of these four scene elements and the
rendering process.

Camera Light

Object
Shadow Catcher

Real Synthetic
Specular
Highlights

Shadows

Image from: https://www.firemist.com/technotes/3d-how-it-works-part1 Render from: https://www.firemist.com/technotes/3d-how-it-works-part1

Figure 1-1: Relationships among the object, light, and camera. Top: Light travels
from the source to the scene, interacts with the objects therein, and reaches the
camera. Bottom (left): A real photo contains complex light transport effects such as
specular highlights and soft shadows. Bottom (right): With careful scene modeling
and physically-based rendering, one can reproduce the real photo with a synthetic
render, thanks to computer graphics.

1.1.1 Shape

Shape is arguably the most important aspect of a 3D scene because it provides a


“foundation” on which material properties are defined. As such, it is difficult, if
possible at all, to estimate other scene aspects such as material properties without
knowledge of geometry. Though important, geometry is not easy to represent since

28
the representation must be powerful enough to represent high-frequency structures,
descriptive enough to extract information from, and fast enough to perform operations
on. Unsurprisingly, there is no single optimal shape representation that is omnipotent.
We visualize the popular shape representations in Figure 1-2.

(A) Mesh (B) Point Clouds (C) Voxels (D) SDF (0 level set) (E) 2.5D Maps

Figure 1-2: Visualization of different shape representations. These example images


are taken from (A) Turk and Levoy [1994], (B) PointNet++ [Qi et al., 2017b], (C)
OGN [Tatarchenko et al., 2017], (D) DeepSDF [Park et al., 2019a], and (E) GenRe
Zhang et al. [2018b].

Mesh The computer graphics community has been using mesh as their shape repre-
sentation for decades. Briefly, mesh is a compact representation that describes shape
as a list of vertices and faces (i.e., connectivity among the vertices). See Figure 1-2
(A) for an example. Besides being compact, it is powerful to represent complex ge-
ometry of any topology (by simply adding more vertices and faces) and efficient to
compute ray intersections on [Möller and Trumbore, 1997], which is a particularly
important feature as millions of ray-mesh intersection computations are common in
ray tracing. Despite universal and efficient, mesh is less amenable to neural networks
than the other representations to be discussed below.

Point Clouds If we ablate the mesh faces (or equivalently, vertex connectivity), we
come to the point cloud representation where a collection of 3D points describe the
surface geometry. See Figure 1-2 (B) for an example. A point cloud of size 𝑁 is just
an 𝑁 × 3 array of unordered 3D coordinates and therefore can be easily processed
by network architectures such as PointNet [Qi et al., 2017a]. The major drawback,
though, is its lack of semantics for surface since there is no face information. As such,
a ray that should have hit the surface would travel through the unconnected points,

29
and the concept of being inside or outside of the shape is undefined.

Voxels Another possibility to represent shape is using voxels: a 3D grid of occu-


pancy values. Intuitively, this “pixelated” representation looks like a LEGO® approx-
imation of the actual (smooth) shape. See Figure 1-2 (C) for an example. To convert
voxels into mesh, one can run the Marching Cubes algorithm [Lorensen and Cline,
1987]. Like pixels, voxels are friendly to Convolutional Neural Networks (CNNs), and
extending a 2D CNN to a 3D one is straightforward. As such, our shape representa-
tions used in Chapter 4 are mostly voxels [Sun et al., 2018b, Wu et al., 2018, Zhang
et al., 2018b].
The disadvantage of using voxels, though, is its poor scalability and high memory
consumption. Indeed, as we see in Chapter 4, nearly all voxel-based approaches have
limited reconstruction resolution because of the cubic memory demand in resolution
(although there are data structures such as octree that can alleviate this problem,
as discussed in Section 4.2). Note that voxels often waste resources since shape,
especially surface, is often sparse in the 3D space (a random point in the 3D space is
less likely to fall on the surface or inside of the shape than to land in free space).

Implicit Representations Voxels can be thought of as a 3D discrete field of


scalars. If we use a continuous field of scalars to represent shape, we will be able
to represent smooth geometry better. Signed Distance Function (SDF) is a realiza-
tion of this idea: It is a continuous function that returns the shortest distance from
the query point to the surface, with the sign of the distance indicating whether the
query point is inside or outside of the shape. Since shape is implicitly represented by
the zero level set of the SDF, these functions are sometimes referred to as “implicit
shape representations.” See Figure 1-2 (D) for an example.
Since neural networks are universal function approximators [Hornik et al., 1989]
that are compact and tend to produce smooth output, researchers recently proposed
to parameterize these implicit representations using Multi-Layer Perceptrons (MLPs)
[Park et al., 2019a, Mildenhall et al., 2020]. Following this line of work, Chapter 2 of

30
this dissertation explores using MLPs to represent geometry in two ways: The first
half of the chapter maintains a volumetric representation using MLPs [Srinivasan
et al., 2021], while the second half opts for a surface representation but also using
MLPs [Zhang et al., 2021c].

2.5D Maps Besides 3D representations, there are also 2D representations that can
describe 3D shape. With 3D semantics such as depth or normals, these 2D images are
often referred to as “2.5D” maps or buffers [Marr, 1982]. More specifically, a depth
map has its pixel values indicating how far the camera rays travel before hitting the
objects in the scene; a normal map has its pixel values specifying the 3D orientations
of the surface points visible from this view. See Figure 1-2 (E) for an example. Unlike
3D representations, these 2.5D maps are dependent on the view: Different views of
the same scene lead to different 2.5D maps since different 3D points fall onto the
image plane.
Because these maps are essentially 2D images exploiting sparseness of 3D surface,
they are amenable to image CNNs and other network architectures designed for im-
ages. We use depth maps and other custom 2.5D maps (such as spherical maps in
Section 4.4) in recovering 3D shape in Chapter 4. In addition, Chapter 2 also vi-
sualizes many geometric properties such as surface normals and light visibility using
these 2.5D representations.

1.1.2 Materials

With the shape defined, one next specifies the material properties for the object,
possibly in a spatially-vary way. The simplest material description is reflectance,
concerning only a local surface point where the light ray lands. Because this type of
reflectance depends on only the incoming and outgoing directions (𝜔i and 𝜔o ) w.r.t.
the local surface normal 𝑛 at that point (i.e., no non-local information required), it can
conveniently expressed using a Bidirectional Reflectance Distribution Function
(BRDF): 𝑓 (𝜔i , 𝜔o ). Intuitively, 𝑓 (·) describes how the outgoing energy is distributed
over all possible 𝜔o ’s given every 𝜔i , as visualized in Figure 1-3. The fact that 𝜔i

31
and 𝜔o are often defined in the local frame with 𝑛 as the 𝑧-axis demonstrates why we
often require the shape be defined before considering materials (not to mention that
we need geometry to find the ray-surface intersection too).

Normal (n) n n n

Mirror Glossy Diffuse General

Figure 1-3: Example BRDFs. A perfectly reflective BRDF reflects the incoming light
to the mirrored direction. A glossy BRDF reflects light to a lobe of directions centered
around the mirror direction. A diffuse BRDF reflects light equally to all directions.
A general BRDF reflects light into all directions non-uniformly.

With this formulation, one can describe a diffuse material using the Lambertian
BRDF. Because a perfectly Lambertian material reflects the incoming light to all
outgoing directions equally, the Lambertian BRDF simply returns the same constant
for all 𝜔o ’s given any 𝜔i . Other commonly used BRDFs include the Blinn-Phong
reflection model [Blinn, 1977] and the microfacet BRDF by Walter et al. [2007],
both of which are capable of describing glossy materials with specular highlights (like
those shown in Figure 1-1). If an object has different BRDFs at different surface
locations, Spatially-Varying BRDFs (SVBRDFs) are necessary to specify its material
properties. In Chapter 2, we use the microfacet BRDF by Walter et al. [2007] as the
main reflection model [Srinivasan et al., 2021] and as an analytic alternative to our
learned BRDF [Zhang et al., 2021c].
Despite easy to use, these surface reflectance models deal with only local light
transport happening right at the ray-surface contact points. Therefore, they are un-
able to express non-local light transport such as subsurface scattering (SSS) as com-
monly observed on human skin [Hanrahan and Krueger, 1993] or transmitting light
transport as observed in translucent materials. As such, researchers have developed
more general material-describing functions such as the Bidirectional Scattering
Distribution Function (BSDF) by Jensen et al. [2001]. The first half of Chapter 2
computes local radiance values with BRDFs only (i.e., no scattering or transmittance)
but then employs volume rendering to alpha composite the resultant radiance values

32
along a camera ray [Srinivasan et al., 2021]. On the other hand, the second half opts
for an entirely surface-based treatment: Radiance is computed locally with BRDFs
only, and that local radiance directly arrives at the camera, with no volume rendering
or attenuation along the path.
Besides what has been discussed, there are other important BXDF (“X” being
a wildcard for “R,” “S,” etc.) topics that this dissertation does not touch on. For
instance, while many BXDF models are designed to look realistic, others are carefully
crafted to be physically correct, with properties that a naturally existing material
would possess such as energy conservation and the Helmholtz reciprocity. Our learned
BRDF in Chapter 2 [Zhang et al., 2021c] falls into the former category, with no
guarantee to be physically accurate.
Another essential BXDF topic is importance sampling: the technique that ef-
ficiently samples Monte Carlo paths based on the BXDF to enable efficient, low-
variance rendering [Lawrence et al., 2004]. Incorporating such techniques to the
BRDFs in Chapter 2 could be interesting but meanwhile challenging because the
BRDFs there are unknown and being estimated jointly [Srinivasan et al., 2021, Zhang
et al., 2021c].

1.1.3 Lighting

Shape and materials are intrinsic properties of an object. Extrinsic to the object are
lighting and the camera.
Broadly, lighting can be categorized into direct or indirect illumination. Di-
rect illumination is the light arriving at the object directly from the light source,
while indirect illumination is the light bounced to the object from another object
in the scene rather than the light source, as illustrated in Figure 1-4. Taking into
account also the indirect illumination is crucial to photorealism: Figure 1-4 shows a
comparison between a scene rendered with direct illumination only vs. with global
illumination. It is clear that simulation of many light transport phenomena, such
as the green tint cast by the right green wall (pictured only indirectly via the mir-
ror ball) onto the other wall and the diffuse ball, requires modeling of the indirect

33
illumination. In Chapter 2, we study inverse rendering for both setups: considering
one-bounce indirect illumination [Srinivasan et al., 2021] and considering just direct
illumination [Zhang et al., 2021c].

© Wikimedia Commons User “Barahag” (https://en.wikipedia.org/wiki/File:Global_illumination1.png)


Figure 1-4: Direct
vs. indirect
illumination. A
render made with
direct illumination
only (A) misses
global illumination
effects present in the
(A) Direct Illumination Only (B) With Indirect Illumination full render (B).

How do we represent lighting? The two common representations are latitude-


longitude maps or light probe images and coefficients for some predefined basis
functions such as spherical harmonics or Gaussians. The former method directly
stores the High-Dynamic-Range (HDR) values for all possible latitude-longitude com-
binations in the image grid, while the latter projects the spherical signal into a set
of basis functions and stores just the coefficients. Apparently, the basis function
method has the advantage of being more compact (e.g., only 15 scalar coefficients for
five levels of spherical harmonics) than the latitude-longitude representation. How-
ever, it struggles to represent arbitrary, high-frequency lighting that is easy to rep-
resent (e.g., a latitude-longitude map with alternating pixel colors along both image
dimensions) unless using an excessive number of coefficients. Chapter 2 uses exclu-
sively the latitude-longitude light probe representation [Srinivasan et al., 2021, Zhang
et al., 2021c]. Chapter 5 also uses an image representation but without the latitude-
longitude semantics to the pixel locations.
For each latitude-longitude direction of the light probe, we can compute light vis-
ibility at each scene point by casting a ray from that point to the latitude-longitude
direction and checking whether that ray gets blocked by other geometry. In a scene
without (semi-)transparent objects, every 3D point has a binary visibility to a given
light direction (blocked or not), although values ∈ [0, 1] may also arise when we “ras-

34
terize” the 3D visibility into visibility maps associated with different views, or the
scene representation is volumetric as in Chapter 2 [Srinivasan et al., 2021]. Given a
light direction, the visibility map (as visualized in Figure 1-5) can be thought of as
a “shadow map,” informing us which pixels in this particular view are in shadow. If
we average these per-light visibility maps over all incoming light directions, we get
the ambient occlusion map (as visualized in Figure 1-5) that encodes how “exposed”
each point is to all light directions. We use these maps extensively in Chapter 2 for
visualization.

Figure 1-5: Example shadow


map and ambient occlusion
map. Per-direction visibilities
can be thought of as shadow
maps. Averaging visibilities
over all light directions leads
to the ambient occlusion map
quantifying how “exposed”
each point is to all lights.
Shadow Map Ambient Occlusion

Many of the image features that we observe depend on the incoming light direc-
tion: When this direction varies, those image features such as shadows and specular
highlights change in the same fixed viewpoint. Consider the apple scene in Figure 1-1.
When the light bulb moves around, we will see the shadows and specular highlights
moving accordingly. Other light-dependent effects that are more subtle include
shadow softness and specularity spread: Still in the apple scene, if the light bulb
shrinks in size, approaching a point light, the cast shadows will become harder with
the penumbra gradually disappearing, and the specular highlights will become more
concentrated. Relighting is the problem of synthesizing such light-dependent effects
under novel lighting, addressed by Chapter 3 and Chapter 2 of this dissertation [Sun
et al., 2020, Zhang et al., 2021b, Srinivasan et al., 2021, Zhang et al., 2021c].

35
1.1.4 Cameras

Cameras record a 2D projection of the 3D world onto the image plane. The projection
is governed by camera extrinsics and intrinsics.
Camera extrinsics describes the rigid-body transformation from the world co-
ordinate system to the camera’s local coordinate system, usually in the form of a 3D
rotation matrix 𝑅 ∈ R3×3 and a 3D translation vector 𝑡 ∈ R3 . Camera extrinsics
can then be expressed with a 3 × 4 matrix 𝐸 = [𝑅 | 𝑡]. Therefore, given a 3D point
(in homogeneous coordinates) 𝑥homo
w in the world space, 𝑥c = 𝐸𝑥homo
w produces the
non-homogeneous coordinates of the same point in the camera’s local frame.
Camera intrinsics, on the other hand, specifies how the 3D-to-2D projection is
performed in the camera’s local space. In this dissertation (and most of the computer
vision projects), we assume zero skew, square pixels, and a center
[︂ optical
]︂ center.
𝑓 0 𝑤/2
These assumptions lead to the 3 × 3 intrinsics matrix 𝐾 = 0 𝑓 ℎ/2 , where 𝑓 is
0 0 1
the focal length in pixels1 , and (ℎ, 𝑤) are the image resolution. With both extrinsics
and intrinsics specified, the “one-stop” projection matrix projecting 𝑥homo
w to its 2D
homogeneous coordinates in the image space is 𝑥homo
i = 𝐾𝐸𝑥homo
w . We refer the
reader to Szeliski [2010] for more on camera models and to Hartley and Zisserman
[2004] for in-depth mathematics on projective geometry.
We use these 3D-to-2D projections, their inversions (as simple as matrix inversion),
and their extensions heavily throughout this dissertation. Specifically, in Chapter 2,
we cast camera rays to the scene by inverting the aforementioned 3D-to-2D camera
projection [Srinivasan et al., 2021, Zhang et al., 2021c]. In Chapter 3, we resample
pixels from the camera space to the UV texture space and back [Zhang et al., 2021b].
Finally, we estimate the extrinsics and intrinsics parameters [Sun et al., 2018b] and
backproject 2.5D depth maps to the 3D space (and to “spherical maps”) [Zhang et al.,
2018b] in Chapter 4.
There is a surprising (and perhaps unintuitive) duality between cameras and

1
To convert a mm focal length to pixels, one needs to compare the image resolution (which is
in pixels) with the effective sensor size (which is in mm), then compute how many pixels 1 mm
translates to, and finally scale the mm focal length accordingly.

36
lights as shown by the Dual Photography work of Sen et al. [2005], where they
successfully synthesize the scene appearance from the projector’s perspective and
also relight the scene as if the camera were the projector (light). Similar to the
light-dependent effects discussed above, view-dependent effects are the appearance
variations due to viewpoint changes. Unsurprisingly, specularity moves as you view
it from different viewpoints, e.g., by swaying your head left and right. Shadows,
however, are seldom view-dependent: Shadows do not move w.r.t. the rest of the
3D scene as the viewpoint varies. This is a distinction between cameras and lights
despite their similarities in other aspects. The task of view synthesis is about
synthesizing the view-dependent effects for a novel viewpoint, and we address this
task in Chapter 3 and Chapter 2 [Zhang et al., 2021b, Srinivasan et al., 2021, Zhang
et al., 2021c].

1.1.5 Rendering

We have defined the four essential scene aspects—shape, materials, lighting, and
cameras—and introduced their representations commonly used. The final missing
piece of the puzzle is rendering, the process of “combining” the four elements into an
RGB image.
To figure out the appearance for a 3D point 𝑥, one solves the rendering equa-
tion [Kajiya, 1986, Immel et al., 1986] often using Monte Carlo methods. In this
dissertation where no object emits light, we simplify the full equation to:

∑︁
(1.1)
(︀ )︀
𝐿o (𝑥, 𝜔o ) = 𝑅(𝑥, 𝜔i , 𝜔o )𝐿i (𝑥, 𝜔i ) 𝜔i · 𝑛 ∆𝜔i ,
𝜔i

where 𝐿o (𝑥, 𝜔o ) is the outgoing radiance at 𝑥 as viewed from 𝜔o , 𝑅(𝑥, 𝜔i , 𝜔o ) is the


SVBRDF at 𝑥 with directions 𝜔i and 𝜔o , 𝐿i (𝑥, 𝜔i ) is the incoming radiance (masked
by the visibility) arriving at 𝑥 along 𝜔i , 𝑛 is the surface normal at 𝑥, and ∆𝜔i is the
solid angle corresponding to the lighting sample at 𝜔i .
Note the recursive nature of Equation 1.1: 𝐿i (𝑥, 𝜔i ) in this iteration may equal
𝐿o (𝑥, 𝜔o ) from the previous iteration, e.g., when computing indirect illumination. In

37
the first half of Chapter 2, there is such recursion: 𝐿i is the sum of a light probe
pixel and the one-bounce indirect illumination from a nearby point [Srinivasan et al.,
2021], whereas in the second half, 𝐿i directly takes values from the light probe pixels
since we consider only direct illumination [Zhang et al., 2021c].

Although the rendering equation is expressive and general, one may not be able
to or need to fully decompose Equation 1.1 into every term. For instance, it is
error-prone, if possible at all, to explicitly find 𝑅 from samples of 𝐿o in the setup
of Chapter 3. Moreover, it is unnecessary to solve for every term in Equation 1.1
just for relighting and view synthesis in that setup since we do not plan to edit the
materials 𝑅. In such cases, a middle level of abstraction such as the light transport
function comes in useful. Formally, we reparameterize Equation 1.1 at a higher level
of abstraction:
∑︁
𝐿o (𝑥, 𝜔o ) = 𝑇 (𝑥, 𝜔i , 𝜔o )𝐿′i (𝜔i )∆𝜔i , (1.2)
𝜔i

where 𝑇 (𝑥, 𝜔i , 𝜔o ) is the light transport function that embraces the BRDF, cosine
term, light visibility, and the recursive nature of 𝐿i , and 𝐿′i (𝜔i ) is the light intensity
from 𝜔i . Crucially, unlike 𝐿i , 𝐿′i (𝜔i ) bears no dependency on 𝑥, thereby eliminating
the recursive nature of Equation 1.1. Intuitively, 𝑇 directly returns the “net radi-
ance” at 𝑥 when lit from 𝜔i and viewed from 𝜔o , concealing the actual recursion of
intermediate light bounces.

Chapter 3 demonstrates the usefulness of Equation 1.2, a level of abstraction


higher than the full decomposition of Equation 1.1. Instead of solving for the com-
plex reflectance of human skin, we opt to learn to directly interpolate 𝑇 , thereby
supporting relighting [Sun et al., 2020] or simultaneous relighting and view synthesis
[Zhang et al., 2021b]. That said, a decomposition more shallow than Equation 1.1 has
its own disadvantages: Such an approach is unable to export the underlying geome-
try or edit the materials. In contrast, the low-level abstraction that explicitly solves
for every term in Equation 1.1 (Chapter 2) further supports geometry estimation and
material editing besides free-viewpoint relighting [Srinivasan et al., 2021, Zhang et al.,
2021c].

38
1.2 Inverting the Image Formation Process

Although the image formation process is so well understood that we can render im-
ages indistinguishable from real photographs, inverting this process—recovering from
images the scene properties that we discussed—is still highly challenging because in-
formation loss is huge during the forward process: 3D shape gets projected to the 2D
image plane; reflectance and lighting get convolved together and then observed. In
other words, inverting the image formation process is ill-posed: There are multiple
sets of scene elements that could have caused the images that we observe.

In this section, we introduce four subproblems under the overarching theme of


this dissertation: shape, reflectance, and illumination from appearance. Each sub-
problem has its own dedicated chapter, and the following subsections correspond to
the upcoming four chapters. I) Corresponding to Chapter 2 (“shape, reflectance, and
illumination from appearance”), Section 1.2.1 introduces the task of jointly estimat-
ing shape, reflectance, and illumination from multi-view images. II) Section 1.2.2
proposes the problem of interpolating the light transport function, to which Chap-
ter 3 (“light transport function from appearance”) is dedicated. III) Section 1.2.3
introduces the problem of reconstructing 3D shapes single from images, correspond-
ing to Chapter 4 (“shape from appearance”). IV) Preparing the reader for Chapter 5
(“lighting from appearance”), Section 1.2.4 defines the task of lighting recovery from
the appearance of the illuminated object.

The four subproblems also represent three different levels of abstraction for the
inverse rendering problem. At a low level of abstraction, Chapter 2 attempts to solve
for every term in our (simplified) rendering equation (Equation 1.1) by re-rendering
all the estimated elements back to RGB images, which then get compared against the
observed images for loss computation. Despite challenging, this low-level approach
allows us to export the estimated shape and edit the estimated reflectance in ad-
dition to what a mid-level abstraction would support. Ascending to a middle level
of abstraction, Chapter 3 explores interpolating the light transport function (as in
Equation 1.2) given sparse samples thereof. Our models based on this mid-level ab-

39
straction enable relighting, view synthesis, and both tasks simultaneously while easily
including global illumination effects. Finally, at a high level of abstraction, Chapter 4
and Chapter 5 recover shape or lighting from single images, without modeling the
other scene elements or the rendering process. Relying on large datasets of shapes
or lighting patterns, these two chapters train deep learning models that directly map
the appearance observations to the underling shape or lighting.

1.2.1 Joint Estimation of Shape, Reflectance, & Illumination

In this subsection, we introduce the problem that Chapter 2 attempts to solve: es-
timating shape, reflectance, and illumination from the object’s appearance. This
amounts to explicitly solving for every term in Equation 1.1 and then re-rendering
these estimated factors into RGB images in a physically-based manner. As such, this
low level of abstraction supports operations on the estimated factors, such as lighting
editing (i.e., relighting), reflectance editing (i.e., material change), and shape export
(e.g. into a graphics engine).
Note that the well-known problem of Intrinsic Image Decomposition (IID) [Barrow
and Tenenbaum, 1978] solves only part of this factorization problem. In terms of
shape, the IID methods recover depth or surface normal maps only for the input
view, rather than a full 3D shape [Weiss, 2001, Tappen et al., 2003, Bell et al.,
2014, Barron and Malik, 2014, Janner et al., 2017]. This makes view synthesis with
these approaches impossible. Material-wise, these IID methods mostly assume the
Lambertian reflectance and tend to fail on more complicated materials. Finally,
lighting recovered by the IID approaches is also in the space of the input view (e.g., a
“lighting image”), making relighting with arbitrary lighting difficult. The appearance
factorization approaches that we propose in Chapter 2 address all of these issues that
the IID methods suffer from.
In Chapter 2, we study full appearance decomposition under two setups. In the
first setup, we assume that we observe the object under multiple arbitrary but known
lighting conditions [Srinivasan et al., 2021]. Note that “arbitrary” means that the
lighting does not have to be of a certain form such as one point light in the dark.

40
We also model first-bounce indirect illumination in this setup. In the second setup,
we relax the requirement for input lighting: We observe the object under only one
unknown lighting condition [Zhang et al., 2021c]. This relaxation allows us to apply
our method to a user capture under a natural, unknown lighting condition, such as
one made of a car on the street.

1.2.2 Interpolating the Light Transport Function

As discussed in Section 1.1.5, the light transport function 𝑇 is a convenient abstrac-


tion of complex BXDFs and ray bounces. Having access to 𝑇 enables relighting and
view synthesis: When we query 𝑇 at novel light directions, we are relighting the
scene, without actually knowing the underlying shape or BXDF; when we query 𝑇 at
novel viewing directions, we are synthesizing novel views of the scene, again without
having to evaluate Equation 1.1. Relighting and view synthesis, as applications of
light transport function interpolation, have their own more downstream applications
in Extended Reality (XR), as already discussed.
Recall that at the low level of abstraction, we can already perform relighting, view
synthesis, and both tasks simultaneously. Why do we need this mid-level abstraction
using 𝑇 , especially given that it would not support material editing or shape export?
It is still preferable to perform relighting and view synthesis with 𝑇 because 𝑇 en-
compasses the convoluted interactions between BXDFs and illuminations (which by
themselves may be already complex), making no simplifying assumption as needed
by the low-level abstraction. Therefore, this mid-level abstraction can deliver high-
quality relighting with global illumination effects, without requiring the underlying
geometry and BXDFs be estimated or multiple bounces be simulated. In contrast,
at the low level of abstraction, even the state-of-the-art model of ours supports only
one-bounce indirect illumination [Srinivasan et al., 2021], and most physically-based
models including Zhang et al. [2021c] consider only direct illumination.
In Chapter 3, we study the middle level of abstraction using the light transport
function 𝑇 . We first learn to interpolate 𝑇 in only the light direction, by observing
sparse samples of 𝑇 [Sun et al., 2020]. Although such interpolation supports only

41
relighting (i.e., not view synthesis), this approach has the advantages of being purely
image-based and requiring no 3D modeling. With the additional input of geometry
proxy, we continue exploring the interpolation of 𝑇 in both light and view directions,
thereby enabling simultaneous relighting and view synthesis [Zhang et al., 2021b].

1.2.3 Shape Reconstruction

Reconstructing 3D shapes from images is an important subproblem within inverse ren-


dering. It has wide applications in robotics, autonomous driving, Virtual/Augmented
Reality (A/VR), etc. To name a few example applications, a robotic system often
needs to understand the 3D shape of an object before being able to manipulate (e.g.,
grasp) it; driver-less cars need 3D understanding to avoid obstacles; an AR system
has to know the shape of a desk before allowing a user to place a virtual object on it.
Computer vision, graphics, and cognitive science researchers have been working
on shape reconstruction for decades, with a series of notable “Shape From X” works.
Shape From Shading attempts to recover the shape of a surface from the shading vari-
ation [Horn, 1975]. When multiple images lit by different light sources are available,
Photometric Stereo2 performs shape from shading more robustly [Woodham, 1981].
Shape From Texture aims to recover surface geometry from an image of texture often
using the prior that texture should be roughly regular [Witkin, 1981]. Depth From
Defocus infers depth from the strong depth cue of blur [Pentland, 1987]. Multi-View
Stereo reconstructs 3D shape from multi-view images of the object [Seitz et al., 2006].
Structure From Motion aims to recover both the 3D geometry and camera poses from
a series of images [Ullman, 1979].
Made possible by recent advances in deep learning, single-image 3D shape recon-
struction concerns estimating the 3D shape from just a single generic image (i.e.,
not necessarily an image of just texture) by learning category-specific (e.g., chairs)
priors from a large-scale dataset of shapes. Chapter 4 studies two such problems.
The first problem that Chapter 4 tackles is that when trained with a supervised loss,
2
The word “stereo” again implies the duality between lights and cameras [Szeliski, 2010] as dis-
cussed in Section 1.1.4.

42
the reconstruction network tends to produce blurry “mean shapes” that satisfy the
ℓ2 loss but do not look realistic [Wu et al., 2018]. The second problem addressed by
Chapter 4 is the generalizability of these reconstruction networks: They work well
only on the shape categories seen during training but generalize poorly to novel shape
classes, still “retrieving” shapes from the training classes [Zhang et al., 2018b].
Operating at a high level of abstraction, all solutions proposed in Chapter 4 treat
rendering as a black box and invert it directly with deep learning models that learn
data-driven priors. These models based on the high level of abstraction rely on data
rather than physics and have the advantage of being applicable to single images (cf.
multiple images as required by the mid- and low-level abstractions).

1.2.4 Lighting Recovery

Recovering lighting from the scene or object appearance is a challenging subproblem


of inverse rendering that has wide applications. The most relevant applications of
lighting recovery are arguably in AR. For instance, when an AR user wants to insert
a virtual object into their scene, it is crucial to have the target lighting recovered
so that the virtual object can be lit properly by the same lighting so as to appear
consistently with the real scene. Similarly, in future AR communication systems (de-
veloped hopefully not due to another pandemic), Alice needs to relight the teleported
Bob using Alice’s lighting, which needs recovering, for a photorealistic face-to-face
experience.
In Chapter 5 of this dissertation, we aim to recover lighting that is responsible for
the appearance of the illuminated object as captured in a single image. Specifically,
we study this problem in a special Moon-Earth setup where the Earth serves as the
light source that we aim to recover, and the Moon is the Earth-lit object that we
observe. Note that in reality, the Sun is the light source emitting light that travels
to the Earth and bounces off to the dark side of the Moon (whose bright side gets
directly lit by the Sun), and we are simplifying the setup by removing the Sun and
making the Earth emissive. At the current stage of this work in progress, we perform
all of our experiments on simulated data, and testing the model on real captured data

43
remains future work.
As alluded to previously, Chapter 5 continues to stay at the high level of abstrac-
tion. Specifically, we train a conditional generative model to directly “regress” the
Earth image from the Moon observation and the timestamp. Our data-driven solu-
tion circumvents the need to model the image formation process for this extreme case
and enables lighting recovery from single images.

1.3 Dissertation Structure

The overarching theme of this dissertation is recovering shape, reflectance, and illumi-
nation from appearance. We study four instances of inverse rendering: I) “shape,
reflectance, and illumination from appearance” in Chapter 2, II) “light transport func-
tion from appearance” in Chapter 3, III) “shape from appearance” in Chapter 4, and
IV) “lighting from appearance” in Chapter 5.
These four subtopics represent three levels of abstraction to tackle inverse
rendering: I) the low level of abstraction where we explicitly solve for every term—
shape, reflectance, and illumination—in the rendering equation (Equation 1.1) in a
physically-based manner, achieving full editability and exportability that a mid- or
high-level solution is incapable of, II) the middle level where we utilize the light
transport function (𝑇 in Equation 1.2) to abstract away intermediate light transport
and focus on just the final “net effect,” delivering high-quality relighting results with
global illumination effects for challenging reflectance (such as that of human skin),
and III) the high level where we treat rendering as a black box and invert it with
data-driven priors, supporting single-image input at test time.
In Chapter 1 (this chapter), we have introduced the image formation process by
explaining the four main scene elements (i.e., shape, materials, lighting, and cameras),
their representations in computer vision and graphics, and the rendering process that
“combines” these elements into images that we see. We then defined the problem of in-
verting the image formation process, where we aim to recover the scene elements from
image observations passively. Specifically, we have provided the problem statements

44
for the aforementioned four instances of inverse rendering.

At the low level of abstraction, Chapter 2 presents physically-based models


that solve for every term in the rendering equation (Equation 1.1), recovering shape,
reflectance, and illumination from multi-view images of the object and their camera
poses. This is the right level of abstraction when we need to further edit the estimated
shape or material and export them into common graphics pipelines. The two setups
that we consider are multiple arbitrary but known lighting conditions and a single
unknown lighting condition. For the former case, we develop Neural Reflectance
and Visibility Fields (NeRV) that estimates the shape and reflectance of an object
while modeling one-bounce indirect illumination [Srinivasan et al., 2021]. Under the
latter setup, we present Neural Factorization of Shape and Reflectance (NeRFactor)
capable of jointly estimating shape, reflectance, and the unknown lighting [Zhang
et al., 2021c].

At the middle level of abstraction, Chapter 3 utilizes the light transport func-
tion, 𝑇 in Equation 1.2, to abstract away intermediate light bounces and model
directly the “net effect” radiance. Specifically, we attempt to interpolate 𝑇 from the
sparse samples thereof. This is the right abstraction level for our problem, at which we
can perform high-quality relighting and view synthesis including global illumination
effects, without having to explicitly solve for geometry and reflectance or simulate all
light bounces. We first interpolate the light transport function in just the light direc-
tion, achieving precise, high-frequency portrait relighting with a model that we call
Light Stage Super-Resolution (LSSR) [Sun et al., 2020]. With the additional input of
a geometry proxy, we then develop Neural Light Transport (NLT) that interpolates
𝑇 in both the light and view directions, enabling simultaneous relighting and view
synthesis of humans with complex geometry and reflectance [Zhang et al., 2021b].

At the high level of abstraction, Chapter 4 studies shape from appearance by


treating rendering as a black box and inverting it with a data-driven machine learn-
ing approach. Specifically, we consider single-image 3D shape reconstruction, where
neural networks learn a direct mapping from a single image to the 3D shape therein
using data-driven shape priors. We first present ShapeHD, a model that achieves

45
high-quality reconstruction with an adversarially learned perceptual loss [Wu et al.,
2018]. Tackling the generalization problem of ShapeHD and similar learning mod-
els, we then propose Generalizable Reconstruction (GenRe) capable of generalizing
to novel shape categories unseen during training [Zhang et al., 2018b]. Finally, we
briefly discuss how Pix3D—our own real-world dataset of image-shape pairs with
pixel-level alignment—is constructed [Sun et al., 2018b] and facilitates the evaluation
of ShapeHD and GenRe.
Staying at the high level of abstraction, Chapter 5 presents the current progress
of our work on data-driven lighting recovery from appearance, where we train a con-
ditional generative model of possible lighting patterns given various appearances of
the object illuminated. We frame this problem in a special Moon-Earth setup where
the Earth, as the light source, illuminates the dark side of the Moon. Our model,
Generative Adversarial Networks for the Earth (EarthGANs), aims to recover the
Earth appearance given a single-pixel Moon appearance and the corresponding times-
tamp. This is the proper level of abstraction that circumvents the need to model this
extreme image formation process and makes EarthGAN applicable to a “backyard
image” taken by a mobile phone camera.
Finally, Chapter 6 concludes the dissertation and discusses future directions.

This dissertation draws on multiple collaborative projects that I either led or


participated in. Here we list all the publications involved and organize them by level
of abstraction and chapter (the asterisk indicates equal contribution):

Low Level of Abstraction (Chapter 2)

• Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben


Mildenhall, and Jonathan T. Barron. NeRV: Neural Reflectance and Visibil-
ity Fields for Relighting and View Synthesis. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2021.

• Xiuming Zhang, Pratul P. Srinivasan, Boyang Deng, Paul Debevec, William


T. Freeman, and Jonathan T. Barron. NeRFactor: Neural Factorization of

46
Shape and Reflectance Under an Unknown Illumination. ACM Transactions
on Graphics (TOG), TBA, 2021.

Middle Level of Abstraction (Chapter 3)

• Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhe-
mann, Paul Debevec, Yun-Ta Tsai, Jonathan T. Barron, and Ravi Ramamoor-
thi. Light Stage Super-Resolution: Continuous High-Frequency Relighting.
ACM Transactions on Graphics (TOG), 39(6):1–12, 2020.

• Xiuming Zhang, Sean Fanello, Yun-Ta Tsai, Tiancheng Sun, Tianfan Xue, Ro-
hit Pandey, Sergio Orts-Escolano, Philip Davidson, Christoph Rhemann, Paul
Debevec, Jonathan T. Barron, Ravi Ramamoorthi, and William T. Freeman.
Neural Light Transport for Relighting and View Synthesis. ACM Transactions
on Graphics (TOG), 40(1):1–17, 2021.

High Level of Abstraction (Chapter 4 & Chapter 5)

• Xingyuan Sun*, Jiajun Wu*, Xiuming Zhang, Zhoutong Zhang, Chengkai Zh-
ang, Tianfan Xue, Joshua B. Tenenbaum, and William T. Freeman. Pix3D:
Dataset and Methods for Single-Image 3D Shape Modeling. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

• Jiajun Wu*, Chengkai Zhang*, Xiuming Zhang, Zhoutong Zhang, William T.


Freeman, and Joshua B. Tenenbaum. Learning 3D Shape Priors for Shape
Completion and Reconstruction. In European Conference on Computer Vision
(ECCV), 2018.

• Xiuming Zhang*, Zhoutong Zhang*, Chengkai Zhang, Joshua B. Tenenbaum,


William T. Freeman, and Jiajun Wu. Learning to Reconstruct Shapes From Un-
seen Classes. In Advances in Neural Information Processing Systems (NeurIPS),
2018.

• Xiuming Zhang and William T. Freeman. Data-Driven Lighting Recovery: A


Moon-Earth Case Study. Work in Progress, 2021.

47
THIS PAGE INTENTIONALLY LEFT BLANK

48
Chapter 2

Low-Level Abstraction:
Physically-Based Appearance
Factorization

In this chapter, we model appearance at a low level of abstraction, explicitly solving


for every term in the rendering equation (Equation 1.1). Specifically, we address the
problem of estimating shape, Spatially-Varying Bidirectional Reflectance Distribution
Functions (SVBRDFs), and direct or indirect illumination from multi-view images of
an object lit by a single unknown or multiple arbitrary but known lighting conditions.
With our appearance factorization, one is able to synthesize the object appearance
from a novel viewpoint under any arbitrary lighting. Crucially, our approaches explic-
itly model visibility and are therefore able to not only remove shadows from albedo
during training but also render soft and hard shadows under novel test lighting.
We start with an introduction of inverse rendering (Section 2.1) and then review
the related work in Section 2.2. Next, we present Neural Reflectance and Vis-
ibility Fields (NeRV) that is capable of jointly estimating from scratch shape,
SVBRDFs, and indirect illumination from multi-view images of an object lit by mul-
tiple arbitrary (but known) lighting conditions (Section 2.3) [Srinivasan et al., 2021].
To relax the capture requirement of multiple known illuminations in NeRV, we fur-
ther devise Neural Factorization of Shape and Reflectance (NeRFactor) that

49
factorizes the object appearance into shape, SVBRDFs, and direct illumination from
multi-view images of an object lit by just one unknown lighting condition (Section 2.4)
[Zhang et al., 2021c].
In Section 2.5, we describe our experiments that evaluate how well NeRV and
NeRFactor perform appearance decomposition (and subsequently free-viewpoint re-
lighting), and how they compare with the existing solutions to our tasks, under two
setups: multiple arbitrary but known lighting conditions (for NeRV) and one un-
known lighting condition (for NeRFactor). We also perform additional analyses, in
Section 2.6, to study the importance of each major component of the NeRV and NeR-
Factor models and analyze whether NeRFactor predicts albedo consistently for the
same object when lit by different lighting conditions.

2.1 Introduction

Recovering an object’s geometry and material properties from captured images, such
that it can be rendered from arbitrary viewpoints under novel lighting conditions,
is a longstanding problem within computer vision and graphics. In addition to its
importance for recognition and robotics, a solution to this could democratize 3D
content creation and allow anyone to use real-world objects in Extended Reality (XR)
applications, film-making, and game development. The difficulty of this problem
stems from its fundamentally underconstrained nature, and prior work has typically
addressed this either by using additional observations such as scanned geometry or
images of the object under controlled laboratory lighting conditions, or by making
restrictive assumptions such as assuming a single material for the entire object or
ignoring self-shadowing.
The vision and graphics communities have recently made substantial progress
towards the novel view synthesis portion of this goal. Neural Radiance Fields (NeRF)
has shown that it is possible to synthesize photorealistic images of scenes by training
a simple neural network to map 3D locations in the scene to a continuous field of
volume density and color [Mildenhall et al., 2020]. Volume rendering is trivially

50
differentiable, so the parameters of a NeRF can be optimized for a single scene by
using gradient descent to minimize the difference between renderings of the NeRF
and a set of observed images. Although NeRF produces compelling results for view
synthesis, it does not provide a solution for relighting. This is because NeRF models
just the amount of outgoing light from a location – the fact that this outgoing light
is the result of interactions between incoming light and the material properties of an
underlying surface is ignored.

At first glance, extending NeRF to enable relighting appears to require only chang-
ing the image formation model: Instead of modeling scenes as fields of density and
view-dependent color, we can model surface normals and material properties (e.g.,
the parameters of a Bidirectional Reflectance Distribution Function [BRDF]), and
simulate the transport of the scene’s light sources (which we first assume are known)
according to the rules of physically-based rendering [Pharr et al., 2016]. However,
simulating the attenuation and reflection of light by particles is fundamentally chal-
lenging in NeRF’s neural volumetric representation because content can exist any-
where within the scene, and determining the density at any location requires querying
a neural network.

Consider the naïve procedure for computing the radiance along a single camera
ray due to direct illumination, as illustrated in Figure 2-1: First, we query NeRF’s
Multi-Layer Perceptron (MLP) for the volume density at samples along the cam-
era ray to determine the amount of light reflected by particles at each location that
reaches the camera. For each location along the camera ray, we then query the MLP
for the volume density at densely sampled points between the location and every light
source to estimate the attenuation of light before it reaches that location. This proce-
dure quickly becomes prohibitively expensive if we want to model environment light
sources or global illumination, in which case scene points may be illuminated from
all directions. Prior methods for estimating relightable volumetric representations
from images have not overcome this challenge and can only simulate direct illumina-
tion from a single point light source when training. This is what we refer to as the
“computational complexity problem” of extending NeRF for relighting.

51
Naïve Ours
Figure 2-1: How NeRV
reduces the
Direct
0.4 computational
0.6 complexity. Brute-force
0.7
light transport
simulation through
NeRF’s volumetric
0.8 representation with
One-Bounce

naïve raymarching
Indirect

3m
(left) is intractable. By
approximating
visibility with a neural
visibility field (right)
𝑛 is the number of samples along each ray, ℓ is the number of light
that is optimized
sources, and 𝑑 is the number of indirect illumination directions sam-
pled. Black dots represent evaluating a shape MLP for volume den- alongside the shape
sity at a position, red arrows represent evaluating the visibility MLP MLP, we are able to
at a position along a direction, and the blue arrow represents eval- make optimization
uating the visibility MLP for the expected termination depth of a with complex
ray. Output visibility multipliers and termination depths from the
illumination tractable.
visibility MLP are displayed as text.

The problem of efficiently computing visibility is well explored in the graphics


literature. In standard raytracing graphics pipelines, where the scene geometry is
fixed and known ahead of time, a common solution is to precompute a data structure
that can be efficiently queried to obtain the visibility between pairs of scene points or
between scene points and light sources. This can be accomplished with approaches
including octrees [Samet, 1989], distance transforms [Cohen and Sheffer, 1994], or
bounding volume hierarchies [Pharr et al., 2016]. However, these existing approaches
do not provide a solution to our task: Our geometry is unknown, and our model’s
estimate of geometry changes constantly as it is optimized. Although conventional
data structures could perhaps be used to accelerate rendering after optimization is
complete, we need to efficiently query the visibility between points during optimiza-
tion, and existing solutions are prohibitively expensive to rebuild after each training
iteration (of which there may be millions).

In the first half of this chapter, we present Neural Reflectance and Visibility
Fields (NeRV), an approach for estimating a volumetric 3D representation from

52
images of a scene under multiple arbitrary but known lighting conditions [Srinivasan
et al., 2021], such that novel images can be rendered from arbitrary unseen viewpoints
and under novel unobserved lighting conditions, as shown in Figure 2-2.

(a) Input images of the scene under unconstrained varying (known) lighting conditions

(b) Output renderings from novel viewpoints and lighting conditions


Figure 2-2: Example input and output of NeRV. We optimize a NeRV 3D represen-
tation from multi-view images of a scene illuminated by known but unconstrained
lighting. Our NeRV representation can be rendered from novel views under arbitrary
novel lighting conditions. Here we visualize example input data and renderings for
two scenes. The first two output rendered images for each scene are from the same
viewpoint, each illuminated by a point light at a different location, and the last image
is from a different viewpoint under a random colored illumination.

NeRV can simulate realistic environment lighting and global illumination. Our
key insight is to train an MLP to act as a lookup table into a visibility field during
rendering. Instead of estimating light or surface visibility at a given 3D position
along a given direction by densely evaluating an MLP for the volume density along
the corresponding ray (which would be prohibitively expensive), we simply query
this visibility MLP to estimate visibility and expected termination depth in any di-
rection (see Figure 2-1). This visibility MLP is optimized alongside the MLP that

53
represents volume density and supervised to be consistent with the volume density
samples observed during optimization. Using this neural approximation of the true
visibility field significantly eases the computational burden of estimating volume ren-
dering integrals while training. NeRV enables the recovery of a NeRF-like model
that supports relighting in addition to view synthesis. While previous solutions for
relightable NeRFs are limited to controlled settings that require the input images be
illuminated by a single point light [Bi et al., 2020a], NeRV supports training with
arbitrary environment lighting and one-bounce indirect illumination.
In the second half of this chapter, we continue to investigate whether we can
achieve what NeRV accomplishes but with just one unknown illumination, a setup
often encountered when the user wants to capture daily in-the-wild objects. To this
end, we develop Neural Factorization of Shape and Reflectance (NeRFactor),
a model capable of recovering convincing relightable representations from images of
an object captured under one unknown natural illumination condition [Zhang et al.,
2021c], as shown in Figure 2-3.

Normals Visibility Free-Viewpoint


Free-Viewpoint
Relighting Relighting Material Editing
Posed Multi-View Images
under an Unknown Illumination

Albedo BRDF

Real-World Capture

NeRFactor Applications

Figure 2-3: NeRFactor overview. Given a set of posed images of an object captured
from multiple views under just one unknown illumination condition (left), NeRFactor
is able to factorize the scene into 3D neural fields of surface normals, light visibil-
ity, albedo, and material (center), which enables applications such as free-viewpoint
relighting and material editing (right).

Our key insight is that we can first optimize a NeRF [Mildenhall et al., 2020] from
the input images to initialize our model’s surface normals and light visibility, and then
jointly optimize these initial estimates along with the spatially-varying reflectance and
the lighting condition, such that these estimates, when re-rendered, match the ob-

54
served images. The use of NeRF to produce a high-quality geometry initialization
helps break the inherent ambiguities among shape, reflectance, and lighting, thereby
allowing us to recover a full 3D model for convincing view synthesis and relight-
ing using just a re-rendering loss, simple spatial smoothness priors for each of these
components, and a novel data-driven BRDF prior. Because NeRFactor models light
visibility explicitly and efficiently, it is capable of removing shadows from albedo esti-
mation and synthesizing realistic soft or hard shadows under arbitrary novel lighting
conditions.

Different than NeRV, NeRFactor addresses the computational complexity problem


by using a “hard surface” approximation of the NeRF geometry, where we only perform
shading calculations at a single point along each ray, corresponding to the expected
termination depth along the ray. Besides the computational complexity problem,
there is also the “noisy geometry problem” in extending NeRF for relighting: The
geometry estimated by NeRF contains extraneous high-frequency content that, while
unnoticeable in view synthesis results, introduces high-frequency artifacts into the
surface normals and light visibility computed from NeRF’s geometry. This issue per-
sists in many NeRF-based models including NeRV. NeRFactor addresses this noisy
geometry problem by representing the surface normal and light visibility at any 3D
location on this surface as continuous functions parameterized by MLPs, and encour-
age these functions to produce values that are spatially smooth and stay close to
those derived from the pretrained NeRF.

Thus, NeRFactor decomposes the observed images into estimated environment


lighting and a 3D surface representation of the object including surface normals, light
visibility, albedo, and spatially-varying BRDFs. This enables us to render novel views
of the object under arbitrary novel environment lighting. In summary, NeRFactor
makes the following technical contributions:

• a method for factorizing images of an object under an unknown lighting con-


dition into shape, reflectance, and direct illumination, thereby supporting free-
viewpoint relighting with shadows and material editing,
• a strategy to distill the NeRF-estimated volume density into surface geometry

55
(with normals and visibility) to use as an initialization when improving the
geometry and recovering reflectance, and
• novel data-driven BRDF priors based on training a latent code model on real
measured BRDFs.

Input & Output The input to NeRV is a set of multi-view images of an object
illuminated under multiple arbitrary but known lighting conditions, while NeRFactor
requires only one unknown lighting condition. Both methods require the camera poses
of these images, which can be obtained with an off-the-shelf Structure From Motion
(SFM) package, such as COLMAP [Schönberger and Frahm, 2016]. Both methods
jointly estimate a plausible collection of surface normals, light visibility, albedo, and
spatially-varying BRDFs, which together explain the observed views. NeRFactor
additionally estimates the environment lighting. We then use the recovered geome-
try and reflectance to synthesize images of the object from novel viewpoints under
arbitrary lighting. Modeling visibility explicitly, both methods are able to remove
shadows from albedo and synthesize soft or hard shadows under arbitrary lighting.

Assumptions NeRFactor considers objects to be composed of hard surfaces with


a single intersection point per ray, so volumetric light transport effects such as scat-
tering, transparency, and translucency are not modeled. NeRV, however, utilizes
this “hard surface” assumption only sparely, to speed up the modeling of one-bounce
indirect illumination. In contrast, NeRFactor models only direct illumination since
doing so simplifies computation, and under unknown lighting, we expect most of the
usable signals to be from direct illumination. Finally, our reflectance models con-
sider materials with achromatic specular reflectance (dielectrics), so we do not model
metallic materials (though one can easily extend our models for metallic materials by
additionally predicting a specular color for each surface point).

56
2.2 Related Work

NeRV and NeRFactor both tackle the problem of inverse rendering, whose literature is
reviewed in Section 2.2.1. We also review the coordinate-based neural object or scene
representation, which is fundamental to both works, in Section 2.2.2. Section 2.2.3
surveys precomputation in computer graphics, which motivates the fast “visibility
lookup” in both NeRV and NeRFactor. Finally, because our models can be applied
to perform object capture for downstream graphics applications, we also review prior
art on material capture in Section 2.2.4.

2.2.1 Inverse Rendering

Intrinsic image decomposition aims to attribute what aspects of an image are due to
material, lighting, or geometric variation [Horn, 1970, Land and McCann, 1971, Horn,
1974, Barrow and Tenenbaum, 1978]. The more general problem that additionally
involves non-Lambertian reflectance, global illumination, etc. is often referred to as
inverse rendering [Sato et al., 1997, Marschner, 1998, Yu et al., 1999, Ramamoorthi
and Hanrahan, 2001]. In other words, the goal of inverse rendering is to factorize the
appearance of an object in observed images into the underlying geometry, material
properties, and lighting conditions. It is a longstanding problem in computer vision
and graphics, the difficulty of which (a consequence of its underconstrained nature) is
typically addressed using one of the following strategies: I) learning priors on shape,
illumination, and reflectance, II) assuming known geometry, or III) using multiple
input images of the scene under one or multiple lighting conditions.
Most recent single-image inverse rendering methods [Barron and Malik, 2014, Li
et al., 2018, Yu and Smith, 2019, Sengupta et al., 2019, Li et al., 2020c, Wei et al.,
2020, Sang and Chandraker, 2020] belong to the first category and use large datasets
of images with labeled geometry and materials to train machine learning models to
predict these properties. Most prior works in inverse rendering that recover full 3D
models for graphics applications [Weinmann and Klein, 2015] fall under the second
category and use 3D geometry obtained from active scanning [Park et al., 2020,

57
Schmitt et al., 2020, Zhang et al., 2021b], proxy models [Dong et al., 2014, Chen
et al., 2020, Gao et al., 2020], silhouette masks [Oxholm and Nishino, 2014, Xia et al.,
2016], or multi-view stereo [Nam et al., 2018] as a starting point before recovering
reflectance and refining geometry.
Both NeRV and NeRFactor belong to the third category: We only require as input
posed images of an object under one unknown or multiple known lighting conditions.
The most relevant prior works are Deep Reflectance Volumes (DRV) that estimates
voxel geometry and BRDF parameters [Bi et al., 2020b], and the follow-up work
Neural Reflectance Fields that replaces DRV’s voxel grid with a continuous volume
represented by a Multi-Layer Perceptron (MLP) [Bi et al., 2020a]. NeRV extends
Neural Reflectance Fields, which requires scenes be illuminated by only a single point
light at a time due to their brute-force visibility computation strategy visualized in
Figure 2-1 and models only direct illumination, to work for arbitrary lighting and
global illumination.

2.2.2 Coordinate-Based Neural Representations

We build upon a recent trend within the computer vision and graphics communities
that replaces traditional shape representations such as polygon meshes or discretized
voxel grids with MLPs that represent geometry as parametric functions. These MLPs
are optimized to approximate continuous 3D geometry by mapping 3D coordinates
to properties of an object or scene (such as volume density, occupancy, or signed dis-
tance) at that location. This strategy has been explored for the tasks of representing
shape [Genova et al., 2019, Mescheder et al., 2019, Park et al., 2019a, Deng et al.,
2020, Sitzmann et al., 2020, Tancik et al., 2020] and scenes under fixed lighting for
view synthesis [Niemeyer et al., 2019, Sitzmann et al., 2019b, Mildenhall et al., 2020,
Liu et al., 2020, Yariv et al., 2020].
As one such coordinate-based representation, Neural Radiance Fields (NeRF) has
been particularly successful for optimizing volumetric geometry and appearance from
observed images for the purpose of rendering photorealistic novel views [Mildenhall
et al., 2020]. It can be thought of as a modern neural reformulation of the clas-

58
sic problem of scene reconstruction: given multiple images of a scene, inferring the
underlying geometry and appearance that best explain those images. While classic
approaches have largely relied on discrete representations such as textured meshes
[Hartley and Zisserman, 2004, Snavely et al., 2006] and voxel grids [Seitz and Dyer,
1999], NeRF has demonstrated that a continuous volumetric function, parameterized
as an MLP, is able to represent complex scenes and render photorealistic novel views.
NeRF works well for view synthesis, but it does not enable relighting because it has
no mechanism to disentangle the outgoing radiance of a surface into an incoming
radiance and an underlying surface material.

One technique that has been used for extending NeRF to support relighting is
conditioning the MLP’s output appearance on a latent code that encodes a per-image
lighting, as in NeRF in the Wild [Martin-Brualla et al., 2021] (and previously with
discretized scene representations [Meshry et al., 2019, Li et al., 2020b]). Although
this strategy can effectively explain the appearance variation of training images, it
cannot be used to render the same scene under new lighting conditions not observed
during training (Figure 2-13), since it does not utilize the physics of light transport.

Very recently, several physically-based approaches extend NeRF’s neural repre-


sentation to enable relighting [Bi et al., 2020a, Boss et al., 2021, Zhang et al., 2021a].
NeRV and NeRFactor differ from Bi et al. [2020a] in that we do not require images
be captured under multiple known One-Light-at-A-Time (OLAT) lighting conditions:
NeRV handles arbitrary, non-OLAT environment lighting, and NeRFactor deals with
one unknown arbitrary lighting condition. The methods of Boss et al. [2021] and
Zhang et al. [2021a] can work with the same casual capture setup as in NeRFactor
(i.e., one arbitrary unknown lighting condition), but both crucially do not consider
light visibility and are thus unable to simulate lighting occlusion or shadowing effects.
NeRV and NeRFactor use the estimated geometry to model accurate high-frequency
shadowing and lighting occlusion.

NeRV uses the same volumetric shape representation as NeRF. On the other hand,
NeRFactor continues with the coordinate-based neural representation, but shows that
starting with the NeRF volume and then optimizing a surface representation enables

59
us to recover a fully-factorized and high-quality 3D model using just images captured
under one unknown illumination. Crucially, using a neural volumetric representation
to estimate the initial geometry enables us to recover factored models for objects that
have proven to be challenging for traditional geometry estimation methods.

2.2.3 Precomputation in Computer Graphics

NeRV is inspired by a long line of work in graphics that explores precomputation


[Sloan et al., 2002, Ritschel et al., 2007] and approximation [Bunnell, 2004, Green
et al., 2007, Ritschel et al., 2008, 2009] strategies to efficiently compute global illu-
mination in physically-based rendering. Our neural visibility fields can be thought
of as a neural analogue to visibility precomputation techniques and is specifically de-
signed for use in our neural inverse rendering setting where geometry is dynamically
changing during optimization.

2.2.4 Material Acquisition

A large body of work within the computer graphics community has focused on the
specific subproblem of material acquisition, where the goal is to estimate BRDF
properties from images of materials with known (typically planar) geometry. These
methods have traditionally utilized a signal processing reconstruction strategy, and
used complex controlled camera and lighting setups to adequately sample the BRDF
[Foo, 2015, Matusik et al., 2003, Nielsen et al., 2015]. More recent methods have en-
abled material acquisition from more casual smartphone setups [Aittala et al., 2015,
Hui et al., 2017]. However, this line of work generally requires the geometry be simple
and fully known, while we focus on a more general problem where our only observa-
tions are images of an object with complex shape and spatially-varying reflectance
(plus the environment lighting for NeRV).

60
2.3 Method: Multiple Known Illuminations

We extend Neural Radiance Fields (NeRF) [Mildenhall et al., 2020] to include the
simulation of light transport, which allows NeRFs to be rendered under arbitrary
novel illumination conditions. Instead of modeling a scene as a continuous 3D field of
particles that absorb and emit light as in NeRF, we represent a scene as a 3D field of
oriented particles that absorb and reflect the light emitted by external light sources
(Section 2.3.2). Naïvely simulating light transport through this model is inefficient
and unable to scale to simulate realistic lighting conditions or global illumination. We
remedy this by introducing a neural visibility field representation (optimized alongside
NeRF’s volumetric representation) that allows us to efficiently query the point-to-light
and point-to-point visibilities needed to simulate light transport (Section 2.3.3). The
resulting Neural Reflectance and Visibility Fields (NeRV) [Srinivasan et al., 2021] are
visualized in Figure 2-4.

= ( + )
x
x × × dωi
(b) Light Visibility
(b) Light Visibility (c) Incident Direct Illum. (d) Incident Indirect Illum.
(c) Direct Illumination (e) BRDF
(d) Indirect Illumination (e) BRDF
(a) Our Rendered Image
(a) (Novel
Our Rendered
View and Image
Lighting)
(Novel View and Lighting)

x x x x x x
(f) Normals (g) Albedo (h) Roughness (i) Shadow Map (j) Direct (k) Indirect

(f) Normals (g) Albedo (h) Roughness (i) Shadow Map (j) Direct (k) Indirect

Figure 2-4: Example decomposition of NeRV. Given any continuous 3D location as


input, such as the point at the cyan “x” (a), NeRV outputs the volume density and
the visibility (b) to a spherical environment map surrounding the scene, which is
multiplied by the direct illumination (c) at that point and added to the estimated
indirect illumination (d) at that point to determine the full incident illumination.
This is then multiplied by the predicted BRDF (e) and integrated over all incoming
directions to determine the outgoing radiance at that point. In the bottom row, we
visualize these outputs for the full rendered image: surface normals (f) and BRDF
parameters for diffuse albedo (g) as well as specular roughness (h). We can use the
predicted visibilities to compute the fraction of the total illumination that is actually
incident at any location, visualized as a shadow map (i). We also show the same
rendered viewpoint if it were lit by only direct (j) and indirect illumination (k).

61
2.3.1 Neural Radiance Fields (NeRF)

NeRF represents a scene as a continuous function, parameterized by a “radiance”


Multi-Layer Perceptron (MLP) whose input is a 3D position and viewing direction,
and whose output is the volume density 𝜎 and radiance 𝐿𝑒 (RGB color) emitted by
particles at that location along that viewing direction. NeRF uses standard emission-
absorption volume rendering [Kajiya and Herzen, 1984] to compute the observed
radiance 𝐿(c, 𝜔𝑜 ) (the rendered pixel color) at camera location c along direction 𝜔𝑜
as the integral of the product of three quantities at any point x(𝑡) = c − 𝑡𝜔𝑜 along
the ray: the visibility 𝑉 (x(𝑡), c), which indicates the fraction of emitted light from
position x(𝑡) that reaches the camera at c, the density 𝜎(x(𝑡)), and the emitted
radiance 𝐿𝑒 (x(𝑡), 𝜔𝑜 ) along the viewing direction 𝜔𝑜 :
∫︁ ∞
𝐿(c, 𝜔𝑜 ) = 𝑉 (x(𝑡), c)𝜎(x(𝑡))𝐿𝑒 (x(𝑡), 𝜔𝑜 ) 𝑑𝑡 , (2.1)
0
(︂ ∫︁ 𝑡 )︂
𝑉 (x(𝑡), c) = exp − 𝜎(x(𝑠)) 𝑑𝑠 . (2.2)
0

A NeRF is recovered from observed input images of a scene by sampling a batch


of observed pixels, sampling the corresponding camera rays of those pixels at strat-
ified random points to approximate the above integral using numerical quadrature
[Max, 1995], and optimizing the weights of the radiance MLP via gradient descent to
minimize the error between the estimated and observed pixel colors.

2.3.2 Neural Reflectance Fields

A NeRF representation does not separate the effect of incident light from the material
properties of surfaces. This means that NeRF is only able to render views of a
scene under the fixed lighting condition presented in the input images – a NeRF
cannot be relit. Modifying NeRF to enable relighting is straightforward, as initially
demonstrated by the Neural Reflectance Fields work of Bi et al. [2020a]. Instead
of representing a scene as a field of particles that emit light, it is represented as a
field of particles that reflect incoming light. With this, given an arbitrary lighting

62
condition, we can simulate the transport of light through the volume as it is reflected
by particles until it reaches the camera with a standard volume rendering integral
[Kajiya and Herzen, 1984]:
∫︁ ∞
𝐿(c, 𝜔𝑜 ) = 𝑉 (x(𝑡), c)𝜎(x(𝑡))𝐿𝑟 (x(𝑡), 𝜔𝑜 ) 𝑑𝑡 , (2.3)
∫︁0
𝐿𝑟 (x, 𝜔𝑜 ) = 𝐿𝑖 (x, 𝜔𝑖 )𝑅(x, 𝜔𝑖 , 𝜔𝑜 ) 𝑑𝜔𝑖 , (2.4)
𝒮

where the view-dependent emission term 𝐿𝑒 (x, 𝜔𝑜 ) in Equation 2.1 is replaced with
an integral, over the sphere 𝒮 of incoming directions, of the product of the incoming
radiance 𝐿𝑖 from any direction and a reflectance function 𝑅 (often called a phase
function in volume rendering) that describes how much light arriving from direction
𝜔𝑖 is reflected towards direction 𝜔𝑜 .
We follow Bi et al. [2020a] and use the standard microfacet Bidirectional Re-
flectance Distribution Function (BRDF) described by Walter et al. [2007] as the re-
flectance function, so 𝑅 at any 3D location is parameterized by a diffuse RGB albedo,
a scalar specular roughness, and a surface normal. We replace NeRF’s radiance MLP
with two MLPs: a “shape MLP” that outputs volume density 𝜎 and a “reflectance
MLP” that outputs BRDF parameters (3D diffuse albedo a and scalar roughness 𝛾)
for any input 3D point: MLP𝜃 : x → 𝜎, MLP𝜓 : x → (a, 𝛾).
Instead of parameterizing the 3D surface normal n as a normalized output of
the shape MLP, as in Bi et al. [2020a], we compute n as the negative normalized
gradient vector of the shape MLP’s output 𝜎 w.r.t. x, computed using automatic
differentiation. We further discuss this choice in Section 2.6.3.

2.3.3 Light Transport via Neural Visibility Fields

Although modifying NeRF to enable relighting is straightforward, estimating the vol-


ume rendering integral for general lighting scenarios is computationally challenging
with a continuous volumetric representation such as NeRF. Figure 2-1 visualizes the
scaling properties that make simulating volumetric light transport particularly dif-

63
camera center
light source
expected termination
depth along camera ray
, estimated
termination depth along
indirect bounce ray

Figure 2-5: The geometry of an indirect illumination path in NeRV. The light ray
departs its source, hits 𝑥′ first, gets reflected to 𝑥, and eventually reaches the camera.

ficult. Even if we only consider direct illumination from light sources to a scene
point, a brute-force solution is already challenging for more than a single point light
source as it requires repeatedly querying the shape MLP for volume density along
paths from each scene point to each light source. Moreover, general scenes can be
illuminated by light arriving from all directions, and addressing this is imperative
to recovering relightable representations in unconstrained scenarios. Simulating even
simple global illumination in a brute-force manner is intractable: Rendering a single
ray in our scenes under one-bounce indirect illumination with brute-force sampling
would require a petaflop of computation, and we need to render roughly a billion rays
over the course of training.

We ameliorate this issue by replacing several brute-force volume density integrals


with learned approximations. We introduce a “visibility MLP” that emits an approxi-
mation of the environment lighting visibility at any input location along any input di-
rection and an approximation of the expected termination depth of the corresponding
ray: MLP𝜑 : (x, 𝜔) → (𝑉˜𝜑 , 𝐷
˜ 𝜑 ). When rendering, we use these MLP-approximated

quantities in place of their actual values:


(︂ ∫︁ ∞ )︂
𝑉 (x, 𝜔) = exp − 𝜎(x + 𝑠𝜔) 𝑑𝑠 , (2.5)
0
∫︁ ∞ (︂ ∫︁ 𝑡 )︂
𝐷(x, 𝜔) = exp − 𝜎(x + 𝑠𝜔) 𝑑𝑠 𝑡𝜎(x + 𝑡𝜔) 𝑑𝑡 . (2.6)
0 0

64
In Section 2.3.5 we place losses on the visibility MLP outputs (𝑉˜𝜑 , 𝐷
˜ 𝜑 ) to encourage

them to resemble the (𝑉, 𝐷) corresponding to the current state of the shape MLP.

Below, we provide a detailed walkthrough of how our Neural Visibility Field ap-
proximations simplify the volume rendering integral computation. Figure 2-5 is pro-
vided for reference. We first decompose the reflected radiance 𝐿𝑟 (x, 𝜔𝑜 ) into its direct
and indirect illumination components. Let us define 𝐿𝑒 (x, 𝜔𝑖 ) as radiance due to a
light source arriving at point x from direction 𝜔𝑖 . As defined in Equation 2.3, 𝐿(x, 𝜔𝑖 )
is the estimated incoming radiance at location x from direction 𝜔𝑖 . This means the
incident illumination 𝐿𝑖 decomposes into 𝐿𝑒 + 𝐿 (direct plus indirect light). The
shading calculation for 𝐿𝑟 then becomes:
∫︁
𝐿𝑟 (x, 𝜔𝑜 ) = (𝐿𝑒 (x, 𝜔𝑖 ) + 𝐿(x, −𝜔𝑖 )) 𝑅(x, 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 (2.7)
∫︁ 𝒮 ∫︁
= 𝐿𝑒 (x, 𝜔𝑖 )𝑅(x, 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 + 𝐿(x, −𝜔𝑖 )𝑅(x, 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 .
𝒮
⏟ ⏞ ⏟𝒮 ⏞
component due to direct lighting component due to indirect lighting

To calculate incident direct lighting 𝐿𝑒 we must account for the attenuation of the
(known) environment map 𝐸 due to the volume density along the incident illumina-
tion ray 𝜔𝑖 :

𝐿𝑒 (x, 𝜔𝑖 ) = 𝑉 (x, 𝜔𝑖 )𝐸(x, −𝜔𝑖 ) . (2.8)

Instead of evaluating 𝑉 as another line integral through the volume, we use the
visibility MLP’s approximation 𝑉˜𝜑 . With this, our full calculation for the direct
lighting component of camera ray radiance 𝐿(c, 𝜔𝑜 ) simplifies to:
∫︁ ∞ (︀ ∫︁
𝑉 x(𝑡), c 𝜎 x(𝑡) 𝑉˜𝜑 x(𝑡), 𝜔𝑖 𝐸 x(𝑡), 9𝜔𝑖 𝑅 x(𝑡), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 𝑑𝑡 . (2.9)
)︀ (︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀
0 𝒮

By approximating the integrals along rays from each point on the camera ray toward
each environment direction when computing the color of a pixel due to direct lighting,
we have reduced the complexity of rendering with direct lighting from quadratic in
the number of samples per ray to linear.

65
Next, we focus on the more difficult task of accelerating the computation of ren-
dering with indirect lighting, for which a brute force approach would scale cubically
with the number of samples per ray. We make two approximations to reduce this
intractable computation. Our first approximation is to replace the outermost integral
(the accumulated radiance reflected towards the camera at each point along the ray)
with a single point evaluation by treating the volume as a hard surface located at the
expected termination depth 𝑡′ = 𝐷(c, −𝜔𝑜 ). Note that we do not use the visibility
MLP’s approximation of 𝑡′ here, since we are already sampling 𝜎(x) along the camera
ray. This reduces the indirect contribution of 𝐿(c, 𝜔𝑜 ) to a spherical integral at a
single point x(𝑡′ ):
∫︁
𝐿 x(𝑡′ ), −𝜔𝑖 𝑅 x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 . (2.10)
(︀ )︀ (︀ )︀
𝒮

To simplify the recursive evaluation of 𝐿 inside this integral, we limit the indirect
contribution to a single bounce, and use the hard surface approximation a second
time to replace the integral along a ray for each incoming direction:
∫︁

𝐿(x(𝑡 ), −𝜔𝑖 ) ≈ 𝐿𝑒 (x′ (𝑡′′ ), 𝜔𝑖′ )𝑅(x′ (𝑡′′ ), 𝜔𝑖′ , −𝜔𝑖 )𝑑𝜔𝑖′ , (2.11)
𝒮

where 𝑡′′ = 𝐷
˜ 𝜑 (x(𝑡′ ), 𝜔𝑖 ) is the expected intersection depth along the ray x′ (𝑡′′ ) =

x(𝑡′ ) + 𝑡′′ 𝜔𝑖 as approximated by the visibility MLP. Thus the expression for the
component of camera ray radiance 𝐿(c, 𝜔𝑜 ) due to indirect lighting is:
∫︁∫︁
𝐿𝑒 (x′ (𝑡′′ ), 𝜔𝑖′ )𝑅(x′ (𝑡′′ ), 𝜔𝑖′ , 9𝜔𝑖 )𝑑𝜔𝑖′ 𝑅(x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 , (2.12)
𝒮

and fully expanding the direct radiance 𝐿𝑒 (x′ (𝑡′′ ), 𝜔𝑖′ ) incident at each secondary in-
tersection point gives us:
∫︁∫︁
𝑉˜𝜑 x′ (𝑡′′ ), 𝜔𝑖′ 𝐸 x′ (𝑡′′ ), −𝜔𝑖′ 𝑅 x′ (𝑡′′ ), 𝜔𝑖′ , 9𝜔𝑖 𝑑𝜔𝑖′ 𝑅 x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 , (2.13)
(︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀
𝒮

Finally, we can write out the complete volume rendering equation used by NeRV as

66
the sum of Equations 2.9 and 2.13:
∫︁ ∞ ∫︁
𝐿(c, 𝜔𝑜 ) = 𝑉 x(𝑡), c 𝜎 x(𝑡) 𝑉˜𝜑 x(𝑡), 𝜔𝑖 𝐸 x(𝑡), 9𝜔𝑖 𝑅 x(𝑡), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 𝑑𝑡
(︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀

∫︁∫︁ 0 𝒮

+ 𝑉˜𝜑 x (𝑡 ), 𝜔𝑖′ 𝐸 x′ (𝑡′′ ), 9𝜔𝑖′ 𝑅 x′ (𝑡′′ ), 𝜔𝑖′ , 9𝜔𝑖 𝑑𝜔𝑖′ 𝑅 x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 . (2.14)
(︀ ′ ′′ )︀ (︀ )︀ (︀ )︀ (︀ )︀
𝒮

Figure 2-1 illustrates how the approximations made by NeRV reduce the computa-
tional complexity of computing direct and indirect illumination from quadratic and
cubic (respectively) to linear. This enables the simulation of direct illumination from
environment lighting and one-bounce indirect illumination within the training loop
of optimizing a continuous relightable volumetric scene representation.

2.3.4 Rendering

To render a camera ray x(𝑡) = c − 𝑡𝜔𝑜 passing through a NeRV, we estimate the
volume rendering integral in Equation 2.14 using the following procedure:
1. We draw 256 stratified samples along the ray and query the shape and re-
flectance MLPs for the volume densities, surface normals, and BRDF parame-
ters at each point: 𝜎 = MLP𝜃 (x(𝑡)), n = ∇x MLP𝜃 (x(𝑡)), (a, 𝛾) = MLP𝜓 (x(𝑡)).
2. We shade each point along the ray with direct illumination by estimating the
integral in Equation 2.9. First, we generate 𝐸(x(𝑡), −𝜔𝑖 ) by sampling the known
environment lighting on a 12 × 24 grid of directions 𝜔𝑖 on the sphere around
each point. We then multiply this by the predicted visibility 𝑉˜𝜑 (x(𝑡), 𝜔𝑖 ) and
microfacet BRDF values 𝑅(x(𝑡), 𝜔𝑖 , 𝜔𝑜 ) at each sampled 𝜔𝑖 , and integrate this
product over the sphere to produce the direct illumination contribution.
3. We shade each point along the ray with indirect illumination by estimating the
integral in Equation 2.13. First, we compute the expected camera ray termi-
nation depth 𝑡′ = 𝐷(c, −𝜔𝑜 ) using the density samples from Step 1. Next,
we sample 128 random directions on the upper hemisphere at x(𝑡′ ) and query
the visibility MLP for the expected termination depths along each of these
rays 𝑡′′ = 𝐷
˜ 𝜑 (x(𝑡′ ), 𝜔𝑖 ) to compute the secondary surface intersection points

67
x′ (𝑡′′ ) = x(𝑡′ ) + 𝑡′′ 𝜔𝑖 . We then shade each of these points with direct illu-
mination by following the procedure in Step 2. This estimates the indirect
illumination incident at x(𝑡′ ), which we then multiply by the microfacet BRDF
values 𝑅(x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 ) and integrate over the sphere to produce the indirect
illumination contribution.
4. The total reflected radiance at each point along the camera ray 𝐿𝑟 (x(𝑡), 𝜔𝑜 ) is
the sum of the quantities from Steps 2 and 3. We composite these along the ray
to compute the pixel color using the same quadrature rule [Max, 1995] used in
NeRF:

∑︁
𝐿(c, 𝜔𝑜 ) = 𝑉 (x(𝑡), c)𝛼(𝜎(x(𝑡))𝛿)𝐿𝑟 (x(𝑡), 𝜔𝑜 ) , (2.15)
𝑡
(︃ )︃
∑︁
𝑉 (x(𝑡), c) = exp − 𝜎(x(𝑠))𝛿 , 𝛼(𝑧) = 1 − exp (−𝑧) , (2.16)
𝑠<𝑡

where 𝛿 is the distance between samples along the ray.

2.3.5 Training & Implementation Details

Instead of directly passing 3D coordinates x and direction vectors 𝜔 to the MLPs, we


map these inputs using NeRF’s positional encoding [Mildenhall et al., 2020, Tancik
et al., 2020], with a maximum frequency of 27 for 3D coordinates and 24 for 3D
direction vectors. The shape and reflectance MLPs each use eight fully-connected
Rectified Linear Unit (ReLU) layers with 256 channels. The visibility MLP uses
eight fully-connected ReLU layers with 256 channels each to map the encoded 3D
coordinates x to an 8-dimensional feature vector, which is concatenated with the
encoded 3D direction vector 𝜔 and processed by four fully-connected ReLU layers
with 128 channels each.
We train a separate NeRV representation from scratch for each scene, which re-
quires a set of RGB images as well as their camera poses and corresponding lighting
environments. At each training iteration, we randomly sample a batch of 512 pixel
rays ℛ from the input images and use the procedure previously described to render

68
these pixels from the current NeRV model. We additionally sample 256 random rays
ℛ′ per training iteration that intersect the volume, and we compute the visibility and
expected termination depth at each location, into either direction along each ray for
use as supervision for the visibility MLP. We minimize the sum of three losses:

∑︁
ℒ= ˜
‖𝜏 (𝐿(r)) − 𝜏 (𝐿(r))‖22 +
r∈ℛ
∑︁ (︁ )︁
𝜆 ˜ 𝜑 (r′ (𝑡)) − 𝐷𝜃 (r′ (𝑡))‖2 ,
‖𝑉˜𝜑 (r′ (𝑡)) − 𝑉𝜃 (r′ (𝑡))‖22 + ‖𝐷 2 (2.17)
r′ ∈ℛ′ ∪ℛ,𝑡

where 𝜏 (𝑥) = 𝑥/1+𝑥 is a tone mapping operator [Gharbi et al., 2019], 𝐿(r) and 𝐿(r)
˜ are
the ground truth and predicted camera ray radiance values (ground-truth values are
simply the colors of input image pixels), 𝑉˜𝜑 (r) and 𝐷
˜ 𝜑 (r) are the predicted visibility

and expected termination depth from our visibility MLP given its current weights 𝜑,
𝑉𝜃 (r) and 𝐷𝜃 (r) are the estimates of visibility and termination depth implied by the
shape MLP given its current weights 𝜃, and 𝜆 = 20 is the weight of the loss terms
encouraging the visibility MLP to be consistent with the shape MLP.

Note that the visibility MLP is not supervised using any “ground truth” visibility
or termination depth: It is only optimized to be consistent with the NeRV’s current
estimate of scene geometry, by evaluating Equation 2.5 and Equation 2.6 using the
densities 𝜎 emitted by the shape MLP𝜃 . We apply a stop_gradient to 𝑉𝜃 and 𝐷𝜃
in the last two terms of the loss, so the shape MLP is not encouraged to degrade its
own performance to better match the output from the visibility MLP. We implement
our model in JAX [Bradbury et al., 2018] and optimize it using Adam [Kingma
and Ba, 2015] with a learning rate that begins at 10−5 and decays exponentially to
10−6 over the course of optimization (the other Adam hyperparameters are default
values: 𝛽1 = 0.9, 𝛽2 = 0.999, and 𝜖 = 10−8 ). Each model is trained for a million
iterations using 128 Tensor Processing Unit (TPU) cores, which takes around one
day to converge.

69
2.4 Method: One Unknown Illumination

The input to Neural Factorization of Shape and Reflectance (NeRFactor) is assumed


to be only multi-view images of an object (and the corresponding camera poses)
lit by one unknown illumination condition. NeRFactor represents the shape and
spatially-varying reflectance of an object as a set of 3D fields, each parameterized
by Multi-Layer Perceptrons (MLPs) whose weights are optimized so as to “explain”
the set of observed input images. After optimization, NeRFactor outputs the surface
normal 𝑛, light visibility in any direction 𝑣(𝜔i ), albedo 𝑎, and reflectance 𝑧BRDF
that together explain the observed appearance at any 3D location 𝑥 on the object’s
surface1 . By recovering the object’s geometry and reflectance, NeRFactor enables
applications such as free-viewpoint relighting with shadows and material editing.

2.4.1 Shape

The input to NeRFactor is the same as what is used by Neural Radiance Fields (NeRF)
[Mildenhall et al., 2020], so we can apply NeRF to our input images to compute
initial geometry. NeRF optimizes a neural radiance field: an MLP that maps from
any 3D spatial coordinate and 2D viewing direction to the volume density at that
3D location and color emitted by particles at that location along the 2D viewing
direction. NeRFactor leverages NeRF’s estimated geometry by “distilling” it into a
continuous surface representation that we use to initialize NeRFactor’s geometry. In
particular, we use the optimized NeRF to compute the expected surface location
along any camera ray, the surface normal at each point on the object’s surface, and
the visibility of light arriving from any direction at each point on the object’s surface.
This subsection describes how we derive these functions from the optimized NeRF
and how we re-parameterize them with MLPs so that they can be finetuned after this
initialization step to improve the full re-rendering loss (Figure 2-7).

1
In this section, vectors and matrices (as well as functions that return them) are in bold; scalars
and scalar functions are not.

70
NeRFactor is an pseudo-GT as sup.
: pre-trained; frozen
!i Light : pre-trained; jointly finetuned
all-MLP surface model <latexit sha1_base64="CqK5YU2YsyAHk8kp6/xwQyNdvAA=">AAACBXicbVBNS8NAEN34WetX1KMegkXwVJIi6rHoxWMF+wFNKZvttF26yYbdiVhCL178K148KOLV/+DNf+OmzUFbHyz7eG+GmXlBLLhG1/22lpZXVtfWCxvFza3tnV17b7+hZaIY1JkUUrUCqkHwCOrIUUArVkDDQEAzGF1nfvMelOYyusNxDJ2QDiLe54yikbr2kR9I0dPj0HypL0MY0K6P8IApn0y6dsktu1M4i8TLSYnkqHXtL78nWRJChExQrdueG2MnpQo5EzAp+omGmLIRHUDb0IiGoDvp9IqJc2KUntOXyrwInan6uyOloc4WNZUhxaGe9zLxP6+dYP+yk/IoThAiNhvUT4SD0skicXpcAUMxNoQyxc2uDhtSRRma4IomBG/+5EXSqJS983Ll9qxUvcrjKJBDckxOiUcuSJXckBqpE0YeyTN5JW/Wk/VivVsfs9IlK+85IH9gff4AGUuZmw==</latexit>

accumulated
transmittance
Visibility v : trained from scratch that predicts the
<latexit sha1_base64="zWgGn+KxnGT4WVzvscDZLe5fxpU=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaJUY9ELx4hkUcCGzI79MLI7OxmZpaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWj+7nfGqPSPJaPZpKgH9GB5CFn1FipPu4VS27ZXYCsEy8jJchQ6xW/uv2YpRFKwwTVuuO5ifGnVBnOBM4K3VRjQtmIDrBjqaQRan+6OHRGLqzSJ2GsbElDFurviSmNtJ5Ege2MqBnqVW8u/ud1UhPe+lMuk9SgZMtFYSqIicn8a9LnCpkRE0soU9zeStiQKsqMzaZgQ/BWX14nzUrZuy5X6lel6l0WRx7O4BwuwYMbqMID1KABDBCe4RXenCfnxXl3PpatOSebOYU/cD5/AOV7jQE=</latexit>

MLP
surface normal 𝑛, light
xsurf BRDF visibility 𝑣, albedo 𝑎, <latexit sha1_base64="gXJmz29XFah1xKGcAW/+7Sy8n4k=">AAACA3icbVDLSsNAFJ3UV62vqDvdBIvgqiRF1GXRjcsK9gFNCJPJpB06eTBzIy0h4MZfceNCEbf+hDv/xkmbhbYeGOZwzr3ce4+XcCbBNL+1ysrq2vpGdbO2tb2zu6fvH3RlnApCOyTmseh7WFLOItoBBpz2E0Fx6HHa88Y3hd97oEKyOLqHaUKdEA8jFjCCQUmufmR7MfflNFRfNnFtoBPIZCqCPHf1utkwZzCWiVWSOirRdvUv249JGtIICMdSDiwzASfDAhjhNK/ZqaQJJmM8pANFIxxS6WSzG3LjVCm+EcRCvQiMmfq7I8OhLNZUlSGGkVz0CvE/b5BCcOVkLEpSoBGZDwpSbkBsFIEYPhOUAJ8qgolgaleDjLDABFRsNRWCtXjyMuk2G9ZFo3l3Xm9dl3FU0TE6QWfIQpeohW5RG3UQQY/oGb2iN+1Je9HetY95aUUrew7RH2ifP9fHmPE=</latexit>

Identity zBRDF
and BRDF latent code
<latexit sha1_base64="mw0GeuR83/hro1F8N5+3qQU9ms4=">AAACA3icbVDLSsNAFJ3UV62vqDvdBIvgqiRF1GWpIi6r2Ac0oUymk3bo5MHMjVhDwI2/4saFIm79CXf+jZM2C209MMzhnHu59x434kyCaX5rhYXFpeWV4mppbX1jc0vf3mnJMBaENknIQ9FxsaScBbQJDDjtRIJi3+W07Y7OM799R4VkYXAL44g6Ph4EzGMEg5J6+p7thrwvx776koeeDfQekvrNxWWa9vSyWTEnMOaJlZMyytHo6V92PySxTwMgHEvZtcwInAQLYITTtGTHkkaYjPCAdhUNsE+lk0xuSI1DpfQNLxTqBWBM1N8dCfZltqaq9DEM5ayXif953Ri8MydhQRQDDch0kBdzA0IjC8ToM0EJ8LEimAimdjXIEAtMQMVWUiFYsyfPk1a1Yp1UqtfH5Vo9j6OI9tEBOkIWOkU1dIUaqIkIekTP6BW9aU/ai/aufUxLC1res4v+QPv8AeMHmFE=</latexit>

MLP BRDF
Render
expected
MLP 𝑧BRDF for each surface
Albedo
ray term.
MLP a location 𝑥surf , as well as <latexit sha1_base64="Mzrq9frzKs/D+ICaSB2FHA8Q12o=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZki6rLoxmUF+4B2LJlMpg3NJEOSUcrQ/3DjQhG3/os7/8ZMOwttPRByOOdecnKChDNtXPfbWVldW9/YLG2Vt3d29/YrB4dtLVNFaItILlU3wJpyJmjLMMNpN1EUxwGnnWB8k/udR6o0k+LeTBLqx3goWMQINlZ66AeSh3oS2yvD00Gl6tbcGdAy8QpShQLNQeWrH0qSxlQYwrHWPc9NjJ9hZRjhdFrup5ommIzxkPYsFTim2s9mqafo1CohiqSyRxg0U39vZDjWeTQ7GWMz0oteLv7n9VITXfkZE0lqqCDzh6KUIyNRXgEKmaLE8IklmChmsyIywgoTY4sq2xK8xS8vk3a95l3U6nfn1cZ1UUcJjuEEzsCDS2jALTShBQQUPMMrvDlPzovz7nzMR1ecYucI/sD5/AEyo5L5</latexit>

NeRF
(!i , !o ) ( d , ✓h , ✓d )
the lighting condition. <latexit sha1_base64="PNyVwDzPr+pVAAW9S4Fxuo0bkdg=">AAACJXicdVBNSwMxEM36bf2qevQSLEIFKbtF1IOHohePCvYDuqVk02kbmt0syaxYlv0zXvwrXjwoInjyr5jWHrTVByGP92aYmRfEUhh03Q9nbn5hcWl5ZTW3tr6xuZXf3qkZlWgOVa6k0o2AGZAigioKlNCINbAwkFAPBpcjv34H2ggV3eIwhlbIepHoCs7QSu38edEPlOyYYWi/1Fch9FjbR7jHVGTZEf3XVVl22M4X3JI7Bp0l3oQUyATX7fyr31E8CSFCLpkxTc+NsZUyjYJLyHJ+YiBmfMB60LQ0YiGYVjq+MqMHVunQrtL2RUjH6s+OlIVmtKmtDBn2zbQ3Ev/ymgl2z1qpiOIEIeLfg7qJpKjoKDLaERo4yqEljGthd6W8zzTjaIPN2RC86ZNnSa1c8k5K5ZvjQuViEscK2SP7pEg8ckoq5Ipckyrh5IE8kRfy6jw6z86b8/5dOudMenbJLzifX849p08=</latexit> <latexit sha1_base64="8JxPNoU0Qk/1/Ko8TBjh20evX1Y=">AAACGnicbZBNS8NAEIY3ftb6FfXoZbEIClKSIuqx6MVjBatCE8pmMzWLmw92J2IJ+R1e/CtePCjiTbz4b9zWHLT6wsLLMzPMzhtkUmh0nE9ranpmdm6+tlBfXFpeWbXX1i90misOXZ7KVF0FTIMUCXRRoISrTAGLAwmXwc3JqH55C0qLNDnHYQZ+zK4TMRCcoUF9293xskj0PYQ7LMJyj3oYAbIKRJMgLHf7dsNpOmPRv8atTINU6vTtdy9MeR5DglwyrXuuk6FfMIWCSyjrXq4hY/yGXUPP2ITFoP1ifFpJtw0J6SBV5iVIx/TnRMFirYdxYDpjhpGerI3gf7VejoMjvxBJliMk/HvRIJcUUzrKiYZCAUc5NIZxJcxfKY+YYhxNmnUTgjt58l9z0Wq6B83W2X6jfVzFUSObZIvsEJcckjY5JR3SJZzck0fyTF6sB+vJerXevlunrGpmg/yS9fEFNp+hng==</latexit>

x (1st NeRFactor does not


<latexit sha1_base64="OfIqcm9DmvjZbClZA6CQ4jlPlV8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqswUUZdFNy4r2Ae0Y8lk0jY0kwxJRi1D/8ONC0Xc+i/u/Bsz7Sy09UDI4Zx7yckJYs60cd1vp7Cyura+UdwsbW3v7O6V9w9aWiaK0CaRXKpOgDXlTNCmYYbTTqwojgJO28H4OvPbD1RpJsWdmcTUj/BQsAEj2FjpvhdIHupJZK/0adovV9yqOwNaJl5OKpCj0S9/9UJJkogKQzjWuuu5sfFTrAwjnE5LvUTTGJMxHtKupQJHVPvpLPUUnVglRAOp7BEGzdTfGymOdBbNTkbYjPSil4n/ed3EDC79lIk4MVSQ+UODhCMjUVYBCpmixPCJJZgoZrMiMsIKE2OLKtkSvMUvL5NWreqdV2u3Z5X6VV5HEY7gGE7Bgwuoww00oAkEFDzDK7w5j86L8+58zEcLTr5zCH/gfP4AVZaTEA==</latexit>

Normal
<latexit sha1_base64="+QEJNuyxbank3U3v1VkBjLYmsJs=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKewGUY9BLx4jmAckS5idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWye1eoYNBe6XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2v3aKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPX0YBpSiyfOIKJZu5WREZYY2JdQCUXQrD88ipp1arBZbV2f1Gp3+RxFOEETuEcAriCOtxBA5pA4BGe4RXePOW9eO/ex6K14OUzx/AH3ucPnqmPKg==</latexit>

half) MLP n <latexit sha1_base64="zuMR+F00Fn7dYyoq+1dafhGCjJE=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZki6rLoxmUF+4B2LJlMpg3NJEOSUcrQ/3DjQhG3/os7/8ZMOwttPRByOOdecnKChDNtXPfbWVldW9/YLG2Vt3d29/YrB4dtLVNFaItILlU3wJpyJmjLMMNpN1EUxwGnnWB8k/udR6o0k+LeTBLqx3goWMQINlZ66AeSh3oS2ysT00Gl6tbcGdAy8QpShQLNQeWrH0qSxlQYwrHWPc9NjJ9hZRjhdFrup5ommIzxkPYsFTim2s9mqafo1CohiqSyRxg0U39vZDjWeTQ7GWMz0oteLv7n9VITXfkZE0lqqCDzh6KUIyNRXgEKmaLE8IklmChmsyIywgoTY4sq2xK8xS8vk3a95l3U6nfn1cZ1UUcJjuEEzsCDS2jALTShBQQUPMMrvDlPzovz7nzMR1ecYucI/sD5/AFGZJMG</latexit>

d Light require supervision on


<latexit sha1_base64="s4v2+FC9uJUs0myW2JdRLiWj2q0=">AAACAnicbVDLSsNAFJ34rPUVdSVuBovgqiRF1GXRjcsK9gFNKJPJpB06mQkzE7GE4MZfceNCEbd+hTv/xkmbhbYeGOZwzr3ce0+QMKq043xbS8srq2vrlY3q5tb2zq69t99RIpWYtLFgQvYCpAijnLQ11Yz0EklQHDDSDcbXhd+9J1JRwe/0JCF+jIacRhQjbaSBfehFEuEszLPQCwQL1SQ2X/aQ5wO75tSdKeAicUtSAyVaA/vLCwVOY8I1Zkipvusk2s+Q1BQzkle9VJEE4TEakr6hHMVE+dn0hByeGCWEkZDmcQ2n6u+ODMWq2M1UxkiP1LxXiP95/VRHl35GeZJqwvFsUJQyqAUs8oAhlQRrNjEEYUnNrhCPkMlEm9SqJgR3/uRF0mnU3fN64/as1rwq46iAI3AMToELLkAT3IAWaAMMHsEzeAVv1pP1Yr1bH7PSJavsOQB/YH3+ANhsmF8=</latexit>

dx pseudo-GT as sup. (lat.-long. map) any of the intermediate


𝑥 denotes 3D locations, 𝜔i represents the light direction, 𝜔o denotes factors but rather relies
the viewing direction, and 𝜑d , 𝜃h , 𝜃d are the Rusinkiewicz coordinates. only on priors and a
reconstruction loss.
Our Factorization Rendering
Normals Visibility (novel view; original lighting) Here we show an
example factorization
by NeRFactor,
visualizing visibility as
Illum.
the average light
visibility over all
incoming directions
Albedo BRDF
(i.e., ambient
occlusion) and 𝑧BRDF
as an RGB image
(same colors mean
same materials).

Figure 2-6: NeRFactor model and its example output. NeRFactor is a surface model
that factorizes, in an unsupervised manner, the appearance of a scene observed under
one unknown lighting condition. It tackles this severely ill-posed problem by using a
reconstruction loss, simple smoothness regularization, and data-driven BRDF priors.
Modeling visibility explicitly, NeRFactor is a physically-based model that supports
hard and soft shadows under arbitrary lighting.

71
Surface Points Given a camera and a trained NeRF, we compute the location
at which a ray 𝑟(𝑡) = 𝑜 + 𝑡𝑑 from that camera 𝑜 along direction 𝑑 is expected to
terminate according to NeRF’s optimized volume density 𝜎:
(︂∫︁ ∞ )︂ (︂ ∫︁ 𝑡 )︂
(2.18)
(︀ )︀ (︀ )︀
𝑥surf = 𝑜 + 𝑇 (𝑡)𝜎 𝑟(𝑡) 𝑡 𝑑𝑡 𝑑 , 𝑇 (𝑡) = exp − 𝜎 𝑟(𝑠) 𝑑𝑠 ,
0 0

where 𝑇 (𝑡) is the probability that the ray travels distance 𝑡 without being blocked.
Instead of maintaining a full volumetric representation, we fix the geometry to lie
on this surface distilled from the optimized NeRF. This enables much more efficient
relighting during both training and inference because we can compute the outgoing
radiance just at each camera ray’s expected termination location instead of every
point along each camera ray.

Surface Normals We compute analytic surface normals 𝑛a (𝑥) at any 3D location


as the negative normalized gradient of NeRF’s 𝜎-volume w.r.t. 𝑥. Unfortunately, the
normals derived from a trained NeRF tend to be noisy (Figure 2-7) and therefore
produce “bumpy” artifacts when used for rendering (see the supplemental video).
Therefore, we re-parameterize these normals using an MLP 𝑓n , which maps from
any location 𝑥surf on the surface to a “denoised” surface normal 𝑛: 𝑓n : 𝑥surf ↦→ 𝑛.
During the joint optimization of NeRFactor’s weights, we encourage the output of
this MLP I) to stay close to the normals produced from the pretrained NeRF, II) to
vary smoothly in the 3D space, and III) to reproduce the observed appearance of the
object. Specifically, the loss function reflecting I) and II) is:

∑︁ (︂ 𝜆1 ⃦ ⃦2 𝜆2 ⃦
)︂
(2.19)

ℓn = ⃦𝑓n (𝑥surf ) − 𝑛a (𝑥surf ) 2 +
⃦ ⃦𝑓n (𝑥surf ) − 𝑓n (𝑥surf + 𝜖) 1 ,

𝑥
3 3
surf

where 𝜖 is a random 3D displacement from 𝑥surf sampled from a zero-mean Gaussian


with standard deviation 0.01 (0.001 for the real scenes due to the different scene
scales), and the 𝜆1 and 𝜆2 are hyperparameters set to 0.1 and 0.05, respectively.
A similar smoothness loss on surface normals is used in the concurrent work by
Oechsle et al. [2021] for the goal of shape reconstruction. Crucially, not restricting

72
𝑥 on the expected surface increases the robustness of the MLP by providing a “safe
margin” where the output remains well-behaved even when the input is slightly dis-
placed from the surface. As shown in Figure 2-7, NeRFactor’s normal MLP produces
normals that are significantly higher-quality than those produced by NeRF and are
smooth enough to be used for relighting (Figure 2-9).

Light Visibility We compute the visibility 𝑣a to each light source from any point
by marching through NeRF’s 𝜎-volume from the point to each light location, as in Bi
et al. [2020a]. However, similarly to the surface normal estimation described above,
the visibility estimates derived directly from NeRF’s 𝜎-volume are too noisy to be used
directly and result in rendering artifacts (see the supplemental video). We address this
by re-parameterizing the visibility function as another MLP that maps from a surface
location 𝑥surf and a light direction 𝜔i to the light visibility 𝑣: 𝑓v : (𝑥surf , 𝜔i ) ↦→ 𝑣. We
optimize the weights of 𝑓v to encourage the recovered visibility field I) to be close to
the visibility traced from the NeRF, II) to be spatially smooth, and III) to reproduce
the observed appearance. Specifically, the loss function implementing I) and II) is:

∑︁ ∑︁ (︁ (︀ )︀2
ℓv = 𝜆3 𝑓v (𝑥surf , 𝜔i ) − 𝑣a (𝑥surf , 𝜔i )
𝑥surf 𝜔i
⃒)︁
(2.20)

+𝜆4 𝑓v (𝑥surf , 𝜔i ) − 𝑓v (𝑥surf + 𝜖, 𝜔i )⃒ ,

where 𝜖 is the random displacement defined above, and 𝜆3 and 𝜆4 are hyperparameters
set to 0.1 and 0.05, respectively.
As the equation shows, smoothness is encouraged across spatial locations given
the same 𝜔i , not the other way around. This is by design, to avoid the visibility at a
certain location getting blurred over different light locations. Note that this is similar
to the visibility fields in NeRV, but in NeRFactor we optimize the visibility MLP
parameters to denoise the visibility derived from a pretrained NeRF and minimize
the re-rendering loss. For computing the NeRF visibility, we use a fixed set of 512
light locations given a predefined illumination resolution (to be discussed later). After
optimization, 𝑓v produces spatially smooth and realistic estimates of light visibility,

73
as can be seen in Figure 2-7 (II) and Figure 2-8 (C) where we visualize the average
visibility over all light directions (i.e., ambient occlusion).
In practice, before the full optimization of our model, we independently pretrain
the visibility and normal MLPs to just reproduce the visibility and normal values from
the NeRF 𝜎-volume without any smoothness regularization or re-rendering loss. This
provides a reasonable initialization of the visibility maps, which prevents the albedo or
BRDF MLP from mistakenly attempting to explain away shadows as being modeled
as “painted on” reflectance variation (see “w/o geom. pretrain.” in Figure 2-19 and
Table 2.1).

2.4.2 Reflectance

Our full Bidirectional Reflectance Distribution Function (BRDF) model 𝑅 consists


of a diffuse component (Lambertian) fully determined by albedo 𝑎 and a specular
spatially-varying BRDF 𝑓r (defined for any location on the surface 𝑥surf with incoming
light direction 𝜔i and outgoing direction 𝜔o ) learned from real-world reflectance:

𝑎(𝑥surf )
𝑅(𝑥surf , 𝜔i , 𝜔o ) = + 𝑓r (𝑥surf , 𝜔i , 𝜔o ) . (2.21)
𝜋

Prior art in neural rendering has explored the use of parameterizing 𝑓r with analytic
BRDFs, such as microfacet models [Bi et al., 2020a, Srinivasan et al., 2021], within
a NeRF-like setting. Although these analytic models provide an effective BRDF pa-
rameterization for optimization to explore, no prior is imposed upon the parameters
themselves: All materials that are expressible within a microfacet model are consid-
ered equally likely a priori. Additionally, the use of an explicit analytic model limits
the set of materials that can be recovered, and this is insufficient for modelling all
real-world reflectance functions.
Instead of assuming an analytic BRDF, NeRFactor starts with a learned re-
flectance function that is pretrained to reproduce a wide range of empirically observed
real-world reflectance functions while also learning a latent space for those real-world
reflectance functions. By doing so, we learn data-driven priors on real-world BRDFs

74
that encourage the optimization procedure to recover plausible reflectance functions.
The use of such priors is crucial: Because all of our observed images are taken under
one (unknown) illumination, our problem is highly ill-posed, so priors are necessary
to disambiguate the most likely factorization of the scene from the set of all possible
factorizations.

Albedo We parameterize the albedo 𝑎 at any location on the surface 𝑥surf as an


MLP 𝑓a : 𝑥surf ↦→ 𝑎. Because there is no direct supervision on albedo, and our model
is only able to observe one illumination condition, we rely on simple spatial smooth-
ness priors (and light visibility) to disambiguate between, e.g., the “white-painted
surface containing a shadow” case and the “black-and-white-painted surface” case. In
addition, the reconstruction loss of the observed views also drives the optimization of
𝑓a . The loss function that reflects this smoothness prior is:

∑︁ 1 ⃦
(2.22)

ℓa = 𝜆5 ⃦𝑓a (𝑥surf ) − 𝑓a (𝑥surf + 𝜖)⃦ ,
3 1
𝑥surf

where 𝜖 is the same random 3D perturbation as defined above, and 𝜆5 is a hyperparam-


eter set to 0.05. The output from 𝑓a is used as albedo in the Lambertian reflectance,
but not in the non-diffuse component, for which we assume the specular highlight
color to be white. We empirically constrain the albedo prediction to [0.03, 0.8] fol-
lowing Ward and Shakespeare [1998], by scaling the network’s final sigmoid output
by 0.77 and then adding a bias of 0.03.

Learning Priors From Real-World BRDFs For the specular components of the
BRDF, we seek to learn a latent space of real-world BRDFs and a paired “decoder”
that translates each latent code in the learned space 𝑧BRDF to a full 4D BRDF. To this
end, we adopt the Generative Latent Optimization (GLO) approach by Bojanowski
et al. [2018], which has been previously used by other coordinate-based models such
as Park et al. [2019a] and Martin-Brualla et al. [2021].
The 𝑓r component of our model is pretrained using the the MERL dataset [Ma-
tusik et al., 2003]. Because the MERL dataset assumes isotropic materials, we pa-

75
rameterize the incoming and outgoing directions for 𝑓r using Rusinkiewicz coordi-
nates [Rusinkiewicz, 1998] (𝜑d , 𝜃h , 𝜃d ) (three degrees of freedom) instead of 𝜔i and
𝜔o (four degrees of freedom). Denote this coordinate conversion by 𝑔 : (𝑛, 𝜔i , 𝜔o ) ↦→
(𝜑d , 𝜃h , 𝜃d ), where 𝑛 is the surface normal at that point. We train a function 𝑓r′ (a
re-parameterization of 𝑓r ) that maps from a concatenation of a latent code 𝑧BRDF
(which represents a BRDF identity) and a set of Rusinkiewicz coordinates (𝜑d , 𝜃h , 𝜃d )
to an achromatic reflectance 𝑟:

𝑓r′ : 𝑧BRDF , (𝜑d , 𝜃h , 𝜃d ) ↦→ 𝑟 . (2.23)


(︀ )︀

To train this model, we optimize both the MLP weights and the set of latent codes
𝑧BRDF to reproduce the real-world BRDFs. Simple mean squared errors are computed
on the log of the High-Dynamic-Range (HDR) reflectance values to train 𝑓r′ .

Because the color component of our reflectance model is assumed to be handled


by the albedo prediction network, we discard all color information from the MERL
dataset by converting its RGB reflectance values into achromatic ones2 . The latent
BRDF identity codes 𝑧BRDF are parameterized as unconstrained 3D vectors and ini-
tialized with a zero-mean isotropic Gaussian with a standard deviation of 0.01. No
sparsity or norm penalty is imposed on 𝑧BRDF during training.

After this pretraining, the weights of this BRDF MLP are frozen during the joint
optimization of our entire model, and we predict only 𝑧BRDF for each 𝑥surf by training
from scratch a BRDF identity MLP (Figure 2-6): 𝑓z : 𝑥surf ↦→ 𝑧BRDF . This can
be thought of as predicting spatially-varying BRDFs for all the surface points in
the plausible space of real-world BRDFs. We optimize the BRDF identity MLP to
minimize the re-rendering loss and the same spatial smoothness prior as in albedo:
⃦ ⃦
∑︁ ⃦𝑓z (𝑥surf ) − 𝑓z (𝑥surf + 𝜖)⃦
ℓz = 𝜆6 1
, (2.24)
𝑥
dim(𝑧 BRDF )
surf

2
In principle, one should be able to perform diffuse-specular separation on the MERL BRDFs and
then learn priors on just the specular lobes. We experimented with this idea by using the separation
provided by Sun et al. [2018a], but this yielded qualitatively worse end results.

76
where 𝜆6 is a hyperparameter set to 0.01, and dim(𝑧BRDF ) denotes the dimension-
ality of the BRDF latent code (3 in our implementation because there are only 100
materials in the MERL dataset).
The final BRDF is the sum of the Lambertian component and the learned non-
diffuse reflectance (subscript of 𝑥surf dropped for brevity):

𝑓a (𝑥) (︁ )︀)︁
+ 𝑓r′ 𝑓z (𝑥), 𝑔 𝑓n (𝑥), 𝜔i , 𝜔o , (2.25)
(︀
𝑅(𝑥, 𝜔i , 𝜔o ) =
𝜋

where the specular highlight color is assumed to be white.

2.4.3 Illumination

We adopt a simple and direct representation of lighting: an HDR light probe image
[Debevec, 1998] in the latitude-longitude format. In contrast to spherical harmonics
or a mixture of spherical Gaussians, this representation allows our model to represent
detailed high-frequency lighting and therefore to support hard cast shadows. That
said, the challenges of using this representation are clear: It contains a large number
of parameters, and every pixel/parameter can vary independently of all other pix-
els/parameters. This issue can be ameliorated by our use of the light visibility MLP,
which allows us to quickly evaluate a surface point’s visibility to all pixels of the light
probe. Empirically, we use a 16 × 32 resolution for our lighting environments as we do
not expect to recover higher-frequency content in the light probe image beyond that
resolution (the environment is effectively low-pass filtered by each object’s BRDFs as
discussed by Ramamoorthi and Hanrahan [2004], and the objects in our datasets are
not shiny or mirror-like).
To encourage smoother lighting, we apply a simple ℓ2 gradient penalty on the
pixels of the light probe 𝐿 along both the horizontal and vertical directions:
⎛ ⃦⎡ ⎤ ⃦2 ⎞
⃦ [︁ ]︁ ⃦2 ⃦ −1 ⃦
ℓi = 𝜆7 ⃦ −1 1 * 𝐿⃦ + ⃦ * 𝐿⃦ ⎠ , (2.26)
⎝⃦ ⃦ ⃦⎣ ⎦ ⃦
2 ⃦ 1 ⃦
2

where * denotes the convolution operator, and 𝜆7 is a hyperparameter set to 5 × 10−6

77
(given that there are 512 pixels with HDR values). During the joint optimization,
these probe pixels get updated directly by the final reconstruction loss and the gra-
dient penalty.

2.4.4 Rendering

Given the surface normal, visibility of all light directions, albedo, and BRDF at each
point 𝑥surf , as well as the estimated environment lighting, the final physically-based,
non-learnable renderer renders an image that is then compared against the observed
image. The errors in this rendered image are backpropagated up to, but excluding,
the 𝜎-volume of the pretrained NeRF, thereby driving the joint estimation of surface
normals, light visibility, albedo, BRDFs, and illumination.
Given the ill-posed nature of the problem (largely due to our only observing one
unknown illumination), we expect the majority of useful information to be from
direct illumination rather than global illumination and therefore consider only single-
bounce direct illumination (i.e., from the light source to the object surface then to
the camera). This assumption also reduces the computational cost of evaluating our
model. Mathematically, the rendering equation in our setup is (subscript of 𝑥surf
dropped again for brevity):
∫︁
(2.27)
(︀ )︀
𝐿o (𝑥, 𝜔o ) = 𝑅(𝑥, 𝜔i , 𝜔o )𝐿i (𝑥, 𝜔i ) 𝜔i · 𝑛(𝑥) 𝑑𝜔i
Ω
∑︁
(2.28)
(︀ )︀
= 𝑅(𝑥, 𝜔i , 𝜔o )𝐿i (𝑥, 𝜔i ) 𝜔i · 𝑓n (𝑥) ∆𝜔i
𝜔i
∑︁ (︂ 𝑓a (𝑥) (︁ )︀)︁
)︂
𝑓r′ (2.29)
(︀ (︀ )︀
= + 𝑓z (𝑥), 𝑔 𝑓n (𝑥), 𝜔i , 𝜔o 𝐿i (𝑥, 𝜔i ) 𝜔i · 𝑓n (𝑥) ∆𝜔i ,
𝜔i
𝜋

where 𝐿o (𝑥, 𝜔o ) is the outgoing radiance at 𝑥 as viewed from 𝜔o , 𝐿i (𝑥, 𝜔i ) is the


incoming radiance, masked by the visibility 𝑓v (𝑥, 𝜔i ), arriving at 𝑥 along 𝜔i directly
from a light probe pixel (since we consider only single-bounce direct illumination),
and ∆𝜔i is the solid angle corresponding to the lighting sample at 𝜔i .
The final reconstruction loss ℓrecon is simply the mean squared error (with a unit
weight) between the rendering and the observed image. Therefore, our full loss func-

78
tion is the summation of all the previously defined losses: ℓrecon + ℓn + ℓv + ℓa + ℓz + ℓi .

2.4.5 Implementation Details

NeRFactor is implemented in TensorFlow 2 [Abadi et al., 2016]. All training uses the
Adam optimizer [Kingma and Ba, 2015] with the default hyperparameters.

Staged Training There are three stages in training NeRFactor. First, we optimize
a NeRF using the input posed images (once per scene) and train a BRDF MLP on
the MERL dataset (only once for all scenes). Both of these MLPs are frozen during
the final joint optimization since the NeRF only provides a shape initialization, and
the BRDF MLP provides a latent space of real-world BRDFs for the optimization
to explore. Future shape refinement happens in NeRFactor’s normal and visibility
MLPs, and the actual material prediction happens in NeRFactor’s albedo and BRDF
identity MLPs. Second, we use this trained NeRF to initialize our geometry by
optimizing the normal and visibility MLPs to simply reproduce the NeRF values,
without any additional smoothness loss or regularization. Finally, we jointly optimize
the albedo MLP, BRDF identity MLP, and light probe pixels from scratch, along with
the pretrained normal and visibility MLPs. Fine-tuning the normal and visibility
MLPs along with the reflectance and lighting allows the errors in NeRF’s initial
geometry to be improved in order to minimize the re-rendering loss (Figure 2-7).

Architecture and Positional Encoding We use the default architecture for


NeRF [Mildenhall et al., 2020], and all other MLPs that we introduce contain four
layers (with a skip connection from the input to the second layer), each with 128
hidden units. As in NeRF [Mildenhall et al., 2020], we apply positional encoding to
the input coordinates of all networks with 10 encoding levels for 3D locations and 4
encoding levels for directions.

Runtime We train NeRF for 2,000 epochs, which takes 6–8 hours when distributed
over four NVIDIA TITAN RTX GPUs. Prior to the final joint optimization, comput-

79
ing the initial surface normals and light visibility from the trained NeRF takes 30 min
per view on a single GPU for a 16 × 32 light probe (i.e., 512 light source locations).
This step can be trivially parallelized because each view is processed independently.
Geometry pretraining is performed for 200 epochs, which takes around 20 min on one
TITAN RTX. The final joint optimization is performed for 100 epochs, which takes
only 20 min on one TITAN RTX.

2.5 Results
In this section, we first explain how the datasets are constructed for our tasks (Sec-
tion 2.5.1). Next, we present the high-quality geometry achieved by NeRFactor
(Section 2.5.2) and then its joint estimation of shape, reflectance, and direct illu-
mination (Section 2.5.3). Finally, we showcase the applications of NeRFactor’s ap-
pearance factorization: free-viewpoint relighting (Section 2.5.4) and material editing
(Section 2.5.5).

2.5.1 Data

NeRV uses synthetic multi-view images of an object and their ground-truth camera
poses. NeRFactor additionally uses three sources of data: real multi-view images
of an object and their estimated camera poses, real-world measured Bidirectional
Reflectance Distribution Functions (BRDFs), and captured light probes.

Synthetic Renderings We use the synthetic Blender scenes (hotdog, drums, lego,
and ficus) released by Mildenhall et al. [2020], construct a new Blender scene
(armadillo), and replace the illumination used there with our own arbitrary, man-
made illuminations (for NeRV) or natural illuminations (for NeRFactor) taken from
real light probe images (we use publicly available light probes from hdrihaven.com,
Stumpfel et al. [2004], and Blender). This yields significantly more natural input
illumination conditions.
We also disable all non-standard post-rendering effects used by Blender Cycles

80
when rendering the images, such as “filmic” tone mapping, and retain only the stan-
dard linear-to-sRGB tone mapping. We render all images directly to PNGs instead
of EXRs to simulate real-world mobile phone captures where raw High-Dynamic-
Range (HDR) pixel intensities may not be available; this indeed facilitates applying
NeRFactor directly to real scenes as shown in Figure 2-10.

Real Captures NeRFactor uses mobile phone captures of two real scenes released
by Mildenhall et al. [2020]: vasedeck and pinecone. These scenes are captured
by inwards-facing cameras on the upper hemisphere. There are close to 100 images
per scene, and the camera poses are obtained by COLMAP Structure From Motion
(SFM) [Schönberger and Frahm, 2016]. NeRFactor is directly applicable because it
is designed to work with PNGs instead of EXRs.

Measured BRDFs NeRFactor uses real measured BRDFs from the MERL dataset
by Matusik et al. [2003]. The MERL dataset consists of 100 real-world BRDFs mea-
sured by a conventional gonioreflectometer. Because the color components of BRDFs
are not used by our model, we convert the RGB reflectance values to be achromatic
by converting linear RGB values to relative luminance.

2.5.2 Shape Estimation

NeRFactor jointly estimates an object’s shape in the form of surface points and their
associated surface normals as well as their visibility to each light location. Figure 2-7
visualizes these geometric properties. To visualize light visibility, we take the mean
of the 512 visibility maps corresponding to each pixel of a 16 × 32 light probe and
visualize that average map (i.e., ambient occlusion) as a grayscale image. See the
supplemental video for movies of per-light visibility maps (i.e., shadow maps). As
Figure 2-7 shows, our surface normals and light visibility are smooth and resemble
the ground truth, thanks to the joint estimation procedure that optimizes normals
and visibility to minimize re-rendering errors and encourage spatial smoothness.
If we ablate the spatial smoothness constraints and rely on only the re-rendering

81
I. Surface Normals

II. Light Visibility (mean)

(A) Derived from NeRF (B) Jointly Optimized (C) NeRFactor: Jointly Optimized (D) Ground Truth
w/ Smoothness Constraints

Figure 2-7: High-quality geometry recovered by NeRFactor. (A) We directly derive


the surface normals and light visibility from a trained NeRF: Differentiating NeRF’s
𝜎-volume with respect to 3D location 𝑥 gives surface normals, and marching through
the same 𝜎-volume to each light location gives visibility. However, geometry derived
in this way directly from NeRF is too noisy to be used for relighting (see the supple-
mental video). (B) With the NeRF geometry as the starting point, jointly optimizing
shape and reflectance improves the geometry, but there is still significant noise (e.g.,
the stripe artifacts in [II]). (C) Joint optimization with smoothness constraints leads
to smooth surface normals and light visibility that resemble ground truth. We vi-
sualize the visibility maps as ambient occlusion maps, by taking averages over all
incoming light directions.

82
loss, we end up with noisy geometry that is insufficient for rendering. Although these
geometry-induced artifacts may not show up under low-frequency lighting, harsh light-
ing conditions (such as a single point light with no ambient illumination, i.e., One-
Light-at-A-Time or OLAT) reveal them as demonstrated in our supplemental video.
Perhaps surprisingly, even when our smoothness constraints are disabled, the geom-
etry estimated by NeRFactor is still significantly less noisy than the original NeRF
geometry (compare [A] with [B] of Figure 2-7 and see [I] of Table 2.1) because the
re-rendering loss encourages smoother geometry. See Section 2.6.4 for more details.

2.5.3 Joint Estimation of Shape, Reflectance, & Illumination

In this experiment, we demonstrate how NeRFactor factorizes appearance into shape,


reflectance, and illumination for scenes with complex geometry and/or reflectance.
When visualizing albedo, we adopt the convention used by the intrinsic image
literature of assuming that the absolute brightness of albedo and shading is unre-
coverable [Land and McCann, 1971], and furthermore we assume that the problem
of color constancy (solving for a global color correction that disambiguates between
the average color of the illuminant and the average color of the albedo [Buchsbaum,
1980]) is also out of scope. In accordance with these two assumptions, we visualize
our predicted albedo and measure its accuracy by first scaling each albedo channel
by a global scalar that is identified so as to minimize squared error w.r.t. the ground-
truth albedo, as is done in Barron and Malik [2014]. Unless stated otherwise, all
albedo predictions in this paper are corrected this way, and we apply gamma correc-
tion (𝛾 = 2.2) to display them properly in the figures. Our estimated light probes are
not scaled this way w.r.t. the ground truth (since illumination estimation is not the
primary goal of this work) and visualized by simply scaling their maximum intensity
to 1 and applying gamma correction (𝛾 = 2.2) to show details in the dark regions.
As shown in Figure 2-8 (B), NeRFactor predicts high-quality and smooth surface
normals that are close to the ground truth except in regions with very high-frequency
details such as the bumpy surface of the hotdog buns. In the drums scene, we see
that NeRFactor is able to successfully reconstruct fine details such as the screw at

83
I. Normals II. Albedo III. View Synthesis IV. FV Relighting (point) V. FV Relighting (image)
Method Angle (°) ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓
SIRFS - 26.0204 0.9420 0.0719 - - - - - - - - -
Oxholm & Nishino† 32.0104 26.3248 0.9448 0.0870 29.8093 0.9275 0.0810 20.9979 0.8407 0.1610 22.2783 0.8762 0.1364
NeRFactor 22.1327 28.7099 0.9533 0.0621 32.5362 0.9461 0.0457 23.6206 0.8647 0.1264 26.6275 0.9026 0.0917
using microfacet 22.1804 29.1608 0.9571 0.0567 32.4409 0.9457 0.0458 23.7885 0.8642 0.1256 26.5970 0.9011 0.0925
w/o geom. pretrain. 25.5302 27.7936 0.9480 0.0677 32.3835 0.9449 0.0491 23.1689 0.8585 0.1384 25.8185 0.8966 0.1027
w/o smoothness 26.2229 27.7389 0.9179 0.0853 32.7156 0.9450 0.0405 23.0119 0.8455 0.1283 26.0416 0.8887 0.0920
using NeRF’s shape 32.0634 27.8183 0.9419 0.0689 30.7022 0.9210 0.0614 22.0181 0.8237 0.1470 24.8908 0.8651 0.1154
Reported numbers are the arithmetic means of all four synthetic scenes (hotdog, ficus, lego, and drums) over eight uniformly sampled novel views.
The top three performing techniques for each metric are highlighted in red, orange, and yellow, respectively. For Tasks IV and V, we relight the

84
scenes with 16 novel lighting conditions: eight OLAT conditions plus the eight light probes included in Blender. We apply color correction and
tonemapping to the albedo estimation before computing the errors (see Section 2.5.3 for details). †Oxholm and Nishino [2014] require the
ground-truth illumination, which we provide, and this baseline represents a significantly enhanced version (see Section 2.6.2).
Table 2.1: Quantitative evaluation of NeRFactor. Although ablating the smoothness constraints (“w/o smoothness”) achieves
good view synthesis quality under the original illumination, the noisy estimates lead to poor relighting performance. NeRFactor
achieves the top overall performance across most metrics, although for some metrics, it is outperformed by the microfacet
variant (“using microfacet”), which tends to either converge to the local optimum of maximum roughness everywhere or produce
non-spatially-smooth BRDFs (see the supplemental video). We are unable to present normal, view synthesis, or relighting
errors for SIRFS [Barron and Malik, 2014] as it does not support non-orthographic cameras or “world-space” geometry (though
Figure 2-14 shows that the geometry recovered by SIRFS is inaccurate).
the center of the cymbal, and the metal rims on the sides of the drums. For ficus,
NeRFactor is able to recover the complex leaf geometry of the potted plant. The
average light visibility (ambient occlusion) maps also correctly portray the average
exposure of each point in the scene to the lights. Albedo is recovered cleanly, with
barely any shadowing or shading detail inaccurately attributed to albedo variation;
note how the shading on the drums is absent in the albedo prediction. Moreover,
the predicted light probes correctly reflect the locations of the primary light sources
and the blue sky (blue pixels in [I]). In all three scenes, the predicted BRDFs are
spatially-varying and correctly reflect that different parts of the scene have different
materials, as indicated by different BRDF latent codes in (E).
Instead of representing illumination with a more sophisticated representation such
as spherical harmonics, we opt for a straightforward representation: a latitude-
longitude map whose pixels are HDR intensities. Because lighting is effectively
convolved by a low-pass filter when reflected by a moderately diffuse BRDF [Ra-
mamoorthi and Hanrahan, 2001], and our objects are not shiny or mirror-like, we do
not expect to recover illumination at a resolution higher than 16 × 32. As shown in
Figure 2-8 (I), NeRFactor estimates a light probe that correctly captures the bright
light source on the far left as well as the blue sky. Similarly, in Figure 2-8 (II), the
dominant light source location is also correctly estimated (the bright white blob on
the left).

2.5.4 Free-Viewpoint Relighting

NeRFactor estimates 3D fields of shape and reflectance, thus enabling simultaneous


relighting and view synthesis. As such, all the relighting results shown for NeRFactor
and the supplemental video are rendered from novel viewpoints. To probe the limits
of NeRFactor, we use harsh test illumination conditions that have one point light on
at a time (OLAT), with no ambient illumination. These test illuminations induce
hard cast shadows, which effectively expose rendering artifacts due to inaccurate
geometry and materials. For visualization purposes, we composite the relit results
(using NeRF’s predicted opacity) onto backgrounds whose colors are the averages

85
I. Ground Truth

N/A

I. Prediction

II. Ground Truth

N/A

II. Prediction

III. Ground Truth

N/A

III. Prediction

(A) Rendering (B) Surface Normals (C) Light Visibility (D) Albedo & Illum. (E) BRDF z

Figure 2-8: Joint estimation of shape, reflectance, and lighting by NeRFactor. Here
we visualize factorization produced by NeRFactor (bottom) alongside the ground
truth (top) on three scenes. Although our recovered surface normals, visibility, and
albedo sometimes omit some fine-grained detail, they still closely resemble the ground
truth. Despite that the illuminations recovered by NeRFactor are oversmoothed (due
to the effective low-pass filtering induced by observing illumination only after it has
been convolved by the object’s BRDFs) and incorrect on the bottom half of the hemi-
sphere (since objects are only ever observed from the top hemisphere), the dominant
light sources and occluders are localized nearby their ground-truth locations in the
light probes. Note that we are unable to compare against ground-truth BRDFs as
they are defined using Blender’s shader node trees, while our recovered BRDFs are
parameterized by our learned model.

86
over upper halves of the light probes.
As shown in Figure 2-9 (II), NeRFactor synthesizes correct hard shadows cast
by the hot dogs under the three test OLAT conditions. NeRFactor also produces
realistic renderings of the ficus under the OLAT illuminations (I), especially when
the ficus is back-lit by the point light in (D). Note that the ground truth in (D)
appears brighter than NeRFactor’s results, because NeRFactor models only direct
illumination, whereas the ground-truth image was rendered with global illumination.
When we relight the objects with two new light probes, realistic soft shadows are
synthesized on the plate of hotdog (II).
In ficus, the specularities on the vase correctly reflect the primary light sources
in the both test probes. The ficus leaves also exhibit realistic specular highlights
close to the ground truth in (F). In drums (III), the cymbals are correctly estimated
to be specular and exhibit realistic reflection, though different from the ground-truth
anisotropic reflection (D). This is as expected because all MERL BRDFs are isotropic
[Matusik et al., 2003]. Despite being unable to explain these anisotropic reflections,
NeRFactor correctly leaves them out in albedo rather than interprets them as albedo
paints since doing that would violate the albedo smoothness constraint and contradict
those reflections’ view dependency. In lego, realistic hard shadows are synthesized
by NeRFactor for the OLAT test conditions (IV).

Relighting Real Scenes We apply NeRFactor to the two real scenes, vasedeck
and pinecone, captured by Mildenhall et al. [2020]. These captures are particularly
suitable for NeRFactor: There are around 100 multi-view images of each scene lit
by an unknown environment lighting. As in NeRF, we run COLMAP SFM [Schön-
berger and Frahm, 2016] to obtain the camera intrinsics and extrinsics for each view.
We then train a vanilla NeRF to obtain an initial shape estimate, which we distill
into NeRFactor and jointly optimize together with reflectance and illumination. As
Figure 2-10 (I) shows, the appearance is factorized into illumination (not pictured)
and 3D fields of surface normals, light visibility, albedo, and spatially-varying BRDF
latent codes that together explain the observed views. With such factorization, we

87
I. Ground Truth

I. Prediction

II. Ground Truth

II. Prediction

III. Ground Truth

III. Prediction

IV. Ground Truth

IV. Prediction

(A) View Synthesis & (B) OLAT 1 (C) OLAT 2 (D) OLAT 3 (E) “Courtyard” (F) “Sunrise”
Original Illum.

Figure 2-9: Free-viewpoint relighting by NeRFactor. Here we relight the object using
three OLAT illuminations and two real-world illuminations (light probes captured in
the real world). The renderings produced by our model qualitatively resemble the
ground truth and accurately exhibit challenging effects such as specularities and cast
shadows (both hard and soft).

88
relight the scenes by replacing the estimated illumination with novel arbitrary light
probes (Figure 2-10 [II]). Because our factorization is fully 3D, all the intermediate
buffers can be rendered from any viewpoint, and the relighting results shown are also
from novel viewpoints. Note that the scenes are bounded to avoid faraway geometry
blocking light from certain directions and casting shadows during relighting.

I. Factorizing Appearance

(A) An Input View (B) Reconstruction (C) Albedo (D) BRDF z (E) Normals (F) Visibility (mean)

II. Free-Viewpoint Relighting

(A) View Synthesis (B) “Interior” (C) “Courtyard” (D) “Studio” (E) “Sunrise” (F) “Sunset”

Figure 2-10: NeRFactor’s results on real-world captures. (I) Given posed multi-view
images of a real-world object lit by an unknown illumination condition (A), NeR-
Factor factorizes the scene appearance into 3D fields of albedo (C), spatially-varying
BRDF latent codes (D), surface normals (E), and light visibility for all incoming
light directions, visualized here as ambient occlusion (F). Note how the estimated
albedo of the flowers are shading-free. (II) With this factorization, one can synthesize
novel views of the scene relit by any arbitrary lighting. Even on these challenging
real-world scenes, NeRFactor is able to synthesize realistic specularities and shadows
across various lighting conditions.

89
2.5.5 Material Editing

Since NeRFactor factorizes diffuse albedo and specular BRDF from appearance, one
can edit the albedo, non-diffuse BRDF, or both and re-render the edited object under
an arbitrary lighting condition from any viewpoint. In this subsection, we override
the estimated 𝑧BRDF to the learned latent code of pearl-paint in the MERL dataset,
and the estimated albedo to colors linearly interpolated from the turbo colormap,
spatially varying based on the surface points’ 𝑥-coordinates. As Figure 2-11 (left)
demonstrates, with the factorization by NeRFactor, we are able to realistically re-
light the original estimated materials with the two challenging OLAT conditions.
Furthermore, the edited materials are also relit with realistic specular highlights and
hard shadows by the same test OLAT conditions (Figure 2-11 [right]).

OLAT 1 Figure 2-11: Material editing and


relighting by NeRFactor. The
factorization produced by NeRFactor
can be used for material editing. Here
we show the original materials of two
scenes relit by two OLAT conditions
OLAT 2 (left) alongside the edited materials relit
by the same OLAT conditions (right).
Specifically, we override the predicted
𝑧BRDF to that of pearl-paint in the
MERL dataset and the predicted albedo
to colors interpolated from the turbo
colormap, varying spatially based on the
Original Material Edited Material
surface points’ 𝑥-coordinates.

2.6 Discussion
In this section, we first show how NeRV compares with the baseline methods (none of
which models indirect illumination in contrast to NeRV) in free-viewpoint relighting
(Section 2.6.1). Next, we compare NeRFactor against several competitors in the tasks
of appearance factorization and free-viewpoint relighting (Section 2.6.2). We then
perform ablation studies to evaluate the importance of each major model component

90
of NeRV (Section 2.6.3) and NeRFactor (Section 2.6.4), and then study whether
NeRFactor predicts albedo consistently for the same object but lit by different lighting
conditions (Section 2.6.5).

2.6.1 Baseline Comparisons: Multiple Known Illuminations

Under multiple known illuminations, NeRV outperforms prior work, particularly in


its ability to recover relightable scene representations from images observed under
complex lighting. We urge the reader to view our supplementary video to appreciate
NeRV’s relighting and view synthesis results. In Table 2.2, we show performance for
rendering images from novel viewpoints with lighting conditions not observed during
training.

We evaluate two versions of NeRV: NeRV with Neural Visibility Fields (“NeRV-
NVF”) and NeRV with Test-time Tracing (“NeRV-Trace”). Both methods use the
same training procedure as described above and differ only in how evaluation is per-
formed: NeRV-NVF uses the same visibility approximations used during training at
test time, while NeRV-Trace uses brute-force tracing to estimate visibility to point
light sources to render sharper shadows at test time. We compare against the follow-
ing baselines.

NLT Neural Light Transport (NLT; Section 3.4) requires an input proxy geometry
(which we provide by running marching cubes [Lorensen and Cline, 1987] on NeRFs
[Mildenhall et al., 2020] trained from images of each scene rendered with fixed light-
ing) and trains a convolutional network defined in an object’s texture atlas space
to perform simultaneous relighting and view synthesis. Although our method just
requires images with known but unconstrained lighting conditions for training, NLT
requires multi-view images captured OLAT, where each viewpoint is rendered mul-
tiple times, once per light source. See the supplemental material (Chapter A) for
qualitative comparisons.

91
NeRF+LE & NeRF+Env NeRF plus Learned Embedding (“NeRF+LE”) and
NeRF plus Fixed Environment Embedding (“NeRF+Env”) represent appearance vari-
ation due to changing lighting using latent variables. Both augment the original NeRF
model with an additional input of a 64-dimensional latent code corresponding to the
scene lighting condition. These approaches are similar to NeRF in the Wild [Martin-
Brualla et al., 2021], which also uses a latent code to describe appearance variation
due to variable lighting. NeRF+LE uses a PointNet [Qi et al., 2017a] encoder to
embed the position and color of each light, and NeRF+Env simply uses the flattened
light probe as the latent code.

Neural Reflectance Fields Neural Reflectance Fields [Bi et al., 2020a] uses a
similar neural volumetric representation as NeRV, with the critical difference that
brute-force raymarching is used to compute visibilities. This approach is therefore
unable to consider illumination from sources other than a single point light during
training. At test time when time and memory constraints are less restrictive, it
computes visibilities to all light sources.
We train each method (other than NLT) on nine datasets. Each consists of 150
images of a synthetic scene (hotdog, lego, or armadillo) illuminated by one of these
three lighting conditions: I) “Point” containing a single white point light randomly
sampled on a hemisphere above the scene for each frame, representing a laboratory
setup similar to that of Bi et al. [2020a], II) “Colorful+Point” consisting of a randomly
sampled point light and a set of eight colorful point lights whose locations and colors
are fixed across all images in the dataset (this represents a challenging scenario with
multiple strong light sources that cast shadows and tint the scene), and III) “Ambi-
ent+Point” comprising a randomly sampled point light and a dim gray environment
map (this represents a challenging scenario where scene points are illuminated from
all directions). We separately train each method on each of these nine datasets and
measure performance on the corresponding scene’s test set, which consists of 150
images of the scene under novel lighting conditions (containing either one or eight
point sources) not seen during training, rendered from novel viewpoints not observed

92
during training.
NeRV outperforms all baselines in experiments that correspond to challenging
complex lighting conditions and matches the performance of prior work in experiments
with simple lighting. As visualized in Figure 2-12, the method of Bi et al. [2020a]
performs comparably to NeRV in the case it is designed for: images illuminated by a
single point source. However, their model’s performance degrades when it is trained on
datasets that have complex lighting conditions (Colorful+Point and Ambient+Point
experiments in Table 2.2), as its forward model is unable to simulate light from more
than a single source during training.

(a) Ground Truth (b) Bi et al. (c) NeRV (Ours) (d) Bi et al. (e) NeRV (Ours)
Point Point Ambient+Point Ambient+Point

Figure 2-12: NeRV vs. Bi et al. [2020a]. Both Bi et al. [2020a] (b) and NeRV (c)
recover high-quality relightable models when trained on images illuminated by a single
point source. However, for more complex lighting such as “Ambient+Point,” Bi et al.
[2020a] (d) fails as its brute-force visibility computation is unable to simulate the
surrounding ambient lighting during training. Their model minimizes training loss
by making the scene transparent and is thus unable to render convincing images for
the “single point light” (Row 1) or “colorful set of points lights” (Row 2) conditions.
Because NeRV (e) correctly simulates light transport, its renderings more closely
resemble the ground truth (a).

As visualized in Figure 2-13, NeRV thoroughly outperforms both latent code base-
lines as they are unable to generalize to lighting conditions that are unlike those seen
during training. Our method generally matches or outperforms the NLT baseline,
which requires a controlled laboratory lighting setup and substantially more inputs
than all other methods (the multi-view OLAT dataset we use to train NLT contains

93
hotdog
Train. Illum. Single Point Colorful+Point Ambient+Point OLAT
PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM
NLT − − − − − − 23.57 0.851
NeRF+LE 19.96 0.868 17.88 0.758 20.72 0.869 − −
NeRF+Env 19.94 0.863 19.17 0.824 20.56 0.864 − −
Bi et al. [2020a] 23.74 0.862 22.09 0.799 20.94 0.754 − −
NeRV-NVF 23.93 0.860 24.37 0.885 25.14 0.892 − −
NeRV-Trace 23.76 0.863 24.24 0.886 25.06 0.892 − −

lego
Train. Illum. Single Point Colorful+Point Ambient+Point OLAT
PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM
NLT − − − − − − 24.10 0.936
NeRF+LE 21.42 0.874 21.74 0.890 20.33 0.860 − −
NeRF+Env 21.13 0.855 20.27 0.878 20.24 0.852 − −
Bi et al. [2020a] 22.89 0.897 22.83 0.890 18.10 0.783 − −
NeRV-NVF 22.78 0.866 23.82 0.899 23.32 0.894 − −
NeRV-Trace 23.16 0.883 24.18 0.925 23.79 0.923 − −

armadillo
Train. Illum. Single Point Colorful+Point Ambient+Point OLAT
PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM
NLT − − − − − − 21.62 0.900
NeRF+LE 20.35 0.881 18.76 0.863 17.35 0.859 − −
NeRF+Env 19.60 0.874 17.89 0.863 17.28 0.851 − −
Bi et al. [2020a] 22.35 0.894 21.06 0.892 19.93 0.842 − −
NeRV-NVF 21.14 0.882 22.80 0.910 22.80 0.897 − −
NeRV-Trace 22.14 0.897 23.02 0.921 22.81 0.895 − −

Table 2.2: Quantitative evaluation of NeRV. For every scene, we train each method on
three datasets that contain images of the scene under different illumination conditions
and compare the metrics of all variants on the same testing dataset. Please refer to
Section 2.6.1 for details.

94
8× as many images as our other datasets, and the original NLT paper [Zhang et al.,
2021b] uses 150 OLAT images per viewpoint).

(a) Ground Truth (b) NeRF + LE (c) NeRF + Env. (d) NeRV (Ours)
Ambient+Point Ambient+Point Ambient+Point

Figure 2-13: NeRV vs. latent code models. Modeling appearance changes due to
lighting with a latent code does not generalize to lighting conditions unseen during
training. Here we train the two latent code baselines (b, c) and NeRV (d) on the
Ambient+Point dataset. The latent code models are unable to produce convincing
renderings at test time, while NeRV trained on the same data renders high-quality
images.

2.6.2 Baseline Comparisons: One Unknown Illumination

In this subsection, we compare NeRFactor with both classic and deep learning-based
state of the art in the tasks of appearance factorization and free-viewpoint relighting.
For quantitative evaluations, we use Peak Signal-to-Noise Ratio (PSNR), Structural
Similarity Index Measure (SSIM) [Wang et al., 2004], and Learned Perceptual Image
Patch Similarity (LPIPS) [Zhang et al., 2018a] as our error metrics.

SIRFS We compare the factorization by NeRFactor with that of the classic Shape,
Illumination, and Reflectance From Shading (SIRFS) method [Barron and Malik,
2014], both qualitatively and quantitatively. SIRFS is a single-image method that
decomposes appearance into surface normals, albedo, and shading (not shadowing)
in the input view. In contrast, NeRFactor is a multi-view approach that estimates
these properties plus BRDFs and visibility (hence, shadows) in the full 3D space

95
alongside the unknown illumination. In other words, NeRFactor gets to observe
many more views than SIRFS, which observes only one view. Under this setup, NeR-
Factor outperforms SIRFS quantitatively as shown by Table 2.1. Figure 2-14 shows
that although SIRFS achieves reasonable albedo estimation, it produces inaccurate
surface normals likely due to its inability to incorporate multiple views or to reason
about shape in “world space.” In addition, SIRFS is unable to render the scene from
arbitrary viewpoints or synthesize shadows during relighting.

I. Albedo II. Surface Normals

(A) SIRFS (B) NeRFactor (ours) (C) Ground Truth (A) SIRFS (B) NeRFactor (ours) (C) Ground Truth

Figure 2-14: NeRFactor vs. SIRFS. Here we compare NeRFactor against SIRFS [Bar-
ron and Malik, 2014] that also recovers normals, albedo, and shading (not shadow)
given only one illumination condition. Although the albedo estimates produced by
SIRFS are reasonable, the surface normals are highly inaccurate (likely due to SIRFS’s
inability to use multiple images to inform shape estimation).

Oxholm and Nishino [2014] Given that SIRFS is single-view, we additionally


compare NeRFactor with a significantly improved version of the multi-view approach
by Oxholm and Nishino [2014] that estimates the shape and non-spatially-varying
BRDF under a known lighting condition. Due to the source code being unavailable,
we re-implemented this method, capturing the main ideas of smoothness regulariza-
tion on shape and data-driven BRDF priors, and then enhanced it with a better shape
initialization (visual hull → NeRF shape) and the ability to model spatially-varying
albedo (the original paper considers only non-spatially-varying BRDFs). Other dif-
ferences include representing the shape with a surface normal MLP instead of mesh
and expressing the predicted BRDF with a pretrained BRDF MLP instead of MERL

96
BRDF bases [Nishino, 2009, Nishino and Lombardi, 2011, Lombardi and Nishino,
2012]. Also note that this baseline has the advantage of receiving the ground-truth
illumination as input, whereas NeRFactor has to estimate illumination together with
shape and reflectance.
As shown in Figure 2-15 (I), even though this improved version of Oxholm and
Nishino [2014] has access to the ground-truth illumination, it struggles to remove
shadow residuals from the albedo estimation because of its inability to model visibil-
ity (hotdog and lego). As expected, these residuals in albedo negatively affect the
relighting results in Figure 2-15 (II) (e.g., the red shade on the hotdog plate). More-
over, because the BRDF estimated by this baseline is not spatially-varying, BRDFs of
the hot dog buns and the ficus leaves are incorrectly estimated to be as specular as the
plate and vase, respectively. Finally, this baseline is unable to synthesize non-local
light transport effects such as shadows (hotdog and lego), in contrast to NeRFactor
that correctly produces realistic hard cast shadows under the OLAT conditions.

Philip et al. [2019] The recent work of Philip et al. [2019] presents a technique to
relight large-scale scenes, and specifically focuses on synthesizing realistic shadows.
The input to their system is similar to ours: multi-view images of a scene lit by
an unknown lighting condition. However, their technique only supports synthesizing
images illuminated by a single primary light source, such as the Sun (in contrast to
NeRFactor, which supports any arbitrary light probe). As such, we compare it with
NeRFactor only on the task of point light relighting.
As Figure 2-16 demonstrates, NeRFactor qualitatively outperforms this baseline
and synthesizes hard shadows that better resemble the ground truth. The “yellow fog”
in the background of their results (Figure 2-16 [A]) is likely due to poor geometry
reconstruction by their method. Because their network is trained on outdoor scenes
(not images with backgrounds), we additionally compute error metrics after masking
out the yellow fog with the ground-truth object masks (“Philip et al. [2019] + Masks”)
for a more generous comparison. As the table in Figure 2-16 shows, NeRFactor
outperforms Philip et al. [2019] + Masks in terms of both PSNR and SSIM.

97
I. Albedo Estimation II. Relighting (another view)

(A) Oxholm & (B) NeRFactor (C) Ground Truth (A) Oxholm & (B) NeRFactor (C) Ground Truth
Nishino [2014]† (ours) Nishino [2014]† (ours)

Figure 2-15: NeRFactor vs. Oxholm and Nishino [2014] (enhanced). (I) Their method
is unable to remove shadow residuals from albedo for hotdog and lego likely due to
its inability to model visibility or shadows, although it produces reasonable albedo
estimation for ficus wherein shading (instead of shadowing) predominates. In con-
trast, NeRFactor produces albedo maps with little to no shading. (II) As expected,
the baseline’s relighting results are negatively affected by the shadow residuals in
albedo (e.g., the red shade on the plate of hotdog). Furthermore, because their ap-
proach does not support spatially-varying BRDFs, the hot dog buns and ficus leaves
are mistakenly estimated to be as specular as the plate and the vase, respectively.
NeRFactor, on the other hand, correctly estimates different materials for different
parts of the scenes. Note also how NeRFactor is able to synthesize hard shadows
in hotdog and lego, while the baseline does not model visibility or shadows. †See
Section 2.6.2 for how we significantly enhanced the approach of Oxholm and Nishino
[2014]; in addition, we provide the baseline with the ground-truth illumination, since
unlike NeRFactor, it does not estimate the lighting condition.

98
Philip et al. [2019] + Masks achieves a lower (better) LPIPS score because it
renders new viewpoints by reprojecting observed images using estimated proxy ge-
ometry, as is typical of Image-Based Rendering (IBR) algorithms. Thus, it retains the
high-frequency details present in the input images, resulting in a lower LPIPS score.
However, as a physically-based (re-)rendering approach that operates fully in the 3D
space, NeRFactor synthesizes hard shadows that better match the ground truth and
supports relighting with arbitrary light probes such as “Studio,” which has four major
light sources in Figure 2-10.

OLAT 1 Figure 2-16: NeRFactor vs.


Philip et al. [2019].
Quantitatively, NeRFactor
outperforms Philip et al. [2019]
+ Masks (see Section 2.6.2) in
OLAT 2
PSNR and SSIM, while the
perceptual metric, LPIPS,
favors the latter for its sharp
images (since it is an IBR
method). However,
(A) Philip et al. (B) NeRFactor (C) Ground Truth
[2019] (ours)
qualitatively, shadows
synthesized by NeRFactor
Method PSNR ↑ SSIM ↑ LPIPS ↓ resemble the ground truth
Philip et al. [2019] 20.0397 0.5000 0.1812 more.Note that unlike
Philip et al. [2019] + Masks 21.8620 0.8436 0.1140 NeRFactor, this baseline does
NeRFactor (ours) 22.9625 0.8592 0.1230 not support relighting with
Same as in Table 2.1, we apply color correction to all
arbitrary lighting (such as
predictions (see also Section 2.5.3). The numbers here are another random light probe).
averages over eight test OLAT conditions.

2.6.3 Ablation Studies: Multiple Known Illuminations

We validate NeRV’s choices of using analytic instead of MLP-predicted surface nor-


mals (“NeRV-MLPN”) and simulating one-bounce indirect illumination through the
ablation studies reported in Table 2.3. We can see that modeling indirect illumination
improves performance (Figure 2-17), even for our relatively simple scenes. Although
using analytic instead of MLP-predicted normals is less numerically significant, Fig-
ure 2-18 shows that it results in more accurate estimated surface normals, which may

99
be important for downstream graphics tasks.

Table 2.3: Quantitative


hotdog lego armadillo ablation studies of NeRV.
Method PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIM
These results are on the
NeRV-NoInd 24.43 0.861 23.06 0.888 21.27 0.878
NeRV-MLPN 25.60 0.893 23.18 0.886 22.40 0.891 Ambient+Point dataset.
NeRV-NVF 25.14 0.892 23.32 0.894 22.80 0.897 Please refer to
NeRV-Trace 25.06 0.892 23.79 0.923 22.81 0.895 Section 2.6.3 for details.

direct (b) NeRV (Ours) Direct Only

(a) NeRV (Ours) with Indirect (b) NeRV (Ours) Direct Only

Figure 2-17: Indirect illumination in NeRV. NeRV’s ability to simulate indirect illu-
mination produces realistic details such as the additional brightness in the bulldozer’s
cab due to interreflections.

(a) NeRV (Ours) with Analytic Normals (b) NeRV (Ours) with MLP-Predicted Normals

Figure 2-18: NeRV with analytic vs. MLP-predicted normals. While obtaining surface
normals analytically (a) or as an output of the shape MLP (b) produces similar
renderings, analytic normals are much closer to the true surface normals.

(a) NeRV (Ours) with Analytic Normals (b) NeRV (Ours

2.6.4 Ablation Studies: One Unknown Illumination

In this section, we compare NeRFactor against other reasonable design alternatives


by ablating each of the major model components and observing whether there is
performance drop, both qualitatively and quantitatively.

100
Learned vs. Microfacet BRDFs Instead of using an MLP to parametrize the
BRDF and pretraining it on an external BRDF dataset to learn data-driven priors,
one can adopt an analytic BRDF model such as the microfacet model of Walter et al.
[2007] and ask an MLP to predict spatially-varying roughness for the microfacet
BRDF. As Table 2.1 shows, this model variant achieves good performance across all
tasks but overall underperforms NeRFactor. Note that to improve this variant, we
removed the smoothness constraint on the predicted roughness because even a tiny
smoothness weight still drove the optimization to the local optimum of predicting
maximum roughness everywhere (this local optimum is a “safe” solution that renders
everything more diffuse to satisfy the ℓ2 reconstruction loss). As such, this model
variance sometimes produces noisy rendering due to its non-smooth BRDFs in the
supplemental video.

With vs. Without Geometry Pretraining As shown in Figure 2-6 and discussed
in Section 2.4, we pretrain the normal and visibility MLPs to just reproduce the NeRF
values given 𝑥surf before plugging them into the joint optimization (where they are
then finetuned together with the rest of the pipeline), to prevent the albedo MLP
from mistakenly attempting to explain way the shadows. Alternatively, one can train
these two geometry MLPs from scratch together with the pipeline. As Table 2.1
shows, this variant indeed predicts worse albedo with shading residuals (Figure 2-19
[C]) and overall underperforms NeRFactor.

With vs. Without Smoothness Constraints In Section 2.4, we introduce our


simple yet effective spatial smoothness constraints in the context of MLPs and their
crucial role in this underconstrained setup. Ablating these smoothness constraints
does not prevent this variant from performing well on view synthesis (similar to how
NeRF is capable of high-quality view synthesis without any smoothness constraints)
as shown in Table 2.1 but does hurt this variant’s performance on other tasks such
as albedo estimation and relighting. Qualitatively, this variant produces noisy esti-
mations insufficient for relighting (Figure 2-19 [B]).

101
I. Surface Normals

II. Albedo (color-corrected)

III. View Synthesis

IV. Illumination

(A) NeRFactor (B) NeRFactor w/o (C) NeRFactor w/o (D) NeRFactor (E) NeRFactor (ours) (F) Ground Truth
Using NeRF’s Shape Smoothness Geometry Pretrain. Using Microfacet

Figure 2-19: Qualitative ablation studies of NeRFactor. (A) One can fix the geometry
to that of NeRF and estimate only the reflectance and illumination by ablating the
normal and visibility MLPs of NeRFactor, but the NeRF geometry is too noisy (I) to
be used for relighting (see the supplemental video). (B) Ablating the smoothness reg-
ularization leads to noisy geometry and albedo (I and II). (C) If we train the normal
and visibility MLPs from scratch during the joint optimization (i.e., no pretraining),
the recovered albedo may mistakenly attempt to explain shading and shadows (III).
(D) If we replace the learned BRDF with an MLP predicting the roughness parameter
of a microfacet BRDF, the predicted reflectance either falls into the local optimum
of maximum roughness everywhere or becomes non-smooth spatially (not pictured
here; see the supplemental video). (E) NeRFactor is able to recover plausible nor-
mals, albedo, and illumination without any direct supervision on any factor. The
illuminations recovered by NeRFactor, though oversmoothed, correctly capture the
location of the Sun. See Section 2.5.3 for the color correction and tone mapping
applied to albedo.

102
Estimating the Shape vs. Using NeRF’s Shape If we ablate the normal and
visibility MLPs entirely, this variant is essentially using NeRF’s normals and visibility
without improving upon them (hence “using NeRF’s shape”). As Table 2.1 and the
supplemental video show, even though the estimated reflectance is smooth (encour-
aged by the smoothness priors from the full model), the noisy NeRF normals and
visibility produce artifacts in the final rendering.

2.6.5 Estimation Consistency Across Different Illuminations

In this experiment, we study how different illumination conditions affect the albedo
estimation by NeRFactor. More specifically, we probe how consistent the estimated
albedo predictions are across different input illumination conditions. To this end, we
light the ficus scene with four drastically different lighting conditions as shown in
Figure 2-20, and then estimate the albedo with NeRFactor from these four sets of
multi-view images.

Estimated Albedo Albedo PSNR (dB) ↑


A B C D
A ∞ 38.1 38.0 35.7

B 38.1 ∞ 36.4 37.7

C 38.0 36.4 ∞ 34.7


Illumination Used for Rendering
D 35.7 37.7 34.7 ∞
Case A Case B Case C Case D 34.7 40.0

Figure 2-20: Albedo estimation of NeRFactor across different illuminations. The


albedo fields recovered by NeRFactor are largely consistent across varying illumination
conditions of the input images. Although both Case C and Case D have the Sun as
the primary light source, the performance on Case D is worse (e.g., the specularity
residuals on the vase) because it is a challenging high-frequency lighting condition
that has the Sun intensity properly measured by Stumpfel et al. [2004], while Case C
is an internet light probe that clips the Sun intensity.

As Figure 2-20 shows, NeRFactor’s predictions are similar across the four input

103
illuminations, with pairwise PSNR ≥ 34.7dB. Note that the performance on Case
D is worse (e.g., the specularity residuals on the vase) than on Case C, despite that
both cases seem to have the Sun as the primary light source. The reason is that
Case D had the Sun pixels properly measured by Stumpfel et al. [2004], whereas Case
C is an internet light probe that clipped the Sun pixels. Therefore, Case D has a
much higher-frequency lighting condition than Case C, making it a harder case for
NeRFactor to correctly factorize the appearance.

2.7 Conclusion

We have demonstrated Neural Reflectance and Visibility Fields (NeRV) for


recovering relightable neural volumetric representations from images of scenes illu-
minated by environmental and indirect lighting, by using a visibility Multi-Layer
Perceptron (MLP) to approximate portions of the volume rendering integral that
would otherwise be intractable to estimate during training by brute-force sampling
[Srinivasan et al., 2021]. We believe that this work is an initial foray into leveraging
learned function approximation to alleviate the computational burden incurred by
using rigorous physically-based differentiable rendering for inverse rendering.
Because NeRV requires the object be lit by multiple arbitrary but known illumi-
nations, we continue to devise Neural Factorization of Shape and Reflectance
(NeRFactor) that relaxes this capture requirement, thereby enabling appearance
decomposition under a single unknown lighting condition [Zhang et al., 2021c]. We
have presented how NeRFactor recovers an object’s shape and reflectance from multi-
view images and their corresponding camera poses. Importantly, NeRFactor recovers
these properties from images under an unknown lighting condition, while the majority
of prior work, including NeRV, requires multiple known lighting conditions.
To address the ill-posed nature of this problem, NeRFactor relies on priors to es-
timate a set of plausible shape, reflectance, and illumination that collectively explain
the observed images. These priors include simple yet effective spatial smoothness
constraints (implemented in the context of MLPs) and a data-driven prior on real-

104
world Bidirectional Reflectance Distribution Functions (BRDFs). We demonstrate
that NeRFactor achieves high-quality geometry sufficient for relighting and view syn-
thesis, produces convincing albedo as well as spatially-varying BRDFs, and generates
lighting estimations that correctly reflect the presence or absence of dominant light
sources. With NeRFactor’s factorization, we can relight the object with point lights
or light probe images, render images from arbitrary viewpoints, and even edit the ob-
ject’s albedo and BRDF. We believe this work makes important progress towards the
goal of recovering fully-featured 3D graphics assets from casually-captured photos.
Although we demonstrate that NeRFactor outperforms baseline methods and vari-
ants with different design choices, there are a few important limitations. First, to keep
light visibility computation tractable, we limit the resolution of the light probe images
to 16 × 32, a resolution that may be insufficient for generating very hard shadows
and recovering very high-frequency BRDFs. As such, when the object is lit by a very
high-frequency illumination such as the one in Figure 2-20 (Case D) where the sun
pixels are fully High-Dynamic-Range (HDR), there might be specularity or shadow
residuals in the albedo estimation such as those on the vase.
Second, for fast rendering, NeRFactor considers only single-bounce direct illumi-
nation, so NeRFactor does not properly account for indirect illumination effects. This
is in contrast to NeRV that also models one-bounce indirect illumination in addition
to direct illumination. It is an interesting future direction to combine NeRV and NeR-
Factor into a model that handles images taken under one unknown lighting condition
inputs while modeling indirect illumination.
Finally, NeRFactor initializes its geometry estimation with NeRF in contrast to
NeRV that optimizes the geometry from scratch. While NeRFactor is able to fix
errors made by NeRF (and NeRV) up to a certain degree, it can fail if NeRF estimates
particularly poor geometry in a manner that happens to not affect view synthesis. We
observe this in the real scenes, which contain faraway incorrect “floating” geometry
that is not visible from the input cameras but casts shadows on the objects of interest.
It is much desirable to have a model capable of optimizing geometry from scratch like
NeRV and meanwhile achieving as high-quality geometry as that of NeRFactor.

105
THIS PAGE INTENTIONALLY LEFT BLANK

106
Chapter 3

Mid-Level Abstraction: The Light


Transport Function

In this chapter, we model appearance at a middle level of abstraction using the light
transport (LT) function, without further decomposing it into the underlying shape
and reflectance. Specifically, we address the problem of interpolating the LT function
in both light and view directions from sparse samples of the LT function. By doing so,
one is able to synthesize the appearance from a novel viewpoint under any arbitrary
lighting. We start with an introduction of LT acquisition (Section 3.1) and then
review the related work in Section 3.2.
Next, we present Light Stage Super-Resolution (LSSR) [Sun et al., 2020]
that is capable of interpolating the LT function smoothly and stably in the light
direction, thereby enabling continuous, high-frequency relighting from a fixed view-
point (Section 3.3). To also support view synthesis, we further devise Neural Light
Transport (NLT) [Zhang et al., 2021b] that interpolates the LT function in both
view and light directions, thereby supporting simultaneous relighting and view syn-
thesis (Section 3.4).
In Section 3.5, we describe our experiments that evaluate how well LSSR and NLT
perform relighting and/or view synthesis and how they compare with the existing
solutions to the two tasks. We also perform additional analyses, in Section 3.6,
to study the importance of each major component of the LSSR and NLT models,

107
analyze above which frequency band we would need LSSR to achieve high-frequency
relighting, test whether LSSR and NLT are applicable to smaller light stages with
fewer lights, and finally stress-test NLT with degrading quality of the input geometry.

3.1 Introduction

The light transport (LT) of a scene models how light interacts with objects in the
scene to produce an observed image. The process by which geometry and material
properties of the scene interact with global illumination to result in an image is a
complicated but well-understood consequence of physics [Pharr et al., 2016]. Much
progress in computer graphics has been through the development of more expres-
sive and efficient mappings from scene models (geometry, materials, and lighting) to
images. In contrast, inverting this process is ill-posed and therefore more difficult:
Acquiring the LT in a scene from images requires untangling the myriad intercon-
nected effects of occlusion, shading, shadowing, interreflections, scattering, etc.
Solving this task of inferring aspects of LT from images is an active research
area, and even partial solutions have significant practical uses such as phototourism
[Snavely et al., 2006], telepresence [Orts-Escolano et al., 2016], storytelling [Kelly
et al., 2019], and special effects [Debevec, 2012]. A less obvious, but equally important
application of inferring LT from images consists of generating groundtruth data for
machine learning tasks: Many works rely on high-quality renderings of relit subjects
under arbitrary lighting conditions and from multiple viewpoints to perform relighting
[Meka et al., 2019, Sun et al., 2019], view synthesis [Pandey et al., 2019], re-enacting
[Kim et al., 2018], and alpha matting [Sengupta et al., 2020].
Previous work has shown that it is possible to construct a light stage [Debevec
et al., 2000], plenoptic camera [Levoy and Hanrahan, 1996], or gantry [Murray-
Coleman and Smith, 1990] that directly captures a subset of the LT function and
thereby enables the image-based rendering thereof. The simplest light stage com-
prises just one camera but multiple LED lights distributed (roughly) evenly over the
hemisphere. By programmatically activating and deactivating the LED lights, the

108
light stage acquires samples of the LT function for this particular viewing direction,
which we refer to as a One-Light-at-A-Time (OLAT) image set. Because light is ad-
ditive, this OLAT scan serves as “bases,” by linearly combining which one can relight
the subject according to any desired light probe [Debevec et al., 2000].

These techniques are widely used in film productions and within the research
community. However, these systems can only provide sparse sampling of the LT
function limited to the numbers of cameras and lights, resulting in the inability to
produce photorealistic renderings outside the supported camera or light locations.
Specifically, in order to achieve photorealistic relighting under all possible lighting
conditions, one needs to place the lights close enough on the stage dome such that
shadows and specularities in the captured images of adjacent lights “move” by less
than one pixel. Yet, practical constraints (such as cost and the difficulties in powering
and synchronizing many lights) hinders the construction of light stages with high light
densities. Even if such a high-density light stage could be built, the time for acquiring
an OLAT set grows linearly w.r.t. the number of lights, making human subjects (which
must stay stationary during the capture) difficult to capture. For these reasons, even
the most sophisticated light stages in existence today comprise only a few hundred
(e.g., 330 in our case) lights that are spaced many degrees apart.

This means that the LT function is undersampled w.r.t. the angular sampling of
the lights, and that the images rendered by conventional approaches will likely con-
tain ghosting. It does not produce soft shadows or streaking specularities to render
an image using a “virtual light” that lies in between the real lights by computing a
weighted average of the adjacent OLAT images, but instead produces the superposi-
tion of multiple sharp shadows and specular dots (see Figure 3-1 [b]). These artifacts
are particularly problematic for human faces, which exhibit complex reflectance prop-
erties (such as specularities, scattering, etc.) [Weyrich et al., 2006] and are likely to
fall into the uncanny valley [Seyama and Nagayama, 2007].

To this end, we propose a learning-based solution, which we dub Light Stage


Super-Resolution (LSSR), for super-resolving the angular resolution of light stage
scans of human faces [Sun et al., 2020]. Given an OLAT scan of a human face

109
(a) I1 and I2 , captured for adjacent lights (b) Linearly blended (c) LSSR rendering
𝜔1 and 𝜔2 on the light stage. image, (I1 + I2 )/2. I((𝜔1 + 𝜔2 )/2).

Figure 3-1: LSSR overview. (a) Although light stages are a powerful tool for capturing
and subsequently relighting human subjects, their rendering suffers from adjacent
lights on the stage being separated by some distance. (b) It results in ghosting in
shadowed and specular regions to produce an image for a “virtual” light lying between
the stage’s physical lights with conventional image blending techniques, as seen here
on the subject’s eyes and cheek. (c) By training a deep neural network to regress
from a light direction to an image, our model is able to accurately render the subject
for an arbitrary virtual light direction – as the light moves, highlights and shadows
move smoothly rather than incorrectly blend together, thereby enabling realistic high-
frequency relighting effects.

with finite images and the direction of a desired virtual light, LSSR predicts a high-
resolution RGB image that appears to have been lit by the target light, even though
that light is not present on the light stage (see Figure 3-1 [c]). Our approach can
additionally enable the construction of simpler light stages with fewer lights, thereby
reducing cost and increasing the frame rate at which subjects can be scanned. LSSR
also produces denser rendering for Machine Learning (ML) applications that require
light stage data for training1 such as portrait relighting [Sun et al., 2019] and shadow
removal [Zhang et al., 2020].
LSSR needs to work with the inherent aliasing and regularity of the light stage
data. We address this by combining the power of deep neural networks with the ef-
ficiency and generality of conventional linear interpolation methods. Specifically, we
use an active set of the closest lights within our network (Section 3.3.1) and develop a
novel alias-free pooling approach to combine their network activations (Section 3.3.2)
using a weighting operator guaranteed to be smooth when lights enter or exit the ac-
1
Once trained, such ML systems usually take as input single images, without requiring a light
stage at test time.

110
tive set. Our network allows us to super-resolve an OLAT scan of a human face: We
can repeatedly query our trained model with thousands of light directions and treat
the resulting set of synthesized images as though they were acquired by a physically-
unconstrained light stage with an unbounded sampling density. As we will demon-
strate, these super-resolved virtual OLAT scans allow us to produce photorealistic
rendering of human faces with arbitrarily high-frequency illumination contents.

Besides lights, cameras also dictate how densely one can sample the LT function
with a light stage. For similar reasons aforementioned, a state-of-the-art light stage
comprises only around 50–100 cameras, covering a limited number of viewing direc-
tions. This problem of “absent cameras” is arguably more concerning than that of
absent lights: With only the physical lights, rendering can still be made despite with
artifacts like ghosting shadows, while with only the physical cameras, one is simply
unable to render the subject from a “virtual camera.” Indeed, traditional Image-Based
Rendering (IBR) approaches are usually designed for fixed viewpoints and are unable
to synthesize unseen (novel) views under a desired illumination.

In the remainder of this chapter, we additionally model viewpoint changes and


learn to interpolate the dense LT function of a scene from sparse multi-view, OLAT
images through a semi-parametric technique that we dub Neural Light Transport
(NLT) [Zhang et al., 2021b] (Figure 3-2). Many prior works have addressed similar
tasks (as will be discussed in Section 3.2), with classic works tending to rely on
physics to recover analytical and interpretable models, and recent works using neural
networks to infer a more direct mapping from input images to an output image.

Traditional rendering methods often make simplifying assumptions when mod-


eling geometry, Bidirectional Reflectance Distribution Functions (BRDFs), or com-
plex inter-object interactions in order to make the problem tractable. On the other
hand, deep learning approaches can tolerate geometric and reflectance imperfection,
but they often require many aspects of image formation (even those guaranteed by
physics) be learned “from scratch,” which may necessitate a prohibitively large train-
ing set. NLT is intended to straddle this divide: We construct a classical model of the
subject being imaged (a mesh and a diffuse texture atlas per Lambertian reflectance),

111
Interpolating Lights
f (x, !i , !o )
<latexit sha1_base64="aEct1D+jr1UNK7OyKctwzyz6SRQ=">AAACAXicbZDLSgMxFIYz9VbrbdSN4CZYhApSZqqgy6IblxXsBdphyKSZNjSXIcmIpdSNr+LGhSJufQt3vo1pO4hWfwh8/OccTs4fJYxq43mfTm5hcWl5Jb9aWFvf2Nxyt3caWqYKkzqWTKpWhDRhVJC6oYaRVqII4hEjzWhwOak3b4nSVIobM0xIwFFP0JhiZKwVuntx6e4YdiQnPRTSb5JHoVv0yt5U8C/4GRRBplrofnS6EqecCIMZ0rrte4kJRkgZihkZFzqpJgnCA9QjbYsCcaKD0fSCMTy0ThfGUtknDJy6PydGiGs95JHt5Mj09XxtYv5Xa6cmPg9GVCSpIQLPFsUpg0bCSRywSxXBhg0tIKyo/SvEfaQQNja0gg3Bnz/5LzQqZf+kXLk+LVYvsjjyYB8cgBLwwRmogitQA3WAwT14BM/gxXlwnpxX523WmnOymV3wS877F7SDlcY=</latexit>

!o !i

Interpolating Views
<latexit sha1_base64="JXUgETDU3sy/BKOpJY+kVVQMJt4=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5idzCZD5rHOzAphyU948aCIV3/Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyd1ekrQIe6zfrniV/050CoJclKBHI1++as3UCQVVFrCsTHdwE9smGFtGeF0WuqlhiaYjPGQdh2VWFATZvN7p+jMKQMUK+1KWjRXf09kWBgzEZHrFNiOzLI3E//zuqmNr8OMySS1VJLFojjlyCo0ex4NmKbE8okjmGjmbkVkhDUm1kVUciEEyy+vklatGlxUa/eXlfpNHkcRTuAUziGAK6jDHTSgCQQ4PMMrvHmP3ov37n0sWgtePnMMf+B9/gAPxI/7</latexit>

<latexit sha1_base64="CzvJof2xmhdMc+LShQhYz+0IlHI=">AAAB73icbVDLSgNBEJyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5id9CZD5rHOzAphyU948aCIV3/Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNoUkVV7oTEQOcSWhaZjl0Eg1ERBza0fh25refQBum5IOdJBAKMpQsZpRYJ3V6SsCQ9FW/XPGr/hx4lQQ5qaAcjX75qzdQNBUgLeXEmG7gJzbMiLaMcpiWeqmBhNAxGULXUUkEmDCb3zvFZ04Z4FhpV9Liufp7IiPCmImIXKcgdmSWvZn4n9dNbXwdZkwmqQVJF4vilGOr8Ox5PGAaqOUTRwjVzN2K6YhoQq2LqORCCJZfXiWtWjW4qNbuLyv1mzyOIjpBp+gcBegK1dEdaqAmooijZ/SK3rxH78V79z4WrQUvnzlGf+B9/gAY3JAB</latexit>

Surface
x
<latexit sha1_base64="hL+FaLtOT9luwfLW3Ut08xl3Pcw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtZICF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOeHjQA=</latexit>

x <latexit sha1_base64="hL+FaLtOT9luwfLW3Ut08xl3Pcw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZtZICF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOeHjQA=</latexit>

UV Location (2D)
!i
<latexit sha1_base64="JXUgETDU3sy/BKOpJY+kVVQMJt4=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5idzCZD5rHOzAphyU948aCIV3/Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyd1ekrQIe6zfrniV/050CoJclKBHI1++as3UCQVVFrCsTHdwE9smGFtGeF0WuqlhiaYjPGQdh2VWFATZvN7p+jMKQMUK+1KWjRXf09kWBgzEZHrFNiOzLI3E//zuqmNr8OMySS1VJLFojjlyCo0ex4NmKbE8okjmGjmbkVkhDUm1kVUciEEyy+vklatGlxUa/eXlfpNHkcRTuAUziGAK6jDHTSgCQQ4PMMrvHmP3ov37n0sWgtePnMMf+B9/gAPxI/7</latexit>
Light Direction (2D)
<latexit sha1_base64="CzvJof2xmhdMc+LShQhYz+0IlHI=">AAAB73icbVDLSgNBEJyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5id9CZD5rHOzAphyU948aCIV3/Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNoUkVV7oTEQOcSWhaZjl0Eg1ERBza0fh25refQBum5IOdJBAKMpQsZpRYJ3V6SsCQ9FW/XPGr/hx4lQQ5qaAcjX75qzdQNBUgLeXEmG7gJzbMiLaMcpiWeqmBhNAxGULXUUkEmDCb3zvFZ04Z4FhpV9Liufp7IiPCmImIXKcgdmSWvZn4n9dNbXwdZkwmqQVJF4vilGOr8Ox5PGAaqOUTRwjVzN2K6YhoQq2LqORCCJZfXiWtWjW4qNbuLyv1mzyOIjpBp+gcBegK1dEdaqAmooijZ/SK3rxH78V79z4WrQUvnzlGf+B9/gAY3JAB</latexit>
!o Viewing Direction (2D)

(A) The 6D Light Transport (B) Capture Setup: multiple (C) Simultaneous Relighting and (D) HDRI Relighting
Function Being Learned views; one light at a time View Synthesis

Figure 3-2: NLT overview. (A) NLT learns to interpolate the 6D light transport
function of a surface as a function of the UV coordinate (2 DOFs), incident light
direction (2 DOFs), and viewing direction (2 DOFs). (B) The subject is imaged from
multiple viewpoints when lit by different directional lights; a geometry proxy is also
captured using active sensors. (C) Querying the learned function at different light
and/or viewing directions enables simultaneous relighting and view synthesis of this
subject. (D) The relit renderings that NLT produces can be combined according to
HDRI maps to perform image-based relighting.

but then we embed a neural network within the parameterization provided by that
classical model, construct the inputs and outputs of the model in ways that leverage
domain knowledge of classical graphics techniques, and train that network to model
all aspects of LT—including those not captured by a classical model. By leveraging
a classical model this way, NLT is able to learn an accurate model of the complicated
LT function for a subject from a small training dataset of sparse observations.

A key novelty of NLT is that our learned model is embedded within the texture
atlas space of an existing geometric model of the subject, which provides a novel
framework for simultaneous relighting and view interpolation. We express the 6D LT
function (Figure 3-2) at each location on the surface of our geometric model as simply
the output of a neural network, which works well (as neural networks are smooth
and universal function approximators [Hornik, 1991]) and obviates the need for a
complicated parameterization of spatially-varying reflectance. We evaluate NLT on
joint relighting and view synthesis using sparse image observations of scanned human
subjects within a light stage, and show state-of-the-art results as well as compelling
practical applications.

In summary, NLT makes the follow three main contributions:

112
• an end-to-end, semi-parametric method for learning to interpolate the 6D light
transport function per-subject from real data using convolutional neural net-
works (Section 3.4.2),
• a unified framework for simultaneous relighting and view synthesis by embed-
ding networks into a parameterized texture atlas and leveraging as input a set
of OLAT images (Section 3.4.4), and
• a set of augmented texture-space inputs and a residual learning scheme on
top of a physically accurate diffuse base, which together allow the network to
easily learn non-diffuse, higher-order light transport effects including specular
highlights, subsurface scattering, and global illumination (Section 3.4.1 and
Section 3.4.3).
NLT allows for photorealistic free-viewpoint rendering under controllable lighting
conditions, which not only is a key aspect in compelling user experiences in mixed
reality and special effects, but can be applied to a variety of machine learning tasks
that rely on photorealistic ground-truth data.

3.2 Related Work

The angular undersampling from the light stage relates to much work over the past
two decades on frequency analyses of light transport [Ramamoorthi and Hanrahan,
2001, Sato et al., 2003, Durand et al., 2005] and analyses of sampling rates in Image-
Based Rendering (IBR) [Adelson and Bergen, 1991] for the related problem of view
synthesis [Mildenhall et al., 2019]. This problem also bears some similarities to multi-
image super-resolution [Milanfar, 2010] and angular super-resolution in light fields
[Kalantari et al., 2016, Cheng et al., 2019], where aliased observations are combined
to produce interpolated results. In LSSR, we leverage priors and deep learning to go
beyond these sampling limits, upsampling or super-resolving the sparse lights on the
light stage to achieve continuous, high-frequency relighting.
Recently, many approaches for acquiring a sparse light transport matrix have been
developed, including methods based on compressive sensing [Peers et al., 2009, Sen

113
and Darabi, 2009], kernel Nyström [Wang et al., 2009], optical computing [O’Toole
and Kutulakos, 2010], and neural networks [Ren et al., 2013, 2015, Kang et al., 2018].
However, these methods are not designed for the light stage and are largely orthogonal
to our approach. They seek to acquire the transport matrix for a fixed light sampling
resolution with a sparse set of patterns, while we seek to take this initial sampling
resolution and upsample it to a much higher resolution (which indeed enables con-
tinuous, high-frequency relighting). Most recently, Xu et al. [2018b] proposed a deep
learning approach for image-based relighting from only five light directions, but the
approach cannot reproduce accurate shadows. While we do use many more lights, we
achieve significantly higher-quality results with accurate shadows.

The general approach of using light stages for image-based relighting stands in
contrast to more model-based approaches. Traditionally, instead of super-resolving a
light stage scan, one could use that scan as input to a photometric stereo algorithm
[Woodham, 1980], and attempt to recover the normal and the albedo maps of the
subject. More advanced techniques were developed to produce a parametric model
of the geometry and reflectance for even highly specular objects [Tunwattanapong
et al., 2013]. There are also works that focus on recovering a parametric face model
from a single image [Sengupta et al., 2018], constructing a volumetric model for view
synthesis [Lombardi et al., 2018], or a neural representation of a scene [Tewari et al.,
2020b]. However, the complicated reflectance and geometry of human subjects are
difficult to even parameterize analytically, let alone recover. Although recent progress
may enable accurate captures of human faces using parametric models, there are ad-
ditional difficulties in capturing a complete portrait due to the complexity of hair,
eyes, ears, etc. Indeed, this complexity has motivated the use of image-based relight-
ing via light stages in the visual effects industry for many years [Tunwattanapong
et al., 2011, Debevec, 2012].

Interpolating a reflectance function has also been investigated in the literature.


Masselus et al. [2004] compare the errors of fitting the sampled reflectance function to
various basis functions and conclude that multilevel B-splines can preserve the most
features. More recently, Rainer et al. [2019] utilize neural networks to compress and

114
interpolate sparsely sampled observations. However, these algorithms interpolate the
reflectance function independently on each pixel and do not consider local information
in the neighboring pixels. Thus, their results are smooth and consistent in the light
domain, but might not be consistent in the image domain. Fuchs et al. [2007] treat the
problem as a light super-resolution problem, similar to our work. They use heuristics
to decompose the captured images into diffuse and specular layers, and apply optical
flow and level-set algorithms to interpolate highlights and light visibility, respectively.
This approach works well on highly reflective objects, but as we will demonstrate, it
usually fails on human skin, which contains high frequency bumps and cannot be well
modeled using only the diffuse and specular terms.
In recent years, light stages have also been demonstrated to be invaluable tools
for generating training data for use in machine learning tasks [Meka et al., 2019,
Guo et al., 2019, Sun et al., 2019, Nestmeyer et al., 2019]. This enables user-facing
effects that do not require acquiring a complete light stage scan of the subject, such
as portrait relighting from a single image [Sun et al., 2019, Inc., 2017] or Virtual
Reality (VR) experiences [Guo et al., 2019]. These learning-based applications suffer
from the same undersampling issue as do conventional uses of light stage data. For
example, Sun et al. [2019] observe artifacts when relighting with light probes with
contain high-frequency contents. We believe our method can provide better training
data and significantly improve many of these methods in the future.
Unlike LSSR, NLT additionally upsamples the light stage in the number of cam-
eras. In other words, NLT addresses the problem of recovering a model of light
transport from a sparse set of images of some subject, and then predicting novel im-
ages of that subject from unseen views under novel illuminations. This is a broad
problem statement that relates to and subsumes many tasks in graphics and vision.
We now categorize the existing approaches according to the type of input they take.

3.2.1 Single Observation

The most sparse sampling is just a single image, from which one could attempt to infer
a model (geometry, reflectance, and illumination) of the physical world that resulted

115
in that image [Barrow and Tenenbaum, 1978], usually via hand-crafted [Barron and
Malik, 2014, Li et al., 2020a] or learned priors [Saxena et al., 2008, Eigen et al., 2014,
Sengupta et al., 2018, Li et al., 2018, 2020c, Kanamori and Endo, 2018, Kim et al.,
2018, Gardner et al., 2019, LeGendre et al., 2019, Alldieck et al., 2019, Wiles et al.,
2020, Zhang et al., 2020]. Though practical, the quality gap between what can be
accomplished by single-image techniques and what has been demonstrated by multi-
image techniques is significant. Indeed, none of these methods shows complex light
transport effects such as specular highlights or subsurface scattering [Kanamori and
Endo, 2018, Kim et al., 2018]. Moreover, these methods are usually limited to a single
task, such as relighting [Kanamori and Endo, 2018, Kim et al., 2018, Sengupta et al.,
2018] or view synthesis [Alldieck et al., 2019, Wiles et al., 2020, Li et al., 2020a],
and some support only a limited range of viewpoint change [Kim et al., 2018, Tewari
et al., 2020a].

3.2.2 Multiple Views

Multi-view geometry techniques recover a textured 3D model that can be rendered


using conventional graphics or photogrammetry techniques [Hartley and Zisserman,
2004], but have material and shading variation baked in, and do not enable relighting.
IBR techniques such as light fields [Levoy and Hanrahan, 1996] or lumigraphs [Gortler
et al., 1996] can be used to directly sample and render the plenoptic function [Adelson
and Bergen, 1991], but the accuracy of these techniques is limited by the density of
sampled input images. For unstructured inputs, reprojection-based methods [Buehler
et al., 2001, Eisemann et al., 2008] assume the availability of a geometry proxy (so does
our work), reproject nearby views to the query view, and perform image blending in
that view. However, such methods rely heavily on the quality of the geometry proxy
and cannot synthesize pixels that are not visible in the input views. A class-specific
geometry prior (such as that of a human body [Shysheya et al., 2019]) can be used to
increase the accuracy of a geometry proxy [Carranza et al., 2003], but none of these
methods enables relighting.
Recently, deep learning techniques have been used to synthesize new images from

116
sparse sets of input images, usually by training neural networks to synthesize some
intermediate geometric representation that is then projected into the desired image
[Zhou et al., 2018, Sitzmann et al., 2019a,b, Srinivasan et al., 2019, Flynn et al.,
2019, Mildenhall et al., 2019, 2020, Thies et al., 2020]. Some techniques even entirely
replace the rendering process with a learned “neural” renderer [Thies et al., 2019,
Martin-Brualla et al., 2018, Pandey et al., 2019, Lombardi et al., 2019, 2018, Tewari
et al., 2020b]. Despite effective, these methods generally do not attempt to explicitly
model light transport and hence do not enable relighting—though they are often ca-
pable of preserving view-dependent effects for the fixed illumination condition, under
which the input images were acquired [Thies et al., 2019, Mildenhall et al., 2020].
Additionally, neural rendering often breaks “backwards compatibility” with existing
graphics systems, while our approach infers images directly in texture space that can
be re-sampled by conventional graphics software (e.g., Unity, Blender, etc.) to syn-
thesize novel viewpoints. Recently, Chen et al. [2020] proposed to learn relightable
view synthesis from dense views (200 vs. 55 in this work) under image-based light-
ing; using spherical harmonics as the lighting representation, the work is unable to
produce hard shadow caused by a directional light as in this work.

3.2.3 Multiple Illuminants

Similar to the multi-view task is the task of photometric stereo [Woodham, 1980, Basri
et al., 2007] (as cameras function analogously to illuminants in some contexts [Sen
et al., 2005]): repeatedly imaging a subject with a fixed camera but under different
illuminations and then recovering the surface normals. However, most photometric
stereo solutions assume Lambertian reflectance and do not support relighting with
non-diffuse light transport. More recently, Ren et al. [2015], Meka et al. [2019], Sun
et al. [2019], and Sun et al. [2020] show that neural networks can be applied to relight a
scene captured under multiple lighting conditions from a fixed viewpoint. Nestmeyer
et al. [2020] decompose an image into shaded albedo (so no cast shadow) and residuals,
unlike this work that models cast shadow as part of a physically accurate diffuse base.
None of these works supports view synthesis. Xu et al. [2019] perform free-viewpoint

117
relighting, but unlike our approach, they require running the model of Xu et al.
[2018b] as a second stage.

3.2.4 Multiple Views & Illuminants

Garg et al. [2006] utilize the symmetry of illuminations and view directions to collect
sparse samples of an 8D reflectance field, and reconstruct a complete field using a
low-rank assumption. Perhaps the most effective approach for addressing sparsity in
light transport estimation is to circumvent this problem entirely and densely sample
whatever is needed to produce the desired renderings. The landmark work of Debevec
et al. [2000] uses a light stage to acquire the full reflectance field of a subject by
capturing a One-Light-at-A-Time (OLAT) scan of that subject, which can be used
to relight the subject by linear combination according to some High-Dynamic-Range
Imaging (HDRI) light probe. Despite its excellent results, this approach lacks an
explicit geometric model, so rendering is limited to a fixed set of viewpoints. Although
this limitation has been partially addressed by Ma et al. [2007] who focus on facial
capture, a recent system of Guo et al. [2019] builds a full volumetric relightable
model using two spherical gradient illumination conditions [Fyffe, 2009]. This system
supports relighting and view synthesis but assumes predefined BRDFs and therefore
cannot synthesize more complex light transport effects present in real images.
Zickler et al. [2006] also pose the problem of appearance synthesis as that of
high-dimensional interpolation, but they use radial basis functions on smaller-scale
data. Our work follows the convention of the nascent field of “neural rendering”
[Thies et al., 2019, Lombardi et al., 2019, 2018, Sitzmann et al., 2019a, Tewari et al.,
2020b, Mildenhall et al., 2020], in which a separate neural network is trained for each
subject to be rendered, and all images of that subject are treated as “training data.”
These approaches have shown great promise in terms of their rendering fidelity, but
they require per-subject training and are unable to generalize across subjects yet.
Unlike prior work that focuses on a specific task (e.g., relighting or view synthesis),
our texture-space formulation allows for simultaneous light and view interpolation.
Furthermore, our model is a valuable training data generator for many works that

118
rely on high-quality renderings of subjects under arbitrary lighting conditions and
from multiple viewpoints, such as [Meka et al., 2019, Sun et al., 2019, Pandey et al.,
2019, Kim et al., 2018, Sengupta et al., 2020].

3.3 Method: Precise, High-Frequency Relighting


An One-Light-at-A-Time (OLAT) scan of a subject captured by a light stage consists
of 𝑛 images, where each image is lit by a single light in the stage. The conventional
way to relight the captured subject with an arbitrary light direction is to linearly
blend the images captured under nearby lights in the scan. As shown in Figure 3-
1, this often results in “ghosting” artifacts on shadows and highlights. The goal of
this work is to use machine learning instead of simple linear interpolation to produce
higher-quality results.
Our model, Light Stage Super-Resolution (LSSR) [Sun et al., 2020], takes as input
a query light direction 𝜔 and a complete OLAT scan consisting of images paired with
their light directions {I𝑖 , 𝜔𝑖 }, and uses a deep neural network Φ to obtain the predicted
image I:
I (𝜔) = Φ ({I𝑖 , 𝜔𝑖 }𝑛𝑖=1 , 𝜔) . (3.1)

This formalization is broad enough to describe some prior works on learning-based


relighting [Xu et al., 2018b, Meka et al., 2019]. While these methods usually operate
by training a U-Net [Ronneberger et al., 2015] to map from a sparse set of input images
to an output image, we focus on producing as high-quality as possible rendering results
given the complete OLAT scan.
However, feeding all captured images into a conventional Convolutional Neural
Network (CNN) is not tractable in terms of speed or memory. In addition, this
naive approach seems somewhat excessive for practical applications involving human
faces: While complex translucency and interreflection may require multiple lights to
reproduce, it is unlikely that all images in the OLAT scan are necessary to produce
the image for any particular query light direction, especially given that barycentric
interpolation requires only three nearby lights to produce plausible rendering.

119
Our work attempts to find an effective and tractable compromise between these
two extremes, in which the power of deep neural networks is combined with the
efficiency and generality of nearest neighbor approaches. This is accomplished by a
linear blending approach that, like barycentric blending, ensures the output rendering
is a smooth function of the input, where the blending is performed on the activations of
a neural network’s encoding of our input images instead of on the raw pixel intensities
of the input images.

The complete network structure of LSSR is shown in Figure 3-3. Given a query
light direction 𝜔, we identify the 𝑘 captured images in the OLAT scan whose cor-
responding light directions are nearby the query light direction, which we call the
active set A(𝜔). These OLAT images {I𝑖 }𝑘𝑖=1 and their corresponding light directions
{𝜔𝑖 }𝑘𝑖=1 are then each independently processed in parallel by the encoder Φ𝑒 (·) of our
CNN (or equivalently, they are processed as a single “batch”), thereby producing a
multi-scale set of internal network activations that describe all 𝑘 images. After that,
the set of 𝑘 activations at each layer of the network are pooled into a single set of
activations at each layer, using weighted averaging where the weights are a function
of the query light and each input light W(𝜔, 𝜔𝑖 ). This weighted average is designed
to remove the aliasing introduced by nearest neighbor sampling for the active set
construction.

Together with the query light direction 𝜔, these pooled feature maps are then
fed into the decoder Φ𝑑 (·) by means of skip links from each level of the encoder,
thereby producing the final predicted image I (𝜔). Formally, our final image synthesis
procedure is: ⎛ ⎞
∑︁
I (𝜔) = Φ𝑑 ⎝ W(𝜔, 𝜔𝑖 )Φ𝑒 (I𝑖 , 𝜔𝑖 ) , 𝜔 ⎠ . (3.2)
𝑖∈A(𝜔)

This hybrid approach of nearest neighbor selection and neural network processing
allows us to learn a single neural network that produces high-quality results and
generalizes well across query light directions and across subjects in our OLAT dataset.

Our approach for the active set construction is explained in Section 3.3.1, our
alias-free pooling is explained in Section 3.3.2, the network architecture is described

120
Figure 3-3: Visualization of the LSSR architecture. The encoder Φ𝑒 (·) takes as input a
concatenation of nearby eight OLAT images in the active set and their light directions,
which are processed by a series of convolutional layers. The resulting activations
of these eight images at each level are then combined using our alias-free pooling
(described in Section 3.3.2) and skip-connected to the decoder. The decoder Φ𝑑 (·)
takes as input the query light direction 𝜔, processes it with fully connected layers,
then upsamples it (along with the skip-connected encoder activations), and finally
decodes the image using a series of transposed convolutional layers. Whether or not
a (transposed) convolutional layer alters resolution is indicated by whether its edge
spans two spatial scales.

in Section 3.3.3, and our progressive training procedure is discussed in Section 3.3.4.

3.3.1 Active Set Construction

Light stages are conventionally constructed by placing lights on a regular hexagonal


tessellation of a sphere (with some “holes” for cameras or other practical concerns),
as shown in Figure 3-4. As discussed, at test time, our model works by identifying
the OLAT images and lights that are the nearest to the desired query light direction,
and then averaging their neural activations.
However, this natural approach, when combined with the regularity in light sam-
pling on the light stage, presents a number of problems for training our model. First,
we can only supervise our super-resolution model using the physical lights that “pre-
tend” to be virtual, since these are the only light directions for which we have ground-
truth images (this is another problem when evaluating our model, as will be discussed
in Section 3.5.2). Second, this regular hexagonal sampling means that, for any given
light in the stage, the distances to its neighbors will always exhibit a highly regular

121
Figure 3-4: Construction of the
active sets in LSSR. The LEDs on a
light stage form a regular (hexagonal
here) pattern, giving highly regular
light-to-light distances (a). At test
time, however, a novel light direction
may not lie on this hexagonal grid,
making irregular its distances to the
neighbors (c). We therefore sample a
random subset of the nearest
neighbors as the active set during
training (b), which forces the
network to reason about irregular
distances from the query light to its
neighbors.

pattern (Figure 3-4 [a]). For example, the six nearest neighbors of every point on a
hexagon are guaranteed to have exactly the same distance to that point. In contrast,
at test time, we need to be able to produce rendering for query light directions that
correspond to arbitrary points on the sphere, and those points will likely possess irreg-
ular distances to their neighboring lights (Figure 3-4 [c]). This presents a significant
distribution mismatch between our training and test data. As such, we would expect
poor generalization at test time if we were to naïvely train on highly regular sets of
nearest neighbors.

To address this issue, we adopt a different approach of sampling neighbors for our
active set at training time. For each training iteration, we first identify a larger set of
𝑚 nearest neighbors around the query light (which, in this case, is just one of the real
lights on the stage), and among them randomly select only 𝑘 < 𝑚 neighbors to use
in the active set (in practice, we use 𝑚 = 16 and 𝑘 = 8). As shown in Figure 3-4 (b),
this results in irregular neighbor sampling patterns during training, which simulates
our test-time scenario where the query light is at a variety of locations relative to the
real input lights.

This approach shares a similar motivation as that of dropout in neural networks


[Srivastava et al., 2014], in which network activations are randomly set to 0 during

122
training to prevent overfitting. Here we instead randomly remove input images, which
also has the effect of preventing the model from overfitting to the hexagonal pattern
of the light stage by forcing it to operate on more varied inputs. Notice that the
query light itself is included in the candidate set, to reflect the fact that at test
time, the virtual query light may be right next to a physical light. As we will show
in Section 3.6.1 and in the supplementary video, this active set selection approach
results in a learned model whose synthesized shadows move more smoothly and at a
more regular rate than is achieved with a naïve nearest neighbor sampling approach.

3.3.2 Alias-Free Pooling

A critical component in LSSR is the skip links from each level of the encoder to
its corresponding level of the decoder. This model component is responsible for
producing network activations from the eight images in our active set and reducing
them to one set of activations to be decoded into the target image. This requires
a pooling operator that is permutation-invariant since the images in our active set
may correspond to arbitrary light directions and be presented in any order. Standard
permutation-invariant pooling operators, such as average or max pooling, are not
sufficient for our case because they do not suppress “aliasing” as discussed below.
As the query light direction moves across the sphere, its neighboring images will
enter and leave the active set of LSSR, which will cause the network activations
within our encoder to change abruptly (see Figure 3-5). If we use simple average
or max pooling, the activations in our decoder will also vary abruptly, resulting in
flickering artifacts or temporal instability in our output as the light direction varies.
The root cause of this problem is that our active set is an aliased observation of the
input images. Analogously, one should pass the point-sampled signal (e.g., an image)
through effective prefiltering (e.g., Gaussian blur) to suppress the aliasing artifacts.
Because average or max pooling allows this aliasing to persist, we introduce an
alias-free pooling technique to address this issue. We use a weighted average as our
pooling operator where the weight of each item in our active set is a continuous
function of the query light direction and is guaranteed to be 0 when this neighbor

123
Figure 3-5: LSSR’s alias-free
pooling. As the query light moves,
neighbors leave and enter the active
set, introducing aliasing that results
in jarring temporal artifacts. To
address this, we use an alias-free
pooling technique where the network
activations are averaged with
weights varying smoothly and
becoming exactly zero when lights
enter or leave the set.

enters or leaves the active set. We define our weighting function between the query
light direction 𝜔 and each OLAT light direction 𝜔𝑖 as follows:
(︂ )︂
W(𝜔, 𝜔𝑖 ) = max 0, 𝑒
̃︁ 𝑠(𝜔·𝜔𝑖 −1)
− min 𝑒𝑠(𝜔·𝜔𝑗 −1)
, (3.3)
𝑗∈A(𝜔)

W(𝜔,
̃︁ 𝜔𝑖 )
W(𝜔, 𝜔𝑖 ) = ∑︀ , (3.4)
𝑗 W(𝜔, 𝜔𝑗 )
̃︁

where 𝑠 is a learnable parameter that adjusts the decay of the weight with respect
to the distance, and each 𝜔 is a normalized vector in the 3D space. During training,
parameter 𝑠 will be automatically adjusted to balance between using just the nearest
neighbor (𝑠 = +∞) and an unweighted average of all neighbors (𝑠 = 0).
Our weighting function is an offset spherical Gaussian, similar to the normalized
Gaussian distance between the query light’s Cartesian coordinates and those of the
other lights in our active set, where we have subtracted the raw weight for the most
distant light in the active set (and clipped the resulting weights at 0). This adaptive
truncation is necessary because the lights on the light stage may be spaced irregularly
(due to holes for cameras or other reasons), which means that a fixed truncation may
be too aggressive in setting weights to 0 in regions where lights are sampled less
frequently. We instead leverage the fact that when a light exits the active set, a new
light will enter it at exactly the same time with exactly the same distance to the query
light. This allows us to truncate our Gaussian weights using the maximum distance
in the active set, which ensures that lights have a weight of exactly 0 as they leave or

124
enter the active set. This results in rendering that evolves smoothly as we move the
query light direction.

3.3.3 Network Architecture

The remaining components of LSSR consist of the conventional building blocks used
in constructing CNNs, as seen in Figure 3-3. The encoder of our network consists
of 3 × 3 convolutional blocks (with a stride of 2 to halve the resolution), each of
which is followed by group normalization [Wu and He, 2018] and a PReLU [He et al.,
2015]. The number of hidden units of each layer begins at 32 and doubles after
each layer, but is clipped at 512. The input to our encoder is a set of eight RGB
input images corresponding to the nearby light directions in our active set, to each
of which we concatenate the 𝑥𝑦𝑧-coordinates of the target light (tiled to every pixel),
giving us eight 6-channel input images. These images are processed along the “batch”
dimension of our network and therefore treated identically at each level of the encoder.
These eight images are then pooled down to a single “image” (i.e., a single batch) of
activations using our alias-free pooling (Section 3.3.2), which is then concatenated to
the internal activations of the decoder.
The decoder begins with a series of fully-connected (a.k.a. dense) blocks that take
as input the query light direction 𝜔, each of which is followed by instance normal-
ization [Ulyanov et al., 2016] and a PReLU. These activations are then upsampled to
4 × 4. Each layer of the decoder consists of a 3 × 3 transposed convolutional block
(with a stride of 2 to double resolution), again followed by group normalization and
a PReLU. The input to each layer’s convolutional block is a concatenation of the
upsampled activations from the previous decoder level, with the pooled activations
skip-connected from the encoder at the same spatial resolution. The final activation
function is a sigmoid function that outputs pixel values ∈ [0, 1]. Because our network
is fully convolutional [Long et al., 2015], it can be evaluated on images of an arbitrary
resolution, with the GPU memory being the sole limiting factor. In practice, we train
on 512 × 512 images for the sake of speed, but test on 1024 × 1024 images to maximize
the image quality.

125
3.3.4 Loss Functions & Training Strategy

We supervise the training of LSSR using an ℓ1 loss on pixel intensities. Formally, our
loss function is:
∑︁
ℒ𝑑 = ‖M ⊙ (I𝑖 − I (𝜔𝑖 ))‖1 , (3.5)
𝑖

where I𝑖 is the ground-truth image under light 𝑖, and I (𝜔𝑖 ) is our prediction. When
computing the loss over the image, we use a precomputed binary mask M to exclude
the background pixels.
During training, we construct each training instance by randomly selecting a sub-
ject in our training dataset and then one OLAT light direction 𝑖. The image cor-
responding to that light I𝑖 will be used as the ground-truth image that our model
will attempt to reconstruct, and the query light direction is the light corresponding
to that image 𝜔𝑖 . We then identify a set of eight neighboring image-light pairs to
include in our active set using the selection procedure described in Section 3.3.1. Our
only data augmentation is a randomly-positioned 512 × 512 crop in each batch.
Progressive training has been found effective for accelerating and stabilizing the
training of Generative Adversarial Networks (GANs) for high-resolution image syn-
thesis [Karras et al., 2018]. Although our model is not a GAN (but a convolutional
encoder-decoder architecture with skip connections), we found it to also benefit from
progressive training. We first inject the downsampled image input directly into a
coarse layer of our encoder and supervise training by imposing a reconstruction loss
at a coarse layer of our decoder, resulting in a shallower model that is easier to train.
As training proceeds, we add additional convolutional layers to the encoder and de-
coder, thereby gradually increasing the resolution of our model until we arrive at
the complete network and the full image resolution. In total, we train our network
for 200,000 iterations, using eight NVIDIA V100 GPUs, which takes approximately
ten hours. Please see the detailed training procedure in the supplementary material
(Chapter B).
Our model is implemented in TensorFlow [Abadi et al., 2016] and trained using
Adam [Kingma and Ba, 2015] with a batch size of 1 (the batch dimension of our

126
tensors is used to represent the eight images in our active set), a learning rate of
10−3 , and the default hyperparameter settings (𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−7 ).

3.4 Method: Free-Viewpoint Relighting


Having presented our model for interpolating the light transport (LT) function in
just light direction to enable relighting, we now study the more complex problem of
simultaneous relighting and view synthesis, which requires us to interpolate the LT
function in both the light and view directions.
Neural Light Transport (NLT) [Zhang et al., 2021b] is a semi-parametric model
with a residual learning scheme that aims to close the gap between the diffuse render-
ing of the geometry proxy and the real input image (Figure 3-6). The semi-parametric
approach is used to fuse previously recorded observations to synthesize a novel, pho-
torealistic image under any desired illumination and viewpoint.

Relightables NLT (ours) Real Image


[Guo et al. 2019]

Figure 3-6: Gap in photorealism that NLT attempts to close. Even when high-
quality geometry and albedo can be captured (e.g., by Guo et al. [2019]), photoreal-
istic rendering remains challenging because any geometric inaccuracy will show up as
visual artifacts (e.g., black rims or holes in the hair), and manually creating spatially-
varying, photorealistic materials is onerous, if possible at all. NLT aims to close this
gap by learning directly from real images the residuals that account for geometric
inaccuracies and non-diffuse LT, such as global illumination.

The method relies on recent advances in computer vision that have enabled ac-
curate 3D reconstructions of human subjects, such as the technique of Collet et al.

127
[2015], which takes as input several images of a subject and produces as output a
mesh of that subject and a UV texture map describing its albedo. At first glance,
this appears to address the entirety of our problem: Given a textured mesh, we can
perform simultaneous view synthesis and relighting by simply re-rendering that mesh
from some arbitrary camera and under an arbitrary illumination. However, this sim-
plistic model of reflectance and illumination only permits equally simplistic relighting
and view synthesis, assuming Lambertian reflectance:

˜ 𝑜 (x, 𝜔𝑜 ) = 𝜌(x)𝐿𝑖 (x, 𝜔𝑖 ) (𝜔𝑖 · n(x)) .


𝐿 (3.6)

Here 𝐿
˜ 𝑜 (x, 𝜔𝑜 ) is the diffuse rendering of a point x with a surface normal n(x) and

albedo 𝜌(x), lit by a directional light 𝜔𝑖 with an incoming intensity 𝐿𝑖 (x, 𝜔𝑖 ) and
viewed from 𝜔𝑜 . This reflectance model is only sufficient for describing matte surfaces
and direct illumination. More recent methods (such as the Relightables [Guo et al.,
2019]) also make strong assumptions about materials by modeling reflectance with a
cosine lobe model.

The shortcomings of these methods are obvious when compared to a more expres-
sive rendering approach, such as the rendering equation [Kajiya, 1986], which makes
far fewer simplifying assumptions:
∫︁
𝐿𝑜 (x, 𝜔𝑜 ) = 𝐿𝑒 (x, 𝜔𝑜 ) + 𝑓𝑠 (x, 𝜔𝑖 , 𝜔𝑜 ) 𝐿𝑖 (x, 𝜔𝑖 ) (𝜔𝑖 · n(x)) d 𝜔𝑖 . (3.7)
Ω

From this we observe the many limitations in computing 𝐿


˜ 𝑜 (x, 𝜔𝑜 ): It assumes a single

directional light instead of integrating over the hemisphere of all incident directions Ω,
it approximates an object’s Bidirectional Reflectance Distribution Function (BRDF)
𝑓𝑠 (·) as a single scalar, and it ignores emitted radiance 𝐿𝑒 (·) (in addition to scattering
and transmittance, which this rendering equation does not model either). The goal
of our learning-based model is to close the gap between 𝐿𝑜 (x, 𝜔𝑜 ) and 𝐿
˜ 𝑜 (x, 𝜔𝑜 ), and

furthermore between 𝐿𝑜 (x, 𝜔𝑜 ) and the observed image.

Though not perfect for relighting, the geometry and texture atlas provided by

128
Guo et al. [2019] offers us a mapping from each image of a subject onto a canonical
texture atlas that is shared across all views of that subject. This motivates the high-
level approach of our model: We use this information to map the input images of
the subject from the camera space (XY pixel coordinates) to the texture space (UV
texture atlas coordinates), then use a semi-parametric neural network embedded in
this texture space to fuse multiple observations and synthesize an RGB texture atlas
for the desired relit and/or novel-view image. This is then resampled back into the
camera space of the desired viewpoint, thereby giving us an output rendering of the
subject under the desired illumination and viewpoint.
In Section 3.5.1 and Section 3.4.1, we describe our data acquisition setup and the
input data to our framework. In Section 3.4.2, we detail the texture-space, two-path
neural network architecture at the core of our model, which consists of: I) “observation
paths” that take as input a set of observed RGB images that have been warped into
the texture space and produce a set of intermediate neural activations, and II) a
“query path” that uses these activations to synthesize a texture-space rendering of
the subject according to some desired light and/or viewing direction.
The texture-space inputs encode a rudimentary geometric understanding of the
scene and correspond to the arguments of the 6D LT function (i.e., UV location on
the 3D surface x, incident light direction 𝜔𝑖 , and viewing direction 𝜔𝑜 ). By using
a skip-link between the query path’s diffuse base image and its output as described
in Section 3.4.3, our model is encouraged to learn a residual between the provided
Lambertian rendering with geometric artifacts and the real-world appearance, which
not only guarantees the physical correctness of the diffuse LT, but also directs the
network’s attention towards learning higher-order, non-diffuse LT effects. In Sec-
tion 3.4.5, we explain how our model is trained end-to-end to minimize a photometric
loss and a perceptual loss in the camera space. Our model is visualized in Figure 3-7.

3.4.1 Texture-Space Inputs

In order to perform light and view interpolation, we use as input to our model a
set of OLAT images, the subject’s diffuse base, and the dot products of the surface

129
normals with the desired or observed viewing directions or light directions (a.k.a.
“cosine maps”), all in the UV space. This augmented input allows our learned model
to leverage insights provided by classic graphics models, as the dot products between
the normals and the viewing or lighting directions are the standard primitives in
parametric reflectance models (Equation 3.6, Equation 3.7, etc.).

UV-Wrapped Legend
Query Path + Prediction
conv.
blocks

feature
maps

Query Light Query View Diffuse Base Non-Diffuse


Cosines Cosines µ µ µ Residuals feature
maps
pooled
Observation Path(s) across all
Physically Based Diffuse Base obs.

µ mean
- ⊙ ⊙
skip links

UV wrap.
Non-Diffuse Residuals Fully Lit Light View Ground-Truth
= Observed – Diffuse Base ×K Cosines Visibility recon. loss
Image

Figure 3-7: NLT Model. Our network consists of two paths. The “observation paths”
take as input 𝐾 nearby observations (as texture-space residual maps) sampled around
the target light and viewing directions, and encode them into multiscale features that
are pooled to remove the dependence on their order and number. These pooled
features are then concatenated to the feature activations of the “query path,” which
takes as input the desired light and viewing directions (in the form of cosine maps)
as well as the physically accurate diffuse base (also in the texture space). This path
predicts a residual map that is added to the diffuse base to produce the texture
rendering. With the (differentiable) UV wrapping predefined by the geometry proxy,
we then resample the texture-space rendering back into the camera space where the
prediction is compared against the ground-truth image. Because the entire network
is embedded in the texture space of a subject, the same model can be trained to
perform relighting, view synthesis, or both simultaneously, depending on the input
and supervision.

These augmented texture-space input buffers superficially resemble the “G-buffers”


used in deferred shading models [Deering et al., 1988] and used with neural networks
in Deep Shading [Nalbach et al., 2017]. But unlike Nalbach et al. [2017], our goal
is to train a model for view and light interpolation using real images, instead of
renderings from a CG model. This different goal motivates additional novelties of

130
our approach, such as the embedding of our model in UV space (which removes
the dependency on viewpoints and implicitly provides aligned correspondence across
multiple views) and our use of a residual learning scheme (to encourage training
to focus on higher-order LT effects). Li et al. [2019] also successfully employ deep
learning in the texture space and regress Precomputed Radiance Transfer (PRT)
coefficients for deformable objects, but they learn only predefined diffuse and glossy
light transport from synthetic rendering.
We use three types of buffers in our model, as described below.

Cosine Map Assuming directional light sources, we calculate the cosine map of
a light as the dot product between the light’s direction 𝜔 and the surface’s normal
vector n(x). For each view and light (both observed and queried), we compute two
cosine maps: a view cosine map n(x) · 𝜔𝑜 and a light cosine map n(x) · 𝜔𝑖 . Crucially,
these maps are masked by visibility computed via ray casting from each camera onto
the geometry proxy, such that the light cosines also provide rough understanding of
cast shadows (texels with zero visibility; see Figure 3-7), leaving the network an easier
task of adding, e.g., global illumination effects to these hard shadows.

Diffuse Base The diffuse base is obtained by summing up all OLAT images for
each view or equivalently, illuminating the subject from all directions simultaneously
(because light is additive). These multiple views are then averaged together in the
texture space, which mitigates the view-dependent effects and produces a texture
map that resembles albedo. Note that multiplying the diffuse base by a light cosine
map produces the diffuse rendering (with hard cast shadows) for that light 𝐿
˜ 𝑜 (x, 𝜔𝑖 ).

The construction of this diffuse base is visualized in the bottom middle of Figure 3-7.

Residual Map We compute the difference between each observed OLAT image
and the aforementioned diffuse base, thereby capturing the “non-diffuse and non-
local” residual content of each input image. These residual maps are available only
for the sparsely captured OLAT from fixed viewpoints. To synthesize a novel view
for any desired lighting condition, our network uses a semi-parametric approach that

131
interpolates previously seen observations and their residual maps, generating the final
rendering.

3.4.2 Query & Observation Networks

Our semi-parametric approach is shown in Figure 3-7: The network takes as input
multiple UV buffers in two distinct branches, namely a “query path” and “observation
paths.” The query path takes as input a set of texture maps that can be generated
from the captured geometry, i.e., view/light cosine maps and a diffuse base. The
observation paths represent the semi-parametric nature of our framework and have
access to non-diffuse residuals of the captured OLAT images. The two branches are
merged in an end-to-end fashion to synthesize an unseen lighting condition from any
desired viewpoint.
To synthesize a new image of the subject under a desired lighting and viewpoint,
we have access to potentially all the OLAT images from multiple viewpoints. The
goal of the observation paths is to combine these images and extract meaningful fea-
tures that are passed to the query path to perform the final rendering. However,
using all these observations as input is not practical during training due to memory
and computational limits. Therefore, for a desired novel view and light condition,
we randomly select only 𝐾 = 1 or 3 OLAT images from the “neighborhood” as ob-
servations (the precise meaning of “neighborhood” will be clarified in Section 3.4.5).
The random sampling prevents the network from “cheating” by memorizing fixed
neighbors-to-query mappings and encourages it to learn that for a given query, differ-
ent observation selections should lead to the same prediction (also observed by Sun
et al. [2020]).
These observed images (in the form of UV-space residual maps as shown in Fig-
ure 3-7) are then fed in parallel (i.e., processed as a “batch”) into the observation
paths of our network, which can alternatively be thought of as 𝐾 distinct networks
that all share the same weights. The resulting set of 𝐾 network activations are then
averaged across the set of images by taking their arithmetic mean2 , thereby becoming
2
In practice, we observe no improvement when we replace the uniform weights with the barycen-

132
invariant to their cardinality and order, and are then passed to the query path.
While the goal of the observation paths is to process input images and glean
reflectance information from them, the goal of the query path is to take as input
information that encodes the non-diffuse residuals of nearby lights/views and then
predict radiance values of the queried light and view positions at each UV location.
We therefore concatenate the aggregated activations from the observation paths to the
self-produced activations of the query path using a set of cross-path skip-connections.
The query path then decodes a texture-space rendering of the subject under the
desired light and viewing directions, which is then resampled to the perspective of
the desired viewpoint using conventional, differentiable UV wrapping.
Our proposed architecture has several advantages over a single-path network that
would take as input all the available observations, which would be prohibitively ex-
pensive in terms of memory and computation. Because our observation paths do not
depend on a fixed order or number of images, during training, we can simply select a
dynamic subset of whatever observations that are best suited to the desired lighting
and viewpoint. This ability is useful because the lights and cameras in our dataset
are sampled at different rates—lights are around 4× denser than cameras. The supe-
riority of this dual-path design is demonstrated by both qualitative and quantitative
experiments in Section 3.6.1.

3.4.3 Residual Learning of High-Order Effects

When synthesizing the output texture-space image in the query path of our net-
work, we do not predict the final image directly. Instead, we have a residual skip-
link [He et al., 2016] from the input diffuse base to the output of our network.
Formally, we train our deep neural network to synthesize a residual ∆𝐿 that is
then added to our diffuse base 𝐿
˜ 𝑜 (x, 𝜔𝑜 ) to produce our final predicted rendering
˜ 𝑜 (x, 𝜔𝑜 ). This approach of adding a physically-based diffuse ren-
𝐿𝑜 (x, 𝜔𝑜 ) = ∆𝐿 + 𝐿
dering allows our network to focus on learning higher-order, non-diffuse, non-local
light transport effects (specularities, scattering, etc.) instead of having to “re-learn”
tric coordinates of the query w.r.t. its 𝐾 = 3 observations.

133
the fundamentals of image formation (basic colors, rough locations and shapes of cast
shadows, etc.). Because these residuals are the unconstrained output of a network,
this model is able to describe any output image: Positive residuals can be added to
represent specularities, and negative residuals can be added to represent shading or
shadowing. This residual approach causes our model to be implicitly regularized to-
wards a simplified but physically-plausible diffuse model – the network can “fall back”
to the diffuse base rendering by simply emitting zeros.
We demonstrate that our method is capable of modeling complicated lighting ef-
fects including specular highlights (BRDFs), subsurface scattering (BSSRDFs), and
diffuse interreflection (global illumination), in the context of relighting a toy dragon
scene. We consider a 3D model with perfect geometry and known material properties
and render it in a virtual scene similar to a light stage setup using Cycles (Blender’s
built-in, physically-based renderer). We produce a diffuse render of the scene as a
baseline, and then re-render it using both our model and Blender with three light-
ing effects: specular highlights, subsurface scattering, and diffuse interreflections, to
demonstrate that NLT is capable of modeling those effects. The results are shown in
Figure 3-8 and Figure 3-9.

Specular Highlights In Blender, we mix a glossy shader into the dragon’s diffuse
shader and re-render the scene, resulting in a render with highlights. We then train
our model to infer the residuals for relighting. In Figure 3-8 (center), we show the NLT
renderings under two novel light directions (unseen during training) alongside with
the ground-truth renderings. The residual image predicted by our model correctly
models the specular highlights, and our rendering closely resembles the ground truth.

Subsurface Scattering Our model can capture lighting effects that cannot be
captured by a BRDF, such as subsurface scattering. We mix a subsurface scattering
shader into the dragon’s diffuse shader, and then train our model to learn these effects
in relighting. As shown in Figure 3-8 (right), the NLT results are almost identical to
the ground truth.

134
Specularity Subsurface Scattering

Novel Light 1
Novel Light 2

Diffuse Base Pred. Residuals Diffuse Base + Ground Truth Pred. Residuals Diffuse Base + Ground Truth
(+ only) Pred. Residuals (+ only) Pred. Residuals

Figure 3-8: Modeling non-diffuse BSSRDFs as residuals for relighting in NLT. A dif-
fuse base (left) captures all diffuse LT (e.g., hard shadows) under a novel point light.
By learning a residual on top of this base rendering, NLT can reproduce non-diffuse
LT (here, specularities and subsurface scattering) from the actual scene appearance.
When predicting specularities (center), NLT emits exclusively positive residuals (neg-
ative part hence not shown) to add bright highlights to the diffuse base. When
predicting scattering (right), the additive residuals represent additional illumination
provided by nearby subsurface light transport.
Novel Light 2 Novel Light 1

Scene: adding a Lambertian


green wall to provide Pred. Res. Pred. Res. Diffuse Ground
diffuse interreflection (- only) (+ only) Base + Res. Truth

Figure 3-9: Modeling global illumination as residuals for relighting in NLT. The diffuse
bases are the same as in Figure 3-8. In addition to intrinsic material properties, NLT
can also learn to express global illumination (e.g., diffuse interreflection) as residuals.
Here we add a diffuse green wall to the right of the scene (left). Under Novel Light 1
(right top), the wall provides additional green indirect illumination, so the residuals
are green and mostly positive. Notably, the residuals are not necessarily all positive:
Under Novel Light 2 (right bottom), the residuals are mostly negative and high in
blue and red, effectively casting “negative purple” indirect illumination that results
in a greenish tinge.

135
Diffuse Interreflection To demonstrate global illumination, in Figure 3-9 we place
a matte green wall into the scene, and we see that NLT is able to accurately predict
the non-local light transport of a green glow cast by the wall onto the dragon.

3.4.4 Simultaneous Relighting & View Synthesis

Embedded in the texture space, NLT is a unified framework that can perform re-
lighting, view synthesis, or both simultaneously. The architecture described in Sec-
tion 3.4.2 takes as input the cosine maps that encode the light and viewing directions,
as well as a set of observed residual maps from nearby lights and/or views (neighbor
selection scheme in Section 3.4.5). Since there is no model design specific to relighting
or view synthesis, the model is agnostic to which task it is solving other than interpo-
lating the 6D LT function. Therefore, by varying both lights and views in the training
data, the model can be trained to render the subject under any desired illumination
from any camera position (a.k.a. simultaneously relighting and view synthesis). We
demonstrate this capability in Section 3.5.8 and the supplemental video.

3.4.5 Network Architecture, Losses, & Other Details

Both paths of our architecture are modifications of the U-Net [Ronneberger et al.,
2015], where our query path is a complete encoder-decoder architecture (with skip-
connections) that decodes the final predicted image, while our observation path is
just an encoder. Following standard conventions, each scale of the network consists
of two convolutional layers (except at the very start and end), where downsampling
and upsampling are performed using strided (possibly transposed) convolutions, and
the channel number of the feature maps is doubled after each downsampling and
halved after each upsampling. Detailed descriptions of the architectures of these
two networks are provided in Table 3.1. No normalization is used. Note that the
activations from the observation paths are appended to the query path before its
internal skip connections, meaning that observation activations are effectively skip-
connected to the decoder of the query network.

136
Observation Path Query Path
ID Operator Output Shape ID Operator Output Shape
O1 conv(16, 1 × 1, 1) 𝐻 𝑊 16 Q1 conv(16, 1 × 1, 1) 𝐻 𝑊 16
O2 conv(16, 3 × 3, 2) 𝐻/2 𝑊/2 16 Q2 append (mean(O1)) 𝐻 𝑊 32
O3 conv(16, 3 × 3, 1) 𝐻/2 𝑊/2 16 Q3 conv(16, 3 × 3, 2) 𝐻/2 𝑊/2 16
O4 conv(32, 3 × 3, 2) 𝐻/4 𝑊/4 32 Q4 conv(16, 3 × 3, 1) 𝐻/2 𝑊/2 16
O5 conv(32, 3 × 3, 1) 𝐻/4 𝑊/4 32 Q5 append (mean(O3)) 𝐻/2 𝑊/2 32
... ... ... Q6 conv(32, 3 × 3, 2) 𝐻/4 𝑊/4 32
O14 conv(1024, 3 × 3, 2) 𝐻/128 𝑊/128 1024 Q7 conv(32, 3 × 3, 1) 𝐻/4 𝑊/4 32
O15 conv(1024, 3 × 3, 1) 𝐻/128 𝑊/128 1024 Q8 append (mean(O5)) 𝐻/4 𝑊/4 64
O16 conv(2048, 3 × 3, 2) 𝐻/256 𝑊/256 2048 ... ... ...
O17 conv(2048, 3 × 3, 1) 𝐻/256 𝑊/256 2048 Q44 append (Q8) 𝐻/4 𝑊/4 80
Q45 convT (8, 3 × 3, 2) 𝐻/2 𝑊/2 8
conv(𝑑, 𝑤 × ℎ, 𝑠) denotes a two-dimensional
Q46 convT (8, 3 × 3, 1) 𝐻/2 𝑊/2 8
convolutional layer (a.k.a. conv2D) with 𝑑 output
Q47 append (Q5) 𝐻/2 𝑊/2 40
channels, a filter size of (𝑤 × ℎ), and a stride of
Q48 convT (4, 3 × 3, 2) 𝐻 𝑊 4
𝑠, and is always followed by a leaky ReLU [Maas
Q49 convT (4, 3 × 3, 1) 𝐻 𝑊 4
et al., 2013] activation function. convT is the
Q50 append (Q2) 𝐻 𝑊 36
transpose of convand is also followed by a leaky
Q51 convT (3, 1 × 1, 1) 𝐻 𝑊 3
ReLU.

Table 3.1: Neural network architecture of NLT. The layers that reflect skip con-
nections from the activations of our observation path are highlighted in blue, and
U-Net-like skip-links within the query path are highlighted in green.

We trained our model to minimize losses in the image space between the predicted
image 𝐿𝑜 (x𝑖 , 𝜔𝑜 ) and the ground-truth captured image. To this end, we first resample
the UV-space prediction back to the camera space, and then compute the total loss
as a combination of a robust photometric loss [Barron, 2019] and a perceptual loss
(LPIPS) [Zhang et al., 2018a]. We use the loss function of Barron [2019] with 𝛼 = 1
(a.k.a. pseudo-Huber loss) applied to a CDF9/7 wavelet decomposition [Cohen et al.,
1992] in the YUV color space:
√︃(︂ )︂2
∑︁ CDF (YUV (𝐿𝑜 (x𝑖 , 𝜔𝑜 ) − 𝐿*𝑜 (x𝑖 , 𝜔𝑜 )))
ℓ𝐼 = + 1 − 1. (3.8)
𝑖
𝑐

We empirically set the scale hyperparameter 𝑐 = 0.01. As was demonstrated by


Barron [2019], we found that imposing a robust loss in this YUV wavelet domain
produced reconstructions that better captured both high- and low-frequency details.
The perceptual loss ℓ𝑃 [Zhang et al., 2018a] is defined as the ℓ2 distance in feature
space extracted with a VGG network [Simonyan and Zisserman, 2015] pretrained on
ImageNet [Deng et al., 2009]. The final loss function is simply the sum of the two

137
losses ℓ = ℓ𝐼 + ℓ𝑃 . Empirically, we found that using the same weight for both losses
achieved the best results.

We trained our model by minimizing ℓ using Adam [Kingma and Ba, 2015] with
a learning rate of 2.5 × 10−4 , a batch size of 1, and the following optimizer hyperpa-
rameters: 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−7 . Our model is implemented in TensorFlow
[Abadi et al., 2016] and trained on a single NVIDIA Tesla P100, which takes less than
12 hours for the real scenes and much less for synthetic scenes.

Observations Our observation paths are designed to be invariant to the number


and order of observations, so that one can input as many observations into the path
as the task requires or memory constraints permit. During training, the observations
are the 𝐾 = 3 neighbors randomly sampled from a cone around the query light
direction with an apex angle of 30° for relighting, or from a pool of nearby cameras
that have large visibility overlap in the UV space (per-texel and operation) with the
query camera for view synthesis. In training simultaneous models, we use 𝐾 = 1
neighbor with both the nearest camera and light as the observation. At test time,
thanks to the framework’s invariance to the number and order of observations, we
use a fixed set of observations to reduce flickering caused by frequent neighborhood
switching3 and observe no visual difference with different sets of observations. One
may also use all of the physical lights and cameras as observations at test time.

Resolutions For relighting and view synthesis, our texture-space images have a
resolution of 1024 × 1024, and the camera-space images have a resolution of 1536 ×
1128. For simultaneous relighting and view synthesis, the resolutions used are 512 ×
512 in the UV space and 1024 × 752 in the camera space.

3
See Section 3.3 for how we addressed a similar issue using our active sets and alias-free pooling
in LSSR [Sun et al., 2020].

138
3.5 Results

In this section, we first introduce our data capture process using a light stage (Sec-
tion 3.5.1) and discuss the evaluation metrics used in our experiments (Section 3.5.2).
We then show LSSR’s capabilities of continuous directional relighting (Section 3.5.3)
and high-frequency image-based relighting (Section 3.5.4), and its application of light-
ing softness control (Section 3.5.5). Section 3.5.6 compares LSSR against its base-
lines, which do not utilize any 3D geometry either. Finally, in Section 3.5.7 and
Section 3.5.8, we present how NLT enables fixed-viewpoint relighting (which is al-
ready supported by LSSR) and additionally free-viewpoint relighting with additional
input of geometry proxy.

3.5.1 Hardware Setup & Data Acquisition

LSSR uses the One-Light-at-A-Time (OLAT) portrait dataset from Sun et al. [2019],
which contains 22 subjects with multiple facial expressions captured using a light
stage with a seven-camera system. The light stage consists of 302 LEDs uniformly
distributed on a spherical dome, and capturing a subject takes roughly 6 s. Each cap-
ture produces an OLAT scan of a specific facial expression per camera, which consists
of 302 images, and we treat the OLAT scans from different cameras as independent
OLAT scans, since we are not considering viewpoint change in LSSR.
Following previous works [Meka et al., 2019, Sun et al., 2019], we ask the subject to
stay still during the acquisition phase, which lasts about 6 s for a full OLAT sequence.
Since it is nearly impossible for the performer to stay perfectly still, we align all the
images using the optical flow technique of Meka et al. [2019]: We capture “all-lights-
on” images throughout the scan that are used as “tracking frames,” and compute 2D
flow fields between each tracking frame and a reference tracking frame taken from
the middle of the sequence. These flow fields are then interpolated from the tracking
frames to the rest of the images to produce a complete alignment. As such, for a
given camera, the captured images in each OLAT scan are aligned and only differ in
lighting directions.

139
For LSSR, we manually select four OLAT scans with a mixture of subjects and
views as our validation set, and select another 16 scans with good coverage of genders
and skin tones as training data. Our 16 training scans cover only five of the seven
cameras, as the remaining two are covered by the validation data. We train the
LSSR network using all lights from our OLAT data in a canonical, global lighting
coordinate frame, which allows us to train a single network for all viewpoints in our
training data. We train a single model for all subjects in our training dataset, which
we found matches the performance of training an individual network for each subject.

Aiming to simultaneously handle viewpoint and lighting change, NLT relies on


multi-view OLAT images in the form of texture-space UV buffers. This requires us
to acquire training images under known illumination conditions alongside a param-
eterized geometric model to obtain the UV buffers. We use a light stage similar to
what LSSR uses to acquire OLAT images where only one (known) directional light
source is active in each image. For each session, we captured 331 OLAT images for
64 RGB cameras placed around the performer. When a light is pointing towards a
given camera or gets blocked by the subject, the resultant image is either “polluted”
by the glare or is overly dark. As such, for a given camera, there are approximately
130 usable OLAT images. These OLAT images are sparse samples of the 6D light
transport function, which NLT learns to interpolate. We visualize samples of these
OLAT images in Figure 3-10.

For NLT, we additionally acquired a base mesh for use as the geometry proxy that
NLT requires. Following the approach of Guo et al. [2019], we use 32 high-resolution
active IR cameras and 16 custom dot illuminator projectors to construct a high-
quality parameterized base mesh of each subject fully automatically. These data are
critical to our approach, as the estimated geometry provided by this system provides
the substrate that our learned model is embedded within in the form of a texture
atlas. However, this captured 3D model is far from perfect due to approximations
in the mesh model (that cannot accurately model fine structures such as hair) and
hand-crafted priors in the reflectance estimation (that relies on a cosine lobe Bidi-
rectional Reflectance Distribution Function [BRDF] model). This is demonstrated in

140
Figure 3-10:
Sample images
used for training
NLT. These
Camera 1

multi-view, One-
Light-at-A-Time
(OLAT) images
are sparse samples
of the 6D light
transport function
that NLT
interpolates. A
proxy of the
Camera 2

underlying
geometry is also
required by NLT,
but it can be as
rough as 500
vertices (see
Light 1 Light 2 Light 3 Section 3.6.4).

Figure 3-6. Our model overcomes these issues and enables photorealistic renderings,
as demonstrated in Section 3.5.7 and Section 3.5.8. Additionally, we demonstrate in
Section 3.6.4 that our neural rendering approach is robust to geometric degradation
and can work with geometry proxies of as few as 500 vertices.

We collect a dataset of 70 human subjects with fixed poses, each of which provides
around 18,000 frames under 331 lighting conditions and 55 viewpoints (before filtering
out glare-polluted and overly dark frames, as aforementioned). We randomly hold out
six lighting conditions and two viewpoints for training. The subjects are selected to
maximize diversity in terms of clothing, skin color, and age. By training our model
to reproduce held-out images from these light stage scans, we are able to learn a
general LT function that can be used to produce rendering for arbitrary viewpoints
and illuminations. Because our scans do not share the same UV parameterization,
we train a separate model for each subject.

141
3.5.2 Evaluation Metrics

Empirically evaluating our models presents a significant challenge: Both LSSR and
NLT attempt to super-resolve an undersampled scan from a light stage, which means
that the only ground truth available for benchmarking is also undersampled in both
light and view directions. In other words, the goals are to accurately synthesize images
for virtual lights and/or cameras in between the physical lights and/or cameras on
the stage, but we do not have ground-truth images that correspond to those virtual
lights and/or cameras. For this reason, qualitative results (figures and videos) are
preferred, and we encourage the reader to view them.

For the quantitative results to be presented, we use held-out real images lit by
physical lights from physical cameras on our light stage as a validation set. When
evaluating on one of these validation images, LSSR does not use the active-set selec-
tion technique of Section 3.3.1, but instead just samples the 𝑘 = 8 nearest neighbors
(excluding the validation image itself from the input). Holding out the validation
image from the input is critical, as otherwise the model could simply reproduce the
input image as an error-free output. This held-out validation approach is not ideal, as
all such evaluations will follow the same regular sampling pattern of our light stage.
This evaluation task is therefore more biased than the real task of predicting images
away from the sampling pattern of the light stage.

It is not straightforward either to select an appropriate metric for measuring the


image synthesis accuracy for our tasks. Conventional image interpolation techniques
often result in ghosting artifacts or duplicated highlights, which are perceptually
salient but often do not get penalized heavily by traditional image metrics such as
Mean Square Error (MSE). We therefore evaluate the image quality using multiple im-
age metrics: Root MSE (RMSE) or Peak Signal-to-Noise Ratio (PSNR), the Sobolev
𝐻 1 norm [Ng et al., 2003], Structural (Dis)similarity (SSIM or DSSIM) [Wang et al.,
2004], and (Robust) Learned Perceptual Image Patch Similarity (LPIPS [Zhang et al.,
2018a] or E-LPIPS [Kettunen et al., 2019]). Among these metrics, RMSE or PSNR
measures pixel-wise errors, 𝐻 1 norm emphasizes image gradient errors, (D)SSIM fo-

142
cuses on structures in the images, and (E-)LPIPS captures perceptual differences.
Again, images and videos may be more informative about the quality achieved.

3.5.3 Precise Directional Relighting

Traditional image-based relighting methods produce accurate results when the target
lighting is concentrated around the physical lights of the stage, but may introduce
ghosting artifacts or inaccurate shadows when no physical light is nearby. In Figure 3-
11, we interpolate between two physical lights on the stage. As shown in Figure 3-11
(b, c), linear blending or Xu et al. [2018b] with adaptive sampling fails to produce
realistic results and always contains multiple superposed shadows or highlights. The
shadows produced by Meka et al. [2019] are sharp, but are not moving smoothly
when the light moves. In contrast, LSSR is able to produce sharp and realistic
images for arbitrary light directions: Highlights and cast shadows move smoothly
as we change the light direction, and our results have comparable sharpness to the
(non-interpolated) ground-truth images that are available.
260:10 • Sun et al.

Captured image under light A Interpolation between captured lights ! Captured image under light B

(a) Ours

(b) Linear Blending

(c) [Xu et al. 2018] w/ adaptive sampling

(d) [Meka et al. 2019]

Fig. 9. Here we use produce interpolated images corresponding to “virtual” lights between two real lights of the light stage. Our model (a) produces renderings
Figure 3-11: Interpolation by LSSR between two physical lights. Here LSSR produces
where sharp shadows and accurate highlights move realistically. Linear blending (b) and Xu et al. [2018] with adaptive sampling result in ghosting artifacts
the interpolated images corresponding to “virtual” lights between two real lights on the
and duplicated highlights. The results from Meka et al. [2019] contain blurry highlights and shadows with unrealistic motion.

light stage. (a) LSSR produces images where sharp shadows and accurate highlights
map. By taking a linear combination of all such images (weighted
move realistically. (b, c) Linear blending and byXu etpixel
their al. values
[2018b] with
and solid adaptive
angles), sampling
we are able to produce a
rendering that matches the sampling resolution of the environment
result in ghosting artifacts and duplicated highlights. (d) The results from Meka et al.
map. As shown in Fig. 10, this approach produces images with sharp
[2019] contain blurry highlights and shadowsshadows
withand unrealistic
minimal ghosting motion.
when given a high-frequency envi-
ronment, while linear blending does not. In this example, we use
an environment resolution of 256 ⇥ 128, which corresponds to a
super-resolved light stage with 32,768 lights. Please see our video
for more environment relighting results.
We now analyze the relationship between the image quality gain
143 from our model and the frequency of the lighting. Speci�cally, we
(a) With super-resolution (b) Without super-resolution evaluate for which environments, and at what frequencies, our
algorithm will be required for accurate rendering, and conversely
Fig. 10. Our model (a) is able to produce accurate relighting results under how our model performs in low-frequency lighting environments
high-frequency environments by super-resolving the light stage before where previous solutions are adequate. For this purpose, we use one
3.5.4 High-Frequency Image-Based Relighting

OLAT scans captured by a light stage can be linearly blended to reproduce images
that appear to have been captured under some environmental lighting. The pixel
values of a light probe are usually distributed to the nearest or neighboring lights on
the light stage for blending. This traditional approach causes ghosting artifacts in
shadows and specularities, due to the finite sampling of light directions on the light
stage. Although this ghosting is hardly noticeable when the lighting is low-frequency,
it can be significant when the lighting contains high-frequency contents, such as the
sun in the sky. These ghosting artifacts can be ameliorated by using LSSR. Given
a light probe, our algorithm predicts an image corresponding to the light direction
of each pixel in the light probe. By taking a linear combination of all such images
(weighted by their pixel values and solid angles), we are able to produce rendering
that matches the sampling resolution of the light probe. As shown in Figure 3-12,
this approach produces images with sharp shadows and minimal ghosting when given
a high-frequency light probe, whereas linear blending does not. In this example, we
use 256 × 128 light probes, corresponding to a super-resolved light stage with 32,768
lights. Please see our video for more image-based relighting results.

3.5.5 Lighting Softness Control

LSSR’s ability to render images under arbitrary light directions also allows us to
control the softness of the shadow. Given a light direction, we can densely synthesize
images corresponding to the light directions around it and average those images to
produce rendering with realistic soft shadows (the sampling radius of these lights
determines the softness of the resulting shadow). As shown in Figure 3-13, LSSR is
able to synthesize realistic shadows with controllable softness, which is not possible
using traditional linear blending methods.

144
(a) With Super-Resolution by LSSR (b) Without Super-Resolution

Figure 3-12: High-frequency image-based relighting by LSSR. (a) Our model is able
to produce accurate relighting results under high-frequency environment lighting, by
super-resolving the light stage before performing image-based relighting [Debevec
et al., 2000]. (b) Using the light stage data as-is results in ghosting.

Light Stage Super-Resolution: Continuous High-Frequency Relighting • 260:11

Our full image Increased shadow radius !

(a) Our Model

(b) Linear Blending

Figure 3-13: Controlling


Fig. 12. lighting
So� shadows can besoftness
renderedwith LSSR. Softand
by synthesizing shadows rendered
averaging imagesby LSSR
(a) are more realistic and contain fewer ghosting artifacts than those rendered using
corresponding to directional light sources within some area on the sphere.
linear blending (b). rendered by our method (a) are more realistic and contain
So� shadows
fewer ghosting artifacts than those rendered using linear blending (b).

145
of the shadow. Given a light direction, we can densely synthesize
images corresponding to the light directions around it, and average
those images together to produce a rendering with realistic soft
3.5.6 Geometry-Free Relighting

We compare our LSSR results against the existing relighting approaches. The linear
blending baseline in Table 3.2 produces competitive results, despite being very simple:
just linearly blending the input images according to our alias-free weights. Because
linear blending directly interpolates aligned pixel values, it is often able to retain
accurate high-frequency details in the flat region, and this strategy works well in
minimizing the error metrics. However, linear blending produces significant ghosting
artifacts in shadows and highlights, as shown in Figure 3-14. Although these errors
are easy to detect visually, they appear hard to measure empirically.

Method RMSE 𝐻1 DSSIM E-LPIPS


LSSR (ours) 0.0160 0.0203 0.0331 0.00466 Table 3.2: Relighting errors
w/ naïve neighbors 0.0156 0.0199 0.0322 0.00449 of LSSR. While LSSR w/
w/ average pooling 0.0203 0.0241 0.0413 0.00579 naïve neighbors achieves
Linear Blending 0.0191 0.0232 0.0366 0.00503 the lowest error, the full
Fuchs et al. [2007] 0.0195 0.0258 0.0382 0.00485 model performs better on
Photometric Stereo 0.0284 0.0362 0.0968 0.00895
Xu et al. [2018b]
our real test data where the
w/ eight optimal lights 0.0410 0.0437 0.1262 0.01666 synthesized light does not
w/ adaptive input 0.0259 0.0291 0.1156 0.00916 lie on the regular hexagons
Meka et al. [2019] 0.0505 0.0561 0.1308 0.01482 (see Section 3.5.2 and
Figure 3-20 for details).
We report the arithmetic mean of each metric and highlight the
top three in red, orange, and yellow for each metric.

Comparisons Against Fuchs et al. [2007] We evaluate against the layer-based


technique of Fuchs et al. [2007] by decomposing an OLAT into diffuse, specular,
and visibility layers, and interpolating the illumination individually for each layer.
Although the method works well on specular objects as shown in the original paper,
it performs less well on OLATs of human subjects, as shown in Table 3.2. This appears
to be due to the complex specularities on the human skin not being tracked accurately
by the optical flow algorithm of Fuchs et al. [2007]. Additionally, the interpolation of
the visibility layer sometimes contains artifacts, resulting in incorrect cast shadows.
That said, the algorithm results in fewer ghosting artifacts than the linear blending
algorithm, as shown in Figure 3-14 and as reflected by E-LPIPS.

146
260:8 • Sun et al.

(g) Xu (h) Xu
(a) Ours (d) Linear (e) Fuchs (f) Photometric (i) Meka
(b) Groundtruth (c) Ours et al. [2018] w/ et al. [2018] w/
(full image) blending et al. [2007] stereo et al. [2019]
optimal sample adaptive sample

Figure 3-14: Relighting by LSSR and the baselines. Here we present a qualitative
Fig. 6. Here we present a qualitative comparison between our method and other light interpolation algorithms. Traditional methods (linear blending, Fuchs
et al. [2007], photometric stereo) retain detail but su�er from ghosting artifacts in shadowed regions. Results from Xu et al. [2018] and Meka et al. [2019]
comparison between our method and other light interpolation algorithms. Traditional
exhibit significant oversmoothing and brightness changes. Our method retains details and synthesizes shadows that resemble the ground truth.

methods (linear blending, Fuchs et al. [2007], photometric stereo) retain details but
interpolates aligned pixel values, it is often able to retain accurate
suffer from ghosting artifacts in the shadowedmodel
high frequency details in the �at region, and this strategy works
does not attempt to factorize the human subject into a pre-
regions. Rendering by Xu et al. [2018b]
de�ned re�ectance model wherein interpolation can be explicitly
andforMeka
well et al.
minimizing [2019]
our error exhibits
metrics. However, noticeable
linear blending oversmoothing
performed. Our modeland brightness
is instead changes.
trained to identify a latentOur
vector
produces signi�cant ghosting artifacts in shadows and highlights, space of network activations in which naive linear interpolation
method
as retains
shown in Fig. 6. Thoughdetails
these errors and synthesizes
are easy shadows
to detect visually, resultsthat resemble
in accurate the
non-linearly ground
interpolated truth.
images, which results
they appear to be hard to measure empirically. in more accurate renderings.
We evaluate against the layer-based technique of Fuchs et al. The technique of Xu et al. [2018] (retrained on our training data)
[2007] by decomposing an OLAT into di�use, specular, and visibility represents another possible candidate for addressing our problem.
Comparisons
layers, and interpolating the Against Photometric
illumination individually for each layer.Stereo Using
This technique doesthe
not layer
natively decomposition
solve our problem. In orderpro-to
Although the method works well on specular objects as shown in the �nd the optimal lighting directions for relighting, it requires as
ducedpaper,
original by itFuchs
performs et
less al.
well [2007],
on OLATs ofwe additionally
human subjects, as perform
input all 302 photometric
high-resolution images stereo
in each on
OLATthe
scanOLAT
in the �rst
shown in Tab. 1. This appears to be due to the complex specularities step, which signi�cantly exceeds the memory constraints of modern
on
datahuman byskin not being
simple trackedregression
linear accurately by thetooptical �ow
estimate a GPUs. To addressalbedo
per-pixel this, we �rst jointly train
image andthenormal
Sample-Netmap.and the
algorithm of Fuchs et al. [2007]. Additionally, the interpolation of Relight-Net on our images (downsampled by a factor of 4⇥ due to
the visibility layer sometimes contains artifacts, which results in cast memory constraints) to identify 8 optimal directions from the 302
Usingbeing
shadows thispredicted
normal mapThat
incorrectly. and albedo
being image, wetotal
said, the algorithm candirections
then ofuse thestage.
the light Lambertian reflectance
Using those 8 optimal directions,
results in fewer ghosting artifacts than the linear blending algorithm, we then retrain the Relight-Net using the full-resolution images from
toshown
as renderin Fig. 6aand
new diffuse
as re�ected by theimage correspondingourtotraining
E-LPIPS metric. the data,
query light indirection,
as prescribed Xu et al. [2018].which we
Table 1 shows
Using the layer decomposition produced by Fuchs et al. [2007], we that this approach works poorly on our task. This may be because
additionally
add to perform photometric stereo
the specular layer on the
byOLAT data by simple
[Fuchs this technique
et al., 2007] is built around
to produce 8 �xed
the input rendering.
final images and is naturally
As
linear regression to estimate a per-pixel albedo image and normal disadvantaged compared to our approach, which is able to use any
map. Using this normal map and albedo image we can then use of the 302 light stage images as input. We therefore also evaluate a
shown in
Lambertian Tableto3.2,
re�ectance renderthis
a newapproach underperforms
di�use image correspond- variant that
of Xu etof Fuchs
al. [2018] whereetweal.
use [2007], likelyselection
the same active-set due
ing to the query light direction, which we add to the specular layer approach used by our model to select the images used to train their
to the
from [Fuchs reflectance of human
et al. 2007] to produce faces being
our �nal rendering. As shown non-Lambertian.
Relight-Net. By usingAdditionally, theapproach
our active-set selection scattering
(Sec. 3.1)
in Tab. 1, this approach underperforms that of Fuchs et al. [2007], this baseline is able to better reason about local information, which
effect
likely due toofthehuman
re�ectance hair
of human isfaces
poorly modeled in terms
being non-Lambertian. of a per-pixel
improves performance as shown inalbedo and this
Tab. 1. However, normal
baseline
Additionally, the scattering e�ect of human hair is poorly modeled still results in �ickering artifacts when rendering with moving lights,
in terms of a per-pixel albedo and normal vector. These limiting because (unlike our approach) it is sensitive to the aliasing induced
vector. These limiting assumptions result inwhen
assumptions result in overly sharpened and incorrect shadow pre-
overly sharpened and incorrect shadow
images leave and enter the active set.
dictions, as shown in Fig. 6. In contrast to this photometric stereo We also evaluate Deep Re�ectance Fields [Meka et al. 2019] for
predictions,
approach as shown
and the layer-based in of
approach Figure 3-14.
Fuchs et al. [2007], our our task, which is also outperformed by our model. This is likely

In contrast to this photometric stereo approach and the layer-based approach of


ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020.

Fuchs et al. [2007], LSSR does not attempt to factorize the human subject into a
predefined reflectance model wherein interpolation can be explicitly performed. Our
model is instead trained to identify a latent vector space of network activations in
which naive linear interpolation results in accurate non-linearly interpolated images,
producing more accurate rendering.

147
Comparisons Against Xu et al. [2018b] The technique of Xu et al. [2018b]
(retrained on our training data) represents another possible candidate for addressing
our problem. This technique, though, does not natively solve our problem: To find the
optimal lighting directions for relighting, it requires as input all 302 high-resolution
images in each OLAT scan in the first step, which significantly exceeds the memory
constraints of modern GPUs. To address this, we first jointly train their Sample-Net
and Relight-Net on our images (downsampled by 4× due to memory constraints) to
identify the eight optimal directions from the 302 directions.
Using those eight optimal directions, we then retrain Relight-Net using the full-
resolution images from our training data, as prescribed by Xu et al. [2018b]. Ta-
ble 3.2 shows that this approach works poorly on our task. This may be because
this technique is built around eight fixed input images and is naturally disadvantaged
compared with our approach that is able to use any of the 302 light stage images as
input. We therefore also evaluate a variant of Xu et al. [2018b], where we use the
same active set selection as used by LSSR to train their Relight-Net. By using our
active set (Section 3.3.1), this enhanced baseline is able to better reason about local
information, which improves the performance as shown in Table 3.2. However, this
baseline still results in flickering artifacts when rendering with moving lights, because
unlike LSSR, it is sensitive to the aliasing induced when images leave and enter the
active set.

Comparisons Against DRF We also evaluate Deep Reflectance Fields (DRF)


[Meka et al., 2019] for our task, which also underperforms LSSR. This is likely because
their model is specifically designed for fast and approximate video relighting, and uses
only two images as input, while our model has access to the entire OLAT scan and
is designed to prioritize rendering quality.

3.5.7 Geometry-Based Relighting

Although NLT can interpolate the LT function in both light and view directions,
here we demonstrate that NLT achieves relighting results similar to those of LSSR

148
if we query NLT at just novel light directions 𝜔𝑖 . Because NLT requires geometry
proxy as additional input (to support free-viewpoint relighting; see Section 3.5.8), for
fixed-viewpoint relighting, image-based methods that use no 3D geometry, such as
LSSR, suffice and are more convenient.
First, we quantitatively evaluate our model against the state-of-the-art relighting
solutions and ablations of our model, and report our results in Table 3.3 in terms
of PSNR, SSIM [Wang et al., 2004], and LPIPS [Zhang et al., 2018a]. We see that
NLT outperforms all baselines and ablations, although simple baselines such as diffuse
rendering and barycentric blending also obtain high scores. This appears to be due to
these metrics under-emphasizing high-frequency details and high-order light transport
effects. These results are more easily interpreted using the visualization in Figure 3-
15, where we see that the renderings produced by our approach more closely resemble
the ground truth than those of other models. In particular, our method synthesizes
shadows, specular highlights, and self-occlusions with higher precision when compared
against simple barycentric blending, as well as state-of-art neural rendering algorithms
such as Nalbach et al. [2017] and Xu et al. [2018b]. Our approach also produces more
realistic results than the geometric 3D capture pipeline of Guo et al. [2019]. See the
supplemental video for more examples.

Method PSNR ↑ SSIM ↑ LPIPS ↓


Table 3.3: NLT Relighting. NLT
(or its variant) outperforms all
Diffuse Base 30.21 ± .08 .878 ± .003 .102 ± .003
Bary. Blending 34.28 ± .20 .942 ± .002 .051 ± .002
baselines in terms of PSNR and
Deep Shading 33.67 ± .27 .918 ± .009 .106 ± .012 LPIPS. Although barycentric
Xu et al. [2018b] 31.94 ± .09 .923 ± .003 .089 ± .003 blending achieves a similar SSIM
Relightables 31.03 ± .08 .891 ± .003 .090 ± .003 score, it produces inaccurate
NLT (ours) 33.99 ± .19 .942 ± .002 .045 ± .002 renderings. Ablating LPIPS
NLT w/o res. 33.65 ± .25 .928 ± .006 .063 ± .007 slightly increases PSNR, but
NLT w/o obs. 32.56 ± .15 .925 ± .002 .064 ± .002
degrades rendering quality (such
NLT w/o LPIPS 34.43 ± .24 .939 ± .005 .066 ± .006
as the facial specularities shown
The numbers are means and 95% confidence intervals. in Figure 3-15).

HDRI Relighting Our model’s ability to synthesize images corresponding to ar-


bitrary light directions allows us to render subjects under arbitrary HDRI environ-

149
Ground Truth NLT (ours) Ground Truth NLT (ours) Ground Truth NLT (ours)

A B C

1. Ground Truth 2. NLT (ours) 3. Nearest Light 4. Barycentric 5. Deep Shading 6. [Xu et al. 2018] 7. Relightables
Blending [2017] [2019]

Figure 3-15: NLT relighting with a directional light. Here we visualize the perfor-
mance of NLT for the task of relighting using directional lights. We show represen-
tative examples of full-body subjects with zoom-ins focusing on cast shadows (A, C)
and facial specular highlights (B). Note how NLT is able to outperform all the other
approaches with sharper and ghosting-free results that are drastically different from
the nearest neighbors.

150
ment maps. To do this, we synthesize 331 directional OLAT images that cover the
whole light stage dome. These images are then converted to light stage weights by
approximating each light with a Gaussian around its center, and we produce the
HDRI relighting results by simply using a linear combination of the rendered OLAT
images [Debevec et al., 2000]. As shown in Figure 3-16, we are able to reproduce
view-dependent effects as well as specular highlights with high fidelity, and generate
compelling composites of the subjects in virtual scenes. See the supplemental video
for more examples.

Figure 3-16: HDRI relighting by NLT. Because NLT can relight a subject with any
directional light, it can be used to render OLAT “bases” that can then be linearly
combined to relight the scene for a given HDRI map (shown as insets) [Debevec et al.,
2000]. The relit subjects exhibit realistic specularities and shadows.

3.5.8 Changing the Viewpoint

So far, we have shown how LSSR and NLT both support relighting from the original
viewpoint. Now we focus on simultaneous relighting and view synthesis, by querying

151
our NLT model also at novel viewing directions.
A quantitative analysis is presented in Table 3.4, where we see that our approach
outperforms the baselines and is comparable with Thies et al. [2019], which (unlike our
technique) only performs view synthesis and does not enable relighting. A qualitative
analysis is visualized in Figure 3-17. We see that the inferred residuals produced by
NLT are able to account for the non-diffuse, non-local light transport and mitigate the
majority of artifacts in the diffuse base caused by geometric inaccuracy. We see that
renderings from NLT exhibit accurate specularities and sharper details, especially
when compared with other machine learning methods, thereby demonstrating that
our model is able to capture view-dependent effects. See the supplementary video for
more examples.

Method PSNR ↑ SSIM ↑ LPIPS ↓ Table 3.4: View synthesis errors


Diffuse Base 31.45 ± .268 .917 ± .005 .070 ± .004 of NLT. NLT outperforms all
BB (UV) 34.97 ± .273 .960 ± .002 .035 ± .002 baselines (top) and ablations
DS (UV) 34.77 ± .405 .950 ± .009 .058 ± .009 (bottom) in LPIPS. BB and
DNR 35.49 ± .315 .966 ± .002 .039 ± .002
DNR are competitive in PSNR
Relightables 32.24 ± .246 .922 ± .005 .065 ± .004
NLT (ours) 34.83 ± .259 .959 ± .002 .030 ± .001 and SSIM, but BB is unable to
NLT w/o res. 34.49 ± .258 .958 ± .002 .032 ± .001
hallucinate missing pixels, and
NLT w/o obs. 34.43 ± .286 .953 ± .003 .037 ± .002 DNR does not provide a means
NLT w/o LPIPS 34.36 ± .273 .959 ± .003 .041 ± .002 for relighting. For fair
The numbers are means and 95% confidence intervals. “BB”
comparisons, we circumvent the
means Barycentric Blending, and “DS” stands for Deep need to learn viewpoints for BB
Shading. and DS by using the UV space.

Simultaneous Relighting & View Synthesis In Figure 3-18, we show the unique
ability of our model to synthesize illumination and viewpoints simultaneously with
an unprecedented quality for human capture. Note that our model’s ability to natu-
rally handle this simultaneous task is a direct consequence of embedding our neural
network within the UV space of the subject. All that is required to enable simultane-
ous relighting and view interpolation is interleaving the training data for both tasks
and training a single instance of our network (more details in Section 3.4.4). Fig-
ure 3-18 shows that our method accurately models shadows and global illumination,
while correctly capturing high-frequency details such as specular highlights. See the

152
A

1. Diffuse Base 2. Pred. Residuals 3. NLT (ours) 4. Ground Truth 5. DNR [2019] 6. Deep Shading 7. Relightables
(+ only) [2017] (UV) [2019]

Figure 3-17: View synthesis by NLT. NLT is able to handle view-dependent specu-
larities (eyes, nose tips, cheeks), high-frequency geometry variation (Subjects B’s and
D’s hair), and global illumination (Subjects A, B, and C’s shirts). We see a sub-
stantial improvement over the state-of-the-art view synthesis method of Thies et al.
[2019] (Column 5), which tends to produce blurry results (the missing specularities
in Subject B’s eyes), and over the recent geometric approach of Guo et al. [2019]
(Column 7), which lacks non-Lambertian material effects.

153
supplementary video for more examples.

The recent work of Mildenhall et al. [2020], Neural Radiance Fields (NeRF),
achieves impressive view synthesis given approximately 100 views of an object. Here
we qualitatively compare NLT against NeRF with 10 levels of positional encoding
for the location and 4 for the viewing direction. NeRF does not require any proxy
geometry, but in this particular setting, it has to work with a limited number of
views (around 55), which are insufficient to capture the full volume. As Figure 3-
19 (left) shows, NLT synthesizes more realistic facial and eye specularity as well as
higher-frequency hair details.

We also attempt to extend NeRF to perform simultaneous relighting and view


synthesis, and compare NLT with this extension, “NeRF+Light.” To this end, we
additionally condition the output radiance on the light direction (with 4 levels of
positional encoding) along with the original viewing direction. As shown in Figure 3-
19 (right), NeRF+Light struggles to synthesize hard shadows or specular highlights,
and produces significantly more blurry results than NLT, which demonstrates the
importance of a proxy geometry when there are lighting variation and only sparse
viewpoints.

3.6 Discussion

In this section, we present the ablation studies that demonstrate the importance of
each major model component in LSSR and NLT (Section 3.6.1). We then attempt to
answer the interesting question in Section 3.6.2: Above which frequency band does
one need LSSR to achieve high-quality image-based relighting? Section 3.6.3 then
addresses a related question whether LSSR and NLT can super-resolve a smaller light
stage. Finally, we explore how NLT’s performance degrades as the quality of the
input geometry proxy deteriorates (Section 3.6.4).

154
Illumination varies (view constant)

View varies (illumination constant)

Illumination varies (view constant)


View varies (illumination constant)

Figure 3-18: Simultaneous relighting and view synthesis by NLT. NLT is able to
perform simultaneous relighting and view synthesis, and produces accurate renderings
(including view- and light-dependent effects) for unobserved viewpoints and light
directions. Along the 𝑥-axis we vary illumination, and along the 𝑦-axis we vary the
view. This functionality is enabled by our decision to embed our neural network
architecture within the texture atlas of a subject.

155
View Synthesis Simultaneous Relighting and View Synthesis

A C

D
NeRF NLT (ours) Ground Truth NeRF + Light NLT (ours) Ground Truth

Figure 3-19: Comparing NLT against NeRF and NeRF+Light. In view synthesis
(left), NeRF struggles to synthesize realistic facial specularities, high-frequency hair
details, and specularity in the eyes (red boxes in A & B). In simultaneous relighting
and view synthesis (right), the NeRF+Light extension fails to synthesize facial details
(red boxes in C & D) and hard shadows (yellow boxes in C).

3.6.1 Ablation Studies

We first evaluate some ablated versions of LSSR, with results shown in Table 3.2. We
then present comparisons between the full NLT model and its ablated versions, to
demonstrate the contribution of each major model component.

LSSR With Naïve Neighbors In this ablation, we use the 𝑘 = 8 nearest neigh-
bors as our active set during training. This variant leads to a match in the sampling
pattern between our training and validation data, thereby achieving better numeri-
cal performance (Table 3.2). This apparent performance improvement is misleading,
since the validation set has the same regular light layout as the training set, but the
test set presents irregular sampling pattern (see Section 3.5.2). As such, this variant
suffers from significant overfitting during our real test-time scenario, where the query
light does not fall on the regular hexagonal grid of the light stage. In Figure 3-20,
we visualize the output of this variant and LSSR as a function of the query light
direction. We see that LSSR is able to synthesize a cast shadow that is a smooth
linear function of the query light angle (after accounting for foreshortening, etc.). The
variant, however, fails to synthesize this linearly-varying shadow, due to the aliasing

156
and overfitting problems described earlier. See the supplemental video for additional
visualization. Light Stage Super-Resolution: Continuous High-Frequency Relighting • 260:7

(a) One rendering, for reference (b) Our model (c) Our model w/ naive neighbors (d) Our model w/ avg pooling

Figure 3-20: Continuous directional relighting by LSSR. (a) We show an LSSR ren-
Fig. 5. A visualization of how our learned model synthesizes renderings in which shadows move smoothly as a function of light direction. In (a) we show a
rendering from our model for some virtual light ` with a horizontal angle of , and highlight one image strip that includes a horizontal cast shadow. In (b) we
dering for some virtual light with a horizontal angle of 𝜃, and highlight one image
repeatedly query our model with values that should induce a linear horizontal translation of the shadow boundary in the image plane, and by stacking
strip that includes a horizontal cast shadow. (b) We repeatedly query our model with
these image strips we can see this linear trend emerge (highlighted in red). In (c) and (d) we do the same for ablations of our model that do not have our
active-set random selection procedure nor our alias-free pooling, and we see that the resulting shadow boundary does not vary smoothly or linearly.
𝜃 values that should induce a linear horizontal translation of the shadow boundary
in the image plane. By stacking these image4.1strips,
Table 1. Here we benchmark our model against prior work and ablations
of our model on our validation dataset. We report the arithmetic mean of
Ablationwe see this linear trend emerge
Study
(highlighted
each in red).
metric across the validation (c,topd)
set. The We
three doof each
results themetric We
same for �rst evaluate against
the ablated models ablated versions of our model,
without ourwith results
active
are highlighted in red, orange, yellow, respectively. While “Ours w/naive shown in Tab. 1. In the “Ours w/naive neighbors” ablation we use the
set selection
neighbors“ has the lowest procedure
error according toor
this alias-free pooling.k =We
evaluation, “Our model“ observe
8 nearest neighborsthat thesetresulting
in our active shadow
during training. This setup
leads to a match between our training and validation data, which
boundary does not vary smoothly or
performs be�er in our real test-time scenario where the synthesized light
linearly.
does not lie in a regular hexagonal grid (see text and Fig. 5 for details). results in better numerical performance (as shown in Tab. 1) but also
signi�cant over�tting: this apparent improvement in performance
Algorithm RMSE H1 DSSIM E-LPIPS is misleading, because the validation set of our dataset has the same
Our model 0.0160 0.0203 0.0331 0.00466 overly-regular sampling as the training set. During our real test-
Ours w/naive neighbors 0.0156 0.0199 0.0322 0.00449 time scenario in which we synthesize with lights that do not lie on
Ours w/avg-pooling 0.0203 0.0241 0.0413 0.00579 the regular hexagonal grid of our light stage, we see this ablated
LSSR With Average
Linear blending 0.0191 0.0232Pooling In this ablation,
0.0366 0.00503 we replace
model generalizes the
poorly. In Fig. 5 wealias-free pooling
visualize the output of our
Fuchs et al. [2007] 0.0195 0.0258 0.0382 0.00485 model and ablations of our model as a function of the query light
of our model
Photometric stereo with0.0284simple average
0.0362 0.0968 pooling.
0.00895 Asdirection.
shown We seeinthat our model
Table 3.2,is able to synthesize
ablating a castcom-
this shadow
Xu et al. [2018] that is a smooth linear function in the image plane of the angle of
w/ 8 optimal lightsthe performance
0.0410 0.0437 0.1262 0.01666 the query light (after accounting for foreshortening, etc. ). Ablations
ponent hurts quantitatively ofand, more do
our technique importantly,
not reproduce thiscauses flickering
linearly-varying shadow,
w/ adaptive input 0.0259 0.0291 0.1156 0.00916
Meka et al. [2019] 0.0505 0.0561 0.1308 0.01482 due to the aliasing and over�tting problems described earlier. See
in the real test-time scenario where we smoothly vary our
the supplemental videolight source
for additional (the quantita-
visualizations.
In the “Ours w/avg-pooling” ablation we replace the alias-free
tive evaluation
validation approach is notcannot
ideal, as allreflect this).willBecause
such evaluations follow average
pooling pooling
of our model withassigns non-zero
simple average pooling. Asweights
shown in
the same regular sampling pattern of our light stage. This evaluation Tab 1, ablating this component reduces performance. But more im-
to isimages
task as biased
therefore more theythanenter
the realand
task of exit the
predicting active portantly,
images ablating this component also causes �ickering during
set, renderings from this model variant
our real test-time scenario in which we smoothly vary our light
away from the sampling pattern of the light stage.
Selecting an appropriate metric for measuring image reconstruc- source, and this is not re�ected in our quantitative evaluation. Be-
contain significant temporal instability. See cause
tion accuracy for our task is not straightforward. Conventional
the average
supplemental
pooling assignsvideo
a non-zero forweight
examples.
to images as they
image interpolation techniques often result in ghosting artifacts or enter and exit our active set, renderings from this model will con-
duplicated highlights, which are perceptually salient but often not tain signi�cant temporal instability. See the supplemental video for
penalized heavily by traditional image metrics such as per-pixel examples.
NLT
RMSE. WeWithout Observation
therefore evaluate image quality using Paths Instead of our two-path query and observation
multiple image
metrics: RMSE, the Sobolev H 1 norm [Ng et al. 2003], DSSIM [Wang 4.2 Related Work Comparison
etnetwork (Section 3.4.2), we can just train the query
al. 2004], and E-LPIPS [Kettunen et al. 2019]. RMSE measures We compare path ofagainst
our results our related
network without
approaches any
that are capable
pixel-wise error, the H 1 norm emphasizes image gradient error, of solving the relighting problem. The “Linear blending” baseline
while DSSIM and E-LPIPS approximate an overall perceptual dif- in Tab. 1 produces competitive results, despite being a very simple
observation. As shown in Figure 3-21, this ablation
ference between the predicted image and the ground truth. Still,
struggles to synthesize details for
algorithm: we simply blend the input images of our light stage
images and videos are preferred for comparison. according to our alias-free weights. Because linear blending directly
each possible view and lighting condition, and produces oversmoothed results.
ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020.

NLT Without Residual Learning Instead of using our residual learning ap-
proach (Section 3.4.3), we can allow our network to directly predict the output image.

157
Ground Truth NLT (ours) NLT w/o NLT w/o NLT w/o
Residuals Observations LPIPS [2018]

Figure 3-21: NLT and its ablated variants for relighting. Removing different compo-
nents of NLT reduces rendering quality: No direct access to the diffuse base makes it
more challenging for the network to learn hard shadows, having no observation path
deprives the network of information from nearby views or lights, and removing the
perceptual loss of Zhang et al. [2018a] blurs the shadow boundary.

As shown in Figure 3-21, not using the diffuse base at all reduces the quality of the
rendered image, likely because the network is then forced to waste its capacity on
inferring shadows and albedo.

The middle ground between no diffuse base and our full method is using the diffuse
bases only as network input, but not for the skip link. Comparing the “Deep Shading”
rows and “NLT w/o obs.” rows of Table 3.3 and Table 3.4 reveals the importance
of the skip connection to diffuse bases: In both relighting and view synthesis, NLT
without observations (which has the skip link) outperforms Deep Shading (which uses
the diffuse bases only as network input) in LPIPS. Our proposed residual learning
scheme allows our model to focus on learning higher-order light transport effects,
which results in more realistic renderings.

NLT Without Perceptual Loss We find that adding a perceptual loss as pro-
posed by Zhang et al. [2018a] helps the network produce higher-frequency details
(such as the hard shadow boundary in Figure 3-21). Quantitative evaluations ver-
ify this observation: Full NLT with the perceptual loss achieves the best perceptual
scores in both tasks of relighting and view synthesis.

158
3.6.2 Image-Based Relighting Under Varying Light Frequency

We now analyze the image quality gain achieved by LSSR w.r.t. the light frequency.
Specifically, we evaluate for which environmental lighting and at what frequency,
LSSR will be necessary for accurate rendering, and conversely how it performs under
low-frequency lighting where previous solutions are adequate. For this purpose, we
use one OLAT scan and render it under 380 high-quality indoor and outdoor lighting
(probes downloaded from hdrihaven.com) using both LSSR and Linear Blending.
We then measure the image quality gain from our model by computing DSSIM be-
tween our rendering and that by Linear Blending. We measure the frequency of the
environmental lighting by decomposing it into spherical harmonics (up to degree 50)
and finding the degree below which 90% of the energy can be recovered.

As shown in Figure 3-22, the benefit of using our model becomes more significant
as the frequency of the lighting increases. For low-frequency lighting (up to degree
15 spherical harmonics), our model produces almost identical results compared with
the traditional Linear Blending method. This is a desired property, showing that our
method reduces gracefully to Linear Blending for-low frequency lighting, and thus
produces high-quality results for both low- and high-frequency lighting. As the fre-
quency of the lighting becomes higher, LSSR’s rendering contains sharper and more
accurate shadows without ghosting artifacts. Note that there is some variation among
the environment maps as expected; even a very high-frequency environment could co-
incidentally have its brightest lights aligned with one of the lights on the stage, leading
to low errors in Linear Blending and comparable results to our method. Neverthe-
less, the trend is clear in Figure 3-22 with many high-frequency probes requiring our
algorithm for lighting super-resolution.

According to the plot in Figure 3-22, we conclude that LSSR is necessary when the
lighting frequency is equal to or greater than about 20 (that is more than 212 = 441
basis functions). This number is on the same order as the number of lights on the
stage (𝑛 = 302). Therefore, our frequency analysis is consistent with the intuition.
If the lighting cannot be recovered using the limited light bases on the light stage,

159
Light

Our
Model

Linear
Blending

Figure 3-22: Quality gain by LSSR w.r.t. lighting frequency. In the top plot, each
blue dot represents a light probe. We render a portrait under this lighting using both
linear blending and LSSR, and measure the image differences using SSIM to evaluate
the quality gain achieved our algorithm. The improvement becomes more apparent
when the lighting contains more high-frequency contents. In the bottom figure, we
compare the rendered images using LSSR and linear blending under lighting with
different frequencies. Our model produces similar results to linear blending when
the lighting variation is low-frequency (left two columns). As the lighting becomes
higher-frequency, LSSR produces better rendering with fewer artifacts and sharper
shadows (right two columns).

160
then LSSR is required to generate denser bases to accurately render the shadow and
highlights.

3.6.3 Subsampling the Light Stage

An interesting question in LT acquisition is how many images (light samples) are


needed to reconstruct the full LT function. To address this question, we present an
experiment where we remove some lights from our training set and use only these
subsampled data for training. We reduce the number of lights on the light stage 𝑛
(while maintaining a uniform distribution on the sphere) to [250, 200, 150, 100], while
also changing the number of candidates 𝑚 and the active set size 𝑘 to [14, 12, 10, 8]
and [7, 6, 5, 4], respectively. The image quality on the complete validation set (with
all 302 lights) is plotted against the number of subsampled training lights is shown
Figure 3-23.

Figure 3-23: LSSR vs.


0.0500
Ours linear blending: relighting
0.0475 Linear Blend errors w.r.t. light density.
0.0450 The relighting quality
0.0425 degrades as we ablate lights
DSSIM

0.0400 from the light stage.


0.0375 However, LSSR is able to
0.0350 retain the quality to a
0.0325 greater extent with sparser
0.0300
100 150 200 250 300
lights than naïve linear
number of lights on the light stage blending.

As expected, the relighting quality decreases as we remove the lights, but we


see that the rendering quality of LSSR decreases more slowly than that of Linear
Blending. This can also be observed in Figure 3-24 where we present relit rendering
using these subsampled light stages. We see that removing lights reduces accuracy for
both methods, but that our synthesized shadows remain relatively sharp: Ghosting
artifacts appear only at 𝑛 = 100. In comparison, Linear Blending produces ghosting
artifacts near shadow boundaries across all values of 𝑛. At test time, LSSR can also
produce accurate shadows and sharp highlights. Please refer to our supplementary

161
video for the qualitative comparison.
Light Stage Super-Resolution: Continuous High-Frequency Relighting • 260:9

n = 100 n = 150 n = 200 n = 250 n = 302 Groundtruth Groundtruth (Complete)

(a) Ours

(b) Linear Blending

Figure 3-24: LSSR vs. linear blending: relighting with sparser lights. As we decrease
Fig. 7. Here we compare the performance of our model against linear blending as we reduce n, the number of lights in our light stage. As we decrease the
number of available lights from n = 302 to n = 100, the quality of our model’s rendered shadow degrades slowly. Linear blending, in contrast, is unable to
the number of available lights from 𝑛 = 302 to 100, the quality of LSSR’s rendered
produce an accurate rendering even with access to all lights.

shadow degrades slowly. Linear blending, in contrast, is unable to produce an accurate


relighting quality decreases as we remove the lights, but we see that
rendering even with access to all lights. the rendering quality of our method decreases more slowly than
that of linear blending. This can also be observed in Fig. 7, where
we present relit renderings using these subsampled light stages. We
see that removing lights reduces accuracy compared to the ground
truth, but that our synthesized shadows remain relatively sharp:
For NLT, we perform a similar analysis where ghostingwe artificially
artifacts only appeardownsample the lights
when n = 100. In comparison, linear
blending produces ghosting artifacts near shadow boundaries for all
on the light stage to study the effects of light density on NLT’s
values of n. During test time,relighting
our model can performance.
also produce accurate
shadows and sharp highlights. Please refer to our supplementary
video for our
We use only 60% of the lights to train a relighting qualitative
model, comparison.
which translates to around
Fig. 8. The image quality of relighting algorithms will gradually reduce as
75 lights per camera. Although the model is still able to relight
5 CONTINUOUS the personRELIGHTING
HIGH-FREQUENCY
we remove lights from the light stage. However, our algorithm is able to reasonably
retain the image quality to a greater extent with fewer lights compared to A key bene�t of our method is the ability to "super-resolve" an OLAT
in Figure 3-25, inspection reveals that the relit
naive linear blending. scanimage has
with virtual ghosting
lights shadows
at a higher resolution like
than the those
original light
stage data, thereby enabling continuous high-frequency relighting
often observed in Barycentric Blending. with an essentially continuous lighting distribution (or equivalently,
because their model is speci�cally designed for fast and approximate with a light stage whose sampling frequency is unbounded). In this
video relighting and uses only two images as input, while our model section, we present three applications of this idea.
has access to the entire OLAT scan and is designed to prioritize
high-quality rendering. Precise Directional Light Relighting. Traditional image-based re-
lighting methods produce accurate results near the observed lights
4.3 Light Stage Subsampling of the stage, but may introduce ghosting e�ects or inaccurate shad-
An interesting question in light transport acquisition is how many ows when no observed light is nearby. In Fig. 9 we try to interpolate
images (light samples) are needed to reconstruct the full light trans- the image between two lights on the stage. As shown in the second
port function. To address this question, we present an experiment in and the third row, linear blending or Xu et al. [2018] with adaptive
which we remove some lights from our training set and use only this sampling does not produce realistic results and always contains
subsampled data during training and inference. We reduce the num- multiple superposed shadows or highlights. The shadows produced
ber of lights on the light stage n (while maintaining a uniform dis- by Meka et al. [2019] are sharp, but are not moving smoothly when
tribution on the sphere) to [250, 200, 150, 100], while also changing the light moves. In contrast, our method is able to produce sharp
the number of candidates m and the active set size k to [14, 12, 10, 8] and realistic images for arbitrary light directions: highlights and
and [7, 6, 5, Ground Truth
4] respectively. Image quality onBarycentric
the complete valida- cast All Lights
shadows move smoothly as we 60%changeLights
the light direction, and
tion dataset (with all 302 lights) as a functionBlending
of the number of our results have comparable sharpness to the (non-interpolated)
subsampled training/input lights is shown in Fig. 8. As expected, groundtruth images that are available.

Figure 3-25: NLT relighting with sparser lights. When


ACM Trans. only
Graph., Vol. 39, No.60%
6, Article lights are
260. Publication date:used
December to
2020.

train a relighting model, we observe ghosting shadows in our relit rendering (yellow
arrow), similar to those produced by barycentric blending.

162
3.6.4 Degrading the Input Geometry Proxy

Here we analyze how our model performs with respect to different factors. We show
that as the geometry degrades, our neural rendering approach consistently outper-
forms traditional reprojection-based methods, which heavily rely on the geometry
quality. In relighting, we show that our model performs reasonably when the number
of illuminants is reduced, demonstrating the potential applicability of NLT to smaller
light stages.
Because NLT leverages a geometry proxy to generate a texture parameterization,
we study its robustness against geometry degradation in the context of view synthesis.
We decimate our mesh progressively from the original 100,000 vertices down to only
500 vertices (bottom left of Figure 3-26). At each mesh resolution, we train one
NLT model with 𝐾 = 3 nearby views and evaluate it on the held-out views. With
the geometry proxy, one can also reproject nearby observed views to the query view,
followed by different types of blending [Buehler et al., 2001, Eisemann et al., 2008].
We NLT against Eisemann et al. [2008] at each decimation level.

NLT (ours)
Floating
Textures Floating
Textures
[2008]

NLT
(ours)

0.5k 1k 5k 10k 50k 100k 0.5k Vertices 5k Vertices 100k Vertices (original)

Figure 3-26: Performance of NLT w.r.t. quality of the geometry proxy. As we dec-
imate the geometry proxy from 100, 000 vertices down to only 500 vertices, NLT
remains performant in terms of LPIPS (lower is better; bands indicate 95% confi-
dence intervals), while Floating Textures, a reprojection-based method, suffers from
the low quality of the geometry proxy, producing missing pixels (e.g., in the hair) and
misplaced high-frequency patterns (e.g., shadow boundaries), as highlighted by the
yellow arrows. Both NLT and Floating Textures use the same three nearby views.

As Figure 3-26 shows, even at the extreme decimation level of 500 vertices, NLT

163
produces reasonable rendering with no missing pixels, because it has learned to hallu-
cinate pixels that are non-visible from any of the nearby views. In contrast, Floating
Textures [Eisemann et al., 2008] leaves missing pixels unfilled (e.g., in the hair) due
to reprojection errors stemming from the rough geometry proxy. As the geometry
proxy gets more accurate, Floating Textures improves but still struggles to render
high-frequency patterns correctly (such as the shadow boundary beside the nose,
highlighted by a yellow arrow), even at the original mesh resolution. In comparison,
the high-frequency patterns in the NLT rendering match the ground truth. Quanti-
tatively, NLT also outperforms Floating Textures in terms of LPIPS (lower is better)
across all mesh resolutions.

3.7 Conclusion

The light stage is a crucial tool for enabling the image-based relighting of human
subjects in novel environments, but as we have demonstrated, light stage scans are
undersampled w.r.t. the angle of incident light, which means that synthesizing virtual
lights by simply combining images would result in ghosting in shadows and specu-
lar highlights. We have presented a learning-based solution, Light Stage Super-
Resolution (LSSR) [Sun et al., 2020], for super-resolving light stage scans, thereby
allowing us to create a “virtual” light stage with a much higher angular lighting res-
olution and therefore render accurate shadows and highlights under high-frequency
lighting.
Our network works by embedding input images from the light stage into a learned
space where network activations can then be averaged, and then decoding those acti-
vations according to some query light direction to reconstruct an image. In construct-
ing LSSR, we have identified two critical issues: an overly regular sampling pattern
in the light stage training data and aliasing introduced when pooling activations of a
set of nearest neighbors. These issues are addressed through our use of dropout-like
supersampling of neighbors in our active set and our alias-free pooling technique. By
combining ideas from conventional linear interpolation with the expressive power of

164
deep neural networks, LSSR is able to produce rendering where shadows and high-
lights move smoothly as a function of the light direction.
This work is by no means the final word for the task of light stage super-resolution
or image-based relighting. Approaches similar to ours could be applied to other
general light transport acquisition problems, to other physical scanning setups, or to
other kinds of objects besides human subjects. Although our network can work on
inputs with different image resolutions, GPU memory has been a major bottleneck
to apply our approach to images with much higher resolutions such as 4K resolution.
A much more memory-efficient approach for light stage super-resolution is expected
for production-level use in the visual effects industry.
Although we exclusively pursue the One-Light-at-A-Time (OLAT) scanning ap-
proach with our light stage, alternative patterns where multiple lights are active
simultaneously could be explored, which may enable a sparser light stage design. De-
spite the undersampling of the light stage being self-evident in our visualizations, it
may be interesting to develop a formal theory of this undersampling w.r.t. materials
and camera resolution, so as to understand what degree of undersampling can be
tolerated in the limit. We have made a first step in this direction with the graph in
Figure 3-22. We believe that LSSR represents an exciting direction for future research
and has the potential to further cut the cost for reproducing accurate high-frequency
relighting effects.
What remains unaddressed by LSSR is viewpoint change in the task of view syn-
thesis. We continued to propose Neural Light Transport (NLT) [Zhang et al.,
2021b] to support both viewpoint and light interpolation, thereby supporting simul-
taneous relighting and view synthesis. We have presented NLT, a semi-parametric
deep learning framework that allows for simultaneous relighting and view synthesis
of full-body scans of human subjects.
Our approach is enabled by prior work [Guo et al., 2019] that provides a method
for recovering geometric models and texture atlases, and uses as input OLAT images
captured by a light stage. Our model works by embedding a deep neural network into
the UV texture space provided by a mesh and texture atlas, and then training that

165
model to synthesize texture-space RGB images corresponding to observed light and
viewing directions. Our model consists of a dual-path neural network architecture for
aggregating information from observed images and synthesizing new images, which
is further enhanced through the use of augmented texture-space inputs that leverage
insights from conventional graphics techniques and a residual learning scheme that
allows training to focus on higher-order light transport effects such as highlights, scat-
tering, and global illumination. Multiple comparisons and experiments demonstrate
clear improvement over previous specialized relighting or view synthesis solutions,
and our approach additionally enables simultaneous relighting and view synthesis.
Our method has occasional failure modes as shown in Figure 3-27, where complex
light transport effects, such as the ones on the glittery chain, are hard to synthesize,
and the final renderings lack high-frequency details.

Figure 3-27: A
failure case of
NLT’s view
synthesis. NLT
may fail to
synthesize views
of complicated
light transport
effects such as
Ground Truth NLT (ours)
those on the
glittery chain.

Similar to recent neural rendering approaches [Lombardi et al., 2018, 2019, Thies
et al., 2019, Sitzmann et al., 2019a, Mildenhall et al., 2020], NLT must be trained
individually per scene, and generalizing to unseen scenes is an important future step
for the field. In addition, neural rendering of dynamic scenes is desirable, especially
in this case of human subjects. Using a fixed texture atlas may directly enable our
method to work for dynamic performers.
Additionally, the fixed 1024×1024 resolution of our texture-space model limits our
model’s ability to synthesize higher-frequency contents, especially when the camera
zooms very close to the subject, or when an image patch is allocated too few texels

166
(see the hair artifact in Figure 3-17 [D]). This could be solved by training on higher-
resolution images, but this would increase memory requirements and likely require
significant engineering effort.

167
THIS PAGE INTENTIONALLY LEFT BLANK

168
Chapter 4

High-Level Abstraction: Data-Driven


Shape Reconstruction

In this chapter, we study “shape from appearance” at a high level of abstraction,


using a data-driven approach without modeling reflectance or lighting. Specifically,
we address the problem of 3D shape reconstruction from single images. We start with
an introduction of two common problems in the existing solutions to this problem:
Supervised training tends to produce blurry mean shapes, and the models cannot
generalize to novel shape categories unseen during training (Section 4.1). We then
review the related work in Section 4.2.
Next, we present ShapeHD [Wu et al., 2018] that is capable of reconstructing
high-quality 3D shapes with structural details, addressing the first problem with an
adversarially learned naturalness loss (Section 4.3). To address the second problem,
we further devise Generalizable Reconstruction (GenRe) [Zhang et al., 2018b]
that generalizes to novel shape categories including non-rigid shapes despite trained
only on cars, chairs, and airplanes (Section 4.4). Because the community lacks a high-
quality evaluation dataset of images and their corresponding 3D models, we present
how we built our own dataset, Pix3D [Sun et al., 2018b], that provides accurate
alignment of the 3D models to the images (Section 4.5).
In Section 4.6, we describe our experiments that study the characteristics of Pix3D,
evaluate how well ShapeHD completes 3D shapes from a single-view depth map and

169
reconstructs 3D shapes from a single-view RGB image, and demonstrate how GenRe
is capable of reconstructing shapes from novel class categories unseen during training.
We also perform additional analyses, in Section 4.7, to study if any object detector
emerges naturally from training ShapeHD’s network for shape reconstruction, how the
naturalness loss adds structural details to the ShapeHD output, when ShapeHD tends
to fail, how the input viewpoint affects GenRe’s generalization power, and finally
whether GenRe is able to reconstruct non-rigid shapes and simple shape primitives
when trained only on cars, chairs, and airplanes.

4.1 Introduction

In this chapter, we aim to push the limits of 3D shape completion from a single depth
image and of 3D shape reconstruction from a single color image. Specifically, our
goals are to develop models that achieve high-quality reconstruction with structural
details (ShapeHD [Wu et al., 2018]) and generalize beyond the training shape classes
to unseen shape categories (GenRe [Zhang et al., 2018b]). Towards these goals, we
built Pix3D [Sun et al., 2018b], a real-world dataset of images and the 3D shapes
inside with pixel-level alignment.
Recently, researchers have made impressive progress on the these tasks [Choy
et al., 2016, Tulsiani et al., 2017, Dai et al., 2017], making use of gigantic 3D datasets
[Chang et al., 2015, Xiang et al., 2014, 2016]. Many of these methods tackle the ill-
posed nature of the problem by using deep convolutional networks to regress possible
3D shapes. Leveraging the power of deep networks, their systems learn to avoid
producing implausible shapes (Figure 4-1 [b]). However, from Figure 4-1 (c) we
see that there is still ambiguity that a supervised network fails to model: From
just one view (Figure 4-1 [a]), there exist multiple natural shapes that explain the
observation equally well. In other words, there is no deterministic ground truth for
each observation. Through pure supervised learning, the network tends to generate
blurry “mean shapes” that minimize the loss exactly due to this ambiguity.
To tackle this issue and enable higher-quality reconstruction with structural de-

170
(b) Unnatural Shapes

(a) Observation

(c) Natural Shapes

Figure 4-1: Two levels of ambiguity in single-view 3D shape perception. For each
2D observation (a), there exist many possible 3D shapes that explain this observa-
tion equally well (b, c), but only a small fraction of them correspond to real, daily
shapes (c). Methods that exploit deep networks for recognition reduce, to a certain
extent, ambiguity on this level. By using an adversarially learned naturalness model,
ShapeHD aims to model ambiguity on the next level: Even among the realistic shapes,
there are still multiple shapes explaining the observation well (c).

tails, we propose ShapeHD that completes or reconstructs a 3D shape by combining


deep volumetric convolutional networks with shape priors learned in an adversarial
manner [Wu et al., 2018]. The learned shape priors penalize the model only if the
generated shape is unrealistic, not when it deviates from the ground truth. This over-
comes the difficulty discussed above. Our model characterizes this naturalness loss
through adversarial learning, a research topic that has received immense attention
in recent years and is still rapidly growing [Goodfellow et al., 2014, Radford et al.,
2016, Wu et al., 2016]. Experiments on multiple synthetic and real datasets suggest
that ShapeHD performs well on single-view 3D shape completion and reconstruction.
Further analyses reveal that the network learns to attend to meaningful object parts,
and the naturalness module indeed helps in characterizing shape details over time.
All methods aforementioned, including ShapeHD, learn a parametric function
𝑓2D→3D , implemented as deep neural networks, that maps a 2D image to its corre-
sponding 3D shape. Essentially, 𝑓2D→3D encodes shape priors (“what realistic shapes
look like”), often learned from large shape repositories such as ShapeNet [Chang et al.,

171
2015]. Because the problem is well-known to be ill-posed—there exist many 3D expla-
nations for any 2D visual observation—modern systems have explored looping in var-
ious structures into this learning process. For example, Wu et al. [2017] use intrinsic
images or 2.5D sketches [Marr, 1982] as an intermediate representation, and concate-
nates two learned mappings for shape reconstruction: 𝑓2D→3D = 𝑓2.5D→3D ∘ 𝑓2D→2.5D .
These methods, however, ignore the fact that mapping a 2D image or a 2.5D
sketch to a 3D shape involves complex but deterministic geometric projections (see
Section 1.1.4). Simply using a neural network to approximate these projections,
instead of modeling this mapping explicitly, leads to inference models that are over-
parametrized (and hence subject to overfitting training classes). It also misses valu-
able inductive biases that can be wired in through such projections. Both of these
factors contribute to poor generalization to unseen classes.
In contrast to these artificial systems, humans can imagine, from just a single
image, the full 3D shape of a novel object that is never seen before. Vision researchers
have long argued that the key to this ability may be a sophisticated hierarchy of
representations, extending from images through surfaces to volumetric shape, which
process different aspects of shape in different representational formats [Marr, 1982].
In the remainder of this chapter, we explore how these ideas can be integrated into
single-image 3D shape reconstruction to enable generalization to novel shape classes
unseen during training.
To this end, we propose to disentangle geometric projections from shape recon-
struction to better generalize to unseen shape categories. Building upon the MarrNet
framework [Wu et al., 2017], we further decompose 𝑓2.5D→3D into a deterministic ge-
ometric projection 𝑝 from 2.5D to a partial 3D model and a learnable completion 𝑐
of the 3D model. A straightforward version of this idea would be to perform shape
completion in the 3D voxel grid: 𝑓2.5D→3D = 𝑐3D→3D ∘ 𝑝2.5D→3D . However, shape com-
pletion in 3D is challenging, as the manifold of plausible shapes is sparser in 3D than
in 2D, and empirically this fails to reconstruct shapes well.
Instead we perform completion based on spherical maps. Spherical maps are
surface representations defined on the UV coordinates of a unit sphere, where the

172
value at each coordinate is calculated as the minimal distance travelled from this point
to the 3D object surface along the sphere’s radius. Such a representation combines
appealing features of 2D and 3D: Spherical maps are a form of 2D images, on which
neural inpainting models work well; but they have a semantics that allows them to
be projected into 3D to recover full shape geometry. They essentially allow us to
complete non-visible object surfaces from visible ones, as a further intermediate step
to full 3D reconstruction. We now have 𝑓2.5D→3D = 𝑝S→3D ∘ 𝑐S→S ∘ 𝑝2.5D→S , where S
stands for spherical maps.

Our full model, named Generalizable Reconstruction (GenRe), thus com-


prises three cascaded, learnable modules connected by fixed geometric projections.
First, a single-view depth estimator predicts depth from a 2D image (𝑓2D→2.5D ); the
depth map is then projected into a spherical map (𝑝2.5D→S ). Second, a spherical map
inpainting network inpaints the partial spherical map (𝑐S→S ); the inpainted spherical
map is then projected into 3D voxels (𝑝2.5D→3D ). Finally, we introduce an additional
voxel refinement network to refine the estimated 3D shape in voxel space. Our neural
modules only have to model object geometry for reconstruction, without having to
learn geometric projections. This enhances generalizability, along with several other
factors: During training, our modularized design forces each module of the network
to use features from the previous module, instead of directly memorizing shapes from
the training classes; also, each module only predicts outputs that are in the same
domain as its inputs (image or voxel space), which leads to more regular mappings.

GenRe achieves state-of-the-art performance on reconstructing shapes both within


and outside training classes. Figure 4-2 shows examples of our model reconstructing
a table and a bed from single images, after training only on cars, chairs, and air-
planes. We also present detailed analyses of how each component contributes to the
final prediction. With GenRe, we emphasize the task of generalizable single-image
3D shape reconstruction, disentangle geometric projections from shape reconstruc-
tion, and integrate spherical maps with differentiable, deterministic projections into
a neural model.

173
Input (Novel Class) Our Reconstruction Input (Novel Class) Our Reconstruction

Figure 4-2: GenRe overview. We study the task of generalizable single-image 3D


reconstruction, aiming to reconstruct the 3D shape of an object outside training
classes. Here we show a table and a bed reconstructed from single RGB images by
GenRe trained on cars, chairs, and airplanes. GenRe learns to reconstruct objects
outside the training classes.

4.2 Related Work

In this section, we review the related work on 3D shape completion, single-image


3D reconstruction, 2.5D sketch recovery, perceptual losses and adversarial learning,
spherical projections, zero- and few-shot recognition, and 3D shape datasets.

4.2.1 3D Shape Completion

Shape completion is an essential task in geometry processing and has wide applica-
tions. Traditional methods have attempted to complete shapes with local surface
primitives, or to formulate it as an optimization problem [Nealen et al., 2006, Sorkine
and Cohen-Or, 2004], e.g., Poisson surface reconstruction solves an indicator function
on a voxel grid via the Poisson equation [Kazhdan and Hoppe, 2013, Kazhdan et al.,
2006]. Recently, there have also been a growing number of papers on exploiting shape
structures and regularities [Mitra et al., 2006, Thrun and Wegbreit, 2005] and papers
on leveraging strong database priors [Sung et al., 2015, Li et al., 2015, Brock et al.,
2016]. These methods, however, often require the database to contain exact parts of
the shape, and thus have limited generalization power.
With the advances in large-scale shape repositories like ShapeNet [Chang et al.,
2015], researchers began to develop fully data-driven methods, some building upon
deep convolutional networks. To name a few, Voxlets [Firman et al., 2016] employs
random forests for predicting unknown voxel neighborhoods. Wu et al. [2015] use
a deep belief network to obtain a generative model for a given shape database, and

174
Thanh Nguyen et al. [2016] extend the method for mesh repairing.

Probably the most related paper to ShapeHD is 3D-EPN [Dai et al., 2017], which
achieves impressive results on 3D shape completion from partial depth scans by lever-
ing 3D convolutional networks and non-parametric patch-based shape synthesis meth-
ods. ShapeHD has advantages over 3D-EPN in two aspects. First, with a naturalness
loss, ShapeHD can choose among multiple hypotheses that explain the observation,
therefore reconstructing a high-quality 3D shape with fine details; in contrast, the
output from 3D-EPN without non-parametric shape synthesis is often blurry. Sec-
ond, our completion takes a single feed-forward pass without any postprocessing, and
is thus much faster (<100 ms) than 3D-EPN.

4.2.2 Single-Image 3D Reconstruction

The problem of recovering the object shape from a single image is challenging, as it
requires both powerful recognition systems and prior shape knowledge. As an early
attempt, Huang et al. [2015] propose to borrow shape parts from existing Computer-
Aided Design (CAD) models. With the development of large-scale shape repositories
like ShapeNet [Chang et al., 2015] and methods like deep convolutional networks,
researchers have built more scalable and efficient models in recent years [Kar et al.,
2015, Choy et al., 2016, Girdhar et al., 2016, Rezende et al., 2016, Tatarchenko et al.,
2016, Wu et al., 2016, Yan et al., 2016, Häne et al., 2017, Novotny et al., 2017,
Tulsiani et al., 2017, Wu et al., 2017]. While most of these approaches encode objects
in voxels from vision, there have also been attempts to reconstruct objects in point
clouds [Fan et al., 2017, Groueix et al., 2018] or octave trees [Riegler et al., 2017a,b,
Tatarchenko et al., 2017]. The shape priors learned in these approaches, however, are
in general only applicable to their training classes, with very limited generalization
power for reconstructing shapes from unseen categories. In contrast, GenRe exploits
2.5D sketches and spherical representations for better generalization to objects outside
training classes.

175
4.2.3 2.5D Sketch Recovery

A related direction is to estimate 2.5D sketches (e.g., depth and surface normal maps)
from an RGB image. The origin of intrinsic image estimation dates back to the early
years of computer vision [Barrow and Tenenbaum, 1978]. Through years, researchers
have explored recovering 2.5D sketches from texture, shading, or color images [Horn
and Brooks, 1989, Zhang et al., 1999, Weiss, 2001, Tappen et al., 2003, Bell et al., 2014,
Barron and Malik, 2014]. With the development of depth sensors [Izadi et al., 2011]
and larger-scale RGB-D datasets [Silberman et al., 2012, Song et al., 2017, McCormac
et al., 2017], there have also been papers on estimating depth [Eigen and Fergus, 2015,
Chen et al., 2016], surface normals [Wang et al., 2015, Bansal and Russell, 2016],
and other intrinsic images [Janner et al., 2017, Shi et al., 2017] with deep networks.
Inspired by MarrNet [Wu et al., 2017], we reconstruct 3D shapes via modeling 2.5D
sketches but incorporating a naturalness loss for much higher quality, and focus on
reconstructing shapes from novel shape categories unseen during training.

4.2.4 Perceptual Losses & Adversarial Learning

Researchers recently proposed to evaluate the quality of 2D images using perceptual


losses [Johnson et al., 2016, Dosovitskiy and Brox, 2016]. The idea has been applied to
many image tasks like style transfer and super-resolution [Johnson et al., 2016, Ledig
et al., 2017]. Furthermore, the idea has been extended to learn a perceptual loss func-
tion with Generative Adversarial Networks (GANs) [Goodfellow et al., 2014]. GANs
incorporate an adversarial discriminator into the procedure of generative modeling,
and achieve impressive performance on tasks like image synthesis [Radford et al.,
2016]. Isola et al. [2016] and Zhu et al. [2016] use GANs for image translation with
and without supervision, respectively.
In 3D vision, Wu et al. [2016] extend GANs for 3D shape synthesis. However, their
model for shape reconstruction (3D-VAE-GAN) often produces a noisy, incomplete
shape given an RGB image. This is because training GANs jointly with recogni-
tion networks could be highly unstable. Many other researchers have also noticed

176
this issue: Although adversarial modeling of 3D shape space may resolve the ambi-
guity discussed earlier, its training could be challenging [Dai et al., 2017]. Due to
this issue, when Gwak et al. [2017] explored adversarial networks for single-image
3D reconstruction, they opted to use GANs to model 2D projections instead of 3D
shapes. This weakly supervised setting, however, hampers their reconstructions. In
ShapeHD, we develop our naturalness loss by adversarial modeling of the 3D shape
space, outperforming the state of the art significantly.

4.2.5 Spherical Projections

Spherical projections have been shown effective in 3D shape retrieval [Esteves et al.,
2018], classification [Cao et al., 2017], and finding possible rotational as well as reflec-
tive symmetries [Kazhdan et al., 2004, 2002]. Recent papers [Cohen et al., 2018, 2017]
have studied differentiable, spherical convolution on spherical projections, aiming to
preserve rotational equivariance within a neural network. These designs, however,
perform convolution in the spectral domain with limited frequency bands, causing
aliasing and loss of high-frequency information. In particular, convolution in the
spectral domain is not suitable for shape reconstruction where the quality highly de-
pends on the high-frequency components. In addition, the ringing effects caused by
aliasing would introduce undesired artifacts.

4.2.6 Zero- & Few-Shot Recognition

In computer vision, abundant attempts have been made to tackle the problem of
few-shot recognition. We refer readers to the review article [Xian et al., 2017] for a
comprehensive list. A number of earlier papers have explored sharing features across
categories to recognize new objects from a few examples [Bart and Ullman, 2005,
Torralba et al., 2007, Farhadi et al., 2009, Lampert et al., 2009]. More recently, many
researchers have begun to study zero- or few-shot recognition with deep networks
[Antol et al., 2014, Akata et al., 2016, Wang and Hebert, 2016, Hariharan and Gir-
shick, 2017, Wang et al., 2017]. Especially, Peng et al. [2015] explored the idea of

177
learning to recognize novel 3D models via domain adaptation.
While these proposed methods are for recognizing and categorizing images or
shapes, in GenRe we explore reconstructing the 3D shape of an object from unseen
classes. This problem has received little attention in the past, possibly due to its
considerable difficulty. A few imaging systems have attempted to recover 3D shapes
from single shots by making use of special cameras [Proesmans et al., 1996, Sagawa
et al., 2011]. In contrast, we study 3D reconstruction from a single RGB image.
Very recently, researchers have begun to look at the generalization power of 3D re-
construction algorithms [Rock et al., 2015, Jayaraman et al., 2018, Funk and Liu,
2017, Shin et al., 2018]. Here we present a novel approach that makes use of spherical
representations for better generalization.

4.2.7 3D Shape Datasets

For decades, researchers have been building datasets of 3D objects, either as a reposi-
tory of 3D CAD models [Bogo et al., 2014, Shilane et al., 2004, Bronstein et al., 2008]
or as images of 3D shapes with pose annotations [Leibe and Schiele, 2003, Savarese
and Fei-Fei, 2007]. Both directions have witnessed the rapid development of web-scale
databases: ShapeNet [Chang et al., 2015] was proposed as a large repository of more
than 50k models covering 55 categories, and Xiang et al. [2014] built Pascal 3D+ and
ObjectNet3D [Xiang et al., 2016], two large-scale datasets with alignment between
2D images and the 3D shapes inside. While these datasets have helped in advancing
the field of 3D shape modeling, they have their respective limitations: Datasets like
ShapeNet or Elastic2D3D [Lahner et al., 2016] do not have real images, and recent 3D
reconstruction challenges using ShapeNet have to be exclusively on synthetic images
[Yi et al., 2017]; Pascal 3D+ and ObjectNet3D have only rough alignment between
images and shapes, because objects in the images are matched to a pre-defined set of
CAD models, not their actual shapes. This has limited their usage as a benchmark
for 3D shape reconstruction [Tulsiani et al., 2017].
With depth sensors like Kinect [Izadi et al., 2011, Janoch et al., 2011], the commu-
nity has built various RGB-D or depth-only datasets of objects and scenes. We refer

178
readers to the review article from Firman [Firman, 2016] for a comprehensive list.
Among those, many object datasets are designed for benchmarking robot manipula-
tion [Calli et al., 2015, Hodan et al., 2017, Lai et al., 2011, Singh et al., 2014]. These
datasets often contain a relatively small set of hand-held objects in front of clean back-
grounds. Tanks and Temples [Knapitsch et al., 2017] is an exciting new benchmark
with 14 scenes, designed for high-quality, large-scale, multi-view 3D reconstruction.
In comparison, our dataset, Pix3D [Sun et al., 2018b], focuses on reconstructing a 3D
object from a single image, and contains much more real-world objects and images.
Probably the dataset closest to Pix3D is the large collection of object scans from
Choi et al. [2016], which contains a rich and diverse set of shapes, each with an RGB-
D video. Their dataset, however, is not ideal for single-image 3D shape modeling for
two reasons. First, the object of interest may be truncated throughout the video;
this is especially the case for large objects like sofas. Second, their dataset does
not explore the various contexts that an object may appear in, as each shape is only
associated with one scan. In Pix3D, we address both problems by leveraging powerful
web search engines and crowdsourcing.
Another closely related benchmark is IKEA [Lim et al., 2013], which provides
accurate alignment between images of IKEA objects and 3D CAD models. This
dataset is therefore particularly suitable for fine pose estimation. However, it contains
only 759 images and 90 shapes, relatively small for shape modeling1 . In contrast,
Pix3D contains 10,069 images (13.3×) and 395 shapes (4.4×) of greater variations.

4.3 Method: Learning & Using Shape Priors

In this section, we briefly present how ShapeHD [Wu et al., 2018] achieves high-quality
single-image 3D shape reconstruction with fine details by incorporating adversarially
learned priors into MarrNet [Wu et al., 2017].
ShapeHD consists of three components: a 2.5D sketch estimator and a 3D shape
estimator that predicts a 3D shape from an RGB image via 2.5D sketches (Figure 4-3
1
Only 90 of the 219 shapes in the IKEA dataset have associated images.

179
[I, II], inspired by MarrNet), and a deep naturalness model that penalizes the shape
estimator if the predicted shape is unnatural (Figure 4-3 [III]). Models trained with a
supervised reconstruction loss alone often generate blurry mean shapes. Our learned
naturalness model helps in avoiding this issue.

(a) Voxel (b) Naturalness

2D 2.5D 3D W

Image Depth Shape

(I) 2.5D Sketch Estimation (II) 3D Shape Completion (III) Shape Naturalness

Figure 4-3: ShapeHD model. For single-view shape reconstruction, ShapeHD com-
prises three components: (I) a 2.5D sketch estimator that predicts depth, surface
normal, and silhouette images from a single image, (II) a 3D shape completion mod-
ule that regresses 3D shapes from silhouette-masked depth and surface normal images,
and (III) an adversarially pretrained convolutional network that serves as the natu-
ralness loss function. While finetuning the 3D shape completion network, we use two
losses: a supervised loss on the output shape and a naturalness loss offered by the
pretrained discriminator.

2.5D Sketch Estimation Network Our 2.5D sketch estimator has an encoder-
decoder structure that predicts the object’s depth, surface normals, and silhouette
from an RGB image (Figure 4-3 [I]). We use a ResNet-18 [He et al., 2016] to encode
a 256 × 256 image into 512 feature maps of size 8 × 8. The decoder consists of four
transposed convolutional layers with a kernel size of 5×5, and a stride and padding of
2. The predicted depth and surface normal images are then masked by the predicted
silhouette and used as the input to our shape completion network.

3D Shape Completion Network Our 3D estimator (Figure 4-3 [II]) is an encoder-


decoder network that predicts a 3D shape in the canonical view from 2.5D sketches.
The encoder is adapted from ResNet-18 [He et al., 2016] to encode a four-channel

180
256 × 256 image (one for depth and three for surface normals) into a 200-D latent
vector. The vector then goes through a decoder of five transposed convolutional and
ReLU layers to generate a 128×128×128 voxelized shape. Binary cross-entropy losses
between predicted and target voxels are used as the supervised loss ℓvoxel .

4.3.1 Shape Naturalness Network

Due to the inherent uncertainty of single-view 3D shape reconstruction, shape com-


pletion networks with only a supervised loss usually predict unrealistic mean shapes.
By doing so, they minimize the loss when there exist multiple possible ground truth
shapes. We instead introduce an adversarially trained deep naturalness regularizer
that penalizes the network for such unrealistic shapes.
We pretrain a 3D Generative Adversarial Network (GAN) [Goodfellow et al., 2014]
to determine whether a shape is realistic. Its generator synthesizes a 3D shape from a
randomly sampled vector, and its discriminator distinguishes generated shapes from
real ones. Therefore, the discriminator has the ability to model the real shape distri-
bution and can be used as a naturalness loss for the shape completion network. The
generator is not involved in our later training process. Following 3D-GAN [Wu et al.,
2016], we use five transposed convolutional layers with batch normalization and Rec-
tified Linear Unit (ReLU) for the generator, and five convolutional layers with leaky
ReLU for the discriminator.
Due to the high dimensionality of 3D shapes (128 × 128 × 128), training a GAN
becomes highly unstable. To deal with this issue, we follow Gulrajani et al. [2017]
and use the Wasserstein GAN loss with a gradient penalty to train our adversarial
generative network. Specifically,

𝑥)‖2 − 1)2 ] ,
𝑥)] − E [𝐷(𝑥)] + 𝜆 E [(‖O𝑥^ 𝐷(ˆ
ℓWGAN = E [𝐷(˜ (4.1)
˜∼𝑃𝑔
𝑥 𝑥∼𝑃𝑟 ^∼𝑃𝑥
𝑥

where 𝐷 is the discriminator, 𝑃𝑔 and 𝑃𝑟 are distributions of generated shapes and real
shapes, respectively. The last term is the gradient penalty from Gulrajani et al. [2017].
During training, the discriminator attempts to minimize the overall loss ℓWGAN , while

181
the generator attempts to maximize the loss via the first term in Equation 4.1, so
we can define our naturalness loss as ℓnatural = − E [𝐷(˜
𝑥)], where 𝑃𝑐 are the recon-
˜∼𝑃𝑐
𝑥
structed shapes from our completion network.

4.3.2 Training Paradigm

We train our network in two stages. We first pretrain the three components of our
model separately. The shape completion network is then fine-tuned with both voxel
and naturalness losses.

Our 2.5D sketch estimation network and 3D completion network are trained with
images rendered with ShapeNet [Chang et al., 2015] objects (see Section 4.6.1 and
Section 4.6.5 for details). We train the 2.5D sketch estimator using an ℓ2 loss and
Stochastic Gradient Descent (SGD) with a learning rate of 0.001 for 120 epochs.
We only use the supervised loss ℓvoxel for training the 3D estimator at this stage,
again with SGD, a learning rate of 0.1, and a momentum of 0.9 for 80 epochs. The
naturalness network is trained in an adversarial manner, where we use Adam [Kingma
and Ba, 2015] with a learning rate of 0.001 and a batch size of 4 for 80 epochs. We
set 𝜆 = 10 as suggested by Gulrajani et al. [2017].

We then fine-tune our completion network with both voxel loss and naturalness
losses as ℓ = ℓvoxel + 𝛼ℓnatural . We compare the scale of gradients from the losses and
train our completion network with 𝛼 = 2.75 × 10−11 using SGD for 80 epochs. Our
model is robust to these parameters; they are only for ensuring gradients of various
losses are of the same magnitude.

An alternative is to jointly train the naturalness module with the completion


network from scratch using both losses. It seems tempting, but in practice we find
that Wasserstein GANs have large losses and gradients, resulting in unstable outputs.
We therefore choose to use the pretraining and finetuning setup of ours.

182
4.4 Method: Generalizing to Unseen Classes
Single-image reconstruction algorithms learn a parametric function 𝑓2D→3D that maps
a 2D image to a 3D shape. We tackle the problem of generalization to novel shape
classes unseen during training, by regularizing 𝑓2D→3D . The key regularization that we
impose is to factorize 𝑓2D→3D into geometric projections and learnable reconstruction
modules.
Our Generalizable Reconstruction (GenRe) model [Zhang et al., 2018b] consists of
three learnable modules, connected by geometric projections as shown in Figure 4-4.
The first module is a single-view depth estimator 𝑓2D→2.5D (Figure 4-4 [a]), taking
a color image as input and estimates its depth map. As the depth map can be
interpreted as the visible surface of the object, the reconstruction problem becomes
predicting the object’s complete surface given this partial estimate.

Geometric Projection
Network Module

a b c

RGB Image Depth Partial Inpainted Projected Voxels Final 3D Shape


Spherical Map Spherical Map

Figure 4-4: GenRe model. Our model for generalizable single-image 3D reconstruc-
tion (GenRe) has three components: (a) a depth estimator that predicts depth in
the original view from a single RGB image, (b) a spherical inpainting network that
inpaints a partial, single-view spherical map, and (c) a voxel refinement network that
integrates two backprojected 3D shapes (from the inpainted spherical map and from
depth) to produce the final output.

As 3D surfaces are hard to parametrize efficiently, we use spherical maps as a


surrogate representation. A geometric projection module (𝑝2.5D→S ) converts the esti-
mated depth map into a spherical map, referred to as the partial spherical map. It
is then passed to the spherical map inpainting network (𝑐S→S ; Figure 4-4 [b]) to pre-
dict an inpainted spherical map, representing the object’s complete surface. Another
projection module (𝑝S→3D ) projects the inpainted spherical map to the voxel space.
As spherical maps only capture the outermost surface towards the sphere, they
cannot handle self-occlusion along the sphere’s radius. We use a voxel refinement

183
module (Figure 4-4 [c]) to tackle this problem. It takes two 3D shapes as input, one
projected from the inpainted spherical map and the other from the estimated depth
map, and outputs a final 3D shape.

4.4.1 Single-View Depth Estimator

The first component of our network predicts a depth map from an image with a clean
background. Using depth as an intermediate representation facilitates the reconstruc-
tion process by distilling essential geometric information from the input image [Wu
et al., 2017].
Further, depth estimation is a class-agnostic task: Shapes from different classes
often share common geometric structure, despite distinct visual appearances. Take
beds and cabinets as examples. Although they are of different anatomy in general,
both have perpendicular planes and hence similar patches in their depth images. We
demonstrate this both qualitatively and quantitatively in Section 4.6.6.

4.4.2 Spherical Map Inpainting Network

With spherical maps, we cast the problem of 3D surface completion into 2D spherical
map inpainting. Empirically we observe that networks trained to inpaint spherical
maps generalize well to new shape classes (Figure 4-5). Also, compared with voxels,
spherical maps are more efficient to process, as 3D surfaces are sparse in nature;
quantitatively, as we demonstrate in Section 4.6.7 and Section 4.6.8, using spherical
maps results in better performance.
As spherical maps are signals on the unit sphere, it is tempting to use network
architectures based on spherical convolution [Cohen et al., 2018]. They are however
not suitable for our task of shape reconstruction. This is because spherical convolu-
tion is conducted in the spectral domain. Every conversion to and from the spectral
domain requires capping the maximum frequency, causing extra aliasing and infor-
mation loss. For tasks such as recognition, the information loss may be negligible
compared with the advantage of rotational invariance offered by spherical convolu-

184
RGB Input Inpainted Ground Truth RGB Input Inpainted Ground Truth

Figure 4-5: GenRe’s spherical inpainting module generalizing to new classes. Trained
on chairs, cars, and planes, the module completes the partially visible leg of the table
(red boxes) and the unseen cabinet bottom (purple boxes) from partial spherical maps
projected from ground-truth depth.

tion. But for reconstruction, the loss leads to blurred output with only low-frequency
components. We empirically find that standard convolution works much better than
spherical convolution under our setup.

4.4.3 Voxel Refinement Network

Although an inpainted spherical map provides a projection of an object’s surface onto


the unit sphere, the surface information is lost when self-occlusion occurs. We use a
refinement network that operates in the voxel space to recover the lost information.
This module takes two voxelized shapes as input, one projected from the estimated
depth map and the other from the inpainted spherical map, and predicts the final
shape. As the occluded regions can be recovered from local neighboring regions,
this network only needs to capture local shape priors and is therefore class-agnostic.
As shown in the experiments, when provided with ground-truth depth and spherical
maps, this module performs consistently well across training and unseen classes.

4.4.4 Technical Details

Here we describe the technical details about the implementation of GenRe.

Single-View Depth Estimator Following Wu et al. [2017], we use an encoder-


decoder network for depth estimation. Our encoder is a ResNet-18 [He et al., 2016],

185
encoding a 256 × 256 RGB image into 512 feature maps of size 1 × 1. The decoder is a
mirrored version of the encoder, replacing all convolution layers with transposed con-
volution layers. In addition, we adopt the U-Net structure [Ronneberger et al., 2015]
and feed the intermediate outputs of each block of the encoder to the corresponding
block of the decoder. The decoder outputs the depth map in the original view at the
resolution of 256 × 256. We use an ℓ2 loss between predicted and target images.

Spherical Map Inpainting Network The spherical map inpainting network has
a similar architecture as the single-view depth estimator. To reduce the gap between
standard and spherical convolutions, we use periodic padding to both inputs and
training targets in the longitude dimension, making the network aware of the periodic
nature of spherical maps.

Voxel Refinement Network Our voxel refinement network takes as input voxels
projected from the estimated, original-view depth and from the inpainted spherical
map, and recovers the final shape in voxel space. Specifically, the encoder takes as
input a two-channel 128 × 128 × 128 voxel (one for coarse shape estimation and the
other for surface estimation), and outputs a 320-D latent vector. In decoding, each
layer takes an extra input directly from the corresponding level of the encoder.

Geometric Projections We make use of three geometric projections: a depth to


spherical map projection, a depth map to voxel projection, and a spherical map to
voxel projection. For the depth to spherical map projection, we first convert depth
into 3D point clouds using camera parameters, and then turn them into surfaces with
the Marching Cubes algorithm [Lorensen and Cline, 1987, Lewiner et al., 2003]. Then,
the spherical representation is generated by casting rays from each UV coordinate on
the unit sphere to the sphere’s center. This process is not differentiable. To project
depth or spherical maps into voxels, we first convert them into 3D point clouds.
Then, a grid of voxels is initialized, where the value of each voxel is determined by
the average distance between all the points inside it to its center. Then, for all the
voxels that contain points, we negate its value and add 1 to it. This projection process

186
is fully differentiable.

Training We train our network with viewer-centered 3D supervision, where the 3D


shape is rotated to match the object’s pose in the input image. This is in contrast to
object-centered approaches, where the 3D supervision is always in a predefined pose
regardless of the object’s pose in the input image. Object-centered approaches are
less suitable for reconstructing shapes from new categories, as predefined poses are
unlikely to generalize across categories.
We first train the 2.5D sketch estimator with RGB images and their correspond-
ing depth images, all rendered with ShapeNet [Chang et al., 2015] objects (see Sec-
tion 4.6.1 and the supplemental material—Chapter D—for details). We then train
the spherical map inpainting network with single-view (partial) spherical maps and
the ground-truth full spherical maps as supervision. Finally, we train the voxel re-
finement network on coarse shapes predicted by the inpainting network as well as
3D surfaces backprojected from the estimated 2.5D sketches, with the corresponding
ground-truth shapes as supervision. We then jointly fine-tune the spherical inpainting
module and the voxel refinement module with both 3D shape and 2D spherical map
supervision.

4.5 Method: Building a Real-World Dataset

Looking into Figure 4-6, we realize existing datasets have limitations for the task of
modeling a 3D object from a single image. ShapeNet [Chang et al., 2015] is a large
dataset for 3D models, but does not come with real images; Pascal 3D+ [Xiang et al.,
2014] and ObjectNet3D [Xiang et al., 2016] have real images, but the image-shape
alignment is rough because the 3D models do not match the objects in images; IKEA
[Lim et al., 2013] has high-quality image-3D alignment, but it only contains 90 3D
models and 759 images.
We desire a dataset that has all three merits—a large-scale dataset of real images
and ground-truth shapes with precise 2D-3D alignment. Our dataset, named Pix3D,

187
Mismatched 3D shapes

Well-aligned images and shapes Imprecise pose annotations


Our Dataset: Pix3D Existing Datasets

Figure 4-6: Pix3D vs. existing datasets. We present Pix3D, a new large-scale dataset
of diverse image-shape pairs. Each 3D shape in Pix3D is associated with a rich and
diverse set of images, each with an accurate 3D pose annotation to ensure precise
2D-3D alignment. In comparison, existing datasets have limitations: 3D models may
not match the objects in images, pose annotations may be imprecise, or the dataset
size may be relatively small.

has 395 3D shapes of nine object categories [Sun et al., 2018b]. Each shape is asso-
ciated with a set of real images, capturing the exact object in diverse environments.
Further, the 10,069 image-shape pairs have precise 3D annotations, giving pixel-level
alignment between shapes and their silhouettes in the images.

Building such a dataset, however, is highly challenging. For each object, it is


difficult to simultaneously collect its high-quality geometry and in-the-wild images.
We can crawl many images of real-world objects, but we do not have access to their
shapes; 3D Computer-Aided Design (CAD) repositories offer object geometry, but
do not come with real images. Further, for each image-shape pair, we need a precise
pose annotation that aligns the shape with its projection in the image.

We overcome these challenges by constructing Pix3D in three steps. First, we


collect a large number of image-shape pairs by crawling the web and performing 3D
scans ourselves. Second, we collect 2D keypoint annotations of objects in the images
on Amazon Mechanical Turk (AMT), with which we optimize for 3D poses that align
shapes with image silhouettes. Third, we filter out image-shape pairs with a poor
alignment and, at the same time, collect attributes (i.e., truncation, occlusion) for
each instance, again by crowdsourcing.

In addition to high-quality data, we need a proper metric to objectively evaluate

188
the reconstruction results. A well-designed metric should reflect the visual quality
of the reconstructions. In this paper, we calibrate commonly used metrics, including
Intersection over Union (IoU), Chamfer Distance (CD), and Earth Mover’s Distance
(EMD), on how well they capture human perception of shape similarity. Based on
this, we benchmark state-of-the-art algorithms for 3D object modeling on Pix3D to
demonstrate their strengths and weaknesses.
Figure 4-7 summarizes how we build Pix3D. We collect images from search engines
and shapes from 3D repositories; we also take pictures and scan shapes ourselves.
Finally, we use labeled keypoints on both images and 3D shapes to align them.

Data Source 1: Extending IKEA Image-Shape Pairs Final Pose Estimation


Keypoint Levenberg-
Labeling Marquardt

Efficient PnP

Data Source 2: Scanning and


Taking Pictures Ourselves Image-Shape Pairs with Keypoints Initial Pose Estimation

Figure 4-7: Building Pix3D. We build Pix3D in two steps. First, we collect image-
shape pairs by crawling web images of IKEA furniture as well as scanning objects
and taking pictures ourselves. Second, we align the shapes with their 2D silhouettes
by minimizing the 2D coordinates of the keypoints and their projected positions from
3D, using the Efficient PnP and the Levenberg-Marquardt algorithm.

4.5.1 Collecting Image-Shape Pairs

We obtain the raw image-shape pairs in two ways. One is to crawl images of IKEA
furniture from the web and align them with CAD models provided in the IKEA
dataset [Lim et al., 2013]. The other is to directly scan 3D shapes and take pictures.

189
Extending IKEA The IKEA dataset [Lim et al., 2013] contains 219 high-quality
3D models of IKEA furniture, but has only 759 images for 90 shapes. Therefore, we
choose to keep the 3D shapes from IKEA dataset, but expand the set of 2D images
using online image search engines and crowdsourcing.
For each 3D shape, we first search for its corresponding 2D images through Google,
Bing, and Baidu, using its IKEA model name as the keyword. We obtain 104,220
images for the 219 shapes. We then use AMT to remove irrelevant ones. For each
image, we ask three AMT workers to label whether this image matches the 3D shape
or not. For images whose three responses differ, we ask three additional workers and
decide whether to keep them based on majority voting. We end up with 14,600 images
for the 219 IKEA shapes.

3D Scan We scan non-IKEA objects with a Structure Sensor2 mounted on an iPad.


We choose to use the Structure Sensor because its mobility enables us to capture a
wide range of shapes. The iPad RGB camera is synchronized with the depth sensor at
30 Hz, and calibrated by the Scanner App provided by Occipital Inc.3 The resolution
of RGB frames is 2592 × 1936, and the resolution of depth frames is 320 × 240. For
each object, we take a short video and fuse the depth data to get its 3D mesh by
using fusion algorithm provided by Occipital Inc. We also take 10–20 images for each
scanned object in front of various backgrounds from different viewpoints, making sure
the object is neither cropped nor occluded. In total, we have scanned 209 objects and
taken 2,313 images. Combining these with the IKEA shapes and images, we have 418
shapes and 16,913 images altogether.

4.5.2 Image-Shape Alignment

To align a 3D model with its projection in a 2D image, we need to solve for its 3D
pose (translation and rotation) and the camera parameters used to capture the image.
We use a keypoint-based method inspired by Lim et al. [2013]. Denote the key-
2
https://structure.io
3
https://occipital.com

190
points’ 2D coordinates as x2D = {𝑥1 , 𝑥2 , · · · , 𝑥𝑛 } and their corresponding 3D coor-
dinates as X3D = {𝑋1 , 𝑋2 , · · · , 𝑋𝑛 }. We solve for camera parameters and 3D poses
that minimize the reprojection error of the keypoints. Specifically, we want to find
the projection matrix 𝑃 that minimizes:

∑︁
ℒ(𝑃 ; X3D , x2D ) = ‖Proj𝑃 (𝑋𝑖 ) − 𝑥𝑖 ‖22 , (4.2)
𝑖

where Proj𝑃 (·) is the projection function.


Under the central projection assumption (zero-skew, square pixel, and the optical
center is at the center of the frame), we have 𝑃 = 𝐾[𝑅|𝑇 ], where 𝐾 is the camera
intrinsic matrix; 𝑅 ∈ R3×3 and 𝑇 ∈ R3[︂represent]︂ the object’s 3D rotation and 3D
𝑓 0 𝑤/2
translation, respectively. We know 𝐾 = 0 𝑓 ℎ/2 , where 𝑓 is the focal length, and 𝑤
0 0 1
and ℎ are the width and height of the image. Therefore, there are altogether seven
parameters to be estimated: rotations 𝜃, 𝜑, 𝜓, translations 𝑥, 𝑦, 𝑧, and focal length 𝑓
(Rotation matrix 𝑅 is determined by 𝜃, 𝜑, and 𝜓).
To solve Equation 4.2, we first calculate a rough 3D pose using the Efficient PnP
algorithm [Lepetit et al., 2009] and then refine it using the Levenberg-Marquardt
algorithm [Levenberg, 1944, Marquardt, 1963], as shown in Figure 4-7. Details of
each step are described below.

Efficient PnP Perspective-n-Point (PnP) is the problem of estimating the pose of


a calibrated camera given paired 3D points and 2D projections. The Efficient PnP
(EPnP) algorithm solves the problem using virtual control points [Levenberg, 1944].
Because EPnP does not estimate the focal length, we enumerate the focal length 𝑓
from 300 to 2,000 with a step size of 10, solve for the 3D pose with each 𝑓 , and choose
the one with the minimum projection error.

The Levenberg-Marquardt Algorithm (LMA) We take the output of EPnP


with 50 random disturbances as the initial states and then run LMA on each of them.
Finally, we choose the solution with the minimum projection error.

191
Implementation Details For each 3D shape, we manually label its 3D keypoints.
The number of keypoints ranges from 8 to 24. For each image, we ask three AMT
workers to label if each keypoint is visible on the image, and if so, where it is. We
only consider visible keypoints during the optimization.

The 2D keypoint annotations are noisy, which severely hurts the performance of
the optimization algorithm. We try two methods to increase its robustness. The first
is to use RANdom SAmple Consensus (RANSAC). The second is to use only a subset
of 2D keypoint annotations. For each image, denote 𝐶 = {𝑐1 , 𝑐2 , 𝑐3 } as its three sets
of human annotations. We then enumerate the seven nonempty subsets 𝐶𝑘 ⊆ 𝐶; for
each keypoint, we compute the median of its 2D coordinates in 𝐶𝑘 . We apply our
optimization algorithm on every subset 𝐶𝑘 and keep the output with the minimum
projection error. After that, we let three AMT workers choose, for each image, which
of the two methods offers better alignment, or neither performs well. At the same
time, we also collect attributes (i.e., truncation, occlusion) for each image. Finally, we
finetune the annotations ourselves using the Graphical User Interface (GUI) offered
in ObjectNet3D [Xiang et al., 2016]. Altogether there are 395 3D shapes and 10,069
images. Sample 2D-3D pairs are shown in Figure 4-8.

4.6 Results

In this section, we analyze our real-world dataset—Pix3D, evaluate which shape er-
ror metric matches human perception the best, and finally describe our experiments
evaluating ShapeHD and GenRe. Specifically, Section 4.6.4 and Section 4.6.5 show
how ShapeHD is capable of high-fidelity 3D shape completion from a single-view
depth map and 3D shape reconstruction from a single-view RGB image, respectively.
Section 4.6.6, Section 4.6.7, and Section 4.6.8 demonstrate how GenRe is able to ac-
curately estimate depth for novel shape categories unseen during training, reconstruct
novel objects from the training classes, and reconstruct objects from novel test classes
unseen during training, respectively.

192
3D Shape Image Alignment 3D Shape Image Alignment 3D Shape Image Alignment

Figure 4-8: Sample images and shapes in Pix3D. From left to right: 3D shapes, 2D
images, and 2D-3D alignment. Rows 1–2 show some chairs we scanned, Rows 3–4
show a few IKEA objects, and Rows 5–6 show some objects of other categories we
scanned.

193
4.6.1 Data

Here we describe our synthetic data for training and the real data for testing.

Synthetic Data We render each of the ShapeNet Core55 [Chang et al., 2015] ob-
jects in 20 random, fully unconstrained views. For each view, we randomly set the
azimuth and elevation angles of the camera, but the camera’s up vector is fixed to be
the world +𝑦 axis, and the camera always looks at the object center. The focal length
is fixed at 50 mm on a 35 mm film. In ShapeHD, to boost the realism of the rendered
RGB images, we put three different types of backgrounds behind the object during
rendering. One third of the images are rendered in a clean white background; one
third are rendered in High-Dynamic-Range (HDR) backgrounds with illumination
channels that produce realistic lighting; we render the remaining one third images
onto backgrounds randomly sampled from the SUN database [Xiao et al., 2010]. We
use Mitsuba [Jakob, 2010], a physically-based rendering engine, for all our renderings.
We used 90% of the data for training and 10% for testing.

For our experiments on shape completion, we additionally render the ground-truth


depth images of each object from all 20 views. Depth values are measured from the
camera center (i.e., ray depth), rather than from the image plane. To approximate
data from a depth scanner, we also generate the accompanying ground-truth surface
normal images from the raw depth data, as surface normal maps are a common
byproduct of depth scanning. All our rendered surface normal vectors are defined in
the camera space.

For the generalization experiments for GenRe, we train our models on the three
largest ShapeNet classes (car, chair, and airplane), and test them on the next 10
largest classes: bench, vessel, rifle, sofa, table, phone, cabinet, speaker, lamp,
and display. Besides ShapeNet renderings, we also test GenRe on non-rigid shapes
such as humans and horses [Bronstein et al., 2008] (Section 4.7.5) and on highly
regular shape primitives (Section 4.7.6).

194
Real Data We also test our models, trained only on synthetic data, on real images
from PASCAL 3D+ [Xiang et al., 2014] and Pix3D [Sun et al., 2018b].
We now present some statistics of Pix3D and contrast it with its predecessors.
Figure 4-9 shows the category distributions of 2D images and 3D shapes in Pix3D.
Our dataset covers a large variety of shapes, each of which has a large number of in-
the-wild images. Chairs cover the significant part of Pix3D, because they are common,
highly diverse, and well-studied by recent literature [Dosovitskiy et al., 2017, Tulsiani
et al., 2017, Gwak et al., 2017].

Images From IKEA Shapes Self-scanned


Self-taken 240 From IKEA
4000
Crawled
200
3000
160

2000 120

80
1000
40

0 0
Bed Bookcase Chair Desk Misc Sofa Table Tool Wardrobe Bed Bookcase Chair Desk Misc Sofa Table Tool Wardrobe

Figure 4-9: Image and shape distributions across categories of Pix3D. Each shape
in Pix3D is associated with multiple images providing various contexts, in which the
shape is likely to appear.

As a quantitative comparison on the quality of Pix3D and other datasets, we


randomly select 25 chair and 25 sofa images from PASCAL 3D+ [Xiang et al., 2014],
ObjectNet3D [Xiang et al., 2016], IKEA [Lim et al., 2013], and Pix3D. For each
image, we render the projected 2D silhouette of the shape using its pose annotation
provided by the dataset. We then manually annotate the ground truth object masks
in these images, and calculate Intersection over Union (IoU) between the projections
and the ground truth. For each image-shape pair, we also ask 50 AMT workers
whether they think the image is picturing the 3D ground-truth shape provided by
the dataset. From Table 4.1, we see that Pix3D has much higher IoU than PASCAL
3D+ and ObjectNet3D, and slightly higher IoU compared with the IKEA dataset.
The AMT workers also feel IKEA and Pix3D have matched images and shapes, but
not PASCAL 3D+ or ObjectNet3D. In addition, we observe that many 3D models in
the IKEA dataset are of an incorrect scale, making it challenging to align the shapes

195
with images. For example, there are only 15 unoccluded and untruncated images of
sofas in IKEA, while Pix3D has 1,092.

chair sofa IoU EMD CD Human


IoU Match? IoU Match? IoU 1 0.55 0.60 0.32
PASCAL 3D+ 0.514 0.00 0.813 0.00 EMD 0.55 1 0.78 0.43
CD 0.60 0.78 1 0.49
ObjectNet3D 0.570 0.16 0.773 0.08
Human 0.32 0.43 0.49 1
IKEA 0.748 1.00 0.918 1.00
Pix3D (ours) 0.835 1.00 0.926 1.00 Table 4.2: Correlation between
different shape metrics and
Table 4.1: Dataset quality of Pix3D. We
human judgments. IoU, EMD,
compute the Intersection over Union (IoU)
and CD have Spearman’s rank
between manually annotated 2D masks and the
correlation coefficients of 0.32,
2D projections of 3D shapes. We also ask
0.43, and 0.49 with human
humans to judge whether the object in the
judgments, respectively.
images matches the provided shape.

4.6.2 Metrics

Designing a good evaluation metric is important to encourage researchers to design


algorithms that reconstruct high-quality 3D geometry, rather than low-quality 3D
reconstruction that overfits a certain metric.
Many 3D reconstruction papers use Intersection over Union (IoU) to evaluate
the similarity between ground truth and reconstructed 3D voxels, which may signifi-
cantly deviate from human perception. In contrast, metrics like shortest distance and
geodesic distance are more commonly used than IoU for matching meshes in graphics
[Kreavoy et al., 2007, Jain and Zhang, 2006]. Here we conduct behavioral studies
to calibrate IoU, Chamfer Distance (CD) [Barrow et al., 1977], and Earth Mover’s
Distance (EMD) [Rubner et al., 2000] on how well they reflect human perception.

Definitions The definition of IoU is straightforward. For CD and EMD, we first


convert voxels to point clouds as follows, and then compute CD and EMD between
pairs of point clouds. We first extract the isosurface of each predicted voxel using
the Lewiner Marching Cubes algorithm [Lewiner et al., 2003]. In practice, we use 0.1
as a universal surface value for extraction. We then uniformly sample points on the

196
surface meshes and create the densely sampled point clouds. Finally, we randomly
sample 1,024 points from each point cloud and normalize them into a unit cube for
distance calculation.
The CD between 𝑆1 , 𝑆2 ⊆ R3 is defined as:

1 ∑︁ 1 ∑︁
CD(𝑆1 , 𝑆2 ) = min ‖𝑥 − 𝑦‖2 + min ‖𝑥 − 𝑦‖2 . (4.3)
|𝑆1 | 𝑦∈𝑆2 |𝑆2 | 𝑥∈𝑆1
𝑥∈𝑆1 𝑦∈𝑆2

For each point in each cloud, CD finds the nearest point in the other point set and
sums up all distances. CD has been used in shape retrieval challenges [Yi et al., 2017].
For EMD, we follow the definition in Fan et al. [2017]. The EMD between 𝑆1 , 𝑆2 ⊆
R3 (of equal size, i.e., |𝑆1 | = |𝑆2 |) is:

1 ∑︁
EMD(𝑆1 , 𝑆2 ) = min ||𝑥 − 𝜑(𝑥)||2 , (4.4)
|𝑆1 | 𝜑:𝑆1 →𝑆2 𝑥∈𝑆
1

where 𝜑 : 𝑆1 → 𝑆2 is a bijection. We divide EMD by the size of the point cloud


for normalization. In practice, calculating the exact EMD value is computationally
expensive; we instead use a (1 + 𝜖) approximation algorithm [Bertsekas, 1985].

Which Metric Is the Best? We then conduct two user studies to compare these
metrics and benchmark how they capture human perception.
We run three shape reconstructions algorithms (3D-R2N2 [Choy et al., 2016], DRC
[Tulsiani et al., 2017], and 3D-VAE-GAN [Wu et al., 2016]) on 200 randomly selected
images of chairs. We then, for each image and every pair of its three reconstructions,
ask three AMT workers to choose the one that looks the closest to the object in
the image. We also compute how each pair of objects rank in each metric. Finally,
we calculate the Spearman’s rank correlation coefficients between different metrics
(i.e., IoU, EMD, CD, and human perception). Table 4.2 suggests that EMD and CD
correlate better with human ratings.

4.6.3 Baselines

We organize the baselines based on the shape representation they use.

197
Voxels Voxels are arguably the most common representation for 3D shapes in the
deep learning era due to their amenability to 3D convolution. For this representation,
we consider e 3D Recurrent Reconstruction Neural Network (3D-R2N2) [Choy et al.,
2016], Differentiable Ray Consistency (DRC) [Tulsiani et al., 2017], MarrNet [Wu
et al., 2017], and Octree Generating Network (OGN) [Tatarchenko et al., 2017] as
baselines. Our model uses 1283 voxels of [0, 1] occupancy. All these baselines and our
ShapeHD take a single image as input, without requiring any object mask.
In ShapeHD, we compare with a state-of-the-art shape completion method: 3D-
Encoder Predictor Network (3D-EPN) [Dai et al., 2017]. To ensure a fair comparison,
we convert depth maps to partial surfaces registered in a canonical global coordinate
defined by ShapeNet Core55 [Chang et al., 2015], which is required by 3D-EPN.
While the original 3D-EPN paper generates their partial observations by rendering
and fusing multi-view depth maps, our method takes a single-view depth map as
input and is solving a more challenging problem.

Mesh & Point Clouds Considering the cubic complexity of the voxel representa-
tion, recent papers have explored meshes [Groueix et al., 2018, Yao et al., 2018] and
point clouds [Fan et al., 2017] in the context of neural networks. In this work, we
consider AtlasNet [Groueix et al., 2018] and Point Set Generation Network (PSGN)
[Fan et al., 2017] as baselines. Like GenRe, both PSGN and AltasNet require object
silhouettes as input in addition to the single RGB image.

Multi-View Maps Another way of representing 3D shapes is to use a set of multi-


view depth images [Soltani et al., 2017, Shin et al., 2018, Jayaraman et al., 2018]. We
compare with the model from Shin et al. [2018] in this regime.

Spherical Maps As introduced in Section 4.1, one can also represent 3D shapes as
spherical maps. We include two baselines with spherical maps: first, a one-step base-
line that predicts final spherical maps directly from RGB images (“GenRe-1step”);
second, a two-step baseline that first predicts single-view spherical maps from RGB

198
images and then inpaints them (“GenRe-2step”). Both baselines use the aforemen-
tioned “U-ResNet” image-to-image network architecture.
To provide justification for using spherical maps, we provide a baseline (“3D Com-
pletion”) that directly performs 3D shape completion in voxel space. This baseline
first predicts depth from an input image and then projects the depth map into the
voxel space. A completion module takes the projected voxels as input and predicts
the final result.
To provide a performance upper bound for our spherical inpainting and voxel
refinement networks (Figure 4-4 [b, c]), we also include the results when our model
has access to ground-truth depth in the original view (“GenRe-Oracle”) and to ground-
truth full spherical maps (“GenRe-SphOracle”).

4.6.4 Single-View Shape Completion

For 3D shape completion from a single depth image, we only use the last two modules
of ShapeHD: the 3D shape estimator and deep naturalness network.

Qualitative Results In Figure 4-10 and Figure 4-11, we show 3D shapes predicted
by ShapeHD from single-view depth images. While common encoder-decoder struc-
ture usually generates mean shapes with few details, our ShapeHD predicts shapes
with large variance and fine details. In addition, even when there is strong occlusion
in the depth image, our model can predict a high-quality, plausible 3D shape that
looks good perceptually, and infer parts not present in the input images.
We now show results of ShapeHD on real depth scans. We capture six depth
maps of different chairs using a Structure sensor4 and use the captured depth maps
to evaluate our model. All the corresponding normal maps used as input are estimated
from depth measurements. Figure 4-12 shows that ShapeHD completes 3D shapes
well given a single-view depth map. ShapeHD is more flexible than 3D-EPN, as we
do not need any camera intrinsics or extrinsics to register depth maps. In our case,
none of these parameters is known, so 3D-EPN cannot be applied.
4
http://structure.io

199
Input ShapeHD (2 Views) Ground Truth Input ShapeHD (2 Views) Ground Truth

Figure 4-10: 3D shape completion from single-view depth by ShapeHD. From left to
right: input depth maps, shapes reconstructed by ShapeHD in the canonical view
and a novel view, and ground-truth shapes in the canonical view. Assisted by the
adversarially learned naturalness losses, ShapeHD recovers highly accurate 3D shapes
with fine details. Sometimes the reconstructed shape deviates from the ground truth
but can be viewed as another plausible explanation of the input (e.g., the airplane on
the left, third row).

200
Input 3D-EPN ShapeHD w/o ShapeHD (2 Views) Ground Truth
ℒ"#$%&#' (2 Views) (2 Views)

Figure 4-11: 3D shape completion by ShapeHD. Our results contain more details than
3D-EPN. We observe that the adversarially trained naturalness losses help fix errors,
add details (e.g., the plane wings in Row 3, car seats in Row 6, and chair arms in
Row 8), and smooth planar surfaces (e.g., the sofa back in Row 7).

201
Scanned Depth Photo of Scanned Depth Photo of
ShapeHD (2 Views) ShapeHD (2 Views)
(Single-View) the Object (Single-View) the Object

Figure 4-12: 3D shape completion by ShapeHD using real depth data. ShapeHD is
able to reconstruct the shape well from just a single view. From left to right: input
depth, two views of our results, and color images of the objects.

Ablation When using the naturalness loss, the network is penalized for generating
mean shapes that are unreasonable but minimize the supervised loss. In Figure 4-11,
we show reconstructed shapes from our ShapeHD with and without naturalness loss
(i.e., before fine-tuning with 𝐿natural ), together with ground truth shapes and shapes
predicted by 3D-EPN [Dai et al., 2017]. Our results contain finer details compared
with those from 3D-EPN. Also, the performance of ShapeHD improves greatly with
the naturalness loss, which predicts more reasonable and complete shapes.

Quantitative Results We present quantitative results in Table 4.3. Our ShapeHD


outperforms the state of the art by a margin in all metrics. Our method outputs
shapes at the resolution of 1283 , while shapes produced by 3D-EPN are of resolution
323 . Therefore, for a fair comparison, we downsample our predicted shapes to 323
and report results of both methods in that resolution. The original 3D-EPN paper
suggests a post-processing step that retrieves similar patches from a shape database
for results of a higher resolution. Practically, we find this steps takes 18 hours for a
single image. We therefore report results without post-processing for both methods.
Table 4.3 also suggests the naturalness loss improves the completion results,

202
IoU (323 ) CD
Methods
chair car plane Avg chair car plane Avg
3D-EPN .147 .274 .155 .181 .227 .200 .125 .192
ShapeHD w/o 𝐿natural .466 .698 .488 .529 .112 .083 .071 .093
ShapeHD .488 .698 .452 .529 .096 .078 .068 .084

Table 4.3: Average shape completion errors of ShapeHD on ShapeNet. Our model
outperforms the state of the art by a large margin. The learned naturalness losses
consistently improve the CD between our results and the ground truth.

achieving comparable IoU and better (lower) CD. As aforementioned, CD is better


at capturing human perception of shape quality than IoU.

4.6.5 Single-View Shape Reconstruction

We now evaluate ShapeHD on 3D shape reconstruction from a single color image.

Results on Synthetic Data We first evaluate on rendering of the ShapeNet ob-


jects [Chang et al., 2015]. We present reconstructed 3D shapes and quantitative
results in Figure 4-13. All these models are trained on our rendering of the largest 13
ShapeNet categories (those with at least 1,000 models) with ground-truth 3D shapes
as supervision. In general, our ShapeHD is able to predict 3D shapes that closely
resemble the ground-truth shapes, giving fine details that make the reconstructed
shapes more realistic. It also performs better quantitatively.

Generalization to Novel Categories An important aspect of evaluating shape


reconstruction methods is on how well they generalize. Here we train our model and
baselines on the largest three ShapeNet classes (car, chair, and plane), again with
ground-truth shapes as supervision, and test them on the next largest ten. Figure 4-
14 shows our ShapeHD outperforms DRC (3D) and is comparable with AtlasNet.
However, note that AtlasNet requires ground-truth silhouettes as additional input,
while ShapeHD works on raw images.

203
Input DRC (3D) AtlasNet ShapeHD GT Input DRC (3D) AtlasNet ShapeHD GT

Methods bench boat cabin car chair disp lamp phone plane rifle sofa speak table Avg
DRC (3D) .122 .131 .127 .077 .128 .128 .168 .102 .166 .107 .106 .138 .138 .126
AtlasNet† .123 .130 .169 .107 .141 .162 .171 .138 .105 .096 .131 .172 .161 .139
ShapeHD .121 .103 .126 .066 .125 .124 .157 .084 .073 .053 .102 .141 .124 .108

Figure 4-13: 3D shape reconstruction by ShapeHD on ShapeNet. Our rendering


of ShapeNet is more challenging than that from Choy et al. [2016]; as such, the
numbers of the other methods may differ from those in the original paper. All methods
are trained with full 3D supervision on our rendering of the largest 13 ShapeNet
categories. †DRC and ShapeHD take a single image as input, while AltasNet requires
ground-truth object masks as additional input.

(a) Testing on Training Categories

Input Est. Depth ShapeHD (2 views) GT Input Est. Depth ShapeHD (2 views) GT

(b) Testing on Novel Categories

Input DRC (3D) AtlasNet ShapeHD GT Input DRC (3D) AtlasNet ShapeHD GT

Methods bench boat cabin disp lamp phone rifle sofa speak table Avg
DRC (3D) .175 .161 .189 .278 .225 .268 .153 .149 .203 .221 .202
AtlasNet† .155 .114 .202 .244 .261 .263 .121 .126 .206 .262 .195
ShapeHD .166 .129 .182 .252 .235 .229 .232 .133 .193 .199 .195

Figure 4-14: 3D shape reconstruction by ShapeHD on novel categories. All methods


are trained with full 3D supervision on our rendering of ShapeNet car, chair, and
airplane, and tested on the next 10 largest categories. †DRC and ShapeHD take
a single image as input, while AltasNet requires ground-truth object silhouettes as
additional input.

204
Results on Real Data We then evaluate ShapeHD on two real datasets, PASCAL
3D+ [Xiang et al., 2014] and Pix3D [Sun et al., 2018b]. Here we train our model on
our synthetic ShapeNet rendering and use the trained models released by the authors
as baselines. All methods take ground-truth 3D shapes as supervision during training.
As shown in Figure 4-15 and Figure 4-16, ShapeHD works well, inferring a reasonable
shape even in the presence of strong self-occlusion. In particular, in Figure 4-15, we
compare our reconstructions against the best-performing alternatives (DRC on chair
and airplane, and AtlasNet on car). In addition to preserving details, our model
captures the shape variations of the objects, while the competitors produce similar
reconstructions across instances.

Best Best
Input ShapeHD Input ShapeHD
alternative alternative

Figure 4-15: 3D shape reconstruction by ShapeHD on PASCAL 3D+. From left to


right: input, two views of reconstructions from ShapeHD, and reconstructions by
the best alternative in Table 4.4. Assisted by the learned naturalness loss, ShapeHD
recovers accurate 3D shapes with fine details.

Quantitatively, Table 4.4 and Table 4.5 suggest that ShapeHD performs signifi-
cantly better than the other methods in almost all metrics. The only exception is the

205
(a) Input (b) AtlasNet (c) DRC (3D) (d) ShapeHD (e) GT

Figure 4-16: 3D shape reconstruction by ShapeHD on Pix3D. For each input im-
age, we show reconstructions by AtlasNet, DRC, and our ShapeHD alongside with
the ground truth. ShapeHD reconstructs complete 3D shapes with fine details that
resemble the ground truth.

206
CD on PASCAL 3D+ cars, for which OGN performs the best. However, as PASCAL
3D+ has only around 10 CAD models for each object category as the ground-truth
3D shapes, the ground-truth labels and scores can be inaccurate, failing to reflect
human perception [Tulsiani et al., 2017].

50
CD
Methods 40

# Test examples
chair car airplane Avg 30

3D-R2N2 0.238 0.305 0.305 0.284 20

DRC (3D) 0.158 0.099 0.112 0.122 10


OGN - 0.087 - - 0
0 1 2 3 4 5 6 7 8 9 10
ShapeHD 0.137 0.129 0.094 0.119 # Users (of 10) that prefer ours

(a) CD on PASCAL 3D+ (b) Human Study Results

Table 4.4: 3D shape reconstruction by ShapeHD on PASCAL 3D+. (a) We com-


pare our ShapeHD against 3D-R2N2, DRC, and OGN. PSGN and AtlasNet are not
evaluated because they require object masks as additional input, but PASCAL 3D+
has only inaccurate masks. (b) In the behavioral study, most of the users prefer our
reconstructions on most images. Overall, our reconstructions are preferred 64.5% of
the time to OGN’s.

3D-R2N2 DRC (3D) PSGN† AtlasNet† ShapeHD Table 4.5: 3D shape


IoU (32 )
3
0.136 0.265 - - 0.284 reconstruction by
IoU (1283 ) 0.089 0.185 - - 0.205 ShapeHD on Pix3D.
CD 0.239 0.160 0.199 0.126 0.123 All methods are
trained with full 3D
†3D-R2N2, DRC, and ShapeHD take a single image as input, while supervision on
PSGN and AtlasNet require the ground-truth mask as input. PSGN
and AtlasNet generate surface point clouds without guaranteeing rendered images of
watertight meshes and therefore cannot be evaluated in IoU. ShapeNet objects.

We therefore conduct an additional user study, where we show an input image


and its two reconstructions (from ShapeHD and from OGN, each in two views) to
users on Amazon Mechanical Turk (AMT), and ask them to choose the shape that
looks closer to the object in the image. For each image, we collect 10 responses from
“Masters” (workers who have demonstrated excellence across a wide range of “HITs”).
Table 4.4 (b) suggests that on most images, most users prefer our reconstructions to
OGN’s. In general, our reconstructions are preferred 64.5% of the time.

207
4.6.6 Estimating Depth for Novel Shape Classes

We show qualitative and quantitative results of GenRe on depth estimation quality


across categories. As shown in Figure 4-17, our depth estimator learns effectively
the concept of near and far, generalizes well to unseen categories, and does not show
statistically significant deterioration as the novel test class gets increasingly dissimilar
to the training classes, laying the foundation for the generalization power of GenRe.
Formally, the dissimilarity from test class 𝐶test to training classes 𝐶train is defined as

𝑥∈𝐶test min𝑦∈𝐶train CD(𝑥, 𝑦).


1
∑︀
|𝐶test |

Training Novel Test Non-zero slope:


<latexit sha1_base64="mynLrqQGfbDsDPdnYc36dFEQ4C4=">AAACAHicdVA9SwNBEN3zM8avqIWFzWIQbAx3MRi1CtpYiYJRIQlhbzPRJXu7x+6cGI80/hUbC0Vs/Rl2/hv3NIKKPhh4vDfDzLwwlsKi7795I6Nj4xOTuan89Mzs3HxhYfHU6sRwqHMttTkPmQUpFNRRoITz2ACLQglnYW8/88+uwFih1Qn2Y2hF7EKJruAMndQuLDcRrjE91GrjBoymVuoYdumgXSj6pZ3trXJli/ol368G5SAj5Wpls0IDp2QokiGO2oXXZkfzJAKFXDJrG4EfYytlBgWXMMg3Ewsx4z12AQ1HFYvAttKPBwZ0zSkd2tXGlUL6oX6fSFlkbT8KXWfE8NL+9jLxL6+RYHe7lQoVJwiKfy7qJpKiplkatCMMcJR9Rxg3wt1K+SUzjKPLLO9C+PqU/k9Oy6XALwXHlWJtbxhHjqyQVbJOAlIlNXJAjkidcDIgd+SBPHq33r335D1/to54w5kl8gPeyzv0sJah</latexit>

Classes Classes p > 0.05 <latexit sha1_base64="QadlIXs0puMS5l+5QVDS/k6eQpQ=">AAAB7XicdVDLSgMxFM3UV62vqks3wSK4GjLj2NaNFN24rGBroR1KJs20sZnMkGSEMvQf3LhQxK3/486/MdNWUNEDgZNz7uXee4KEM6UR+rAKS8srq2vF9dLG5tb2Tnl3r63iVBLaIjGPZSfAinImaEszzWknkRRHAae3wfgy92/vqVQsFjd6klA/wkPBQkawNlI7OUc2Ou2XK8g+q1ddrwrNH9Uc18mJW/NOPOgYJUcFLNDsl997g5ikERWacKxU10GJ9jMsNSOcTku9VNEEkzEe0q6hAkdU+dls2yk8MsoAhrE0T2g4U793ZDhSahIFpjLCeqR+e7n4l9dNdVj3MyaSVFNB5oPClEMdw/x0OGCSEs0nhmAimdkVkhGWmGgTUMmE8HUp/J+0XdtBtnPtVRoXiziK4AAcgmPggBpogCvQBC1AwB14AE/g2YqtR+vFep2XFqxFzz74AevtE3yljmk=</latexit>

Input Prediction Ground Truth

Figure 4-17: GenRe’s depth estimator generalizing to novel shape classes. Left: Our
single-view depth estimator, trained on car, chair, and airplane, generalizes to
novel classes: bus, train, and table. Right: As the novel test class gets increas-
ingly dissimilar to the training classes (left to right), depth prediction does not show
statistically significant degradation (𝑝 > 0.05).

4.6.7 Reconstructing Novel Objects From Training Classes

We present results on generalizing to novel objects from the training classes. All
models are trained on car, chair, and airplane, and tested on unseen objects from
the same three categories.
As shown in Table 4.6 (Seen), GenRe is the best-performing viewer-centered
model. It also outperforms most of the object-centered models except AtlasNet.
GenRe’s preformance is impressive given that object-centered models tend to per-
form much better on objects from seen classes [Shin et al., 2018]. This is because
object-centered models, by exploiting the concept of canonical views, actually solve
an easier problem. The performance drop from the object-centered DRC to the

208
viewer-centered DRC supports this empirically. However, for objects from unseen
classes, the concept of canonical views is no longer well-defined. As we will see in
Section 4.6.8, this hurts the generalization power of the object-centered methods.

Unseen
Models Seen
bch vsl rfl sfa tbl phn cbn spk lmp dsp Avg
Object- DRC .072 .112 .100 .104 .108 .133 .199 .168 .164 .145 .188 .142
Centered AtlasNet .059 .102 .092 .088 .098 .130 .146 .149 .158 .131 .173 .127
DRC .092 .120 .109 .121 .107 .129 .132 .142 .141 .131 .156 .129
MarrNet .070 .107 .094 .125 .090 .122 .117 .125 .123 .144 .149 .120
Shin et al. [2018] .065 .092 .092 .102 .085 .105 .110 .119 .117 .142 .142 .111
3D Completion .076 .102 .099 .121 .095 .109 .122 .131 .126 .138 .141 .118
Viewer-
GenRe-1step .063 .104 .093 .114 .084 .108 .121 .128 .124 .126 .151 .115
Centered
GenRe-2step .061 .098 .094 .117 .084 .102 .115 .125 .125 .118 .118 .110
GenRe (ours) .064 .089 .092 .112 .082 .096 .107 .116 .115 .124 .130 .106
GenRe-Oracle .045 .050 .048 .031 .059 .057 .054 .076 .077 .060 .060 .057
GenRe-SphOracle .034 .032 .030 .021 .044 .038 .037 .044 .045 .031 .040 .036

Table 4.6: 3D shape reconstruction by GenRe on training and novel classes. The
novel classes are ordered from the most to the least similar to the training classes.
Our model is viewer-centered by design but achieves performance on par with the
object-centered state of the art (AtlasNet) in reconstructing the seen classes. As for
generalization to novel classes, our model outperforms the state of the art across 9
out of the 10 classes in terms of CD.

4.6.8 Reconstructing Objects From Unseen Classes

We show how GenRe generalizes to novel shape classes unseen during training.

Synthetic Rendering We use the 10 largest ShapeNet classes other than chair,
car, and airplane as our test set. Table 4.6 (Unseen) shows that our model consis-
tently outperforms the state of the art, except for rifle, in which AtlasNet performs
the best. Qualitatively, GenRe produces reconstructions that are much more consis-
tent with input images, as shown in Figure 4-18. In particular, on unseen classes,
our results still attain good consistency with the input images, while the competitors
either lack structural details present in the input (e.g., 5) or retrieve shapes from the
training classes (e.g., 4, 6, 7, 8, 9).

209
1 6

2
7

3
8

4
9

5 10

Input Best Baseline GenRe (Ours) Ground Truth Input Best Baseline GenRe (Ours) Ground Truth

Figure 4-18: GenRe’s reconstruction within and beyond training classes. Each row
from left to right: the input image, two views from the best-performing baseline for
each testing object (1–4, 6–9: AtlasNet; 5, 10: Shin et al. [2018]), two views of our
GenRe predictions, and the ground truth. All models are trained on the same dataset
of cars, chairs, and airplanes.

Comparing our model with its variants, we find that the two-step approaches
(GenRe-2step and GenRe) outperform the one-step approach across all novel cate-
gories. This empirically supports the advantage of our two-step modeling strategy
that disentangles geometric projections from shape reconstruction.

Real Images We further compare how GenRe, AtlasNet, and Shin et al. [2018]
perform on real images from Pix3D. Here, all models are trained on ShapeNet car,
chair, and airplane, and tested on real images of bed, bookcase, desk, sofa, table,
and wardrobe.
Quantitatively, Table 4.7 shows that GenRe outperforms the two competitors
across all novel classes except bed, for which Shin et al. [2018] perform the best. For
chair, one of the training classes, the object-centered AtlasNet leverages the canon-
ical view and outperforms the two viewer-centered approaches. Qualitatively, our
reconstruction preserves the details present in the input (e.g., the hollow structures
in the second row of Figure 4-19).
Because neither depth maps nor spherical maps provide information inside the
shapes, our model predicts only surface voxels that are not guaranteed watertight.
Consequently, IoU cannot be used as an evaluation metric. We hence evaluate the
reconstruction quality using CD. For models that output voxels, including DRC and

210
AtlasNet Shin GenRe
chair .080 .089 .093
bed .114 .106 .113
bookcase .140 .109 .101
desk .126 .121 .109
sofa .095 .088 .083
table .134 .124 .116
wardrobe .121 .116 .109

Table 4.7: Reconstruction errors


Input Best Baseline GenRe (Ours) Ground Truth
of GenRe on Pix3D. GenRe
outperforms the two baselines
Figure 4-19: GenRe’s reconstruction on real
across all unseen classes except
images from novel classes. The real images are
bed. For the seen class, chair,
from Pix3D. The baseline is either AtlasNet or
object-centered AtlasNet
Shin et al. [2018], depending on which performs
performs the best by leveraging
better. All models are trained on car, chair,
the canonical view.
and airplane, so the test classes are all novel.

our GenRe model, we sweep voxel thresholds from 0.3 to 0.7 with a step size of 0.05 for
isosurfaces, compute CD with 1,024 points sampled from all isosurfaces, and report
the best average CD for each object class.
Shin et al. [2018] report that object-centered supervision produces better recon-
structions for objects from the training classes, whereas viewer-centered supervision
is advantaged in generalizing to novel classes. Therefore, for DRC and AtlasNet, we
train each network with both types of supervision. Note that AtlasNet, when trained
with viewer-centered supervision, tends to produce unstable predictions that render
CD meaningless. Hence, we only present CD for the object-centered AtlasNet.

4.7 Discussion

For ShapeHD, we visualize what the network is learning (Section 4.7.1), analyze
the effects of the naturalness loss over time (Section 4.7.2), and discuss common
failure modes (Section 4.7.3). For GenRe, we study how the input viewpoint affects
the model’s ability to generalize to unseen shape classes (Section 4.7.4), if a model
trained on rigid shapes is able to generalize to non-rigid ones (Section 4.7.5), and

211
finally whether the model can reconstruct simple, regular shapes well (Section 4.7.6).

4.7.1 Network Visualization

As ShapeHD successfully reconstructs object shapes and parts, it is natural to ask if it


learns object or part detectors implicitly. To this end, we visualize the top activating
regions across all validation images for units in the last convolutional layer of the
encoder in our 3D completion network, using the method proposed by Zhou et al.
[2014]. As shown in Figure 4-20, the network indeed learns a diverse and rich set of
object and part detectors. There are detectors that attend to the car wheels, chair
backs, chair arms, chair legs, and airplane engines. Also note that many detectors
respond to certain patterns (e.g., strided) in particular, which is probably contributing
to the fine details in the reconstruction. Additionally, there are units that respond
to generic shape patterns across different categories, like the curve detector in the
bottom right.

Figure 4-20: Visualization of how ShapeHD attends to details in the depth maps.
Row 1: car wheel detectors; Row 2: chair back and leg detectors. The left responds
to the strided pattern in particular. Row 3: chair arm and leg detectors; Row 4:
airplane engine and curved surface detectors. The right responds to a specific pattern
across different classes.

212
4.7.2 Training With the Naturalness Loss Over Time

We study the effect of the naturalness loss over time. In Figure 4-21, we plot the
loss of the completion network with respect to the fine-tuning epoch. We realize that
the voxel loss goes down slowly but consistently. If we visualize the reconstructed
examples at different timestamps, we clearly see that details are being added to the
shape. These fine details occupy a small region in the voxel grid, so training with the
supervised loss alone is unlikely to recover them. In contrast, with the adversarially
learned perceptual loss, ShapeHD recovers the details successfully.

0.071

0.070
Training Loss

(a)
0.069
(a) (b)
(b)
0.068
(d)
(c)
0.067

0 20 40 60 80
Epoch (c) (d)

Figure 4-21: How ShapeHD improves over time with the naturalness loss. The pre-
dicted shape becomes increasingly realistic as details are being added.

4.7.3 Failure Cases

We present the failure cases of ShapeHD in Figure 4-22. We observe our model has
these common failing modes: It sometimes gets confused by deformable object parts
(e.g., wheels on the top left); it may miss uncommon object parts (top right, the ring
above the wheels); it has difficulty in recovering very thin structure (bottom right),
and may generate other patterns instead (bottom left). While the voxel representation
makes it possible to incorporate the naturalness loss, intuitively, it also encourages
the network to focus on thicker shape parts, as they carry more weights in the loss
function.

213
Input ShapeHD (3 views) GT Input ShapeHD (3 views) GT

Figure 4-22: Common failure modes of ShapeHD. Top left: The model sometimes gets
confused by deformable object parts (e.g., wheels). Top right: The model might miss
uncommon object parts (the ring above the wheels). Bottom row: The model has
difficulty in recovering very thin structure and may generate other structure patterns
instead.

4.7.4 Effects of Viewpoints on Generalization

The generic viewpoint assumption states that the observer is not in a special position
relative to the object [Freeman, 1994]. This makes us wonder if the “accidentalness”
of the viewpoint affects the quality of GenRe’s reconstruction.
As a quantitative analysis, we test our model trained on ShapeNet chair, car, and
airplane on 100 randomly sampled ShapeNet tables, each rendered in 200 different
views sampled uniformly on a sphere. We then compute, for each of the 200 views,
the median CD of the 100 reconstructions. Finally, in Figure 4-23, we visualize these
median CDs as a heatmap over an elevation-azimuth view grid. As the heatmap
shows, our model makes better predictions when the input view is generic than when
it is accidental, consistent with the intuition.

4.7.5 Generalizing to Non-Rigid Shapes

We probe the generalization limit of GenRe by testing it on unseen non-rigid shapes,


such as horses and humans. As the focus is mainly on the spherical map inpainting
network (Figure 4-4 [b]) and the voxel refinement network (Figure 4-4 [c]), we assume
our model has access to the ground-truth single-view depth (i.e., GenRe-Oracle) in
this experiment. As demonstrated in Figure 4-24, our model not only retains the
visible details in the original view, but also completes the unseen surfaces using the

214
Azimuth ✓ = 2⇡
<latexit sha1_base64="qxzGJJ2kQ5UCyzYfSDYkSRigEGo=">AAACBHicbVDJSgNBEO1xjXGLesylMQiewkwQ9CJEvXiMYBbIDKGnU5M06VnorhHjkIMXf8WLB0W8+hHe/Bs7y0ETHxQ83quiqp6fSKHRtr+tpeWV1bX13EZ+c2t7Z7ewt9/Qcao41HksY9XymQYpIqijQAmtRAELfQlNf3A19pt3oLSIo1scJuCFrBeJQHCGRuoUii7CPWYXDyJMsU9H1MU+IDuvuInoFEp22Z6ALhJnRkpkhlqn8OV2Y56GECGXTOu2YyfoZUyh4BJGeTfVkDA+YD1oGxqxELSXTZ4Y0SOjdGkQK1MR0on6eyJjodbD0DedIcO+nvfG4n9eO8XgzMtElKQIEZ8uClJJMabjRGhXKOAoh4YwroS5lfI+U4yjyS1vQnDmX14kjUrZscvOzUmpejmLI0eK5JAcE4eckiq5JjVSJ5w8kmfySt6sJ+vFerc+pq1L1mzmgPyB9fkDmoaYCg==</latexit>

0
<latexit sha1_base64="6ZTwbptvK00HUiMuNssEoeJJPkc=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1Fip6Q7KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGNn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1U9t+o1ryv12zyOIpzBOVyCBzWowz00oAUMEJ7hFd6cR+fFeXc+lq0FJ585hT9wPn8AeRmMtA==</latexit>
<latexit

.157

Accidental Views

.076
Generic Views Elevation
<latexit sha1_base64="Zv+82o4CtFUAlRlOBY9VuEYFsXo=">AAACA3icbVDLSgNBEJyNrxhfUW96GQyCp7Argl6EoAgeI5gHZEOYnfQmQ2YfzPQGwxLw4q948aCIV3/Cm3/jZJODJhY0FFXddHd5sRQabfvbyi0tr6yu5dcLG5tb2zvF3b26jhLFocYjGammxzRIEUINBUpoxgpY4EloeIPrid8YgtIiCu9xFEM7YL1Q+IIzNFKneOAiPGB6I2GYKXRM3bgvLt1YdIolu2xnoIvEmZESmaHaKX653YgnAYTIJdO65dgxtlOmUHAJ44KbaIgZH7AetAwNWQC6nWY/jOmxUbrUj5SpEGmm/p5IWaD1KPBMZ8Cwr+e9ifif10rQv2inIowThJBPF/mJpBjRSSC0KxRwlCNDGFfC3Ep5nynG0cRWMCE48y8vkvpp2bHLzt1ZqXI1iyNPDskROSEOOScVckuqpEY4eSTP5JW8WU/Wi/VufUxbc9ZsZp/8gfX5Awewl74=</latexit>
=⇡ Error( , ✓)
<latexit sha1_base64="kXfNM1n/0U5V4psYcJyvnjzZSRU=">AAACBHicbVBNS8NAEN34WetX1WMvi0WoICURQY9FETxWsB/QlLLZTpqlm2zYnYgl9ODFv+LFgyJe/RHe/DemHwdtfTDweG+GmXleLIVB2/62lpZXVtfWcxv5za3tnd3C3n7DqERzqHMllW55zIAUEdRRoIRWrIGFnoSmN7ga+8170Eao6A6HMXRC1o+ELzjDTOoWii7CA6bXWis9KrtxIE6oiwEgO853CyW7Yk9AF4kzIyUyQ61b+HJ7iichRMglM6bt2DF2UqZRcAmjvJsYiBkfsD60MxqxEEwnnTwxokeZ0qO+0llFSCfq74mUhcYMQy/rDBkGZt4bi/957QT9i04qojhBiPh0kZ9IioqOE6E9oYGjHGaEcS2yWykPmGYcs9zGITjzLy+SxmnFsSvO7VmpejmLI0eK5JCUiUPOSZXckBqpE04eyTN5JW/Wk/VivVsf09YlazZzQP7A+vwB5DmXkg==</latexit>

Figure 4-23: Reconstruction errors of GenRe across different input viewpoints. The
vertical (horizontal) axis represents elevation (azimuth). Accidental views (dark blue
box) lead to large errors, while generic views (green box) result in smaller errors. Er-
rors are computed for 100 tables; these particular tables are for visualization purposes
only.

generic shape priors learned from rigid objects (ShapeNet car, chair, and airplane).

Input GenRe (Ours) Ground Truth Input GenRe (Ours) Ground Truth

Figure 4-24: Single-view completion of Figure 4-25: Completion of highly


non-rigid shapes from depth by GenRe. regular shapes (primitives) by GenRe.
The GenRe model is trained on car, The GenRe model is trained on car,
chair, and airplane. chair, and airplane.

4.7.6 Generalizing to Highly Regular Shapes

We further explore whether GenRe captures global shape attributes by testing it on


highly regular shapes that can be parametrized by only a few attributes (such as
cones and cubes). Similar to Section 4.7.5, the model has seen only ShapeNet car,

215
chair, and airplane during training, and we assume our model has access to the
ground-truth single-view depth (i.e., GenRe-Oracle).
As Figure 4-25 shows, although our model hallucinates the unseen parts of these
shape primitives, it fails to exploit global shape symmetry to produce correct predic-
tions. This is not surprising given that our network design does not explicitly model
such regularity. A possible future direction is to incorporate priors that facilitate
learning high-level concepts such as symmetry.

4.8 Conclusion
We have presented Pix3D [Sun et al., 2018b], a large-scale dataset of well-aligned 2D
images and 3D shapes, and also explored how three commonly used metrics corre-
spond to human perception through behavioral studies. With this high-quality test set
and informative error metrics, we then continued to develop two models for single-
image 3D shape reconstruction: ShapeHD [Wu et al., 2018] aimed at high-fidelity
reconstruction with structural details and Generalizable Reconstruction (GenRe)
[Zhang et al., 2018b] generalizing to novel shape classes unseen during training.
For ShapeHD, we proposed to use learned shape priors to overcome the 2D-
3D ambiguity and to learn from the multiple hypotheses that explain a single-view
observation. Our model achieves state-of-the-art results with structural details on
3D shape completion and reconstruction. We hope our results will inspire further
research in 3D shape modeling, in particular on explaining the ambiguity behind
partial observations.
For GenRe, we have studied the problem of generalizable single-image 3D re-
construction. We exploit various image and shape representations including 2.5D
sketches, spherical maps, and voxels. We have presented a novel viewer-centered
model that integrates these representations for generalizable, high-quality 3D shape
reconstruction. The experiments demonstrate that GenRe achieves state-of-the-art
performance on shape reconstruction for both seen and unseen classes. We hope our
system will inspire future research along this challenging but rewarding direction.

216
Chapter 5

High-Level Abstraction: Data-Driven


Lighting Recovery

In this chapter, we study “lighting from appearance” at a high level of abstraction,


using a data-driven approach without explicitly modeling shape, reflectance, or the
rendering process. Specifically, we address the problem of recovering lighting from
a single-pixel observation of the object illuminated, under the special context of the
Earth as the light source and the Moon as the illuminated (see Figure 5-1). We start
with an introduction of how light transports in the Sun-Earth-Moon system, present
how we formulate the problem of estimating the Earth’s appearance from a single-
pixel Moon observation as that of recovering lighting—the Earth—from the Earth-lit
part of the Moon, and finally explain why a Generative Adversarial Network (GAN)
solution is suitable for this problem (Section 5.1). We then review the related work
in Section 5.2.
Next, we present a Generative Adversarial Networks for the Earth (Earth-
GANs) capable of generating photorealistic images of the Earth that are likely to
be responsible for the given Moon appearance captured by a consumer camera such
as that of a mobile phone. Specifically, the generation is conditioned on the obser-
vation timestamp (respecting the fact that date and time dictate the Earth’s conti-
nental/oceanic configuration according to the Earth rotation) and the average color
of the Moon, which informs us of the atmospheric conditions that effectively “mask”

217
the light source—the Earth (Section 5.3). Then, EarthGAN’s task is to generate the
mostly likely Earth image that, as the light source, gave rise to the Moon observation
at that particular timestamp. The fact that we rely on only the average color of the
Moon is crucial for everyday users to be potentially able to apply our method to their
casual “backyard capture.”
In Section 5.4, we start by describing the characteristics and multi-modal nature of
our data. We then test EarthGAN at the actual timestamps for which we have ground-
truth Earth images, to gauge the quality of the Earth image generated by EarthGAN
given an observation timestamp and an average Moon color. Since EarthGAN requires
a dataset of Earth images, one could alternatively run simple nearest neighbor-based
algorithms to retrieve the “best” Earth image given the same input. These non-
parametric methods have the advantage of excellent simplicity and interpretability.
As such, we also evaluate two simple models from this category against EarthGAN
and demonstrate their limitations, the most important of which is their inability to
“hallucinate” novel contents unseen in their entirety during training.
Next, we “super-resolve” the Earth rotation in time, by querying EarthGAN at
a finer time resolution of 5 min. Through this experiment, we demonstrate that
EarthGAN has learned from the data to synthesize photorealistic, continuous Earth
rotation despite having seen only snapshots of the Earth that are 1h+ apart during
training. We then frame the modeling of atmospheric conditions such as clouds and
other nuances as multi-modal generation commonly seen in Generative Adversarial
Networks (GANs), and demonstrate how EarthGAN controls these nuances with a
random vector. Finally, we perform analyses on whether we can rely on just the times-
tamp without observing the Moon at all, whether the choice of the GAN backbone
changes the results, and how other time encoding schemes affect the generation.

5.1 Introduction

Leaving the source, light travels to and interacts with the object, resulting in the
object’s appearance that we observe. More formally, appearance (filtered signal)

218
can be thought of as a convolution of the object’s reflectance (filter) over lighting
(signal) [Ramamoorthi and Hanrahan, 2004]. As such, a signal processing formulation
of lighting recovery often involves explicit modeling of the filter, i.e., the object’s
reflectance [Ramamoorthi and Hanrahan, 2001]. Such approaches, however, may
not be viable for cases where we do not have an accurate model for the object’s
reflectance, or where physically-based rendering is challenging (e.g., due to additional
nuances not exhaustively modeled in the rendering process). The Moon-Earth case
that this chapter studies is exactly one such case.

In this chapter, we study the problem of lighting recovery from the appearance
of the illuminated object. Specifically, we investigate a special case where the Earth
serves as the light source that we aim to recover, and the Moon is the illuminated
object that we observe. This is a simplification of the actual light transport: In
reality, light from the Sun travels to the Earth, and the Earth reflects the light, in a
spatially-varying manner, to the Moon. Because part of the Moon is also lit directly
by the Sun at a much higher light intensity, only the dark side of the Moon carries
signals about the (indirect) illumination from the Earth. In our simplified setup, the
Sun is removed, and the Earth is made emissive to directly provide illumination to
the Moon.

As previously alluded to, this Moon-Earth setup makes it challenging to solve the
problem using a physically-based approach like the high-level abstraction presented
in Chapter 2. There are mainly three reasons. Firstly, although Bidirectional Re-
flectance Distribution Function (BRDF) approximations such as Hapke [1981] are
available for the Moon surface, it is still hard to obtain an accurate reflectance model
on top of the complex Moon surface geometry. Secondly, the image formation process
involves various factors that have effects but are hard to model and even unknown,
such as atmospheric turbulence. Finally, a physically-based model will not be able to
recover high-frequency lighting when given just a single-pixel observation of the illu-
minated object. Furthermore, whether physically-based modeling is necessary in this
case is debatable: While physically-based modeling is general and can be applied to
recover any “rare” lighting, the lighting in our case, (emissive) images of the Earth, is

219
highly specialized and possesses many strong regularities: The Earth is always mostly
round, and consists of blue pixels for the ocean, yellow or green pixels for the conti-
nents, and white pixels for the clouds. Furthermore, there are abundant high-quality
Earth images, taken by a spacecraft camera, available online.

In light of these observations, we tackle this problem at a high level of abstrac-


tion, using a data-driven approach that frees us from modeling the complex image
formation process and allows us to exploit the regularities by learning from the data.
Specifically, we propose Generative Adversarial Networks for the Earth (EarthGANs),
a conditional generative model that takes as input the Moon observation and its times-
tamp, and produces an Earth image representing the Earth lighting that is likely to
be responsible for the Moon observation. Importantly, EarthGAN is designed to
work with a single-pixel Moon observation (i.e., the mean color) and therefore can
be potentially applied to “backyard” images taken by a mobile phone camera. Once
trained, EarthGAN learns the regularities in the Earth images and a mapping from
the timestamp and Moon observation to the Earth appearance. With this mapping,
we are able to “super-resolve” the Earth appearance in time, by querying the trained
EarthGAN at a time granularity finer than the actual capture intervals, to reveal the
underlying continuous Earth rotation when given just discrete snapshots. In addition,
EarthGAN learns to model the Earth appearance due to other nuances such as cloud
patterns with its randomness vector, capable of synthesizing multiple possible Earth
appearances, e.g., with different cloud patterns, given the same timestamp and Moon
observation.

Because our dataset contains pairs of Moon observations and their correspond-
ing Earth images, we can alternatively train an image-to-image Convolutional Neural
Network (CNN) in the supervised way. However, this supervised alternative is un-
able to properly handle the one-to-many mappings in our case: There are multiple
possible Earth appearances, e.g., with different cloud patterns, that can give rise to
the observed Moon appearance. Imposing a supervised loss on the CNN would lead
to blurry “mean” images that satisfy the ℓ2 loss. Therefore, a GAN-based approach is
proper for our problem as it has been shown capable of high-resolution, multi-modal

220
generation [Wang et al., 2018, Karras et al., 2019, Park et al., 2019b].
We compare EarthGAN against two nearest neighbor baselines and demonstrate
EarthGAN’s superiority of being able to synthesize novel pixels (cf. the baselines
retrieving only seen snapshots). Qualitative and quantitative experiments justify our
design choices including the utilization of the (low-resolution) Moon observation, the
GAN architecture, and our timestamp encoding scheme.

5.2 Related Work

Our work is related to several areas in computer vision and graphics. In this section,
we organize the relevant literature into three categories: Non-Line-of-Sight (NLOS)
imaging, lighting recovery, and Generative Adversarial Networks (GANs).

5.2.1 Non-Line-of-Sight Imaging

Because our goal is to image the Earth from the Earth, Non-Line-of-Sight (NLOS)
imaging is a relevant topic. That said, our approach can barely be called an “imaging”
system given that its reliance on mostly a database (strong priors) and only a single-
pixel observation of the Moon (weak observation).
Broadly, NLOS approaches can be categorized into active and passive methods.
Active methods rely on energy-emitting sensors such as a Time-of-Flight (TOF) cam-
era, whereas passive approaches rely on just energy-absorbing devices such as a con-
ventional RGB camera. In the active category, Heide et al. [2014], Shin et al. [2015,
2016], Laurenzis et al. [2016] use laser to shine at a point visible to both the observable
and hidden scenes and measure the time that the light takes to return [Pandharkar
et al., 2011, Shin et al., 2016]. By measuring the time of flight and intensity, one can
infer depth, shape, and reflectance of the hidden objects [Shin et al., 2014]. However,
TOF cameras have the limitations of requiring specialized hardware setup, being un-
able to introduce additional light in uncontrollable cases like our Earth-Moon scene,
and being vulnerable to ambient lighting.

221
In the passive category, Bouman et al. [2017] turn corners into cameras and re-
cover a video of the hidden scene from the computer-observable intensity change near
the corners. Other works have also considered turning naturally existing structures
into cameras. For instance, Cohen [1982], Torralba and Freeman [2012] have used
naturally occurring pinholes (such as windows) and pinspecks for NLOS imaging.
In addition, Nishino and Nayar [2006] have extracted environment lighting from the
specular reflections off human eyes. Also related is the work of Wu et al. [2012] that
visualizes small, imperceptible color changes in videos.
The most relevant is probably the work by Hasinoff et al. [2011] who used occlusion
geometry to improve the conditioning of the diffuse light transport inversion, turning
the Moon into a diffuse reflector to make an image of the Earth. Freeman [2020]
spoke about several attempts to photograph the Earth using the Moon as a camera
and the computational imaging projects resulting from those attempts.
In contrast to these methods that are mostly physically-based, our EarthGAN
approach is data-driven and operates without modeling the actual image forma-
tion process. EarthGAN exploits the fact that there are strong regularities in our
lighting—the Earth images—and learns data-driven priors of what an Earth image
should look like. With such strong priors, it is able to recover an Earth image given
just a single-pixel observation of the Moon (and the corresponding timestamp).

5.2.2 Lighting Recovery

Inferring lighting form images is a longstanding problem in computer vision [Horn,


1974]. The seminal work by Debevec [1998] demonstrated that virtual objects can be
convincingly placed in real photographs by relighting the objects with High-Dynamic-
Range (HDR) light probes captured with a chrome ball. Using machine learning,
Lalonde et al. [2009], Gardner et al. [2017], Hold-Geoffroy et al. [2017], Calian et al.
[2018], LeGendre et al. [2019], Sengupta et al. [2019] estimate HDR light probes
from a regular low-dynamic-range photograph. These approaches, however, are not
sufficient when we have near-field lighting, or when lighting varies drastically w.r.t.
the 3D location in the scene. To address this issue, many works estimate spatially-

222
varying lighting by predicting a separate light probe for each pixel of the input image
[Shelhamer et al., 2015, Garon et al., 2019, Li et al., 2020c]. Karsch et al. [2014]
recover 3D area lights by detecting visible light sources and retrieving non-visible
lights from a labeled dataset.
These spatially-varying lighting estimation approaches, though, do not ensure
that the estimated lighting is spatially coherent in the 3D space. To handle this
problem, Song and Funkhouser [2019] first obtain 3D understanding of the scene,
project observed pixels to the target light probe according to the query location, and
finally inpaint the missing regions of the light probe using a neural network. Although
this method ensures spatial coherence of the estimated lighting for observed lights,
the unseen light sources that are inpainted may not be spatially consistent. The
recent work by Gardner et al. [2019] assumes a fixed number of light sources and
regresses those lights’ colors, intensities, and positions using a neural network. This
work ensures that the estimated lighting is spatially consistent. Another recent work
is Lighthouse [Srinivasan et al., 2020], which takes as input perspective and spherical
panorama images, and outputs spatially-coherent and -varying lighting.
As we see, many prior works in this domain concern the spatially-varying and
-coherent nature of lighting. These properties, however, are of little relevance under
our setup, where the lighting to estimate lies on a monitor-like plane far from the
illuminated object. That said, the machine learning approaches that some of these
works take resemble our data-driven recovery at a high level.

5.2.3 Generative Adversarial Networks

The influential work of Kingma and Welling [2014] presented Variational Autoen-
coders (VAEs) as a method that learns to encode images into low-dimensional latent
codes that can get reconstructed back to the input images. Similarly, Goodfellow
et al. [2014] proposed GANs where the generator learns to synthesize images that are
indistinguishable from the real images, while the discriminator aims to improve its
ability to tell a generated image from a real one. As such, both models are capable
of synthesizing images when fed with random vectors.

223
Yet the user often wants to generate images based on some input rather than
randomly. Known as “conditional image synthesis,” the task is to generate photore-
alistic images that satisfy the given condition. The simplest form of a condition is
probably a class label, e.g., cat, while image conditions such as a segmentation map
are also of interest. Researchers have proposed class-conditional models that can
synthesize images satisfying the input class labels [Mirza and Osindero, 2014, Odena
et al., 2017, Brock et al., 2018, Mescheder et al., 2018, Miyato and Koyama, 2018].
Text-conditional models have also been proposed to, e.g., generate an image based on
a caption [Reed et al., 2016, Zhang et al., 2017, Hong et al., 2018, Xu et al., 2018a].
When both the input and output are images, researchers have devised image-to-image
models such as Karacan et al. [2016], Liu et al. [2017], Isola et al. [2017], Zhu et al.
[2017a,b], Huang et al. [2018], Karacan et al. [2019].
In enabling image-to-image models to generate high-resolution output, Wang et al.
[2018] developed pix2pixHD that produces 2048 × 2048 images with a multi-scale
generator and discriminator. Observing that normalization layers often “wash away”
semantic information when the condition maps are passed as input through the net-
work layers, Park et al. [2019b] proposed SPADE that uses the input condition map
to modulate the activations after the normalization, achieving high-quality image gen-
eration given a condition image such as a segmentation map. In this work, we build
upon SPADE and make it an “imaging” network that takes as input a weak observa-
tion of the Moon (as well as the corresponding timestamp) and makes an image of
the Earth that is likely responsible for the observation, relying on strong priors about
the Earth regularities learned from our data.

5.3 Method

During training, the input to Generative Adversarial Networks for the Earth (Earth-
GANs) is a collection of Earth images associated with their timestamps and average
colors of the corresponding Moon images. The generator of the EarthGAN then learns
a mapping from the timestamp and average Moon color to a full Earth image, while

224
the discriminator aims to differentiate the generated image from the real Earth im-
age given the timestamp and the average Moon color. Because the input timestamp
and average color do not fully dictate the Earth’s appearance (e.g., different cloud
configurations overlaid on a “clean Earth” may lead to the same Moon observation),
we additionally condition our generation on a randomness vector 𝑧 given by encoding
the ground-truth image during training, as in SPADE [Park et al., 2019b].
At test time, we randomly sample a 𝑧 from its target distribution, to which the
image encoder has learned to stay close during training, and ask the generator to
produce a full-resolution Earth image given 𝑧, the test timestamp, and the average
Moon color. With a trained EarthGAN, one can explore the regularities of the Earth
images learned from the image collection: By querying EarthGAN at actual times-
tamps for which we have ground-truth Earth images captured, we evaluate how well
EarthGAN recovers highly regular lighting from a single-pixel observation; by query-
ing the model at time intervals finer than the capture granularity, we probe whether
EarthGAN has learned the underlying evolution of lighting as a function of time or
has just memorized some snapshots seen during training; by varying 𝑧 while keeping
the other conditions constant, we investigate what aspects of the Earth images are
explained by randomness rather than the timestamp and average color.

5.3.1 Data & Simulation

Here we describe how to collect a dataset of Earth images, our scene setup for data
generation, and the rendering specifics.

Earth Photographs The Earth images are crawled from the data release by NASA
[2015], taken by the Earth Polychromatic Imaging Camera (EPIC) on the Deep Space
Climate Observatory (DSCOVR) spacecraft. The DSCOVR spacecraft was launched
to the Earth-Sun Lagrange-1 (L-1) point in February of 2015, with the mission of
monitoring space weather events such as geomagnetic storms. EPIC is the on-board
camera capturing 2048 × 2048 images of the entire Sun-lit face of the Earth through
a Cassegrain telescope with a 30 cm aperture.

225
EPIC takes images every 60 min to 100 min, so there are usually around ten images
available per day. There are two types of images that EPIC takes: the “natural color”
images that are created using bands lying within the human visual range and the
“enhanced color” images that are additionally processed to enhance land features. We
use the natural color images as they simulate what a conventional camera would have
produced [NASA, 2015]. For rendering, we use these images as the Earth plane texture
at their original resolution of 2048 × 2048, while we downsample them to 256 × 256 for
training the EarthGAN. Each photograph is associated with a timestamp specifying
on which date and at what time it was captured (e.g., 2016-01-08 00:55:16). We
use all images from 2016 to 2020 (inclusive) as our training data and all 2021 images
up to May 31 as the test data.

Scene Setup We set up a Blender1 scene where we use a Lambertian sphere with
albedo by NASA [2019] to approximate the Moon, and an emissive plane (like a
computer monitor) to display the Earth images. Although one could alternatively use
the actual Moon topographic map, such as the one released by NASA [2019] or USGS
[2020], as a displacement map on top of the sphere geometry, and a similar sphere-like
geometry onto which the Earth texture is mapped, the gain by this “physically correct”
setup should be negligible in our data-driven approach that operates at a high level
of abstraction, without explicitly modeling the geometry. Similarly for reflectance,
it is reasonable to just use a Lambertian approximation (instead of the Bidirectional
Reflectance Distribution Function [BRDF] by Hapke [1981]) for the Moon and make
the Earth plane a textured emitter (rather than a non-emitter that reflects light from
the Sun to Moon in a spatially-varying manner).
Despite the shape and reflectance approximations, we respect the Moon-Earth
distance by placing the two objects at a distance in scale w.r.t. their sizes. The
radius of the Moon sphere is set to the actual Moon radius, and the Earth plane is
sized such that the texture, when mapped onto the plane, has the Earth radius be
roughly correct as viewed from the Moon. The Earth plane is perpendicular to the
1
https://www.blender.org/

226
Earth-Moon line. The camera sits at the world origin and looks at the Moon center.
Figure 5-1 shows the actual Sun-Earth-Moon system, our simplified version thereof,
and a screenshot of the scene setup in Blender.

*Not drawn Moon


to scale

Bright Dark
Side Side plify
S im “Glowing”
Earth
Moon
© Philipp Salzgeber
Earth
Sun
Lighting:
Object: The Earth
Dark Bright
Side Side The Moon

Figure 5-1: Our simplification of the Sun-Earth-Moon system. We simplify the actual
light transport from the Sun to the Earth then to (the dark side of) the Moon to light
directly coming to the Moon from a “glowing Earth.” Therefore, the emissive Earth
plane is the lighting we aim to recover, and the Moon sphere is the Earth-lit object
that we observe.

Rendering Rendering is performed with Blender Cycles, a physically-based ren-


dering engine built-in to Blender. We render the Moon images at 256 × 256 reso-
lution with the number of samples per pixel being 512. Rendering one image takes
around 30 s on a 24-core CPU machine. All images are directly rendered into PNGs
of discretized pixel values rather than EXRs of raw values, for easier adaptation to
real-world images that may not have raw values stored.
Figure 5-2 shows two examples of the Earth illumination and the corresponding
Moon response. Note that to demonstrate how the Moon responds differently to
different Earth illuminations, we show the differential images with the mean Moon
image subtracted off. As shown in Figure 5-2, when the Earth illumination has the
African continent (or mostly ocean) illuminating the Moon, the Moon appearance

227
has additional yellow (or blue) tint compared with the mean Moon image. In other
words, when we take averages of such Moon images, the average colors do inform us
of some characteristics of the Earth illuminations.

Earth Illumination
Figure 5-2: How the Moon
responds differently to distinct
Earth illuminations. When the
African continent (or mostly
ocean) illuminates the Moon in
Example 1 (or Example 2), the
Moon appearance has
Moon Response – Mean Moon Image
additional yellow (or blue) tint
compared with the average
Moon image. This proves the
existence of signals that we
can use to estimate the Earth
appearance by observing the
Moon.
Example 1 Example 2

5.3.2 Nearest Neighbor-Based Recovery

Because we aim to recover lighting in a data-driven approach at the high level of


abstraction, a dataset of Earth images is required and can be collected as described
in Section 5.3.1. With such a dataset, one can alternatively use nearest neighbor-
based (NN-based) approaches to retrieve the “best” Earth reconstruction given the
Moon observation. Such non-parametric baselines have the advantages of excellent
simplicity and interpretability.
We consider two NN-based methods: the “NN-time” baseline that relies on just
timestamps (i.e., no Moon observation) and returns the Earth image taken at roughly
the same time of the day but one year prior to the query timestamp, and the “NN-obs”
baseline that additionally considers multiple years prior and returns the one with the
closest mean Moon color. Timestamp closeness is measured based on the time of day

228
on the same date of year, and color distance is simply the ℓ2 distance between two
RGB tuples. Figure 5-3 provides pictorial descriptions of these two baselines.

Example Query Timestamp: 2016-01-01 00:31:52

2021-01-01 00:55:15 2017-01-01 02:02:38

NN-time: 2018-01-01 00:45:50


Returns the nearest year
2019-01-01 01:23:02

NN-obs: 2020-01-01 00:35:12


Returns the one with the closest
Moon mean

: Query Obs.
B : Closest
: Others

G NN Candidate Pool
R (training images captured at
Moon Mean around the same time on the same
day of previous years)

Figure 5-3: Illustration of the nearest neighbor baselines for EarthGAN. Right: Given
the query timestamp, we generate the pool of NN candidates.Left: NN-time (blue box)
finds the NN based on only timestamps.NN-obs (green box) additionally computes
the mean colors of the Moon observations and then returns the NN candidate with
the closest Moon mean to the observed mean.

5.3.3 Generative Adversarial Network-Based Recovery

Because the NN baselines have to “snap” the query timestamp to one of the train-
ing timestamps, which are almost 2 h apart (Section 5.3.1), they produce discrete
snapshots instead of the desired continuous Earth rotation when we query them at
a continuous series of timestamps. In other words, they are unable to interpolate
between two timestamps or synthesize novel pixels that are not seen during training.
As such, we desire a generative model that learns a continuous function of the
Earth appearance w.r.t. the timestamp and Moon observation, such that the synthe-
sized Earth appearance evolves in a photorealistic and smooth way when we query the
model between two adjacent capture timestamps. This generative model must also be

229
conditional: Instead of generating random samples as an unconditional GAN does, it
needs to condition its output Earth image on the timestamp and Moon observation.
Intuitively, the model is trained to synthesize the Earth lighting that has given rise
to the appearance of the Moon at that particular timestamp.

Our EarthGAN model is one such model, based on SPADE [Park et al., 2019b],
that models the Earth appearance as a smooth function of the timestamp 𝑡, the
one-pixel observation (mean color) of the Moon 𝑜, and a randomness vector 𝑧. The
generator 𝐺 takes as input the condition c = (𝑡, 𝑜, 𝑧) and generates an Earth image
𝐺(c). The discriminator then learns to differentiate a real pair (c, x) from a generated
fake one (c, 𝐺(c)). Formally, EarthGAN models the conditional distribution of the
Earth appearance given the conditions via the following minimax game [Goodfellow
et al., 2014, Wang et al., 2018]:

min max E(c,x)∼𝑝data (c,x) [log 𝐷(c, x)] + Ec∼𝑝data (c) [log(1 − 𝐷 (c, 𝐺(c))] , (5.1)
𝐺 𝐷

where 𝑝data (·)’s are the data distributions, effectively our collection of condition-image
pairs. We defer the implementation and loss details to the end of this section.

Although timestamp 𝑡 seemingly fully dictates what the Earth looks like (e.g.,
which pixels belong to America and which to the Pacific Ocean) according to as-
tronomy, atmospheric conditions such as clouds add randomness to the actual Earth
appearance. Intuitively, the additional observation 𝑜 helps disambiguate certain cases,
e.g., whether the Earth appears more gray likely due to the overlaid clouds or more
blue if the atmospheric conditions allow us to see direclty the ocean. However, 𝑜 does
not solve the problem entirely: Different cloud patterns may still give the same 𝑜 at
𝑡, and furthermore, there may be other nuances that EarthGAN is not yet modeling.
This sounds like the familiar problem of multi-modal generation often encountered
in GANs: Given the same condition (e.g., a segmentation map), the model needs the
capability of generating multiple plausible images (e.g., different RGB images that
all satisfy the segmentation map).

To this end, we prepend an image encoder to our GAN model, as how multi-

230
modal generation is achieved in SPADE [Park et al., 2019b], to regress (parameters
of) a multivariate Gaussian distribution from the ground-truth Earth image during
training. Intuitively, the image encoder and 𝐺 form a Variational Autoencoder (VAE)
[Kingma and Welling, 2014], with the encoder trying to capture the “style” of the
image. With this design, we can sample a randomness vector 𝑧 from the learned latent
space and have it capture factors and nuances (other than 𝑜 or 𝑡) that might affect
the Earth appearance. At test time, we sample 𝑧 from the prior distribution—a zero-
mean, unit-variance multivariate Gaussian, which was used in the Kullback–Leibler
divergence during training. When we explore how 𝑡 and 𝑜 affect the generation, we
sample just one 𝑧 and use that throughout. On the other hand, when we explore
how 𝑧 affects the generation, we keep 𝑡 and 𝑜 constant, and vary only 𝑧. Figure 5-4
visualizes our EarthGAN model.

mean( )=o Condition


“Map”
encode(‘2016-01-01 00:55:15’) = t c …

Multi- real
Image SPADE
Concat. Scale or
Encoder Generator
Discrim. fake

KL Divergence
<latexit sha1_base64="FszSNLPf//Yb7b+7yleKTJvdpfE=">AAACLXicbVDLSsNAFJ34rPVVdelmsAgVpCRF1I1Q1IVupKJ9QBPKZDpph84kYWYilJAfcuOviOCiIm79DSdpRW09MMzhnHu59x43ZFQq0xwZc/MLi0vLuZX86tr6xmZha7shg0hgUscBC0TLRZIw6pO6ooqRVigI4i4jTXdwkfrNByIkDfx7NQyJw1HPpx7FSGmpU7i0OVJ9jFh8k5RsN2BdOeT6i20eJWeZ6XqxmRzCb27f0R5HP951ctApFM2ymQHOEmtCimCCWqfwYncDHHHiK8yQlG3LDJUTI6EoZiTJ25EkIcID1CNtTX3EiXTi7NoE7mulC71A6OcrmKm/O2LEZXqErkxXlNNeKv7ntSPlnTox9cNIER+PB3kRgyqAaXSwSwXBig01QVhQvSvEfSQQVjrgvA7Bmj55ljQqZeu4XLk9KlbPJ3HkwC7YAyVggRNQBVegBuoAg0fwDEbgzXgyXo1342NcOmdMenbAHxifXwX1qc4=</latexit>

N (µ = 0, ⌃ = I)

Figure 5-4: EarthGAN model. Given the Moon observation, we compute the mean as
our observation 𝑜. We also encode the string timestamp into a 3-vector 𝑡, respecting
the time and date semantics. We repeat the concatenation of 𝑜 and 𝑡 across the
spatial dimensions, producing a condition “map” at the same resolution as the Earth
image. The Earth image is encoded into (parameters of) a multivariate Gaussian
distribution encouraged to stay close the the zero-mean, unit-variance Gaussian. The
SPADE generator aims to generate an Earth image given the condition map and a
random sample from the multivariate Gaussian. The multi-scale discriminator then
tries to tell whether the pair of the condition map and Earth image is real. We use
the same loss function as in SPADE.

Timestamp Encoding To perform numerical operations on string timestamps like


2016-01-08 00:55:16, we need to convert them into numbers for EarthGAN. One

231
might consider using a string encoder from the Natural Language Processing (NLP)
literature, but this would destroy the semantics that we understand well: the peri-
odicities in dates and time (e.g., time is periodic with a cycle of 24 h), their bounds
(e.g., July 31 → August 1 instead of July 32), etc. In addition, training an additional
string encoder would add another layer of obscurity to the already inexplicable GAN
dynamics.
As such, we opt for a simple encoding scheme that not only preserves the seman-
tics (which facilitates learning of the continuous Earth rotation, as we will show in
Section 5.4.3), but also is compatible with the input format required by SPADE-like
models. Specifically, we normalize month 𝑚 ∈ {1, 2, . . . , 12} as 𝑚′ = (𝑚−1)/11, day2
𝑑 ∈ {1, 2, . . . , 31} as 𝑑′ = (𝑑−1)/30, and second in the day 𝑠 ∈ {1, 2, . . . , 86400} as
𝑠′ = 𝑠−1/86399. Then, each element of the encoded timestamp 𝑡 = (𝑚′ , 𝑑′ , 𝑠′ ) is in [0, 1];
equivalently, 𝑡 falls inside a unit cube just like the RGB observation 𝑜.
We leave out the year information on purpose, so two timestamps that are dif-
ferent only in year will be mapped to the same 𝑡. This timestamp encoding scheme
is beneficial because it ensures that EarthGAN observes drastically different Earth
appearance for similar timestamps, as shown in Figure 5-5. Consequently, EarthGAN
is motivated to explain these appearance variations using the randomness vector 𝑧.
Additionally, the year bit is too sparse when mapped to the real axis (i.e., 2015 → 0,
2016 → 1/6, . . . , 2021 → 1).

Condition “Maps” Because SPADE and other image-to-image translation models


require the input condition (usually segmentation maps) be of the same resolution as
the image, we concatenate 𝑡 and 𝑜 into the condition o (a 6D vector) and repeat it
across the two spatial dimensions, obtaining a 6-channel condition “map” of the same
resolution as the Earth image but with the same 6D vector o at every pixel location.
Intuitively, this operation conditions every pixel of the generated Earth image on the
same 𝑜; when a Moon image is available as the observation (instead of the current
single-pixel mean), it is worth exploring whether conditioning different earth pixels
2
We ignore the variable number of days for different months.

232
Training Testing

(A) 2016-01-01 (B) 2017-01-01 (C) 2018-01-01 (D) 2021-01-01


00:55:15 00:36:34 00:17:51 00:27:12

Figure 5-5: Different Earth appearances at similar timestamps. These four times-
tamps are close in time of day and on the same date of year (January 1), so the
continental and oceanic patterns remain roughly the same across these timestamps.
However, the final Earth appearances are drastically different because of the non-
stationary cloud patterns. These appearance variations motivate EarthGAN to con-
trol the clouds and other nuances with 𝑧. This figure also demonstrates that nearest
neighbors may not reconstruct the test image well due to varying cloud patterns.

on different Moon pixels observed leads to higher-quality results.

Implementation Details & Losses We follow the SPADE design by Park et al.
[2019b] in both network architectures and loss functions. Specifically, the generator is
a a series of SPADE residual blocks, at each of which the condition maps are injected.
The discriminator is a multi-resolution convolutional network based on PatchGAN
[Isola et al., 2017, Wang et al., 2018]. Same as SPADE, EarthGAN uses the loss
function from pix2pixHD [Wang et al., 2018] except that the least squared loss term
[Mao et al., 2017] is replaced with the hinge loss [Miyato et al., 2018, Zhang et al.,
2019]. We train EarthGAN on one NVIDIA TITAN RTX for around eight hours.

5.4 Results
In this section, we first explain how we generate test conditions to evaluate Earth-
GAN’s performance. We then perform qualitative and quantitative evaluations of
how well EarthGAN recovers the Earth lighting in 2021, for which we have ground-

233
truth Earth photos captured by EPIC. Next, we probe EarthGAN’s limit by testing
it on timestamps at a finer granularity (5 min apart) than the actual timestamps
(almost 120 min apart), answering the question whether EarthGAN learns any un-
derlying regularities in these Earth images. We also compare EarthGAN’s synthesis
against that of the Nearest Neighbor (NN) baselines, demonstrating how EarthGAN
outperforms the NN baselines significantly when the query timestamp is far from
the nearby timestamps, or when the ground-truth image contains novel pixels unseen
during training. We demonstrate how EarthGAN learns to model clouds and other
nuances as a function of the randomness vector 𝑧. Finally, we perform an ablation
study where we demonstrate the importance of each major design choice.

5.4.1 Test Data & Evaluation Metrics

Generation of our test data is similar to the training data generation described in
Section 5.3.1. There are three test sets that are designed to reveal I) whether the
lighting recovery is high-quality, II) whether EarthGAN learns the underlying data
regularities, and III) what aspect of the Earth appearance EarthGAN learns to control
with 𝑧, respectively.
For I), we generate our test set with all 2021 timestamps together with their
corresponding Moon observations (recall that EarthGAN is trained on the 2016–2020
data). For these test points, we have the ground-truth Earth images captured by
EPIC, so computing quantitative errors is straightforward. For II), we generate our
test set by uniformly generating timestamps separated by 5 min within the entire
day. For convenience, the two date dimensions (month and day) are fixed to May
31 (the last 2021 date that we consider), the mean color is also fixed to that of the
same day, and 𝑧 is set to all 0’s. We do not have ground truth for these timestamps
since they do not correspond to the actual capture timestamps, but we expect a
good model to produce smooth evolution of the Earth appearance given the Earth’s
underlying rotation. For III), we generate our test set by randomly sampling 𝑧 from
the standard Gaussian while keeping timestamp 𝑡 and observation 𝑜 constant. Again,
these constant values are taken from May 31 for convenience.

234
In terms of evaluation, we use three metrics: Peak Signal-to-Noise Ratio (PSNR),
structural similarity (SSIM) [Wang et al., 2004], and the Learned Perceptual Image
Patch Similarity (LPIPS) [Zhang et al., 2018a]. Because per-pixel error metrics such
as PSNR fail to capture perceptual quality of the synthesis, we highly recommend
the reader to view the qualitative results in the figures and the supplemental video.
This is also observed in Chapter 3; see Section 3.5.2 for more discussion.

5.4.2 Earth Recovery Given the Moon

We query the trained EarthGAN with the actual 2021 timestamps for which we
have ground-truth Earth photos (Test Set I in Section 5.4.1). As Figure 5-6 shows,
EarthGAN is able to generate photorealistic Earth images given the timestamps and
Moon observations. The generated continental and oceanic patterns resemble those
of the ground truth, relying mostly on the time part of the conditions. The cloud
patterns are controlled by additionally the Moon observation and the randomness
vector 𝑧. Although these cloud patterns do not replicate the ground-truth patterns,
they are perceptually photorealistic.
Although Figure 5-6 seems to suggest the two NN baselines, NN-time that re-
trieves the NN using just timestamps and NN-obs that additionally uses mean Moon
observations, are on par with EarthGAN, the good performance of the NN baselines
relies heavily on EPIC’s regular sampling pattern: It takes photos at a mostly fixed
time interval, making it easy to find neighbors that are captured roughly at the same
time of day given the query timestamp. As such, the NN baselines are able to capture
correctly the continental and oceanic patterns, by simply retrieving neighbors that
are also captured around the same time of day.
This sampling regularity breaks entirely when we start querying EarthGAN at
arbitrary timestamps as in Section 5.4.3, where we ask EarthGAN to synthesize the
Earth appearance at an interval of 5 min. As shown in Section 5.4.3, these NN-based
methods will “snap” the query timestamp to a timestamp from EPIC’s sampling
pattern for several consecutive frames and suddenly switch to the next timestamp,
producing temporally unstable snapshots of the Earth appearance.

235
(I)

(II)

(III)

(A) NN-time (B) NN-obs (C) Ours (D) Ground Truth

Figure 5-6: Earth recovery by EarthGAN. EarthGAN (C) generates photorealistic


Earth images that resemble the ground-truth photos (D) captured by EPIC, given
the query timestamps and average Moon observations. Relying on timestamps, NN-
time (A) is able to retrieve a continental/oceanic pattern similar to that of the ground
truth, but NN-obs (B) sometimes retrieves wrong continental patterns because the
additional Moon observation may bias the NN to a more distant time of day. Although
NN-time seems on par with EarthGAN, its good performance heavily relies on EPIC’s
regular sampling (see the text for details). However, this sampling pattern no longer
holds when we query EarthGAN at arbitrary timestamps, in which cases NN-time
fails miserably as shown in Section 5.4.3.

236
Quantitatively, Table 5.1 also suggests that NN-time is the best performing model
because of the issue discussed above. We encountered exactly the same problem in
Chapter 3, where the regular pattern of the light stage lights makes the baseline
approaches look performant quantitatively since we can compute errors only on the
physical lights that do fall onto this regular pattern, but the baseline methods fail at
test time when the query light no longer falls onto the regular pattern. Please see
Section 3.5.2 for an extensive discussion. Again, we strongly encourage the reader
to view the video comparison between EarthGAN and NN-time when we query both
models with novel timestamps that do not fall onto EPIC’s sampling pattern.

Method PSNR ↑ SSIM ↑ LPIPS ↓ Table 5.1: Quantitative evaluation of


NN-time 20.21 0.736 0.076 EarthGAN. The NN-time baseline
NN-obs 19.75 0.730 0.092 seemingly achieves the best performance
EarthGAN 20.18 0.734 0.079 quantitatively but fails when queried at
w/o obs. 20.03 0.727 0.084 arbitrary timestamps (more details in
w/ StyleGAN 19.96 0.725 0.086
w/ time enc. I 19.69 0.724 0.096
text; see Figure 5-7). EarthGAN
w/ time enc. II 20.05 0.731 0.084 outperforms its variants that use no
moon observation at all, have StyleGAN
Red, orange, and yellow cells indicate the top
as the backbone, or encode timestamps
three performant methods, respectively. Errors
are means over around 2,000 test images. using other schemes.

5.4.3 Learning the Continuous Earth Rotation

We test the trained EarthGAN on Test Set II as specified Section 5.4.1: novel times-
tamps at a finer granularity (5 min apart) than the actual timestamps (almost 120 min
apart). This task can be thought of as an attempt to “super-resolve” the Earth ro-
tation in time. As Figure 5-7 (top) shows, when we query the trained EarthGAN at
dense novel timestamps that are only 5 min apart, EarthGAN generates photorealis-
tic, smooth evolution of the Earth appearance that corresponds to the elapsed time
of 5 min, despite having seen only discrete snapshots separated by up to 2 h.
To clearly show these subtle but non-trivial appearance changes, we additionally
show zoom-in visualization of the same crop of the Earth images across different query
timestamps in Figure 5-8. The two boundary timestamps, for which we synthesize

237
2021-01-01 2021-01-01 2021-01-01
13:30:00 13:35:00 13:40:00 Query

Ours
“snapping” “jump”

2019-01-01 2019-01-01
NN-time 12:44:50 14:32:53

Figure 5-7: Continuous Earth rotation learned by EarthGAN. Top: Despite having
seen only snapshots that are almost 2 h apart, EarthGAN learns the underlying con-
tinuous Earth rotation and synthesizes smooth evolution of the Earth appearance
at an interval of 5 min. Bottom: The NN-time baseline retrieves the same training
image (12:44:50) for the first two query timestamps (13:30:00 and 13:35:00). In
other words, the query timestamps get “snapped” to the same training timestamp.
Because the training data are almost 2 h apart, NN-time “jumps” from 12:44:50 di-
rectly to 14:32:53 despite that we only advance the query timestamp by 5 min, from
113:35:00 to 113:40:00.

238
the Earth images in Figure 5-8 (top), are 115 min apart to simulate the time inter-
val between real EPIC captures. EarthGAN is able to “interpolate” from the start
timestamp to the end timestamp, synthesizing the Earth appearance every 5 min in
between, as shown in Figure 5-8 (bottom). Close inspection of Figure 5-8 (bottom)
reveals that the synthesis results do not stay stationary and suddenly jump to the
next pattern (as done by the NN baselines), but rather evolve smoothly, with the
Australian continent moving gradually from the left of the zoom-in window to the
right. Crucially, EarthGAN has no knowledge of the Earth rotation mechanics and is
not instructed to produce smoothly-varying synthesis, but rather learns this underly-
ing Earth motion from data, by observing discrete snapshots (2 h apart) of the Earth
appearance.

Output for 02:30:00 Output for 04:25:00 Figure 5-8: Smooth


evolution of the
Earth appearance
learned by
EarthGAN. Top:
The Earth images
synthesized by
EarthGAN for
02:30:00 and
04:25:00 are
115 min apart, just
like two adjacent
training images.
Bottom: EarthGAN
is able to synthesize
the smooth evolution
of the Earth
appearance every
5 min in between.
We observe that the
Australian continent
moves smoothly
from the left of the
zoom-in window to
the right in our
Outputs in Between (row-major order) synthesis.

239
We compare EarthGAN against the NN baselines qualitatively in Figure 5-7 (bot-
tom). In contrast to EarthGAN that synthesizes a smooth evolution of the Earth
appearance (Figure 5-7 [top]), NN-time “snaps” the query timestamps, 13:30:00 and
13:35:00, both to the nearest training timestamp, 12:44:50, hence mistakenly pro-
ducing the same reconstruction for different timestamps (yellow box of Figure 5-7).
When we advance the query time stamp by just another 5 min to 13:40:00, NN-time
abruptly “jumps” to the next training timestamp, 14:32:53, that is almost 1 h ahead
of the query timestamp, thereby producing the wrong continental pattern (green box
of Figure 5-7).

5.4.4 Multi-Modal Generation & the Clouds

We ask the trained EarthGAN to synthesize multiple Earth images from randomly
sampled 𝑧’s but fixed timestamp 𝑡 and observation 𝑜. This test (using Test Set III
of Section 5.4.1) sheds light on what aspects of the Earth appearance EarthGAN
models with 𝑧 when it is not explicitly asked to. Recall that with our time encoding
scheme that discards the year information by design, EarthGAN observes multiple
possible Earth appearances for a given timestamp, as demonstrated in Figure 5-
5. This encourages EarthGAN to model appearance variance even for the same
timestamp and (similar) Moon observation.

We randomly sample six 𝑧 vectors from the zero-mean, unit-variance multivariate


Gaussian and ask EarthGAN to generate six Earth images given these six 𝑧’s, the
same timestamp 𝑡, the same Moon observation 𝑜 (as aforementioned, 𝑡 and 𝑜 are fixed
to those of an EPIC capture on May 31, 2021). As shown in Figure 5-9, EarthGAN
learns to model different Earth appearances (partially due to the dynamic cloud
patterns) using 𝑧. The six random samples of 𝑧 lead to the same continental/oceanic
patterns, as expected because the timestamp is kept constant, but different cloud
configurations. Both our data encoding scheme and SPADE’s capability of multi-
modal generation contribute to EarthGAN’s ability to model the clouds with 𝑧.

240
Ground Truth Random Samples of Our Generation
1 2

3 4

NN-time

5 6

Figure 5-9: How EarthGAN learns to model the clouds. Besides the timestamp and
average Moon color, atmospheric conditions such as clouds and other nuances also
affect the Earth appearance. Earth learns to model appearance variations due to these
factors with its randomness vector 𝑧. By sampling different 𝑧 vectors, we generate
multiple possible Earth appearances for the same timestamp and Moon observation.

241
5.4.5 Ablation Studies

We now study the importance of the major design choices in developing EarthGAN.
Specifically, we investigate whether we need to observe the Moon at all to make an
image of the Earth when we already have the timestamp, whether the choice of the
GAN architecture affects the generation quality, and why the current time encoding
scheme is superior to the alternatives.

Without Observing the Moon Given the clear regularities in our Earth images
and the strong dependency of the Earth appearance on the timestamp, we study
whether one needs a single-pixel observation of the Moon at all. To this end, we train
an EarthGAN that conditions the Earth appearance only on the timestamp 𝑡 and the
randomness vector 𝑧, without having access to the mean Moon color.
Although Figure 5-10 (D, E, F) suggests that not observing the Moon at all
does not degrade the visual quality of the generation in that all images in (D, E)
look photorealistic, the quantitative evaluation in Table 5.1 proves that additionally
observing the single-pixel Moon yields more accurate reconstruction across all three
error metrics.

Choice of the GAN Architecture Because StyleGAN has been proven successful
in generating high-resolution photorealistic human faces [Karras et al., 2019, 2020],
we study whether the previous results still hold if we replace our SPADE backbone
to a StyleGAN backbone. As Figure 5-10 (A, E, F) shows, EarthGAN with a Style-
GAN backbone still produces photorealistic results, but upon viewing the video of
this model variant’s generation, we notice that it fails to learn the continuous Earth
rotation as learned by EarthGAN with a SPADE backbone. It remains future work
why a similar GAN backbone fails to learn the continuous Earth rotation.

Choice of the Time Encoding Scheme We have demonstrated in Chapter 4 how


the choice of shape representation can be crucial for reconstruction quality and gen-
eralizability, because different representations (e.g., depth maps) have semantics built

242
(I)

(II)

(A) Ours using (B) Ours using (C) Ours using (D) Ours w/o (E) Ours (F) Ground
StyleGAN time enc. 1 time enc. 2 obs. Truth

Figure 5-10: Ablation studies of EarthGAN’s design choices. (A) EarthGAN with
a StyleGAN backbone fails to learn the continuous Earth rotation (not pictured
here). (B, C) Other timestamp encoding alternatives may lead to inaccurate con-
tinental/oceanic patterns. (D, E, F) Not observing the Moon at all still produces
photorealistic synthesis but hurts the reconstruction accuracy per the quantitatively
evaluation (Table 5.1).

in and are more suitable for a specific task (e.g., 3D surface completion) than other
representations. Similarly, the timestamp representation—how we encode timestamp
strings into numerical values—is crucial for EarthGAN’s performance because dif-
ferent encoding schemes have different semantics built in: For instance, encoding
January 1 to 0 and January 31 to 1 provides semantics about the periodicity and
boundness in day of month.

In this experiment, we evaluate three timestamp encoding schemes: I) a single


scalar obtained by encoding the first second of the first day of all timestamps to 0
and the final second of the final day to 1, II) a 2-vector containing time of day and
day of year, obtained by encoding 00:00:00 to 0 and 23:59:59 to 1, and encoding
January 1 to 0 and December 31 to 1, and III) a 3-vector containing time of day,
month of year, and day of month, obtained by the same method as in II) but further
splitting day of year into day of month and month of year. EarthGAN eventually
uses III) because of its superior performance over the other two schemes as shown in
Table 5.1

243
5.5 Conclusion
We have presented Generative Adversarial Networks for the Earth (Earth-
GANs) for recovering the Earth appearance, as the light source, from a single-pixel
observation of the Moon. Specifically, EarthGAN takes as input the timestamp and a
single-pixel observation (mean color) of the Moon, and then outputs an Earth image
that is likely responsible for the Moon observation. EarthGAN learns the strong reg-
ularities present in the Earth images (captured by a spacecraft camera) and produces
photorealistic Earth images indistinguishable from real photographs. Importantly,
EarthGAN learns the smooth evolution of the Earth appearance due to the underly-
ing Earth rotation, despite having seen only discrete snapshots.
The main idea behind EarthGAN is learning the strong priors on what an Earth
images should look like from a collection of around 23,000 Earth images, their cor-
responding Moon renders due to the Earth as the light source, and the timestamps.
This data-driven approach potentially allows one to use as input single images taken
with a mobile phone camera. Conditioned on the timestamp and just the average
color of the Moon observation, EarthGAN recovers the Earth as the light source ac-
curately and learns to control the unexplained Earth appearance variations with a
randomness vector, with which EarthGAN learns to associate cloud patterns.
Although EarthGAN achieves promising results, it is not without limitations.
Firstly, the imaging system has been simplified such that the Earth directly serves as
the light source, emitting light from a monitor-like plane, whereas in reality, light from
the Sun hits the Earth and gets reflected, in a spatially-varying manner (e.g., land
and ocean have different reflectance properties), to the Moon. Other simplifications
include that the Moon is modeled as a Lambertian sphere in the scene without using
its actual topography and reflectance. Secondly, we compute the average Moon color
using all of the Moon pixels, while in practice, only the dark side of the Moon should
be used since the bright side is also lit by the Sun. Yet the bright side may provide
useful calibration signals for the phone camera at hand. Finally, it remains unclear
whether EarthGAN can be readily applied to real-world images.

244
Chapter 6

Conclusion & Discussion

In this dissertation, we have presented the broad problem of inverse rendering and
further discussed four subtopics thereof: I) joint shape, reflectance, and lighting
from appearance (Chapter 2). II) light transport function from appearance (Chap-
ter 3), III) shape from appearance (Chapter 4), and IV) lighting from appearance
(Chapter 5),
These four instances represent three levels of abstraction to tackle inverse
rendering. I) At a low level of abstraction, we have proposed methods that fully fac-
torize the object appearance into shape, reflectance, and illumination, which then get
re-rendered back to the RGB images in a physically-based manner (though simpli-
fied) [Srinivasan et al., 2021, Zhang et al., 2021c]. Despite challenging, such low-level
decomposition explicitly solves for every term in the rendering equation, thereby sup-
porting further applications that a mid- or high-level solution is incapable of, such as
editing and exporting of geometry or material. II) At a middle level, we have shown
how to interpolate the light transport function from sparse samples thereof to enable
relighting, view synthesis, or both tasks simultaneously [Sun et al., 2020, Zhang et al.,
2021b]. This abstraction level properly conceals the underlying complex BXDFs and
ray bounces, and suffices for high-quality relighting and view synthesis. III) At a high
level of abstraction, we have trained deep learning models to learn direct mappings
from a single image to shape (Chapter 4) [Sun et al., 2018b, Wu et al., 2018, Zhang
et al., 2018b] or lighting (Chapter 5), without modeling the other scene constituents or

245
the rendering process. Relying on data-driven priors learned from large-scale datasets,
these high-level methods circumvent the need for exhaustive modeling of the image
formation process and enable applications to single images.

Next, we outline some high-level challenges and future directions around these
four subtopics of this dissertation.

While our models achieved full appearance decomposition (Chapter 2), we made
many simplifying assumptions about the scene elements and the rendering process
itself. For instance, Neural Reflectance and Visibility Fields (NeRV) assumes known
lighting [Srinivasan et al., 2021], Neural Factorization of Shape and Reflectance (NeR-
Factor) assumes direct illumination only [Zhang et al., 2021c], and both NeRV and
NeRFactor consider just Bidirectional Reflectance Distribution Functions (BRDFs;
cf. the more general BXDFs) and non-emissive objects. Furthermore, both methods
follow the trend of expressing everything with function approximators such as Multi-
Layer Perceptrons (MLPs). It remains unclear what the optimal way is to export
these “neural models” into a traditional graphics pipeline. One straightforward way
is meshing the geometry and converting the neural reflectance into an analytic model
(or even producing a reflectance look-up table by repeatedly querying the trained
networks), but this approach is clearly suboptimal and may defeat the purpose of
using neural models in the first place.

Although we have demonstrated how to interpolate the light transport function


for high-quality relighting and view synthesis (Chapter 3), both Light Stage Super-
Resolution (LSSR) and Neural Light Transport (NLT) require a capturing device as
expensive as a light stage [Sun et al., 2020, Zhang et al., 2021b]. It remains a challenge
how to achieve the same quality of results from just casual captures possibly by a
mobile phone camera. Furthermore, similar to all other “neural rendering” approaches
[Lombardi et al., 2018, 2019, Thies et al., 2019, Sitzmann et al., 2019a, Mildenhall
et al., 2020], NLT learns one model per-subject and does not generalize to novel
subjects [Zhang et al., 2021b]. This issue of generalization is a gap to be closed in
matching neural rendering to traditional rendering where the renderer is often entirely
general and transfers to novel scenes with zero problem.

246
In single-image 3D shape reconstruction (Chapter 4), we have mostly used voxels
as our shape representation, and the network architectures are designed or selected ac-
cordingly [Sun et al., 2018b, Wu et al., 2018, Zhang et al., 2018b]. Recently, there have
been significant advances in representing shapes using implicit functions [Sitzmann
et al., 2019a,b, Park et al., 2019a, Sitzmann et al., 2020, Mildenhall et al., 2020], as
discussed in Section 1.1.1. This representation switch may lead to a paradigm shift
in how we design the shape reconstruction networks and shape operations such as
ray casting. Despite a potential paradigm shift, we believe what we learned from
Generalizable Reconstruction (GenRe) is transferable to future methods: One should
hardcode the physical processes that we understand well, such as geometric projec-
tions, instead of learning them from scratch for better generalizability [Zhang et al.,
2018b].
For data-driven recovery of the Earth appearance from the Moon observation
(Chapter 5), we demonstrated only simulation results and acknowledge that there may
be additional practical challenges in applying Generative Adversarial Networks for
the Earth (EarthGANs) to real-world images of the Moon. For example, in practice,
useful signals lie only in the dark region of the Moon since the rest is Sun-lit with
the weak signals from the Earth overwhelmed by the Sun illumination. As such, one
might need a specific phase of the Moon to perform this data-driven “imaging.” In
addition, because we will be capturing the dark side of the Moon, there might be
challenges in capturing the dark region with a high signal-to-noise ratio. That said,
EarthGAN is still promising given that it requires only a single-pixel observation of
the Moon, thanks to the high-level data-driven approach.
As a closing remark, we have witnessed the paradigm shift of personal computing
devices, which brought us to the current era of pervasive mobile phones and laptops.
Will Extended Reality (XR) be the next mode of working, gaming, communicating,
etc.? Either way, we hope that this dissertation contributes to the upcoming new
technology, by accelerating and democratizing 3D content capture and creation.

247
THIS PAGE INTENTIONALLY LEFT BLANK

248
Appendix A

Supplement: Neural Reflectance and


Visibility Fields (NeRV)

This appendix chapter contains additional implementation details for Neural Re-
flectance and Visibility Fields (NeRV) [Srinivasan et al., 2021] and additional quali-
tative results from the experiments discussed in Chapter 2.
Please view our supplementary video for a brief overview of NeRV, qualitative
results with smoothly-moving novel light and camera paths, and demonstrations of
additional graphics applications.

A.1 BRDF Parameterization

We use the microfacet Bidirectional Reflectance Distribution Function (BRDF) de-


scribed by Walter et al. [2007] as our reflectance function and incorporate some of
the simplifications discussed in the BRDF implementations of the Filament [Guy and
Agopian, 2018] and Unreal Engine [Karis and Games, 2013] rendering engines. The
BRDF 𝑅(x, 𝜔𝑖 , 𝜔𝑜 ) that we use is defined for any 3D location x, incoming lighting
direction 𝜔𝑖 , and outgoing reflection direction 𝜔𝑜 as:

𝐷(h, n, 𝛾)𝐹 (𝜔𝑖 , h)𝐺(𝜔𝑖 , 𝜔𝑜 , 𝛾) (︁ a )︁


𝑅(x, 𝜔𝑖 , 𝜔𝑜 ) = + (n · 𝜔𝑖 )(1 − 𝐹 (𝜔𝑖 , h)) , (A.1)
4(n · 𝜔𝑜 ) 𝜋

249
𝜌2
𝐷(h, n, 𝛾) = , (A.2)
𝜋((n · h)2 (𝜌2 − 1) + 1)2
𝐹 (𝜔𝑖 , h) = 𝐹0 + (1 − 𝐹0 )(1 − (𝜔𝑖 · h))5 , (A.3)
(n · 𝜔𝑜 )(n · 𝜔𝑖 )
𝐺(𝜔𝑖 , 𝜔𝑜 , 𝛾) = , (A.4)
((n · 𝜔𝑜 )(1−𝑘)+𝑘)((n · 𝜔𝑖 )(1−𝑘)+𝑘)
𝜔𝑜 + 𝜔𝑖 𝛾4
2
𝜌=𝛾 , h= , 𝑘= , (A.5)
‖𝜔𝑜 + 𝜔𝑖 ‖ 2

where a is the diffuse albedo, 𝛾 is the roughness, and n is the surface normal at 3D
point x. We use 𝐹0 = 0.04, which is the typical value of dielectric (non-conducting)
materials. Note that our definition of the BRDF includes the multiplication by the
Lambert cosine term (n · 𝜔𝑖 ) in order to simplify the equations in Chapter 2.

A.2 Additional Qualitative Results


Figure A-1 and Figure A-2 show additional renderings from NeRV and other baseline
methods. We see that NeRV is able to recover effective relightable 3D scene repre-
sentations from images of scenes with complex illumination conditions. Prior work
such as Bi et al. [2020a] is unable to recover accurate representations from images
with lighting conditions more complex than a single point light. Latent code meth-
ods (representative of “NeRF in the Wild” [Martin-Brualla et al., 2021]) are unable
to generalize to simulate lighting conditions unlike those seen during training.

A.3 Limitations
Recovering a NeRV is a straightforward optimization problem: We optimize the pa-
rameters of the Multi-Layer Perceptrons (MLPs) that comprise a NeRV scene repre-
sentation to minimize the error of re-rendering the input images. NeRV currently does
not incorporate any priors into the optimization problem, so a promising direction for
future work would be to integrate priors on geometry and reflectance (such as learned
priors or simple hand-crafted priors to encourage smooth geometry or reflectance pre-
dictions) into the NeRV optimization so that a relightable 3D scene representation

250
Ground Truth

NeRV (Ours)

NLT

Figure A-1: NeRV vs. NLT. Neural Light Transport (NLT) [Zhang et al., 2021b] uses
a controlled laboratory lighting setup with eight times as many images as used by
NeRV, and an input proxy geometry (which is recovered by training a NeRF on a
set of images with fixed illumination). The artifacts seen in the shadows of NLT’s
renderings demonstrate the difference between recovering geometry that works well
for view synthesis (as NLT does) and recovering geometry that works well for both
view synthesis and relighting (as NeRV does).

251
Train Illum. Single Point Colorful + Point Ambient + Point

Ground Truth

NeRV (Ours)

Bi et al.

NeRF + LE

NeRF + Env

Ground Truth

NeRV (Ours)

Bi et al.

NeRF + LE

NeRF + Env

Figure A-2: Additional results and baseline comparisons for NeRV. NeRV is able to
render convincing images from novel viewpoints under novel lighting conditions. The
method of Bi et al. [2020a] is unable to recover accurate models when trained with
illumination more complex than a single point light (Columns 3–6). Methods that
use latent codes to explain variation in appearance due to lighting (NeRF+LE and
NeRF+Env) are unable to generalize to lighting conditions different than those seen
during training.

252
could be recovered from fewer viewpoints or fewer observed lighting conditions.
Successfully recovering a NeRV representation relies on jointly optimizing the
geometry, reflectance, and visibility MLPs. We have noticed failure cases where the
reflectance MLP seems to converge faster than the geometry and visibility MLPs and
is stuck in a local minimum. For example, in cases where the scene is observed under
very few illumination conditions, the reflectance MLP sometimes quickly converges
to include shadows and light tints in the recovered albedo, and is not able to recover
even after the visibility MLP catches up to correctly explain those shadows. Further
investigations into the optimization landscape and dynamics of NeRV could help shed
light on this issue.
Finally, the NeRV optimization problem trains a geometry MLP along with a
visibility MLP that is meant to approximate integrals of the geometry MLP’s output.
Although we impose a loss that encourages these two MLPs to be consistent with
each other, there is no guarantee that these two MLPs will be exactly consistent.
Investigating potential strategies to enforce such consistency may be helpful.

253
THIS PAGE INTENTIONALLY LEFT BLANK

254
Appendix B

Supplement: Light Stage


Super-Resolution (LSSR)

In this appendix chapter, we provide details on the network architecture and pro-
gressive training scheme of Light Stage Super-Resolution (LSSR) [Sun et al., 2020]
introduced in Chapter 3. We also provide more results and baseline comparisons.

B.1 Progressive Training


To train our model, we use a progressive approach similar to that of Karras et al.
[2018]. Instead of simply training our model in one “stage” to minimize some loss
between the full-resolution output image I (𝜔𝑖 ) and the true image from the light
stage I𝑖 , we train our model using multiple stages in a coarse-to-fine approach, where
our model is progressively trained from low resolutions to high resolutions. To do
this, we use an auxiliary set of 1 × 1 convolutional layers from the decoder branch
of our network that produce a 3-channel image from the higher-dimensional neural
activations at each level of the decoder (see Figure B-1).
Let I (𝜔𝑖 , 𝑑) be the auxiliary predicted image for each level 𝑑 and the full-resolution
“auxiliary” image at the very end of the decoder be just the final predicted image itself:
I (𝜔𝑖 ) = I (𝜔𝑖 , 0). Here 𝑑 simultaneously indicates the depth of our encoder/decoder,
the stage of our progressive training, and the degree of spatial downsampling. During

255
Figure B-1: Network architecture and progressive training scheme of LSSR. The 𝛼𝑑
parameters control the progressive training and growing of the network for each scale
𝑑 of the network by modulating the resolution at which input images are used, and
output images are compared to the ground truth.

the 𝑑’th stage of training, we use a convex combination of the auxiliary image at level
𝑑 and an upsampled version of the auxiliary image at level 𝑑 + 1 as the current model
prediction. Our loss in stage 𝑑 is imposed between that combined image and the true
image, downsampled to the native resolution of level 𝑑 of our network. This approach
ensures that the internal activation of our decoder at level 𝑑 is sufficient to enable the
reconstruction of an accurate RGB image (via the auxiliary branch), which means
that the training of stage 𝑑 results in network weights that are well-suited to initialize
the as-yet-untrained model weights on level 𝑑 − 1 of the decoder in the next stage.

Formally, our loss at level 𝑑 is:

ℒ𝑑 = ‖(𝒟 (I𝑖 , 𝑑) − (𝛼𝑑 I (𝜔𝑖 , 𝑑) + (1 − 𝛼𝑑 )𝒰 (I (𝜔𝑖 , 𝑑 + 1) , 1)))‖1 , (B.1)

where 𝒟 (·, 𝑑) is bilinear downsampling by a factor of 2𝑑 , and 𝒰 (·, 𝑑) is bilinear up-


sampling by a factor of 2𝑑 . When computing the loss over the image, we mask out
pixels that are known to belong to the background of the subject. For each stage,
the blending factor 𝛼𝑑 is linearly interpolated from 0 to 1, which means that at the
beginning of that stage’s training the loss is imposed entirely on an upsampled ver-

256
sion of the last stage’s predicted image, but at the end of that stage’s training the
loss is imposed entirely on the current stage’s predicted image. These 𝛼𝑑 factors also
modulate the input to the encoder: As indicated in Figure B-1, the input to each
level of the encoder is a weighted average of the output from the earlier level and
a downsampled version of the input images. This means that the annealing of each
𝛼𝑑 value has a similar effect on the progressive growing of the encoder as it does for
the decoder: The deeper layers of the decoder are trained first using downsampled
images, and then each finer layer of the decoder is added and blended in at each stage
of training.
Our model is trained using a single optimizer instance with four stages, each of
which corresponds to a spatial scale. For the first three stages, we train the model
in two parts: 30,000 iterations at that stage’s spatial resolution, followed by 20,000
iterations as 𝛼𝑑 is linearly interpolated from that scale to the next. At our final stage,
we train the model for 50,000 iterations. At each stage 𝑑, our model minimizes only
ℒ𝑑 . Note that this gradual annealing of each 𝛼𝑑 during each scale means that the loss
is always a continuous function of the optimization iteration, as ℒ𝑑 at the beginning
of training for stage 𝑑 equals ℒ𝑑+1 at the end of training for stage 𝑑 + 1. In total, we
train our network for 200,000 iterations.

B.2 Baseline Comparisons


In Figure B-2, we present the full-resolution comparisons between our model and the
baselines.

257
260:2 • Sun et al.

(a) Groundtruth (b) Ours (c) Linear blending (d) Fuchset al. [2007]

(e) Photometric stereo (f) Xu et al. [2018] w/ optimal sample (g) Xu et al. [2018] w/ adaptive sample (h) Meka et al. [2019]

(a) Groundtruth (b) Ours (c) Linear blending (d) Fuchset al. [2007]

(e) Photometric stereo (f) Xu et al. [2018] w/ optimal sample (g) Xu et al. [2018] w/ adaptive sample (h) Meka et al. [2019]

(a) Groundtruth (b) Ours (c) Linear blending (d) Fuchset al. [2007]

(e) Photometric stereo (f) Xu et al. [2018] w/ optimal sample (g) Xu et al. [2018] w/ adaptive sample (h) Meka et al. [2019]

Fig. 2. Full resolution qualitative comparison between our method and other light interpolation algorithms.
Figure B-2: More comparisons between LSSR and the baselines.

ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020.

258
Appendix C

Supplement: Pix3D

In this appendix chapter, we elaborate on how we evaluate single-image 3D shape re-


construction algorithms, compare the three shape metrics in terms of nearest neighbor
retrieval, and provide more sample data of Pix3D [Sun et al., 2018b], the real-world
dataset of 3D shapes and their (pixel-aligned) images built by ourselves in Chapter 4.

C.1 Evaluation Procedure

Here we explain in detail our evaluation protocol for single-image 3D shape reconstruc-
tion algorithms. As different voxelization methods may result in objects of different
scales in the voxel grid, for a fair comparison, we preprocess all voxels and point
clouds before calculating Intersection over Union (IoU), Chamfer Distance (CD), and
Earth Mover’s Distance (EMD).
For IoU, we first find the bounding box of the object with a threshold of 0.1,
pad the bounding box into a cube, and then use trilinear interpolation to resample
the cube to the desired resolution (323 ). Some algorithms reconstruct shapes at a
resolution of 1283 . In this case, we first apply a 4× max pooling before trilinear
interpolation because without the max pooling, the sampling grid can be too sparse
to capture thin structures. After the resampling of both the output voxel and the
ground-truth voxel, we search for the optimal threshold that maximizes the average
IoU score over all objects, from 0.01 to 0.50 with a step size of 0.01.

259
For CD and EMD, we first sample a point cloud from the voxelized reconstructions.
For each shape, we compute its isosurface with a threshold of 0.1 and then sample
1,024 points from the surface. All point clouds are then translated and scaled such
that the bounding box of the point cloud is centered at the origin with its longest
side being 1. We then compute CD and EMD for each pair of point clouds.

C.2 Nearest Neighbors With Different Metrics

In Section 4.6.2, we have compared three different metrics using two human studies.
We here compare them in yet another way: For a 3D shape, we retrieve the three
nearest neighbors from Pix3D according to IoU, EMD and CD. As Figure C-1 shows,
EMD and CD perform slightly better than IoU in this task.

Query Top-3 Retrieval Results (IoU) Top-3 Retrieval Results (EMD) Top-3 Retrieval Results (CD)

Figure C-1: Retrieving nearest neighbors in Pix3D using different metrics. Here
we show the three nearest neighbors retrieved from Pix3D using the three different
metrics. EMD and CD work slightly better than IoU.

260
C.3 Sample Data in Pix3D
We supply more sample data in Figure C-2, Figure C-3, and Figure C-4. Figure C-2
shows that each shape in Pix3D is associated with a rich set of 2D images. Figure C-3
and Figure C-4 show the diversity of 3D shapes and the quality of 2D-3D alignment
in Pix3D.

3D shape Image Alignment Image Alignment Image Alignment Image Alignment

Figure C-2: Diverse images associated the same shape in Pix3D.

261
3D Shape Image Alignment 3D Shape Image Alignment 3D Shape Image Alignment

Figure C-3: Sample images and their corresponding shapes in Pix3D. From left to
right: 3D shapes, 2D images, and 2D-3D alignment. Rows 1–2 are beds, Rows 3–4
are bookshelves, Rows 5–6 are scanned chairs, Rows 7–8 are chairs whose 3D shapes
come from IKEA [Lim et al., 2013], and Rows 9–10 are desks.

262
3D Shape Image Alignment 3D Shape Image Alignment 3D Shape Image Alignment

Figure C-4: More sample images and their corresponding shapes in Pix3D. From left
to right: 3D shapes, 2D images, and 2D-3D alignment. Rows 1–2 are miscellaneous
objects, Rows 3–4 are sofas, Rows 5–6 are tables, Rows 7–8 are tools, and Rows 9–10
are wardrobes.

263
THIS PAGE INTENTIONALLY LEFT BLANK

264
Appendix D

Supplement: Generalizable
Reconstruction (GenRe)

In this appendix chapter, we provide the details about data preparation and model
architecture for Generalizable Reconstruction (GenRe) [Zhang et al., 2018b], intro-
duced in Chapter 4.

D.1 Data Preparation

We describe how we prepare our data for network training and testing.

Scene Setup The camera is fully specified by its azimuth and elevation angles as
its distance from the object is fixed at 2.2, its up vector is always the world +𝑦 axis,
and it always looks at the world origin, where the object center lies. The focal length
of our camera is fixed at 50 mm on a 35 mm film. The depth values are measured
from the camera center (i.e., ray depth), rather than from the image plane.

Rendering We render 20 images of random views (or 200 fixed views in the view-
point study) for each object of interest. To boost the rendering realism and diversity,
we use three types of background: the SUN backgrounds [Xiao et al., 2010], High-
Dynamic-Range (HDR) environment lighting crawled on the web, and pure white

265
backgrounds. Specifically, for each rendering, we randomly sample a background
type and then a random instance of that type. We use Mitsuba [Jakob, 2010] for our
rendering.

Data Augmentation For network training, we augment our RGB images with
three techniques: color jittering, adding lighting noise, and color normalization. In
color jittering, we multiply the brightness, contrast, and saturation, one by one in a
random order, by a random factor uniformly sampled from [0.6, 1.4]. We then add
AlexNet-style lighting noise [Krizhevsky et al., 2012] and perform the standard color
normalization with statistics derived from the ImageNet dataset [Deng et al., 2009].

D.2 Model Details


We implement all of our networks in PyTorch 0.3.

D.2.1 Single-View Depth Estimator

We adopt an encoder-decoder architecture, where the encoder is a ResNet-18 [He


et al., 2016] that encodes a 256 × 256 RGB image into 512 feature maps of size 1 × 1.
Specifically, it consists of, in a sequential order:

Conv2d(3, 64, kernel=7, stride=2, pad=3)


BatchNorm2d(64, eps=1e-05, momentum=0.1)
ReLU(inplace)
MaxPool2d(kernel=3, stride=2, pad=1, dilation=1)
BasicBlock(
(conv1): Conv2d(64, 64, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(64, 64, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(64, 64, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(64, 64, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1)

266
)
BasicBlock(
(conv1): Conv2d(64, 128, kernel=3, stride=2, pad=1)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(128, 128, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(downsample):
Conv2d(64, 128, kernel=1, stride=2)
BatchNorm2d(128, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(128, 128, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(128, 128, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(128, 256, kernel=3, stride=2, pad=1)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(256, 256, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(downsample):
Conv2d(128, 256, kernel=1, stride=2)
BatchNorm2d(256, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(256, 256, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(256, 256, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(256, 512, kernel=3, stride=2, pad=1)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(512, 512, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1)
(downsample):
Conv2d(256, 512, kernel=1, stride=2)
BatchNorm2d(512, eps=1e-05, momentum=0.1)
)
BasicBlock(
(conv1): Conv2d(512, 512, kernel=3, stride=1, pad=1)

267
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(conv2): Conv2d(512, 512, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1)
).

The decoder is a mirrored version of the encoder, with all convolution layers
replaced by transposed convolution layers. Additionally, we adopt the U-Net structure
[Ronneberger et al., 2015] by feeding the intermediate outputs of each encoder block
to the corresponding decoder block. The decoder outputs an image of relative depth
values in the original view at the same resolution as input. Specifically, the decoder
comprises:
RevBasicBlock(
(deconv1): ConvTranspose2d(512, 256, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(256, 256, kernel=3, stride=2, pad=1, out_pad=1)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(upsample):
ConvTranspose2d(512, 256, kernel=1, stride=2, out_pad=1)
BatchNorm2d(256, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(256, 256, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(256, 256, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(512, 128, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(128, 128, kernel=3, stride=2, pad=1, out_pad=1)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(upsample):
ConvTranspose2d(512, 128, kernel=1, stride=2, out_pad=1)
BatchNorm2d(128, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(128, 128, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(128, 128, kernel=3, stride=1, pad=1)

268
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(256, 64, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(64, 64, kernel=3, stride=2, pad=1, out_pad=1)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(upsample):
ConvTranspose2d(256, 64, kernel=1, stride=2, out_pad=1)
BatchNorm2d(64, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(64, 64, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(64, 64, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(128, 64, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(64, 64, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(upsample):
ConvTranspose2d(128, 64, kernel=1, stride=1)
BatchNorm2d(64, eps=1e-05, momentum=0.1)
)
RevBasicBlock(
(deconv1): ConvTranspose2d(64, 64, kernel=3, stride=1, pad=1)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1)
(relu): ReLU(inplace)
(deconv2): ConvTranspose2d(64, 64, kernel=3, stride=1, pad=1)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1)
)
ConvTranspose2d(128, 64, kernel=3, stride=2, pad=1, out_pad=1)
BatchNorm2d(64, eps=1e-05, momentum=0.1)
ReLU(inplace)
ConvTranspose2d(64, 1, kernel=8, stride=2, pad=3, out_pad=0).

Relative depth values provided by the predicted depth images are insufficient for
conversions to spherical maps or voxels as there are still two degrees of freedom
undetermined: the minimum and maximum (or equivalently, the scale). Therefore,
we have an additional branch decoding, also from the 512 feature maps, the minimum

269
and maximum of the depth values. Specifically, this branch decoder contains:

Conv2d(512, 512, kernel=2, stride=2)


Conv2d(512, 512, kernel=4, stride=1)
ViewAsLinear()
Linear(in_features=512, out_features=256, bias=True)
BatchNorm1d(256, eps=1e-05, momentum=0.1)
ReLU(inplace)
Linear(in_features=256, out_features=128, bias=True)
BatchNorm1d(128, eps=1e-05, momentum=0.1)
ReLU(inplace)
Linear(in_features=128, out_features=2, bias=True).

Using the pretrained ResNet-18 as our network initialization, we then train this
network with supervision on both the depth image (relative) and the minimum as well
as maximum values. Under this setup, our network predicts effectively the absolute
depth values of the input view, which allows us to project these depth values to the
spherical representation or voxel grid.
This network is trained with a batch size of 4. We use Adam [Kingma and Ba,
2015] with a learning rate of 1 × 10−3 , 𝛽1 = 0.5, and 𝛽2 = 0.9 for the optimization.

D.2.2 Spherical Map Inpainting Network

Our inpainting network shares the same architecture as the single-view depth estima-
tor. To mimic the boundary conditions of spherical maps, we use replication padding
for the vertical dimension (elevation) and periodic padding for the horizontal dimen-
sion (azimuth). The padding size is 16 for both dimensions.
This network is trained with a batch size of 4. We use Adam with a learning rate
of 1 × 10−4 , 𝛽1 = 0.5, and 𝛽2 = 0.9 for the optimization.

D.2.3 Voxel Refinement Network

Our voxel refinement network adopts the U-Net structure [Ronneberger et al., 2015]
and uses a sequence of 3D convolution and transposed convolution layers. The input
tensor of batch size 𝑁 has shape 𝑁 × 2 × 128 × 128 × 128, where one channel contains
voxels projected from the predicted original-view depth map, and the other contains

270
voxels projected from the inpainted spherical map. After fusion, the output tensor is
of shape 𝑁 × 1 × 128 × 128 × 128. Specifically, the network is structured as:
Unet(
Conv3d_block(
Conv3d(2, 20, kernel=8, stride=2, pad=3)
BatchNorm3d(20, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Conv3d_block(
Conv3d(20, 40, kernel=4, stride=2, pad=1)
BatchNorm3d(40, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Conv3d_block(
Conv3d(40, 80, kernel=4, stride=2, pad=1)
BatchNorm3d(80, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Conv3d_block(
Conv3d(80, 160, kernel=4, stride=2, pad=1)
BatchNorm3d(160, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Conv3d_block(
Conv3d(160, 320, kernel=4, stride=2, pad=1)
BatchNorm3d(320, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Conv3d_block(
Conv3d(320, 640, kernel=4, stride=1)
BatchNorm3d(640, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
full_conv_block(
Linear(in_features=640, out_features=640, bias=True)
LeakyReLU(negative_slope=0.01)
)
Deconv3d_skip(
ConvTranspose3d(1280, 320, kernel=4, stride=1)
BatchNorm3d(320, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Deconv3d_skip(
ConvTranspose3d(640, 160, kernel=4, stride=2, pad=1)
BatchNorm3d(160, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)

271
Deconv3d_skip(
ConvTranspose3d(320, 80, kernel=4, stride=2, pad=1)
BatchNorm3d(80, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)

)
Deconv3d_skip(
ConvTranspose3d(160, 40, kernel=4, stride=2, pad=1)
BatchNorm3d(40, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Deconv3d_skip(
ConvTranspose3d(80, 20, kernel=8, stride=2, pad=3)
BatchNorm3d(20, eps=1e-05, momentum=0.1)
LeakyReLU(negative_slope=0.01)
)
Deconv3d_skip(
ConvTranspose3d(40, 1, kernel=4, stride=2, pad=1)
)
).

This network is trained with a batch size of 4. We use Adam with a learning rate
of 1 × 10−5 , 𝛽1 = 0.5, and 𝛽2 = 0.9 for the optimization.

272
Bibliography

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor-
Flow: A System for Large-Scale Machine Learning. In USENIX Symposium on
Operating Systems Design and Implementation (OSDI), 2016. 79, 126, 138

Edward H Adelson and James R Bergen. The Plenoptic Function and the Elements
of Early Vision. Computational Models of Visual Processing, 1991. 113, 116

Miika Aittala, Tim Weyrich, and Jaakko Lehtinen. Two-Shot SVBRDF Capture for
Stationary Materials. ACM Transactions on Graphics (TOG), 34(4):1–13, 2015.
60

Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. Multi-Cue Zero-
Shot Learning With Strong Supervision. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2016. 177

Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, and Marcus Magnor.


Tex2Shape: Detailed Full Human Body Geometry From a Single Image. In
IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 116

Stanislaw Antol, C Lawrence Zitnick, and Devi Parikh. Zero-Shot Learning via Visual
Abstraction. In European Conference on Computer Vision (ECCV), 2014. 177

Aayush Bansal and Bryan Russell. Marr Revisited: 2D-3D Alignment via Surface
Normal Prediction. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2016. 176

Jonathan T Barron. A General and Adaptive Robust Loss Function. In IEEE/CVF


Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 137

Jonathan T Barron and Jitendra Malik. Shape, Illumination, and Reflectance


From Shading. IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 37(8):1670–1687, 2014. 25, 40, 57, 83, 84, 95, 96, 116, 176

Harry G Barrow and Jay M Tenenbaum. Recovering Intrinsic Scene Characteristics


From Images. Computer Vision Systems, 1978. 25, 40, 57, 116, 176

273
Harry G Barrow, Jay M Tenenbaum, Robert C Bolles, and Helen C Wolf. Parametric
Correspondence and Chamfer Matching: Two New Techniques for Image Matching.
In International Joint Conference on Artificial Intelligence (IJCAI), 1977. 196

Evgeniy Bart and Shimon Ullman. Cross-Generalization: Learning Novel Classes


From a Single Example by Feature Replacement. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2005. 177

Ronen Basri, David Jacobs, and Ira Kemelmacher. Photometric Stereo With General,
Unknown Lighting. International Journal of Computer Vision (IJCV), 72(3):239–
257, 2007. 117

Sean Bell, Kavita Bala, and Noah Snavely. Intrinsic Images in the Wild. ACM
Transactions on Graphics (TOG), 33(4):159, 2014. 40, 176

Dimitri P Bertsekas. A Distributed Asynchronous Relaxation Algorithm for the As-


signment Problem. In IEEE Conference on Decision and Control (CDC), 1985.
197

Sai Bi, Zexiang Xu, Pratul P Srinivasan, Ben Mildenhall, Kalyan Sunkavalli, Miloš
Hašan, Yannick Hold-Geoffroy, David Kriegman, and Ravi Ramamoorthi. Neural
Reflectance Fields for Appearance Acquisition. arXiv, 2020a. 19, 54, 58, 59, 62,
63, 73, 74, 92, 93, 94, 250, 252

Sai Bi, Zexiang Xu, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David
Kriegman, and Ravi Ramamoorthi. Deep Reflectance Volumes: Relightable Re-
constructions From Multi-View Photometric Images. In European Conference on
Computer Vision (ECCV), 2020b. 58

James F Blinn. Models of Light Reflection for Computer Synthesized Pictures. In


SIGGRAPH, 1977. 32

Federica Bogo, Javier Romero, Matthew Loper, and Michael J Black. FAUST: Dataset
and Evaluation for 3D Mesh Registration. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2014. 178

Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and Arthur Szlam. Optimizing
the Latent Space of Generative Networks. In International Conference on Machine
Learning (ICML), 2018. 75

Mark Boss, Raphael Braun, Varun Jampani, Jonathan T Barron, Ce Liu, and Hendrik
Lensch. NeRD: Neural Reflectance Decomposition From Image Collections. In
IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 59

Katherine L Bouman, Vickie Ye, Adam B Yedidia, Frédo Durand, Gregory W Wor-
nell, Antonio Torralba, and William T Freeman. Turning Corners Into Cameras:
Principles and Methods. In IEEE/CVF International Conference on Computer
Vision (ICCV), 2017. 222

274
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary,
Dougal Maclaurin, and Skye Wanderman-Milne. JAX: Composable Transforma-
tions of Python+NumPy Programs. http://github.com/google/jax, 2018. 69

Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Generative and
Discriminative Voxel Modeling With Convolutional Neural Networks. arXiv, 2016.
174

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Scale GAN Training for
High Fidelity Natural Image Synthesis. In International Conference on Learning
Representations (ICLR), 2018. 224

Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Numerical Geometry


of Non-Rigid Shapes. Springer Science & Business Media, 2008. 178, 194

Gershon Buchsbaum. A Spatial Processor Model for Object Colour Perception. Jour-
nal of the Franklin Institute, 310(1):1–26, 1980. 83

Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen.
Unstructured Lumigraph Rendering. In SIGGRAPH, 2001. 116, 163

Michael Bunnell. Dynamic Ambient Occlusion and Indirect Lighting. In NVIDIA


GPU Gems 2, 2004. 60

Dan A Calian, Jean-François Lalonde, Paulo Gotardo, Tomas Simon, Iain Matthews,
and Kenny Mitchell. From Faces to Outdoor Light Probes. Computer Graphics
Forum (CGF), 37(2):51–61, 2018. 222

Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha Srinivasa, Pieter Abbeel, and
Aaron M Dollar. Benchmarking in Manipulation Research: Using the Yale-CMU-
Berkeley Object and Model Set. IEEE Robotics and Automation Magazine (RAM),
22(3):36–52, 2015. 179

Zhangjie Cao, Qixing Huang, and Karthik Ramani. 3D Object Classification via
Spherical Projections. In International Conference on 3D Vision (3DV), 2017. 177

Joel Carranza, Christian Theobalt, Marcus A Magnor, and Hans-Peter Seidel. Free-
Viewpoint Video of Human Actors. In SIGGRAPH, 2003. 116

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang,
Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. ShapeNet:
An Information-Rich 3D Model Repository. arXiv, 2015. 170, 171, 174, 175, 178,
182, 187, 194, 198, 203

Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-Image Depth Perception
in the Wild. In Advances in Neural Information Processing Systems (NeurIPS),
2016. 176

275
Zhang Chen, Anpei Chen, Guli Zhang, Chengyuan Wang, Yu Ji, Kiriakos N Ku-
tulakos, and Jingyi Yu. A Neural Rendering Framework for Free-Viewpoint Re-
lighting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2020. 58, 117

Zhen Cheng, Zhiwei Xiong, Chang Chen, and Dong Liu. Light Field Super-Resolution:
A Benchmark. In IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition Workshops (CVPRW), 2019. 113

Sungjoon Choi, Qian-Yi Zhou, Stephen Miller, and Vladlen Koltun. A Large Dataset
of Object Scans. arXiv, 2016. 179

Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese.
3D-R2N2: A Unified Approach for Single and Multi-View 3D Object Reconstruc-
tion. In European Conference on Computer Vision (ECCV), 2016. 170, 175, 197,
198, 204

Adam Lloyd Cohen. Anti-Pinhole Imaging. Optica Acta: International Journal of


Optics, 29(1):63–67, 1982. 222

Albert Cohen, Ingrid Daubechies, and J-C Feauveau. Biorthogonal Bases of Com-
pactly Supported Wavelets. Communications on Pure and Applied Mathematics,
45(5):485–560, 1992. 137

Daniel Cohen and Zvi Sheffer. Proximity Clouds—An Acceleration Technique for 3D
Grid Traversal. The Visual Computer, 11(1):27–38, 1994. 52

Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Convolutional Networks
for Spherical Signals. arXiv, 2017. 177

Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical CNNs. In
International Conference on Learning Representations (ICLR), 2018. 177, 184

Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Cal-
abrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-Quality Streamable
Free-Viewpoint Video. ACM Transactions on Graphics (TOG), 34(4):1–13, 2015.
127

Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. Shape Completion Using
3D-Encoder-Predictor CNNs and Shape Synthesis. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2017. 170, 175, 177, 198, 202

Paul Debevec. Rendering Synthetic Objects Into Real Scenes: Bridging Traditional
and Image-Based Graphics With Global Illumination and High Dynamic Range
Photography. In SIGGRAPH, 1998. 77, 222

Paul Debevec. The Light Stages and Their Applications to Photoreal Digital Actors.
In SIGGRAPH Asia, 2012. 108, 114

276
Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin,
and Mark Sagar. Acquiring the Reflectance Field of a Human Face. In SIGGRAPH,
2000. 108, 109, 118, 145, 151

Michael Deering, Stephanie Winner, Bic Schediwy, Chris Duffy, and Neil Hunt. The
Triangle Processor and Normal Vector Shader: A VLSI System for High Perfor-
mance Graphics. In SIGGRAPH, 1988. 130

Boyang Deng, J P Lewis, Timothy Jeruzalski, Gerard Pons-Moll, Geoffrey Hinton,


Mohammad Norouzi, and Andrea Tagliasacchi. Neural Articulated Shape Approx-
imation. In European Conference on Computer Vision (ECCV), 2020. 58

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A
Large-Scale Hierarchical Image Database. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2009. 137, 266

Yue Dong, Guojun Chen, Pieter Peers, Jiawan Zhang, and Xin Tong. Appearance-
From-Motion: Recovering Spatially Varying Surface Reflectance Under Unknown
Lighting. ACM Transactions on Graphics (TOG), 33(6):1–12, 2014. 58

Alexey Dosovitskiy and Thomas Brox. Generating Images With Perceptual Similarity
Metrics Based on Deep Networks. In Advances in Neural Information Processing
Systems (NeurIPS), 2016. 176

Alexey Dosovitskiy, Jost Springenberg, Maxim Tatarchenko, and Thomas Brox.


Learning to Generate Chairs, Tables and Cars With Convolutional Networks. IEEE
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(4):692–
705, 2017. 195

Frédo Durand, Nicolas Holzschuch, Cyril Soler, Eric Chan, and François X Sillion. A
Frequency Analysis of Light Transport. In SIGGRAPH, 2005. 113

David Eigen and Rob Fergus. Predicting Depth, Surface Normals and Semantic
Labels With a Common Multi-Scale Convolutional Architecture. In IEEE/CVF
International Conference on Computer Vision (ICCV), 2015. 176

David Eigen, Christian Puhrsch, and Rob Fergus. Depth Map Prediction From a
Single Image Using a Multi-Scale Deep Network. In Advances in Neural Information
Processing Systems (NeurIPS), 2014. 116

Martin Eisemann, Bert De Decker, Marcus Magnor, Philippe Bekaert, Edilson


De Aguiar, Naveed Ahmed, Christian Theobalt, and Anita Sellent. Floating Tex-
tures. Computer Graphics Forum (CGF), 27(2):409–418, 2008. 116, 163, 164

Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis.


Learning SO(3) Equivariant Representations With Spherical CNNs. In European
Conference on Computer Vision (ECCV), 2018. 177

277
Haoqiang Fan, Hao Su, and Leonidas Guibas. A Point Set Generation Network for
3D Object Reconstruction From a Single Image. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2017. 175, 197, 198

Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing Objects by
Their Attributes. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2009. 177

Michael Firman. RGBD Datasets: Past, Present and Future. In IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2016.
179

Michael Firman, Oisin Mac Aodha, Simon Julier, and Gabriel J Brostow. Structured
Completion of Unobserved Voxels From a Single Depth Image. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 174

John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan
Overbeck, Noah Snavely, and Richard Tucker. DeepView: View Synthesis With
Learned Gradient Descent. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2019. 117

Sing Choong Foo. A Gonioreflectometer for Measuring the Bidirectional Reflectance of


Material for Use in Illumination Computation. Master’s thesis, Cornell University,
2015. 60

William T Freeman. The Generic Viewpoint Assumption in a Framework for Visual


Perception. Nature, 368(6471):542, 1994. 214

William T Freeman. TUM AI Lecture Series – The Moon Camera (Bill Freeman).
https://www.youtube.com/watch?v=Ytkkl917paM, 2020. Accessed: 08/25/2021.
222

Martin Fuchs, Hendrik PA Lensch, Volker Blanz, and Hans-Peter Seidel. Superreso-
lution Reflectance Fields: Synthesizing Images for Intermediate Light Directions.
Computer Graphics Forum (CGF), 26(3):447–456, 2007. 115, 146, 147

Christopher Funk and Yanxi Liu. Beyond Planar Symmetry: Modeling Human Per-
ception of Reflection and Rotation Symmetries in the Wild. In IEEE/CVF Inter-
national Conference on Computer Vision (ICCV), 2017. 178

Graham Fyffe. Cosine Lobe Based Relighting From Gradient Illumination Pho-
tographs. In SIGGRAPH Posters, 2009. 118

Duan Gao, Guojun Chen, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. Deferred
Neural Lighting: Free-Viewpoint Relighting From Unstructured Photographs.
ACM Transactions on Graphics (TOG), 39(6):1–15, 2020. 58

278
Marc-André Gardner, Kalyan Sunkavalli, Ersin Yumer, Xiaohui Shen, Emiliano Gam-
baretto, Christian Gagné, and Jean-François Lalonde. Learning to Predict Indoor
Illumination From a Single Image. ACM Transactions on Graphics (TOG), 36(6):
1–14, 2017. 222

Marc-André Gardner, Yannick Hold-Geoffroy, Kalyan Sunkavalli, Christian Gagné,


and Jean-François Lalonde. Deep Parametric Indoor Lighting Estimation. In
IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 116,
223

Gaurav Garg, Eino-Ville Talvala, Marc Levoy, and Hendrik P Lensch. Symmetric
Photography: Exploiting Data-Sparseness in Reflectance Fields. In Eurographics
Symposium on Rendering Techniques (EGSR), 2006. 118

Mathieu Garon, Kalyan Sunkavalli, Sunil Hadap, Nathan Carr, and Jean-François
Lalonde. Fast Spatially-Varying Indoor Lighting Estimation. In IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2019. 223

Kyle Genova, Forrester Cole, Aaron Sarna Daniel Vlasic, William T Freeman, and
Thomas Funkhouser. Learning Shape Templates With Structured Implicit Func-
tions. In IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
58

Michaël Gharbi, Tzu-Mao Li, Miika Aittala, Jaakko Lehtinen, and Frédo Durand.
Sample-Based Monte Carlo Denoising Using a Kernel-Splatting Network. ACM
Transactions on Graphics (TOG), 38(4):1–12, 2019. 69

Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning
a Predictable and Generative Vector Representation for Objects. In European
Conference on Computer Vision (ECCV), 2016. 175

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets.
In Advances in Neural Information Processing Systems (NeurIPS), 2014. 171, 176,
181, 223, 230

Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. The
Lumigraph. In SIGGRAPH, 1996. 116

Paul Green, Jan Kautz, and Frédo Durand. Efficient Reflectance and Visibility Ap-
proximations for Environment Map Rendering. Computer Graphics Forum (CGF),
26(3):495–502, 2007. 60

Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu
Aubry. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2018. 175, 198

279
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron
Courville. Improved Training of Wasserstein GANs. In Advances in Neural Infor-
mation Processing Systems (NeurIPS), 2017. 181, 182

Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch, Xueming Yu, Matt Whalen,
Geoff Harvey, Sergio Orts-Escolano, Rohit Pandey, Jason Dourgarian, et al. The
Relightables: Volumetric Performance Capture of Humans With Realistic Relight-
ing. ACM Transactions on Graphics (TOG), 38(6):1–19, 2019. 115, 118, 127, 128,
129, 140, 149, 153, 165

Romain Guy and Mathias Agopian. Physically Based Rendering in Filament. https:
//google.github.io/filament/Filament.html, 2018. Accessed: 08/25/2021.
249

JunYoung Gwak, Christopher B Choy, Manmohan Chandraker, Animesh Garg, and


Silvio Savarese. Weakly Supervised 3D Reconstruction With Adversarial Con-
straint. In International Conference on 3D Vision (3DV), 2017. 177, 195

Christian Häne, Shubham Tulsiani, and Jitendra Malik. Hierarchical Surface Pre-
diction for 3D Object Reconstruction. In International Conference on 3D Vision
(3DV), 2017. 175

Pat Hanrahan and Wolfgang Krueger. Reflection From Layered Surfaces Due to
Subsurface Scattering. In SIGGRAPH, 1993. 32

Bruce Hapke. Bidirectional Reflectance Spectroscopy: 1. Theory. Journal of Geo-


physical Research: Solid Earth, 86(B4):3039–3054, 1981. 219, 226

Bharath Hariharan and Ross Girshick. Low-Shot Visual Recognition by Shrinking


and Hallucinating Features. In IEEE/CVF International Conference on Computer
Vision (ICCV), 2017. 177

Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vi-
sion. Cambridge University Press, 2004. 36, 59, 116

Samuel W Hasinoff, Anat Levin, Philip R Goode, and William T Freeman. Diffuse
Reflectance Imaging With Astronomical Applications. In IEEE/CVF International
Conference on Computer Vision (ICCV), 2011. 222

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep Into
Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2015. 125

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning
for Image Recognition. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2016. 133, 180, 185, 266

280
Felix Heide, Lei Xiao, Wolfgang Heidrich, and Matthias B Hullin. Diffuse Mirrors:
3D Reconstruction From Diffuse Indirect Illumination Using Inexpensive Time-
of-Flight Sensors. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2014. 221

Tomáš Hodan, Pavel Haluza, Štepán Obdržálek, Jiri Matas, Manolis Lourakis, and
Xenophon Zabulis. T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-
Less Objects. In IEEE/CVF Winter Conference on Applications of Computer Vi-
sion (WACV), 2017. 179

Yannick Hold-Geoffroy, Kalyan Sunkavalli, Sunil Hadap, Emiliano Gambaretto, and


Jean-François Lalonde. Deep Outdoor Illumination Estimation. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 222

Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring Se-
mantic Layout for Hierarchical Text-to-Image Synthesis. In IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2018. 224

Berthold K P Horn. Shape From Shading: A Method for Obtaining the Shape of a
Smooth Opaque Object From One View. Technical report, Massachusetts Institute
of Technology, 1970. 25, 57

Berthold K P Horn. Determining Lightness From an Image. Computer Graphics and


Image Processing, 3(4):277–299, 1974. 57, 222

Berthold K P Horn. Obtaining Shape From Shading Information. The Psychology of


Computer Vision, pages 115–155, 1975. 42

Berthold K P Horn and Michael J Brooks. Shape From Shading. MIT press, 1989.
176

Kurt Hornik. Approximation Capabilities of Multilayer Feedforward Networks. Neural


Networks, 4(2):251–257, 1991. 112

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer Feedforward


Networks Are Universal Approximators. Neural Networks, 2(5):359–366, 1989. 30

Qixing Huang, Hai Wang, and Vladlen Koltun. Single-View Reconstruction via Joint
Analysis of Image and Shape Collections. ACM Transactions on Graphics (TOG),
34(4):87, 2015. 175

Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal Unsupervised
Image-to-Image Translation. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2018. 224

Zhuo Hui, Kalyan Sunkavalli, Joon-Young Lee, Sunil Hadap, Jian Wang, and Aswin C
Sankaranarayanan. Reflectance Capture Using Univariate Sampling of BRDFs. In
IEEE/CVF International Conference on Computer Vision (ICCV), 2017. 60

281
David S Immel, Michael F Cohen, and Donald P Greenberg. A Radiosity Method for
Non-Diffuse Environments. In SIGGRAPH, 1986. 37

Apple Inc. Use Portrait Mode on Your iPhone. https://support.apple.com/en-


us/HT208118, 2017. 115

Apple Inc. Introducing Object Capture. https://developer.apple.com/


augmented-reality/object-capture/, 2021. Accessed: 08/21/2021. 25

Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning Visual
Groups From Co-Occurrences in Space and Time. In International Conference on
Learning Representations Workshops (ICLRW), 2016. 176

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-Image Trans-
lation With Conditional Adversarial Networks. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2017. 224, 233

Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard A Newcombe,
Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew J Davi-
son, and Andrew W Fitzgibbon. KinectFusion: Real-Time 3D Reconstruction and
Interaction Using a Moving Depth Camera. In ACM Symposium on User Interface
Software and Technology (UIST), 2011. 176, 178

Varun Jain and Hao Zhang. Robust 3D Shape Correspondence in the Spectral Do-
main. In IEEE International Conference on Shape Modeling and Applications
(SMI), 2006. 196

Wenzel Jakob. Mitsuba Renderer. http://www.mitsuba-renderer.org, 2010. 194,


266

Michael Janner, Jiajun Wu, Tejas Kulkarni, Ilker Yildirim, and Joshua B Tenenbaum.
Self-Supervised Intrinsic Image Decomposition. In Advances in Neural Information
Processing Systems (NeurIPS), 2017. 40, 176

Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T Barron, Mario Fritz, Kate
Saenko, and Trevor Darrell. A Category-Level 3-D Object Dataset: Putting the
Kinect to Work. In IEEE/CVF International Conference on Computer Vision
Workshops (ICCVW), 2011. 178

Dinesh Jayaraman, Ruohan Gao, and Kristen Grauman. ShapeCodes: Self-


Supervised Feature Learning by Lifting Views to Viewgrids. In European Con-
ference on Computer Vision (ECCV), 2018. 178, 198

Henrik Wann Jensen, Stephen R Marschner, Marc Levoy, and Pat Hanrahan. A
Practical Model for Subsurface Light Transport. In SIGGRAPH, 2001. 32

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Losses for Real-Time
Style Transfer and Super-Resolution. In European Conference on Computer Vision
(ECCV), 2016. 176

282
James T Kajiya. The Rendering Equation. In SIGGRAPH, 1986. 37, 128

James T Kajiya and Brian P Von Herzen. Ray Tracing Volume Densities. SIG-
GRAPH, 1984. 62, 63

Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. Learning-Based


View Synthesis for Light Field Cameras. ACM Transactions on Graphics (TOG),
35(6):1–10, 2016. 113

Yoshihiro Kanamori and Yuki Endo. Relighting Humans: Occlusion-Aware Inverse


Rendering for Full-Body Human Images. ACM Transactions on Graphics (TOG),
37(6):1–11, 2018. 116

Kaizhang Kang, Zimin Chen, Jiaping Wang, Kun Zhou, and Hongzhi Wu. Effi-
cient Reflectance Capture Using an Autoencoder. ACM Transactions on Graphics
(TOG), 37(4):1–10, 2018. 114

Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-
Specific Object Reconstruction From a Single Image. In IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2015. 175

Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. Learning to Gen-
erate Images of Outdoor Scenes From Attributes and Semantic Layouts. arXiv,
2016. 224

Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. Manipulating
Attributes of Natural Scenes via Hallucination. ACM Transactions on Graphics
(TOG), 39(1):1–17, 2019. 224

Brian Karis and Epic Games. Real Shading in Unreal Engine 4. Proc. Physically
Based Shading Theory Practice, 4(3), 2013. 249

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive Growing of
GANs for Improved Quality, Stability, and Variation. In International Conference
on Learning Representations (ICLR), 2018. 126, 255

Tero Karras, Samuli Laine, and Timo Aila. A Style-Based Generator Architecture for
Generative Adversarial Networks. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2019. 221, 242

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo
Aila. Analyzing and Improving the Image Quality of StyleGAN. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 242

Kevin Karsch, Kalyan Sunkavalli, Sunil Hadap, Nathan Carr, Hailin Jin, Rafael Fonte,
Michael Sittig, and David Forsyth. Automatic Scene Inference for 3D Object Com-
positing. ACM Transactions on Graphics (TOG), 33(3):1–15, 2014. 223

283
Michael Kazhdan and Hugues Hoppe. Screened Poisson Surface Reconstruction. ACM
Transactions on Graphics (TOG), 32(3):29, 2013. 174
Michael Kazhdan, Bernard Chazelle, David Dobkin, Adam Finkelstein, and Thomas
Funkhouser. A Reflective Symmetry Descriptor. In European Conference on Com-
puter Vision (ECCV), 2002. 177
Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Symmetry De-
scriptors and 3D Shape Matching. In Symposium on Geometry Processing (SGP),
2004. 177
Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson Surface Reconstruc-
tion. In Symposium on Geometry Processing (SGP), 2006. 174
Sean Kelly, Samantha Cordingley, Patrick Nolan, Christoph Rhemann, Sean Fanello,
Danhang Tang, Jude Osborn, Jay Busch, Philip Davidson, Paul Debevec, et al.
AR-ia: Volumetric Opera for Mobile Augmented Reality. In SIGGRAPH Asia
XR, 2019. 108
Markus Kettunen, Erik Härkönen, and Jaakko Lehtinen. E-LPIPS: Robust Perceptual
Image Similarity via Random Transformation Ensembles. arXiv, 2019. 142
Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias
Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian
Theobalt. Deep Video Portraits. ACM Transactions on Graphics (TOG), 37(4):
1–14, 2018. 108, 116, 119
Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.
In International Conference on Learning Representations (ICLR), 2015. 69, 79,
126, 138, 182, 270
Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In Interna-
tional Conference on Learning Representations (ICLR), 2014. 223, 231
Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and Temples:
Benchmarking Large-Scale Scene Reconstruction. ACM Transactions on Graphics
(TOG), 36(4):78, 2017. 179
Vladislav Kreavoy, Dan Julius, and Alla Sheffer. Model Composition From Inter-
changeable Components. In Pacific Conference on Computer Graphics and Appli-
cations (PG), 2007. 196
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification
With Deep Convolutional Neural Networks. In Advances in Neural Information
Processing Systems (NeurIPS), 2012. 266
Zorah Lahner, Emanuele Rodola, Frank R Schmidt, Michael M Bronstein, and Daniel
Cremers. Efficient Globally Optimal 2D-to-3D Deformable Shape Matching. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2016. 178

284
Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. A Large-Scale Hierarchical
Multi-View RGB-D Object Dataset. In IEEE International Conference on Robotics
and Automation (ICRA), 2011. 179

Jean-François Lalonde, Alexei A Efros, and Srinivasa G Narasimhan. Estimating


Natural Illumination From a Single Outdoor Image. In IEEE/CVF International
Conference on Computer Vision (ICCV), 2009. 222

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to Detect


Unseen Object Classes by Between-Class Attribute Transfer. In IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2009. 177

Edwin H Land and John J McCann. Lightness and Retinex Theory. Journal of the
Optical Society of America, 61(1):1–11, 1971. 57, 83

Martin Laurenzis, Andreas Velten, and Jonathan Klein. Dual-Mode Optical Sensing:
Three-Dimensional Imaging and Seeing Around a Corner. Optical Engineering, 56
(3):031202, 2016. 221

Jason Lawrence, Szymon Rusinkiewicz, and Ravi Ramamoorthi. Efficient BRDF


Importance Sampling Using a Factored Representation. ACM Transactions on
Graphics (TOG), 23(3):496–505, 2004. 33

Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham,
Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang,
et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversar-
ial Network. In IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2017. 176

Chloe LeGendre, Wan-Chun Ma, Graham Fyffe, John Flynn, Laurent Charbonnel,
Jay Busch, and Paul Debevec. DeepLight: Learning Illumination for Unconstrained
Mobile Mixed Reality. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2019. 116, 222

Bastian Leibe and Bernt Schiele. Analyzing Appearance and Contour Based Methods
for Object Categorization. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2003. 178

Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An Accurate O(n)
Solution to the PnP Problem. International Journal of Computer Vision (IJCV),
81(2):155, 2009. 191

Kenneth Levenberg. A Method for the Solution of Certain Non-Linear Problems in


Least Squares. Quarterly of Applied Mathematics, 2(2):164–168, 1944. 191

Marc Levoy and Pat Hanrahan. Light Field Rendering. In SIGGRAPH, 1996. 108,
116

285
Thomas Lewiner, Hélio Lopes, Antônio Wilson Vieira, and Geovan Tavares. Efficient
Implementation of Marching Cubes’ Cases With Topological Guarantees. Journal
of Graphics Tools, 8(2):1–15, 2003. 186, 196

Yangyan Li, Angela Dai, Leonidas Guibas, and Matthias Nießner. Database-Assisted
Object Retrieval for Real-Time 3D Reconstruction. Computer Graphics Forum
(CGF), 34(2):435–446, 2015. 174

Yikai Li, Jiayuan Mao, Xiuming Zhang, William T Freeman, Joshua B Tenenbaum,
Noah Snavely, and Jiajun Wu. Multi-Plane Program Induction With 3D Box Priors.
In Advances in Neural Information Processing Systems (NeurIPS), 2020a. 116

Yue Li, Pablo Wiedemann, and Kenny Mitchell. Deep Precomputed Radiance Trans-
fer for Deformable Objects. Proceedings of the ACM on Computer Graphics and
Interactive Techniques (PACMCGIT), 2(1):1–16, 2019. 131

Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely. Crowdsampling the Plenoptic
Function. In European Conference on Computer Vision (ECCV), 2020b. 59

Zhengqin Li, Zexiang Xu, Ravi Ramamoorthi, Kalyan Sunkavalli, and Manmohan
Chandraker. Learning to Reconstruct Shape and Spatially-Varying Reflectance
From a Single Image. ACM Transactions on Graphics (TOG), 37(6):1–11, 2018.
57, 116

Zhengqin Li, Mohammad Shafiei, Ravi Ramamoorthi, Kalyan Sunkavalli, and Man-
mohan Chandraker. Inverse Rendering for Complex Indoor Scenes: Shape,
Spatially-Varying Lighting and SVBRDF From a Single Image. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2020c. 57, 116,
223

Joseph J Lim, Hamed Pirsiavash, and Antonio Torralba. Parsing IKEA Objects: Fine
Pose Estimation. In IEEE/CVF International Conference on Computer Vision
(ICCV), 2013. 179, 187, 189, 190, 195, 262

Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt.
Neural Sparse Voxel Fields. In Advances in Neural Information Processing Systems
(NeurIPS), 2020. 58

Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised Image-to-Image Transla-
tion Networks. In Advances in Neural Information Processing Systems (NeurIPS),
2017. 224

Stephen Lombardi and Ko Nishino. Reflectance and Natural Illumination From a


Single Image. In European Conference on Computer Vision (ECCV), 2012. 97

Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. Deep Appearance
Models for Face Rendering. ACM Transactions on Graphics (TOG), 37(4):1–13,
2018. 114, 117, 118, 166, 246

286
Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas
Lehrmann, and Yaser Sheikh. Neural Volumes: Learning Dynamic Renderable
Volumes From Images. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
117, 118, 166, 246
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully Convolutional Networks
for Semantic Segmentation. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2015. 125
William E Lorensen and Harvey E Cline. Marching Cubes: A High Resolution 3D
Surface Construction Algorithm. In SIGGRAPH, 1987. 30, 91, 186
Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, and
Paul Debevec. Rapid Acquisition of Specular and Diffuse Normal Maps from Po-
larized Spherical Gradient Illumination. In Eurographics Symposium on Rendering
Techniques (EGSR), 2007. 118
Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier Nonlinearities Improve
Neural Network Acoustic Models. In International Conference on Machine Learning
(ICML), 2013. 137
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen
Paul Smolley. Least Squares Generative Adversarial Networks. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 233
Donald W Marquardt. An Algorithm for Least-Squares Estimation of Nonlinear
Parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):
431–441, 1963. 191
David Marr. Vision. W H Freeman and Company, 1982. 31, 172
Stephen R Marschner. Inverse Rendering for Computer Graphics. Cornell University,
1998. 57
Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, Jonathan
Taylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, Peter
Lincoln, et al. LookinGood: Enhancing Performance Capture With Real-Time
Neural Re-Rendering. ACM Transactions on Graphics (TOG), 37(6):1–14, 2018.
117
Ricardo Martin-Brualla, Noha Radwan, Mehdi S M Sajjadi, Jonathan T Barron,
Alexey Dosovitskiy, and Daniel Duckworth. NeRF in the Wild: Neural Radiance
Fields for Unconstrained Photo Collections. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2021. 59, 75, 92, 250
Vincent Masselus, Pieter Peers, Philip Dutré, and Yves D Willemsy. Smooth Recon-
struction and Compact Representation of Reflectance Functions for Image-Based
Relighting. In Eurographics Symposium on Rendering Techniques (EGSR), 2004.
114

287
Wojciech Matusik, Hanspeter Pfister, Matt Brand, and Leonard McMillan. A Data-
Driven Reflectance Model. ACM Transactions on Graphics (TOG), 22(3):759–769,
2003. 60, 75, 81, 87

Tim Maughan. Virtual Reality: The Hype, the Problems and the
Promise. https://www.bbc.com/future/article/20160729-virtual-reality-
the-hype-the-problems-and-the-promise, 2016. Accessed: 08/04/2021. 25

Nelson Max. Optical Models for Direct Volume Rendering. IEEE Transactions on
Visualization and Computer Graphics (TVCG), 1(2):99–108, 1995. 62, 68

John McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J Davison.


SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training
on Indoor Segmentation? In IEEE/CVF International Conference on Computer
Vision (ICCV), 2017. 176

Abhimitra Meka, Christian Haene, Rohit Pandey, Michael Zollhöfer, Sean Fanello,
Graham Fyffe, Adarsh Kowdle, Xueming Yu, Jay Busch, Jason Dourgarian, et al.
Deep Reflectance Fields: High-Quality Facial Reflectance Field Inference From
Color Gradient Illumination. ACM Transactions on Graphics (TOG), 38(4):1–12,
2019. 108, 115, 117, 119, 139, 143, 146, 147, 148

Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which Training Meth-
ods for GANs Do Actually Converge? In International Conference on Machine
Learning (ICML), 2018. 224

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas
Geiger. Occupancy Networks: Learning 3D Reconstruction in Function Space. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2019. 58

Moustafa Meshry, Dan B Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey,
Noah Snavely, and Ricardo Martin-Brualla. Neural Rerendering in the Wild. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2019. 59

Peyman Milanfar. Super-Resolution Imaging. CRC Press, 2010. 113

Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari,


Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local Light Field Fusion: Practical
View Synthesis With Prescriptive Sampling Guidelines. ACM Transactions on
Graphics (TOG), 38(4):1–14, 2019. 113, 117

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi


Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields
for View Synthesis. In European Conference on Computer Vision (ECCV), 2020.
30, 50, 54, 58, 61, 68, 70, 79, 80, 81, 87, 91, 117, 118, 154, 166, 246, 247

288
Mehdi Mirza and Simon Osindero. Conditional Generative Adversarial Nets. arXiv,
2014. 224

Niloy J Mitra, Leonidas J Guibas, and Mark Pauly. Partial and Approximate Sym-
metry Detection for 3D Geometry. ACM Transactions on Graphics (TOG), 25(3):
560–568, 2006. 174

Takeru Miyato and Masanori Koyama. cGANs With Projection Discriminator. In


International Conference on Learning Representations (ICLR), 2018. 224

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral
Normalization for Generative Adversarial Networks. In International Conference
on Learning Representations (ICLR), 2018. 233

Tomas Möller and Ben Trumbore. Fast, Minimum Storage Ray-Triangle Intersection.
Journal of Graphics Tools, 2(1):21–28, 1997. 29

JF Murray-Coleman and AM Smith. The Automated Measurement of BRDFs and


Their Application to Luminaire Modeling. Journal of the Illuminating Engineering
Society, 19(1):87–99, 1990. 108

Oliver Nalbach, Elena Arabadzhiyska, Dushyant Mehta, H-P Seidel, and Tobias
Ritschel. Deep Shading: Convolutional Neural Networks for Screen Space Shading.
Computer Graphics Forum (CGF), 36(4):65–78, 2017. 130, 149

Giljoo Nam, Joo Ho Lee, Diego Gutierrez, and Min H Kim. Practical SVBRDF Ac-
quisition of 3D Objects With Unstructured Flash Photography. ACM Transactions
on Graphics (TOG), 37(6):1–12, 2018. 58

NASA. DSCOVR: EPIC – Earth Polychromatic Imaging Camera. https://epic.


gsfc.nasa.gov/, 2015. Accessed: 08/05/2021. 225, 226

NASA. CGI Moon Kit. https://svs.gsfc.nasa.gov/cgi-bin/details.cgi?aid=


4720, 2019. Accessed: 08/05/2021. 226

Andrew Nealen, Takeo Igarashi, Olga Sorkine, and Marc Alexa. Laplacian Mesh Opti-
mization. In ACM International Conference on Computer Graphics and Interactive
Techniques in Australasia and Southeast Asia (GRAPHITE), 2006. 174

Thomas Nestmeyer, Iain Matthews, Jean-François Lalonde, and Andreas M


Lehrmann. Structural Decompositions for End-to-End Relighting. arXiv, 2019.
115

Thomas Nestmeyer, Jean-François Lalonde, Iain Matthews, Epic Games, Andreas


Lehrmann, and AI Borealis. Learning Physics-Guided Face Relighting Under Di-
rectional Light. In IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2020. 117

289
Ren Ng, Ravi Ramamoorthi, and Pat Hanrahan. All-Frequency Shadows Using Non-
Linear Wavelet Lighting Approximation. ACM Transactions on Graphics (TOG),
22(3):376–381, 2003. 142

Jannik Boll Nielsen, Henrik Wann Jensen, and Ravi Ramamoorthi. On Optimal, Min-
imal Brdf Sampling for Reflectance Acquisition. ACM Transactions on Graphics
(TOG), 34(6):1–11, 2015. 60

Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differen-
tiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D
Supervision. In IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2019. 58

Ko Nishino. Directional Statistics BRDF Model. In IEEE/CVF International Con-


ference on Computer Vision (ICCV), 2009. 97

Ko Nishino and Stephen Lombardi. Directional Statistics-Based Reflectance Model for


Isotropic Bidirectional Reflectance Distribution Functions. Journal of the Optical
Society of America A, 28(1):8–18, 2011. 97

Ko Nishino and Shree K Nayar. Corneal Imaging System: Environment From Eyes.
International Journal of Computer Vision (IJCV), 70(1):23–40, 2006. 222

David Novotny, Diane Larlus, and Andrea Vedaldi. Learning 3D Object Categories
by Looking Around Them. In IEEE/CVF International Conference on Computer
Vision (ICCV), 2017. 175

Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional Image Syn-
thesis With Auxiliary Classifier GANs. In International Conference on Machine
Learning (ICML), 2017. 224

Michael Oechsle, Songyou Peng, and Andreas Geiger. UNISURF: Unifying Neural Im-
plicit Surfaces and Radiance Fields for Multi-View Reconstruction. In IEEE/CVF
International Conference on Computer Vision (ICCV), 2021. 72

Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kow-
dle, Yury Degtyarev, David Kim, Philip L Davidson, Sameh Khamis, Mingsong
Dou, et al. Holoportation: Virtual 3D Teleportation in Real-Time. In ACM Sym-
posium on User Interface Software and Technology (UIST), 2016. 108

Matthew O’Toole and Kiriakos N Kutulakos. Optical Computing for Fast Light
Transport Analysis. ACM Transactions on Graphics (TOG), 29(6):1–12, 2010. 114

Geoffrey Oxholm and Ko Nishino. Multiview Shape and Reflectance From Natural
Illumination. In IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2014. 19, 58, 84, 96, 97, 98

290
Rohit Pandey, Anastasia Tkach, Shuoran Yang, Pavel Pidlypenskyi, Jonathan Taylor,
Ricardo Martin-Brualla, Andrea Tagliasacchi, George Papandreou, Philip David-
son, Cem Keskin, Shahram Izadi, and Sean Fanello. Volumetric Capture of Humans
With a Single RGBD Camera via Semi-Parametric Learning. In IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2019. 108, 117,
119

Rohit Pandharkar, Andreas Velten, Andrew Bardagjy, Everett Lawson, Moungi


Bawendi, and Ramesh Raskar. Estimating Motion and Size of Moving Non-Line-of-
Sight Objects in Cluttered Environments. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2011. 221

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven
Lovegrove. DeepSDF: Learning Continuous Signed Distance Functions for Shape
Representation. In IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2019a. 29, 30, 58, 75, 247

Jeong Joon Park, Aleksander Holynski, and Steve Seitz. Seeing the World in a Bag
of Chips. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2020. 57

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic Image
Synthesis With Spatially-Adaptive Normalization. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2019b. 221, 224, 225, 230, 231,
233

Pieter Peers, Dhruv K Mahajan, Bruce Lamond, Abhijeet Ghosh, Wojciech Matusik,
Ravi Ramamoorthi, and Paul Debevec. Compressive Light Transport Sensing.
ACM Transactions on Graphics (TOG), 28(1):1–18, 2009. 113

Xingchao Peng, Baochen Sun, Karim Ali, and Kate Saenko. Learning Deep Object
Detectors from 3D Models. In IEEE/CVF International Conference on Computer
Vision (ICCV), 2015. 177

Alex Paul Pentland. A New Sense for Depth of Field. IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 9(4):523–531, 1987. 42

Matt Pharr, Wenzel Jakob, and Greg Humphreys. Physically Based Rendering: From
Theory to Implementation. Morgan Kaufmann Publishers Inc., 3rd edition, 2016.
51, 52, 108

Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A Efros, and George Drettakis.
Multi-View Relighting Using a Geometry-Aware Network. ACM Transactions on
Graphics (TOG), 38(4):1–14, 2019. 19, 97, 99

Marc Proesmans, Luc Van Gool, and André Oosterlinck. One-Shot Active 3D Shape
Acquisition. In International Conference on Pattern Recognition (ICPR), 1996.
178

291
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep Learning
on Point Sets for 3D Classification and Segmentation. In IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2017a. 29, 92

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep
Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in
Neural Information Processing Systems (NeurIPS), 2017b. 29

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation


Learning With Deep Convolutional Generative Adversarial Networks. In Inter-
national Conference on Learning Representations (ICLR), 2016. 171, 176

Gilles Rainer, Wenzel Jakob, Abhijeet Ghosh, and Tim Weyrich. Neural BTF Com-
pression and Interpolation. Computer Graphics Forum (CGF), 38(2):235–244, 2019.
114

Ravi Ramamoorthi and Pat Hanrahan. A Signal-Processing Framework for Inverse


Rendering. In SIGGRAPH, 2001. 57, 85, 113, 219

Ravi Ramamoorthi and Pat Hanrahan. A Signal-Processing Framework for Reflection.


ACM Transactions on Graphics (TOG), 23(4):1004–1042, 2004. 77, 219

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and
Honglak Lee. Generative Adversarial Text to Image Synthesis. In International
Conference on Machine Learning (ICML), 2016. 224

Peiran Ren, Jiaping Wang, Minmin Gong, Stephen Lin, Xin Tong, and Baining Guo.
Global Illumination With Radiance Regression Functions. ACM Transactions on
Graphics (TOG), 32(4):1–12, 2013. 114

Peiran Ren, Yue Dong, Stephen Lin, Xin Tong, and Baining Guo. Image-Based
Relighting Using Neural Networks. ACM Transactions on Graphics (TOG), 34(4):
1–12, 2015. 114, 117

Danilo Jimenez Rezende, SM Eslami, Shakir Mohamed, Peter Battaglia, Max Jader-
berg, and Nicolas Heess. Unsupervised Learning of 3D Structure From Images. In
Advances in Neural Information Processing Systems (NeurIPS), 2016. 175

Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and Andreas Geiger. OctNetFusion:
Learning Depth Fusion From Data. In International Conference on 3D Vision
(3DV), 2017a. 175

Gernot Riegler, Ali Osman Ulusoys, and Andreas Geiger. OctNet: Learning Deep
3D Representations at High Resolutions. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2017b. 175

Tobias Ritschel, Thorsten Grosch, Jan Kautz, and Stefan Müller. Interactive Illumi-
nation With Coherent Shadow Maps. In Eurographics Symposium on Rendering
Techniques (EGSR), 2007. 60

292
Tobias Ritschel, Thorsten Grosch, Min H Kim, H-P Seidel, Carsten Dachsbacher,
and Jan Kautz. Imperfect Shadow Maps for Efficient Computation of Indirect
Illumination. ACM Transactions on Graphics (TOG), 27(5):1–8, 2008. 60

Tobias Ritschel, Thomas Engelhardt, Thorsten Grosch, H-P Seidel, Jan Kautz, and
Carsten Dachsbacher. Micro-Rendering for Scalable, Parallel Final Gathering.
ACM Transactions on Graphics (TOG), 28(5):1–8, 2009. 60

Jason Rock, Tanmay Gupta, Justin Thorsen, JunYoung Gwak, Daeyun Shin, and
Derek Hoiem. Completing 3D Object Shape From One Depth Image. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 178

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Net-
works for Biomedical Image Segmentation. In International Conference on Medical
Image Computing and Computer Assisted Intervention (MICCAI), 2015. 119, 136,
186, 268, 270

Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The Earth Mover’s Distance as
a Metric for Image Retrieval. International Journal of Computer Vision (IJCV),
40(2):99–121, 2000. 196

Szymon M Rusinkiewicz. A New Change of Variables for Efficient BRDF Repre-


sentation. In Eurographics Symposium on Rendering Techniques (EGSR), 1998.
76

Ryusuke Sagawa, Hiroshi Kawasaki, Shota Kiyota, and Ryo Furukawa. Dense One-
Shot 3D Reconstruction by Detecting Continuous Regions With Parallel Line Pro-
jection. In IEEE/CVF International Conference on Computer Vision (ICCV),
2011. 178

Hanan Samet. Implementing Ray Tracing With Octrees and Neighbor Finding. Com-
puters & Graphics, 13(4):445–460, 1989. 52

Shen Sang and Manmohan Chandraker. Single-Shot Neural Relighting and SVBRDF
Estimation. In European Conference on Computer Vision (ECCV), 2020. 57

Imari Sato, Takahiro Okabe, Yoichi Sato, and Katsushi Ikeuchi. Appearance Sampling
for Obtaining a Set of Basis Images for Variable Illumination. In IEEE/CVF
International Conference on Computer Vision (ICCV), 2003. 113

Yoichi Sato, Mark D Wheeler, and Katsushi Ikeuchi. Object Shape and Reflectance
Modeling From Observation. In SIGGRAPH, 1997. 57

Silvio Savarese and Li Fei-Fei. 3D Generic Object Categorization, Localization and


Pose Estimation. In IEEE/CVF International Conference on Computer Vision
(ICCV), 2007. 178

293
Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3D: Learning 3D Scene Structure
From a Single Still Image. IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI), 31(5):824–840, 2008. 116
Carolin Schmitt, Simon Donne, Gernot Riegler, Vladlen Koltun, and Andreas Geiger.
On Joint Estimation of Pose, Geometry and SVBRDF From a Handheld Scanner.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2020. 58
Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-From-Motion Re-
visited. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2016. 56, 81, 87
Steven M Seitz and Charles R Dyer. Photorealistic Scene Reconstruction by Voxel
Coloring. International Journal of Computer Vision (IJCV), 35(2):151–173, 1999.
59
Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski.
A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2006. 42
Pradeep Sen and Soheil Darabi. Compressive Dual Photography. Computer Graphics
Forum (CGF), 28(2):609–618, 2009. 113
Pradeep Sen, Billy Chen, Gaurav Garg, Stephen R Marschner, Mark Horowitz, Marc
Levoy, and Hendrik PA Lensch. Dual Photography. ACM Transactions on Graphics
(TOG), 24(3):745–755, 2005. 37, 117
Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David W Jacobs.
SfSNet: Learning Shape, Refectance and Illuminance of Faces in the Wild. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2018. 114, 116
Soumyadip Sengupta, Jinwei Gu, Kihwan Kim, Guilin Liu, David W Jacobs, and
Jan Kautz. Neural Inverse Rendering of an Indoor Scene From a Single Image. In
IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 57, 222
Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, and Ira
Kemelmacher-Shlizerman. Background Matting: The World Is Your Green Screen.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2020. 108, 119
Jun’ichiro Seyama and Ruth S Nagayama. The Uncanny Valley: Effect of Realism
on the Impression of Artificial Human Faces. Presence, 16(4):337–351, 2007. 109
Evan Shelhamer, Jonathan T Barron, and Trevor Darrell. Scene Intrinsics and Depth
From a Single Image. In IEEE/CVF International Conference on Computer Vision
Workshops (ICCVW), 2015. 223

294
Jian Shi, Yue Dong, Hao Su, and Stella X Yu. Learning Non-Lambertian Object
Intrinsics across ShapeNet Categories. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2017. 176

Philip Shilane, Patrick Min, Michael Kazhdan, and Thomas Funkhouser. The Prince-
ton Shape Benchmark. In IEEE International Conference on Shape Modeling and
Applications (SMI), 2004. 178

Daeyun Shin, Charless C Fowlkes, and Derek Hoiem. Pixels, Voxels, and Views:
a Study of Shape Representations for Single View 3D Object Shape Prediction.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2018. 178, 198, 208, 209, 210, 211

Dongeek Shin, Ahmed Kirmani, Vivek K Goyal, and Jeffrey H Shapiro. Computa-
tional 3D and Reflectivity Imaging With High Photon Efficiency. In International
Conference on Image Processing (ICIP), 2014. 221

Dongeek Shin, Ahmed Kirmani, Vivek K Goyal, and Jeffrey H Shapiro. Photon-
Efficient Computational 3-D and Reflectivity Imaging With Single-Photon Detec-
tors. IEEE Transactions on Computational Imaging, 1(2):112–125, 2015. 221

Dongeek Shin, Feihu Xu, Dheera Venkatraman, Rudi Lussana, Federica Villa, Franco
Zappa, Vivek K Goyal, Franco NC Wong, and Jeffrey H Shapiro. Photon-Efficient
Computational Imaging With a Single-Photon Camera. In Computational Optical
Sensing and Imaging, pages CW5D–4. Optical Society of America, 2016. 221

Aliaksandra Shysheya, Egor Zakharov, Kara-Ali Aliev, Renat Bashirov, Egor Burkov,
Karim Iskakov, Aleksei Ivakhnenko, Yury Malkov, Igor Pasechnik, Dmitry Ulyanov,
et al. Textured Neural Avatars. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2019. 116

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor Segmen-
tation and Support Inference From RGBD Images. In European Conference on
Computer Vision (ECCV), 2012. 176

Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for
Large-Scale Image Recognition. In International Conference on Learning Repre-
sentations (ICLR), 2015. 137

Arjun Singh, James Sha, Karthik S Narayan, Tudor Achim, and Pieter Abbeel. Big-
BIRD: A Large-Scale 3D Database of Object Instances. In IEEE International
Conference on Robotics and Automation (ICRA), 2014. 179

Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein,
and Michael Zollhofer. DeepVoxels: Learning Persistent 3D Feature Embeddings.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2019a. 117, 118, 166, 246, 247

295
Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene Representation
Networks: Continuous 3D-Structure-Aware Neural Scene Representations. In Ad-
vances in Neural Information Processing Systems (NeurIPS), 2019b. 58, 117, 247

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon
Wetzstein. Implicit Neural Representations With Periodic Activation Functions.
In Advances in Neural Information Processing Systems (NeurIPS), 2020. 58, 247

Peter-Pike Sloan, Jan Kautz, and John Snyder. Precomputed Radiance Transfer
for Real-Time Rendering in Dynamic, Low-Frequency Lighting Environments. In
SIGGRAPH, 2002. 60

Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo Tourism: Exploring Photo
Collections in 3D. ACM Transactions on Graphics (TOG), 25(3):835–846, 2006.
59, 108

Amir Arsalan Soltani, Haibin Huang, Jiajun Wu, Tejas D Kulkarni, and Joshua B
Tenenbaum. Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and
Silhouettes With Deep Generative Networks. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2017. 198

Shuran Song and Thomas Funkhouser. Neural Illumination: Lighting Prediction for
Indoor Environments. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2019. 223

Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas
Funkhouser. Semantic Scene Completion from a Single Depth Image. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 176

Olga Sorkine and Daniel Cohen-Or. Least-Squares Meshes. In IEEE International


Conference on Shape Modeling and Applications (SMI), 2004. 174

Pratul P Srinivasan, Richard Tucker, Jonathan T Barron, Ravi Ramamoorthi, Ren


Ng, and Noah Snavely. Pushing the Boundaries of View Extrapolation With Multi-
plane Images. In IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2019. 117

Pratul P Srinivasan, Ben Mildenhall, Matthew Tancik, Jonathan T Barron, Richard


Tucker, and Noah Snavely. Lighthouse: Predicting Lighting Volumes for Spatially-
Coherent Illumination. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2020. 223

Pratul P Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Milden-
hall, and Jonathan T Barron. NeRV: Neural Reflectance and Visibility Fields for
Relighting and View Synthesis. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2021. 26, 31, 32, 33, 34, 35, 36, 37, 38, 40, 41,
45, 49, 53, 61, 74, 104, 245, 246, 249

296
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks From Over-
fitting. Journal of Machine Learning Research (JMLR), 15(1):1929–1958, 2014.
122

Jessi Stumpfel, Chris Tchou, Andrew Jones, Tim Hawkins, Andreas Wenger, and
Paul Debevec. Direct HDR Capture of the Sun and Sky. In AFRIGRAPH, 2004.
80, 103, 104

Tiancheng Sun, Henrik Wann Jensen, and Ravi Ramamoorthi. Connecting Measured
BRDFs to Analytic BRDFs by Data-Driven Diffuse-Specular Separation. ACM
Transactions on Graphics (TOG), 37(6):1–15, 2018a. 76

Tiancheng Sun, Jonathan T Barron, Yun-Ta Tsai, Zexiang Xu, Xueming Yu, Graham
Fyffe, Christoph Rhemann, Jay Busch, Paul Debevec, and Ravi Ramamoorthi.
Single Image Portrait Relighting. ACM Transactions on Graphics (TOG), 38(4):
1–12, 2019. 108, 110, 115, 117, 119, 139

Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhemann, Paul
Debevec, Yun-Ta Tsai, Jonathan T Barron, and Ravi Ramamoorthi. Light Stage
Super-Resolution: Continuous High-Frequency Relighting. ACM Transactions on
Graphics (TOG), 39(6):1–12, 2020. 26, 35, 38, 41, 45, 107, 109, 117, 119, 132, 138,
164, 245, 246, 255

Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tian-
fan Xue, Joshua B Tenenbaum, and William T Freeman. Pix3D: Dataset and Meth-
ods for Single-Image 3D Shape Modeling. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2018b. 27, 30, 36, 46, 169, 170, 179, 188,
195, 205, 216, 245, 247, 259

Minhyuk Sung, Vladimir G Kim, Roland Angst, and Leonidas Guibas. Data-Driven
Structural Priors for Shape Completion. ACM Transactions on Graphics (TOG),
34(6):175, 2015. 174

Richard Szeliski. Computer Vision: Algorithms and Applications. Springer Science


& Business Media, 2010. 36, 42

Matthew Tancik, Pratul P Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin


Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T Barron, and Ren Ng.
Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional
Domains. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
58, 68

Marshall F Tappen, William T Freeman, and Edward H Adelson. Recovering Intrinsic


Images From a Single Image. In Advances in Neural Information Processing Systems
(NeurIPS), 2003. 40, 176

297
Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-View 3D Models
From Single Images With a Convolutional Network. In European Conference on
Computer Vision (ECCV), 2016. 175

Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree Generating Net-
works: Efficient Convolutional Architectures for High-Resolution 3D Outputs. In
IEEE/CVF International Conference on Computer Vision (ICCV), 2017. 29, 175,
198

Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Sei-
del, Patrick Pérez, Michael Zollhofer, and Christian Theobalt. StyleRig: Rigging
StyleGAN for 3D Control Over Portrait Images. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2020a. 116

Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi,
Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias
Nießner, et al. State of the Art on Neural Rendering. Computer Graphics Forum
(CGF), 39(2):701–727, 2020b. 114, 117, 118

Duc Thanh Nguyen, Binh-Son Hua, Khoi Tran, Quang-Hieu Pham, and Sai-Kit Ye-
ung. A Field Model for Repairing 3D Shapes. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2016. 175

Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred Neural Rendering:
Image Synthesis Using Neural Textures. ACM Transactions on Graphics (TOG),
38(4):1–12, 2019. 117, 118, 152, 153, 166, 246

Justus Thies, Michael Zollhöfer, Christian Theobalt, Marc Stamminger, and Matthias
Nießner. Image-Guided Neural Object Rendering. In International Conference on
Learning Representations (ICLR), 2020. 117

Sebastian Thrun and Ben Wegbreit. Shape From Symmetry. In IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), 2005. 174

Antonio Torralba and William T Freeman. Accidental Pinhole and Pinspeck Cameras:
Revealing the Scene Outside the Picture. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2012. 222

Antonio Torralba, Kevin P Murphy, and William T Freeman. Sharing Visual Features
for Multiclass and Multiview Object Detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 29(5), 2007. 177

Yun-Ta Tsai and Rohit Pandey. Portrait Light: Enhancing Portrait Light-
ing With Machine Learning. https://ai.googleblog.com/2020/12/portrait-
light-enhancing-portrait.html, 2020. Accessed: 08/21/2021. 26

Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-View
Supervision for Single-View Reconstruction via Differentiable Ray Consistency. In

298
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2017. 170, 175, 178, 195, 197, 198, 207

Borom Tunwattanapong, Abhijeet Ghosh, and Paul Debevec. Practical Image-Based


Relighting and Editing With Spherical-Harmonics and Local Lights. In Conference
for Visual Media Production. IEEE, 2011. 114

Borom Tunwattanapong, Graham Fyffe, Paul Graham, Jay Busch, Xueming Yu, Ab-
hijeet Ghosh, and Paul Debevec. Acquiring Reflectance and Shape From Continu-
ous Spherical Harmonic Illumination. ACM Transactions on Graphics (TOG), 32
(4):1–12, 2013. 114

Greg Turk and Marc Levoy. Zippered Polygon Meshes From Range Images. In
SIGGRAPH, 1994. 29

Shimon Ullman. The Interpretation of Structure From Motion. Proceedings of the


Royal Society of London. Series B. Biological Sciences, 203(1153):405–426, 1979.
25, 42

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance Normalization:


The Missing Ingredient for Fast Stylization. arXiv, 2016. 125

USGS. USGS Releases First-Ever Comprehensive Geologic Map of the Moon.


https://www.usgs.gov/news/usgs-releases-first-ever-comprehensive-
geologic-map-moon, 2020. Accessed: 08/04/2021. 226

Bruce Walter, Stephen R Marschner, Hongsong Li, and Kenneth E Torrance. Micro-
facet Models for Refraction Through Rough Surfaces. In Eurographics Symposium
on Rendering Techniques (EGSR), 2007. 32, 63, 101, 249

Jiaping Wang, Yue Dong, Xin Tong, Zhouchen Lin, and Baining Guo. Kernel Nyström
Method for Light Transport. In SIGGRAPH, 2009. 114

Peng Wang, Lingqiao Liu, Chunhua Shen, Zi Huang, Anton van den Hengel, and
Heng Tao Shen. Multi-Attention Network for One Shot Learning. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 177

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan
Catanzaro. High-Resolution Image Synthesis and Semantic Manipulation With
Conditional GANs. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2018. 221, 224, 230, 233

Xiaolong Wang, David Fouhey, and Abhinav Gupta. Designing Deep Networks for
Surface Normal Estimation. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2015. 176

Yu-Xiong Wang and Martial Hebert. Learning to Learn: Model Regression Networks
for Easy Small Sample Learning. In European Conference on Computer Vision
(ECCV), 2016. 177

299
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image Quality
Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on
Image Processing (TIP), 13(4):600–612, 2004. 95, 142, 149, 235

Greg Ward and Rob Shakespeare. Rendering With Radiance: The Art and Science
of Lighting Visualization. Morgan Kaufmann Publishers, 1998. 75

Xin Wei, Guojun Chen, Yue Dong, Stephen Lin, and Xin Tong. Object-Based Illu-
mination Estimation With Rendering-Aware Neural Networks. In European Con-
ference on Computer Vision (ECCV), 2020. 57

Michael Weinmann and Reinhard Klein. Advances in Geometry and Reflectance


Acquisition. In SIGGRAPH Asia Courses, 2015. 57

Yair Weiss. Deriving Intrinsic Images From Image Sequences. In IEEE/CVF Inter-
national Conference on Computer Vision (ICCV), 2001. 40, 176

Tim Weyrich, Wojciech Matusik, Hanspeter Pfister, Bernd Bickel, Craig Donner,
Chien Tu, Janet McAndless, Jinho Lee, Addy Ngan, Henrik Wann Jensen, et al.
Analysis of Human Faces Using a Measurement-Based Skin Reflectance Model.
ACM Transactions on Graphics (TOG), 25(3):1013–1024, 2006. 109

Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin Johnson. SynSin: End-
to-End View Synthesis From a Single Image. In IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2020. 116

Andrew P Witkin. Recovering Surface Shape and Orientation From Texture. Artificial
Intelligence, 17(1-3):17–45, 1981. 42

Robert J Woodham. Photometric Method for Determining Surface Orientation From


Multiple Images. Optical Engineering, 19(1):191139, 1980. 114, 117

Robert J Woodham. Analysing Images of Curved Surfaces. Artificial Intelligence, 17


(1-3):117–140, 1981. 42

Hao-Yu Wu, Michael Rubinstein, Eugene Shih, John Guttag, Frédo Durand, and
William T Freeman. Eulerian Video Magnification for Revealing Subtle Changes
in the World. ACM Transactions on Graphics (TOG), 31(4):1–8, 2012. 222

Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T Freeman, and Joshua B Tenen-
baum. Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-
Adversarial Modeling. In Advances in Neural Information Processing Systems
(NeurIPS), 2016. 171, 175, 176, 181, 197

Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, William T Freeman, and
Joshua B Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches.
In Advances in Neural Information Processing Systems (NeurIPS), 2017. 172, 175,
176, 179, 184, 185, 198

300
Jiajun Wu, Chengkai Zhang, Xiuming Zhang, Zhoutong Zhang, William T Freeman,
and Joshua B Tenenbaum. Learning 3D Shape Priors for Shape Completion and
Reconstruction. In European Conference on Computer Vision (ECCV), 2018. 27,
30, 43, 46, 169, 170, 171, 179, 216, 245, 247

Yuxin Wu and Kaiming He. Group Normalization. In European Conference on Com-


puter Vision (ECCV), 2018. 125

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang,
and Jianxiong Xiao. 3D ShapeNets: A Deep Representation for Volumetric Shapes.
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2015. 174

Rui Xia, Yue Dong, Pieter Peers, and Xin Tong. Recovering Shape and Spatially-
Varying Surface Reflectance Under Unknown Illumination. ACM Transactions on
Graphics (TOG), 35(6):1–12, 2016. 58

Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-Shot Learning – the Good, the
Bad and the Ugly. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2017. 177

Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond PASCAL: A Benchmark


for 3D Object Detection in the Wild. In IEEE/CVF Winter Conference on Appli-
cations of Computer Vision (WACV), 2014. 170, 178, 187, 195, 205

Yu Xiang, Wonhui Kim, Wei Chen, Jingwei Ji, Christopher Choy, Hao Su, Roozbeh
Mottaghi, Leonidas Guibas, and Silvio Savarese. ObjectNet3D: A Large Scale
Database for 3D Object Recognition. In European Conference on Computer Vision
(ECCV), 2016. 170, 178, 187, 192, 195

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba.
SUN Database: Large-Scale Scene Recognition From Abbey to Zoo. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2010. 194, 265

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang,
and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation With Atten-
tional Generative Adversarial Networks. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2018a. 224

Zexiang Xu, Kalyan Sunkavalli, Sunil Hadap, and Ravi Ramamoorthi. Deep Image-
Based Relighting From Optimal Sparse Samples. ACM Transactions on Graphics
(TOG), 37(4):1–13, 2018b. 114, 118, 119, 143, 146, 147, 148, 149

Zexiang Xu, Sai Bi, Kalyan Sunkavalli, Sunil Hadap, Hao Su, and Ravi Ramamoorthi.
Deep View Synthesis From Sparse Photometric Images. ACM Transactions on
Graphics (TOG), 38(4):1–13, 2019. 117

301
Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective
Transformer Nets: Learning Single-View 3D Object Reconstruction Without 3D
Supervision. In Advances in Neural Information Processing Systems (NeurIPS),
2016. 175

Shunyu Yao, Tzu Ming Harry Hsu, Jun-Yan Zhu, Jiajun Wu, Antonio Torralba,
William T Freeman, and Joshua B Tenenbaum. 3D-Aware Scene Manipulation
via Inverse Graphics. In Advances in Neural Information Processing Systems
(NeurIPS), 2018. 198

Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Ronen Basri,
and Yaron Lipman. Multiview Neural Surface Reconstruction by Disentangling
Geometry and Appearance. In Advances in Neural Information Processing Systems
(NeurIPS), 2020. 58

Li Yi, Hao Su, Lin Shao, Manolis Savva, Haibin Huang, Yang Zhou, Benjamin Gra-
ham, Martin Engelcke, Roman Klokov, Victor Lempitsky, et al. Large-Scale 3D
Shape Reconstruction and Segmentation From ShapeNet Core55. In IEEE/CVF
International Conference on Computer Vision (ICCV), 2017. 178, 197

Ye Yu and William A P Smith. InverseRenderNet: Learning Single Image Inverse


Rendering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2019. 57

Yizhou Yu, Paul Debevec, Jitendra Malik, and Tim Hawkins. Inverse Global Illu-
mination: Recovering Reflectance Models of Real Scenes From Photographs. In
SIGGRAPH, 1999. 57

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang,
and Dimitris N Metaxas. StackGAN: Text to Photo-Realistic Image Synthesis With
Stacked Generative Adversarial Networks. In IEEE/CVF International Conference
on Computer Vision (ICCV), 2017. 224

Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-Attention
Generative Adversarial Networks. In International Conference on Machine Learn-
ing (ICML), 2019. 233

Kai Zhang, Fujun Luan, Qianqian Wang, Kavita Bala, and Noah Snavely. PhySG: In-
verse Rendering With Spherical Gaussians for Physics-Based Material Editing and
Relighting. In IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2021a. 59

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The
Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2018a. 95, 137,
142, 149, 158, 235

302
Ruo Zhang, Ping-Sing Tsai, James Edwin Cryer, and Mubarak Shah. Shape-From-
Shading: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence (TPAMI), 21(8):690–706, 1999. 176

Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Joshua B Tenenbaum, William T


Freeman, and Jiajun Wu. Learning to Reconstruct Shapes From Unseen Classes.
In Advances in Neural Information Processing Systems (NeurIPS), 2018b. 27, 29,
30, 36, 43, 46, 169, 170, 183, 216, 245, 247, 265

Xiuming Zhang, Sean Fanello, Yun-Ta Tsai, Tiancheng Sun, Tianfan Xue, Rohit
Pandey, Sergio Orts-Escolano, Philip Davidson, Christoph Rhemann, Paul De-
bevec, Jonathan T Barron, Ravi Ramamoorthi, and William T Freeman. Neural
Light Transport for Relighting and View Synthesis. ACM Transactions on Graph-
ics (TOG), 40(1):1–17, 2021b. 26, 27, 35, 36, 37, 38, 42, 45, 58, 95, 107, 111, 127,
165, 245, 246, 251

Xiuming Zhang, Pratul P Srinivasan, Boyang Deng, Paul Debevec, William T Free-
man, and Jonathan T Barron. NeRFactor: Neural Factorization of Shape and Re-
flectance Under an Unknown Illumination. ACM Transactions on Graphics (TOG),
2021c. 26, 31, 32, 33, 34, 35, 36, 37, 38, 41, 45, 50, 54, 104, 245, 246

Xuaner Zhang, Jonathan T Barron, Yun-Ta Tsai, Rohit Pandey, Xiuming Zhang, Ren
Ng, and David E Jacobs. Portrait Shadow Manipulation. ACM Transactions on
Graphics (TOG), 39(4):78–1, 2020. 110, 116

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
Object Detectors Emerge in Deep Scene CNNs. In International Conference on
Learning Representations (ICLR), 2014. 212

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo
Magnification: Learning View Synthesis Using Multiplane Images. ACM Transac-
tions on Graphics (TOG), 37(4):1–12, 2018. 117

Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative
Visual Manipulation on the Natural Image Manifold. In European Conference on
Computer Vision (ECCV), 2016. 176

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired Image-to-
Image Translation Using Cycle-Consistent Adversarial Networks. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2017a. 224

Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver
Wang, and Eli Shechtman. Multimodal Image-to-Image Translation by Enforcing
Bi-Cycle Consistency. In IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2017b. 224

Todd Zickler, Ravi Ramamoorthi, Sebastian Enrique, and Peter N Belhumeur. Re-
flectance Sharing: Predicting Appearance From a Sparse Set of Images of a Known

303
Shape. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),
28(8):1287–1302, 2006. 118

304

You might also like