Shape, Reflectance, and Illumination From Appearance

Shape, Reflectance, and Illumination From

Xiuming Zhang
B.Eng., National University of Singapore (2015)
S.M., Massachusetts Institute of Technology (2018)
Submitted to the Department of Electrical Engineering and Computer
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
at the
September 2021
© Massachusetts Institute of Technology 2021. All rights reserved.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
August 27, 2021
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
William T. Freeman
Thomas and Gerd Perkins Professor of Electrical Engineering and
Computer Science
Thesis Supervisor

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Leslie A. Kolodziejski
Professor of Electrical Engineering and Computer Science
Chair, Department Committee on Graduate Students
Shape, Reflectance, and Illumination From Appearance
Xiuming Zhang

Submitted to the Department of Electrical Engineering and Computer Science

on August 27, 2021, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science

The image formation process describes how light interacts with the objects in a scene
and eventually reaches the camera, forming an image that we observe. Inverting this
process is a long-standing, ill-posed problem in computer vision, which involves es-
timating shape, material properties, and/or illumination passively from the object’s
appearance. Such “inverse rendering” capabilities enable 3D understanding of our
world (as desired in autonomous driving, robotics, etc.) and computer graphics appli-
cations such as relighting, view synthesis, and object capture (as desired in Extended
Reality [XR], etc.).
In this dissertation, we study inverse rendering by recovering three-dimensional
(3D) shape, reflectance, illumination, or everything jointly under different setups.
The input across different setups varies from single images to multi-view images lit
by multiple known lighting conditions, then to multi-view images under one unknown
illumination. Across the setups, we explore optimization-based recovery that exploits
multiple observations of the same object, learning-based reconstruction that heavily
relies on data-driven priors, and a mixture of both. Depending on the problem, we
perform inverse rendering at three different levels of abstraction: I) At a low level of
abstraction, we develop physically-based models that explicitly solve for every term
in the rendering equation, II) at a middle level, we utilize the light transport function
to abstract away intermediate light bounces and model only the final “net effect,”
and III) at a high level, we treat rendering as a black box and directly invert it
with learned data-driven priors. We also demonstrate how higher-level abstraction
leads to models that are simple and applicable to single images but also possess fewer
This dissertation discusses four instances of inverse rendering, gradually ascending
in the level of abstraction. In the first instance, we focus on the low-level abstraction
where we decompose appearance explicitly into shape, reflectance, and illumination.
To this end, we present a physically-based model capable of such full factorization
under one unknown illumination and another that handles one-bounce indirect illumi-
nation. In the second instance, we ascend to the middle level of abstraction, at which
we model appearance with the light transport function, demonstrating how this level

of modeling easily supports relighting with global illumination, view synthesis, and
both tasks simultaneously. Finally, at the high level of abstraction, we employ deep
learning to directly invert the rendering black box in a data-driven fashion. Specif-
ically, in the third instance, we recover 3D shapes from single images by learning
data-driven shape priors and further make our reconstruction generalizable to novel
shape classes unseen during training. Also relying on data-driven priors, the fourth
instance concerns how to recover lighting from the appearance of the illuminated
object, without explicitly modeling the image formation process.

Thesis Supervisor: William T. Freeman

Title: Thomas and Gerd Perkins Professor of Electrical Engineering and Computer


These five years at MIT has been truly an amazing journey: I learned “like taking
a drink from a fire hose” (former MIT President Wiesner) and made lifelong friends
with whom I can share the ups and downs in taking that drink.

First, I would like to express my heartfelt gratitude to my advisor, Bill Freeman,

for his endless support and advice. Since Day 1, Bill has been giving me full freedom
to pursue research that I am excited about. In every project, he steered me into
the right direction and provided invaluable feedback every time we chatted. Bill is a
creative scientist-cum-artist who is always trying to image X from Y where X and Y
are crazy pairs like the Earth and the Moon, Boston and rainbow, etc. He is also an
elegant academic noble who teaches me not only computer vision but also how to be
a better person. The question I always ask myself is “What would Bill do?” I had a
slow start at the early phase of my Ph.D., and it was Bill’s “slow down to speed up”
that kept me hanging in there. Bill’s wisdom such as making toy models will continue
to guide me throughout my career and life. I could not ask for a better advisor than
Bill. Thank you, Bill.

I would also like to thank my dissertation committee members: Antonio Torralba

and Jon Barron. Although I did not interact directly with Antonio much, his impact
reaches every corner of our office. His famous quote “Bugs are good because that
means your algorithm is not hopeless” cheered me up every time I found a bug. Jon
is one of the pioneers in the theme of this dissertation, and I started learning about
this field by reading his (inspiring) papers. I was very fortunate to have interned
with him twice at Google, and several papers constituting this dissertation originated
from those internships, so it is redundant to say how much impact Jon had on my
research. He is basically my advisor in industry. Jon is incredibly knowledgeable
about everything (in depth), and his explanations of things are always crystal clear.
I wish through years of learning and practicing, I could have just one inch of Jon’s
breadth and depth of knowledge. It is my honor and pleasure to have both Antonio
and Jon on my dissertation committee. Thank you both for your service.

I owe a debt of gratitude to my advisors who got me started in research during
my undergraduate time: Thomas Yeo, Mert Sabuncu, and Beth Mormino. Thomas
was my Bachelor’s thesis advisor, with whom I worked together on a daily basis for
around two years. Technical and meanwhile attentive to details, he showed me what
top-notch research was while I did not even know much about machine learning. Since
my graduation, he has been continuing to support me in many aspects, from graduate
school applications to recommendation letter writing. Even though the collaboration
with Mert had been mostly online, he generously offered many helpful suggestions on
graduate research in our first (and probably only) in-person interaction back in 2016.
Without the rigorous research training from them, I would not be here writing this
dissertation today.

Besides those already mentioned, I was fortunate to have worked with many in-
telligent collaborators during my Ph.D. (in approximately chronological order): Ji-
ajun Wu*, Zhoutong Zhang*, Chengkai Zhang*, Josh Tenenbaum*, Tianfan Xue*,
Xingyuan Sun*, Charles He, Tali Dekel, Stefanie Mueller, Andrew Owens, Yikai
Li, Jiayuan Mao, Noah Snavely, Cecilia Zhang, Ren Ng, David E. Jacobs, Sergio
Orts-Escolano*, Rohit Pandey*, Christoph Rhemann*, Sean Fanello*, Yun-Ta Tsai*,
Tiancheng Sun*, Zexiang Xu*, Ravi Ramamoorthi*, Paul Debevec*, Boyang Deng*,
Pratul Srinivasan*, Matt Tancik*, Ben Mildenhall*, Steven Liu, Richard Zhang, Jun-
Yan Zhu, and Bryan Russell. This dissertation would not have been possible without
the input from the co-authors marked with an asterisk. I want to particularly thank
two labmates from this list: Jiajun and Zhoutong. As a senior student in the Lab, Jia-
jun provided valuable advice and help in bootstrapping my computer vision research;
the knowledge I gained from exploring 3D vision with Jiajun laid the foundation for
this dissertation. Zhoutong, despite being my peer, constantly amazes me with his
breadth of knowledge in vision and graphics; a “walking Visionpedia” is what I call
him. It was my privilege to have learned so much from everyone listed above.

I would not have been able to get through these challenging years without the
support from the staff in EECS and CSAIL (in no particular order): Janet Fischer,
Alicia Duarte, Kathy McCoy, Katrina LaCurts, Roger White, Sheila Sharbetian, Fern

Keniston, Rachel Gordon, Adam Conner-Simons, Garrett Wollman, Steve Ruggiero,
Jay Sekora, Tom Buehler, Jason Dorfman, Jon Proulx, Mark Pearrow, etc. Janet
and Katrina provided so much helpful advice as I navigated to today. I worked
with Rachel, Adam, Jason, and Tom on the MoSculp news article. They were such
a supportive and strong team that made MoSculp a hit. I would also like to thank
everyone in The Infrastructure Group, without whose solid technical skills and prompt
help, I would not be able to do any of the research presented in this dissertation.
I am thankful to everyone in the Vision Graphics Neighborhood and beyond at
MIT. I enjoyed every conversation we had in our (tiny) kitchen. We went hiking,
watched several musicals and plays, and witnessed the total solar eclipse together.
Thanks to all of you for making my MIT life colorful. To all my friends scattered
around the world, thank you, too, for the support and friendship.
I want to thank my entire family for their unwavering love and support, especially
my parents, Yanbin Sun and Chunmin Zhang, who have always been supporting my
decisions unconditionally (even though some came with great financial costs). I hope
we all agree that we have made the right calls. Being far away from home (and now
trapped by COVID), I was not able to go back home as often as I wanted during my
Ph.D.; as the single child, I wish I had done more. If only Nai-Nai and Lao-Ye were
still around with us today, they would be so proud to see their grandson graduating
with a Ph.D. Thank you, and I love you all.
Lastly, I thank my girlfriend, Hanzheng Li, for always being there for me. I owe a
lot of my success to her. Despite being a classical pianist, she knows all about nonsense
like Reviewer #2, weak reject, etc. She always manages to cheer me up when bad
things happen and to calm me down before overexcitement becomes sorrow – just the
perfect other half for me. The quality of life since I met her has risen significantly,
and I look forward to the next chapter of life together with her.

To Yanbin, Chunmin, and Hanzheng

Brief Contents

1 Introduction 25
1.1 Image Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.2 Inverting the Image Formation Process . . . . . . . . . . . . . . . . . 39
1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2 Low-Level Abstraction: Physically-Based Appearance Factorization 49

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.3 Method: Multiple Known Illuminations . . . . . . . . . . . . . . . . . 61
2.4 Method: One Unknown Illumination . . . . . . . . . . . . . . . . . . 70
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3 Mid-Level Abstraction: The Light Transport Function 107

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.3 Method: Precise, High-Frequency Relighting . . . . . . . . . . . . . . 119
3.4 Method: Free-Viewpoint Relighting . . . . . . . . . . . . . . . . . . . 127
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

4 High-Level Abstraction: Data-Driven Shape Reconstruction 169
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.3 Method: Learning & Using Shape Priors . . . . . . . . . . . . . . . . 179
4.4 Method: Generalizing to Unseen Classes . . . . . . . . . . . . . . . . 183
4.5 Method: Building a Real-World Dataset . . . . . . . . . . . . . . . . 187
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

5 High-Level Abstraction: Data-Driven Lighting Recovery 217

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

6 Conclusion & Discussion 245

A Supplement: Neural Reflectance and Visibility Fields (NeRV) 249

A.1 BRDF Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . 249
A.2 Additional Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 250
A.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

B Supplement: Light Stage Super-Resolution (LSSR) 255

B.1 Progressive Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
B.2 Baseline Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

C Supplement: Pix3D 259

C.1 Evaluation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
C.2 Nearest Neighbors With Different Metrics . . . . . . . . . . . . . . . 260
C.3 Sample Data in Pix3D . . . . . . . . . . . . . . . . . . . . . . . . . . 261

D Supplement: Generalizable Reconstruction (GenRe) 265
D.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
D.2 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266



1 Introduction 25
1.1 Image Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.1.1 Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.1.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.1.3 Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.1.4 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.1.5 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.2 Inverting the Image Formation Process . . . . . . . . . . . . . . . . . 39
1.2.1 Joint Estimation of Shape, Reflectance, & Illumination . . . . 40
1.2.2 Interpolating the Light Transport Function . . . . . . . . . . . 41
1.2.3 Shape Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 42
1.2.4 Lighting Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2 Low-Level Abstraction: Physically-Based Appearance Factorization 49

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2.1 Inverse Rendering . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2.2 Coordinate-Based Neural Representations . . . . . . . . . . . 58
2.2.3 Precomputation in Computer Graphics . . . . . . . . . . . . . 60
2.2.4 Material Acquisition . . . . . . . . . . . . . . . . . . . . . . . 60
2.3 Method: Multiple Known Illuminations . . . . . . . . . . . . . . . . . 61
2.3.1 Neural Radiance Fields (NeRF) . . . . . . . . . . . . . . . . . 62

2.3.2 Neural Reflectance Fields . . . . . . . . . . . . . . . . . . . . 62
2.3.3 Light Transport via Neural Visibility Fields . . . . . . . . . . 63
2.3.4 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.5 Training & Implementation Details . . . . . . . . . . . . . . . 68
2.4 Method: One Unknown Illumination . . . . . . . . . . . . . . . . . . 70
2.4.1 Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.4.2 Reflectance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.4.3 Illumination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.4.4 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 79
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.5.2 Shape Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.5.3 Joint Estimation of Shape, Reflectance, & Illumination . . . . 83
2.5.4 Free-Viewpoint Relighting . . . . . . . . . . . . . . . . . . . . 85
2.5.5 Material Editing . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.6.1 Baseline Comparisons: Multiple Known Illuminations . . . . . 91
2.6.2 Baseline Comparisons: One Unknown Illumination . . . . . . 95
2.6.3 Ablation Studies: Multiple Known Illuminations . . . . . . . . 99
2.6.4 Ablation Studies: One Unknown Illumination . . . . . . . . . 100
2.6.5 Estimation Consistency Across Different Illuminations . . . . 103
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3 Mid-Level Abstraction: The Light Transport Function 107

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.2.1 Single Observation . . . . . . . . . . . . . . . . . . . . . . . . 115
3.2.2 Multiple Views . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.2.3 Multiple Illuminants . . . . . . . . . . . . . . . . . . . . . . . 117

3.2.4 Multiple Views & Illuminants . . . . . . . . . . . . . . . . . . 118
3.3 Method: Precise, High-Frequency Relighting . . . . . . . . . . . . . . 119
3.3.1 Active Set Construction . . . . . . . . . . . . . . . . . . . . . 121
3.3.2 Alias-Free Pooling . . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 125
3.3.4 Loss Functions & Training Strategy . . . . . . . . . . . . . . . 126
3.4 Method: Free-Viewpoint Relighting . . . . . . . . . . . . . . . . . . . 127
3.4.1 Texture-Space Inputs . . . . . . . . . . . . . . . . . . . . . . . 129
3.4.2 Query & Observation Networks . . . . . . . . . . . . . . . . . 132
3.4.3 Residual Learning of High-Order Effects . . . . . . . . . . . . 133
3.4.4 Simultaneous Relighting & View Synthesis . . . . . . . . . . . 136
3.4.5 Network Architecture, Losses, & Other Details . . . . . . . . . 136
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.5.1 Hardware Setup & Data Acquisition . . . . . . . . . . . . . . 139
3.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 142
3.5.3 Precise Directional Relighting . . . . . . . . . . . . . . . . . . 143
3.5.4 High-Frequency Image-Based Relighting . . . . . . . . . . . . 144
3.5.5 Lighting Softness Control . . . . . . . . . . . . . . . . . . . . 144
3.5.6 Geometry-Free Relighting . . . . . . . . . . . . . . . . . . . . 146
3.5.7 Geometry-Based Relighting . . . . . . . . . . . . . . . . . . . 148
3.5.8 Changing the Viewpoint . . . . . . . . . . . . . . . . . . . . . 151
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.6.1 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.6.2 Image-Based Relighting Under Varying Light Frequency . . . 159
3.6.3 Subsampling the Light Stage . . . . . . . . . . . . . . . . . . . 161
3.6.4 Degrading the Input Geometry Proxy . . . . . . . . . . . . . . 163
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

4 High-Level Abstraction: Data-Driven Shape Reconstruction 169

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.2.1 3D Shape Completion . . . . . . . . . . . . . . . . . . . . . . 174
4.2.2 Single-Image 3D Reconstruction . . . . . . . . . . . . . . . . . 175
4.2.3 2.5D Sketch Recovery . . . . . . . . . . . . . . . . . . . . . . . 176
4.2.4 Perceptual Losses & Adversarial Learning . . . . . . . . . . . 176
4.2.5 Spherical Projections . . . . . . . . . . . . . . . . . . . . . . . 177
4.2.6 Zero- & Few-Shot Recognition . . . . . . . . . . . . . . . . . . 177
4.2.7 3D Shape Datasets . . . . . . . . . . . . . . . . . . . . . . . . 178
4.3 Method: Learning & Using Shape Priors . . . . . . . . . . . . . . . . 179
4.3.1 Shape Naturalness Network . . . . . . . . . . . . . . . . . . . 181
4.3.2 Training Paradigm . . . . . . . . . . . . . . . . . . . . . . . . 182
4.4 Method: Generalizing to Unseen Classes . . . . . . . . . . . . . . . . 183
4.4.1 Single-View Depth Estimator . . . . . . . . . . . . . . . . . . 184
4.4.2 Spherical Map Inpainting Network . . . . . . . . . . . . . . . 184
4.4.3 Voxel Refinement Network . . . . . . . . . . . . . . . . . . . . 185
4.4.4 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.5 Method: Building a Real-World Dataset . . . . . . . . . . . . . . . . 187
4.5.1 Collecting Image-Shape Pairs . . . . . . . . . . . . . . . . . . 189
4.5.2 Image-Shape Alignment . . . . . . . . . . . . . . . . . . . . . 190
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
4.6.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.6.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.6.4 Single-View Shape Completion . . . . . . . . . . . . . . . . . . 199
4.6.5 Single-View Shape Reconstruction . . . . . . . . . . . . . . . . 203
4.6.6 Estimating Depth for Novel Shape Classes . . . . . . . . . . . 208
4.6.7 Reconstructing Novel Objects From Training Classes . . . . . 208
4.6.8 Reconstructing Objects From Unseen Classes . . . . . . . . . 209
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.7.1 Network Visualization . . . . . . . . . . . . . . . . . . . . . . 212

4.7.2 Training With the Naturalness Loss Over Time . . . . . . . . 213
4.7.3 Failure Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.7.4 Effects of Viewpoints on Generalization . . . . . . . . . . . . . 214
4.7.5 Generalizing to Non-Rigid Shapes . . . . . . . . . . . . . . . . 214
4.7.6 Generalizing to Highly Regular Shapes . . . . . . . . . . . . . 215
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

5 High-Level Abstraction: Data-Driven Lighting Recovery 217

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
5.2.1 Non-Line-of-Sight Imaging . . . . . . . . . . . . . . . . . . . . 221
5.2.2 Lighting Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.2.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . 223
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5.3.1 Data & Simulation . . . . . . . . . . . . . . . . . . . . . . . . 225
5.3.2 Nearest Neighbor-Based Recovery . . . . . . . . . . . . . . . . 228
5.3.3 Generative Adversarial Network-Based Recovery . . . . . . . . 229
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
5.4.1 Test Data & Evaluation Metrics . . . . . . . . . . . . . . . . . 234
5.4.2 Earth Recovery Given the Moon . . . . . . . . . . . . . . . . . 235
5.4.3 Learning the Continuous Earth Rotation . . . . . . . . . . . . 237
5.4.4 Multi-Modal Generation & the Clouds . . . . . . . . . . . . . 240
5.4.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

6 Conclusion & Discussion 245

A Supplement: Neural Reflectance and Visibility Fields (NeRV) 249

A.1 BRDF Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . 249
A.2 Additional Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 250
A.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

B Supplement: Light Stage Super-Resolution (LSSR) 255
B.1 Progressive Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
B.2 Baseline Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

C Supplement: Pix3D 259

C.1 Evaluation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
C.2 Nearest Neighbors With Different Metrics . . . . . . . . . . . . . . . 260
C.3 Sample Data in Pix3D . . . . . . . . . . . . . . . . . . . . . . . . . . 261

D Supplement: Generalizable Reconstruction (GenRe) 265

D.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
D.2 Model Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
D.2.1 Single-View Depth Estimator . . . . . . . . . . . . . . . . . . 266
D.2.2 Spherical Map Inpainting Network . . . . . . . . . . . . . . . 270
D.2.3 Voxel Refinement Network . . . . . . . . . . . . . . . . . . . . 270

List of Figures

1-1 Relationships among the object, light, and camera. . . . . . . . . . . 28

1-2 Visualization of different shape representations. . . . . . . . . . . . . 29
1-3 Example BRDFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1-4 Direct vs. indirect illumination. . . . . . . . . . . . . . . . . . . . . . 34
1-5 Example shadow map and ambient occlusion map. . . . . . . . . . . 35

2-1 How NeRV reduces the computational complexity. . . . . . . . . . . 52

2-2 Example input and output of NeRV. . . . . . . . . . . . . . . . . . . 53
2-3 NeRFactor overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2-4 Example decomposition of NeRV. . . . . . . . . . . . . . . . . . . . . 61
2-5 The geometry of an indirect illumination path in NeRV. . . . . . . . 64
2-6 NeRFactor model and its example output. . . . . . . . . . . . . . . . 71
2-7 High-quality geometry recovered by NeRFactor. . . . . . . . . . . . . 82
2-8 Joint estimation of shape, reflectance, and lighting by NeRFactor. . . 86
2-9 Free-viewpoint relighting by NeRFactor. . . . . . . . . . . . . . . . . 88
2-10 NeRFactor’s results on real-world captures. . . . . . . . . . . . . . . 89
2-11 Material editing and relighting by NeRFactor. . . . . . . . . . . . . . 90
2-12 NeRV vs. Bi et al. [2020a]. . . . . . . . . . . . . . . . . . . . . . . . 93
2-13 NeRV vs. latent code models. . . . . . . . . . . . . . . . . . . . . . . 95
2-14 NeRFactor vs. SIRFS. . . . . . . . . . . . . . . . . . . . . . . . . . . 96
2-15 NeRFactor vs. Oxholm and Nishino [2014] (enhanced). . . . . . . . . 98
2-16 NeRFactor vs. Philip et al. [2019]. . . . . . . . . . . . . . . . . . . . 99
2-17 Indirect illumination in NeRV. . . . . . . . . . . . . . . . . . . . . . 100

2-18 NeRV with analytic vs. MLP-predicted normals. . . . . . . . . . . . 100
2-19 Qualitative ablation studies of NeRFactor. . . . . . . . . . . . . . . . 102
2-20 Albedo estimation of NeRFactor across different illuminations. . . . 103

3-1 LSSR overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

3-2 NLT overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3-3 Visualization of the LSSR architecture. . . . . . . . . . . . . . . . . 121
3-4 Construction of the active sets in LSSR. . . . . . . . . . . . . . . . . 122
3-5 LSSR’s alias-free pooling. . . . . . . . . . . . . . . . . . . . . . . . . 124
3-6 Gap in photorealism that NLT attempts to close. . . . . . . . . . . . 127
3-7 NLT Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3-8 Modeling non-diffuse BSSRDFs as residuals for relighting in NLT. . . 135
3-9 Modeling global illumination as residuals for relighting in NLT. . . . 135
3-10 Sample images used for training NLT. . . . . . . . . . . . . . . . . . 141
3-11 Interpolation by LSSR between two physical lights. . . . . . . . . . . 143
3-12 High-frequency image-based relighting by LSSR. . . . . . . . . . . . 145
3-13 Controlling lighting softness with LSSR. . . . . . . . . . . . . . . . . 145
3-14 Relighting by LSSR and the baselines. . . . . . . . . . . . . . . . . . 147
3-15 NLT relighting with a directional light. . . . . . . . . . . . . . . . . 150
3-16 HDRI relighting by NLT. . . . . . . . . . . . . . . . . . . . . . . . . 151
3-17 View synthesis by NLT. . . . . . . . . . . . . . . . . . . . . . . . . . 153
3-18 Simultaneous relighting and view synthesis by NLT. . . . . . . . . . 155
3-19 Comparing NLT against NeRF and NeRF+Light. . . . . . . . . . . 156
3-20 Continuous directional relighting by LSSR. . . . . . . . . . . . . . . 157
3-21 NLT and its ablated variants for relighting. . . . . . . . . . . . . . . 158
3-22 Quality gain by LSSR w.r.t. lighting frequency. . . . . . . . . . . . . 160
3-23 LSSR vs. linear blending: relighting errors w.r.t. light density. . . . . 161
3-24 LSSR vs. linear blending: relighting with sparser lights. . . . . . . . 162
3-25 NLT relighting with sparser lights. . . . . . . . . . . . . . . . . . . . 162
3-26 Performance of NLT w.r.t. quality of the geometry proxy. . . . . . . 163

3-27 A failure case of NLT’s view synthesis. . . . . . . . . . . . . . . . . . 166

4-1 Two levels of ambiguity in single-view 3D shape perception. . . . . . 171

4-2 GenRe overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4-3 ShapeHD model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
4-4 GenRe model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4-5 GenRe’s spherical inpainting module generalizing to new classes. . . 185
4-6 Pix3D vs. existing datasets. . . . . . . . . . . . . . . . . . . . . . . . 188
4-7 Building Pix3D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
4-8 Sample images and shapes in Pix3D. . . . . . . . . . . . . . . . . . . 193
4-9 Image and shape distributions across categories of Pix3D. . . . . . . 195
4-10 3D shape completion from single-view depth by ShapeHD. . . . . . . 200
4-11 3D shape completion by ShapeHD. . . . . . . . . . . . . . . . . . . . 201
4-12 3D shape completion by ShapeHD using real depth data. . . . . . . 202
4-13 3D shape reconstruction by ShapeHD on ShapeNet. . . . . . . . . . 204
4-14 3D shape reconstruction by ShapeHD on novel categories. . . . . . . 204
4-15 3D shape reconstruction by ShapeHD on PASCAL 3D+. . . . . . . . 205
4-16 3D shape reconstruction by ShapeHD on Pix3D. . . . . . . . . . . . 206
4-17 GenRe’s depth estimator generalizing to novel shape classes. . . . . . 208
4-18 GenRe’s reconstruction within and beyond training classes. . . . . . 210
4-19 GenRe’s reconstruction on real images from novel classes. . . . . . . 211
4-20 Visualization of how ShapeHD attends to details in the depth maps. 212
4-21 How ShapeHD improves over time with the naturalness loss. . . . . 213
4-22 Common failure modes of ShapeHD. . . . . . . . . . . . . . . . . . . 214
4-23 Reconstruction errors of GenRe across different input viewpoints. . . 215
4-24 Single-view completion of non-rigid shapes from depth by GenRe. . . 215
4-25 Completion of highly regular shapes (primitives) by GenRe. . . . . . 215

5-1 Our simplification of the Sun-Earth-Moon system. . . . . . . . . . . 227

5-2 How the Moon responds differently to distinct Earth illuminations. . 228
5-3 Illustration of the nearest neighbor baselines for EarthGAN. . . . . . 229

5-4 EarthGAN model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5-5 Different Earth appearances at similar timestamps. . . . . . . . . . . 233
5-6 Earth recovery by EarthGAN. . . . . . . . . . . . . . . . . . . . . . 236
5-7 Continuous Earth rotation learned by EarthGAN. . . . . . . . . . . 238
5-8 Smooth evolution of the Earth appearance learned by EarthGAN. . . 239
5-9 How EarthGAN learns to model the clouds. . . . . . . . . . . . . . . 241
5-10 Ablation studies of EarthGAN’s design choices. . . . . . . . . . . . . 243

A-1 NeRV vs. NLT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

A-2 Additional results and baseline comparisons for NeRV. . . . . . . . . 252

B-1 Network architecture and progressive training scheme of LSSR. . . . 256

B-2 More comparisons between LSSR and the baselines. . . . . . . . . . 258

C-1 Retrieving nearest neighbors in Pix3D using different metrics. . . . . 260

C-2 Diverse images associated the same shape in Pix3D. . . . . . . . . . 261
C-3 Sample images and their corresponding shapes in Pix3D. . . . . . . . 262
C-4 More sample images and their corresponding shapes in Pix3D. . . . . 263

List of Tables

2.1 Quantitative evaluation of NeRFactor. . . . . . . . . . . . . . . . . . 84

2.2 Quantitative evaluation of NeRV. . . . . . . . . . . . . . . . . . . . . 94
2.3 Quantitative ablation studies of NeRV. . . . . . . . . . . . . . . . . . 100

3.1 Neural network architecture of NLT. . . . . . . . . . . . . . . . . . . 137

3.2 Relighting errors of LSSR. . . . . . . . . . . . . . . . . . . . . . . . . 146
3.3 NLT Relighting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.4 View synthesis errors of NLT. . . . . . . . . . . . . . . . . . . . . . . 152

4.1 Dataset quality of Pix3D. . . . . . . . . . . . . . . . . . . . . . . . . 196

4.2 Correlation between different shape metrics and human judgments. . 196
4.3 Average shape completion errors of ShapeHD on ShapeNet. . . . . . 203
4.4 3D shape reconstruction by ShapeHD on PASCAL 3D+. . . . . . . . 207
4.5 3D shape reconstruction by ShapeHD on Pix3D. . . . . . . . . . . . 207
4.6 3D shape reconstruction by GenRe on training and novel classes. . . 209
4.7 Reconstruction errors of GenRe on Pix3D. . . . . . . . . . . . . . . . 211

5.1 Quantitative evaluation of EarthGAN. . . . . . . . . . . . . . . . . . 237


Chapter 1


One way to view computer vision is thinking of it as “inverse computer graphics.”

Computer graphics covers the whole procedure of building scene geometry, crafting
shaders for each part of that geometry, setting up lighting and a camera, and even-
tually rendering everything into a photorealistic image. Computer vision, on the
contrary, aims to recover all these intermediate factors from the observed image(s).
This definition of computer vision encompasses several classic problems: Shape From
Shading [Horn, 1970] and Structure From Motion [Ullman, 1979] recover geometry
from images, Intrinsic Image Decomposition [Barrow and Tenenbaum, 1978] recovers
reflectance, and Barron and Malik [2014] additionally recover illumination.
Following this definition of computer vision, we present several approaches all
aimed to recover different intermediate factors (such as shape, reflectance, and light-
ing) from what they collectively lead to—images. These inverse problems are well-
known to be ill-posed, so different priors (data-driven or predefined) are employed
to differentiate a plausible combination of the different factors from other possible
but less likely ones. Solving these inverse problems benefits many downstream vi-
sion and graphics applications. Just to name a few, the development of Extended
Reality (XR) is hindered by the cost and difficulty in creating high-quality 3D assets
[Maughan, 2016], and the ability to automatically recover shape and material prop-
erties from just images would circumvent the heavy manual labor of object scanning,
thereby greatly accelerating and democratizing XR content creation [Inc., 2021]; an

algorithm capable of estimating facial geometry and reflectance would enable “magic”
portrait relighting features on consumer mobile phones, such as Google’s Portrait
Light [Tsai and Pandey, 2020].

Throughout this dissertation, we tackle these inverse rendering problems at three

levels of abstraction. At a low level of abstraction, we devise physically-based
methods that explicitly solve for every term in the (simplified) rendering equation: the
object’s shape, reflectance, and illumination that collectively explain the observed im-
ages. To this end, the first half of Chapter 2 studies whether one can jointly optimize
shape, reflectance, and indirect illumination from scratch given multi-view images of
an object lit by multiple arbitrary but known lighting conditions [Srinivasan et al.,
2021]. To relax the capture requirement of multiple known lighting conditions, the
second half of Chapter 2 is dedicated to achieving a similar decomposition of shape
and reflectance but under just one unknown lighting condition [Zhang et al., 2021c].
With these models, we achieve high-quality geometry estimation, free-viewpoint re-
lighting, and material editing. Both approaches model the actual image formation
process at a low level, relying more on physics than on data (which high-level ap-
proaches usually depend on). Despite challenging, this low-level abstraction enables
applications that a mid- or high-level one would not support, such as shape or material
editing and asset export (e.g., into a traditional graphics engine).

Ascending to a middle level of abstraction, we tackle another two inverse ren-

dering problems by abstracting away the complex object-light interaction with the
light transport (LT) function. Intuitively, the LT function “summarizes” the actual
LT by directly returning the resultant radiance given some convenient descriptions of
the camera and light (the simplest being light and view directions). In the first half
of Chapter 3, we take an entirely image-based approach and focus on interpolating
the LT function in just the light direction, which enables continuous, ghosting-free
directional relighting and high-frequency image-based relighting [Sun et al., 2020]. In
the second half, we interpolate the LT function additionally in the view direction,
thereby performing simultaneous relighting and view synthesis [Zhang et al., 2021b].
In both approaches, we estimate a mid-level representation—the LT function—from

the subject’s appearance observed under various lighting conditions (also from differ-
ent views for Zhang et al. [2021b]), without further factorizing the function into the
underlying shape and reflectance. This mid-level abstraction allows our models to
easily include global illumination effects, but it does not support shape or material
editing (which the low-level abstraction permits) and requires multiple images of the
object (in contrast to the high-level abstraction that is applicable to single images).
Finally, at a high level of abstraction, we aim to directly regress the inter-
mediate factors (e.g., shape, lighting, etc.) from their resultant appearance, without
modeling the actual image formation process. This level of abstraction treats render-
ing as a black box to be inverted and usually involves training end-to-end machine
learning models on large datasets to learn data-driven priors directly on the inter-
mediate factors. Specifically, in this dissertation, we explore two instances of such
methods: 3D shape reconstruction from single images (Chapter 4) and lighting recov-
ery from the appearance of the illuminated object (Chapter 5). In the first problem of
shape reconstruction, we train neural networks to directly regress 3D shapes from sin-
gle images thereof, leveraging the data-driven shape priors learned from a large-scale
shape dataset [Sun et al., 2018b, Wu et al., 2018]. We further make such networks
generalizable to novel shape classes unseen during training, by wiring geometric pro-
jections (which we understand well and can specify exactly) as inductive bias into
our model [Zhang et al., 2018b]. In the second problem of lighting recovery, we train
a conditional generative model to learn regularities in our lighting conditions, such
that when given the appearance of the illuminated, the model generates a plausible
lighting condition responsible for the observation. With this high-level abstraction,
we ignore the physics of the image formation process and take data-driven approaches
that accept single-image input, leveraging the power of machine learning.

1.1 Image Formation

In this section, we briefly introduce the image formation process in nature or computer
graphics. Figure 1-1 shows a cartoon visualization of the relationships among the

object, light, and camera, an example real photo of the scene, and a computer graphics
render aiming to reproduce that real photo. We present only a simplified process that
is sufficient for what this dissertation concerns. In this simplified framework, there
are four key scene elements—shape, materials, lighting, and the camera—and the
rendering process that combines these elements into an RGB image of the scene. In
the following subsections, we elaborate on each of these four scene elements and the
rendering process.

Camera Light

Shadow Catcher

Real Synthetic


Image from: Render from:

Figure 1-1: Relationships among the object, light, and camera. Top: Light travels
from the source to the scene, interacts with the objects therein, and reaches the
camera. Bottom (left): A real photo contains complex light transport effects such as
specular highlights and soft shadows. Bottom (right): With careful scene modeling
and physically-based rendering, one can reproduce the real photo with a synthetic
render, thanks to computer graphics.

1.1.1 Shape

Shape is arguably the most important aspect of a 3D scene because it provides a

“foundation” on which material properties are defined. As such, it is difficult, if
possible at all, to estimate other scene aspects such as material properties without
knowledge of geometry. Though important, geometry is not easy to represent since

the representation must be powerful enough to represent high-frequency structures,
descriptive enough to extract information from, and fast enough to perform operations
on. Unsurprisingly, there is no single optimal shape representation that is omnipotent.
We visualize the popular shape representations in Figure 1-2.

(A) Mesh (B) Point Clouds (C) Voxels (D) SDF (0 level set) (E) 2.5D Maps

Figure 1-2: Visualization of different shape representations. These example images

are taken from (A) Turk and Levoy [1994], (B) PointNet++ [Qi et al., 2017b], (C)
OGN [Tatarchenko et al., 2017], (D) DeepSDF [Park et al., 2019a], and (E) GenRe
Zhang et al. [2018b].

Mesh The computer graphics community has been using mesh as their shape repre-
sentation for decades. Briefly, mesh is a compact representation that describes shape
as a list of vertices and faces (i.e., connectivity among the vertices). See Figure 1-2
(A) for an example. Besides being compact, it is powerful to represent complex ge-
ometry of any topology (by simply adding more vertices and faces) and efficient to
compute ray intersections on [Möller and Trumbore, 1997], which is a particularly
important feature as millions of ray-mesh intersection computations are common in
ray tracing. Despite universal and efficient, mesh is less amenable to neural networks
than the other representations to be discussed below.

Point Clouds If we ablate the mesh faces (or equivalently, vertex connectivity), we
come to the point cloud representation where a collection of 3D points describe the
surface geometry. See Figure 1-2 (B) for an example. A point cloud of size 𝑁 is just
an 𝑁 × 3 array of unordered 3D coordinates and therefore can be easily processed
by network architectures such as PointNet [Qi et al., 2017a]. The major drawback,
though, is its lack of semantics for surface since there is no face information. As such,
a ray that should have hit the surface would travel through the unconnected points,

and the concept of being inside or outside of the shape is undefined.

Voxels Another possibility to represent shape is using voxels: a 3D grid of occu-

pancy values. Intuitively, this “pixelated” representation looks like a LEGO® approx-
imation of the actual (smooth) shape. See Figure 1-2 (C) for an example. To convert
voxels into mesh, one can run the Marching Cubes algorithm [Lorensen and Cline,
1987]. Like pixels, voxels are friendly to Convolutional Neural Networks (CNNs), and
extending a 2D CNN to a 3D one is straightforward. As such, our shape representa-
tions used in Chapter 4 are mostly voxels [Sun et al., 2018b, Wu et al., 2018, Zhang
et al., 2018b].
The disadvantage of using voxels, though, is its poor scalability and high memory
consumption. Indeed, as we see in Chapter 4, nearly all voxel-based approaches have
limited reconstruction resolution because of the cubic memory demand in resolution
(although there are data structures such as octree that can alleviate this problem,
as discussed in Section 4.2). Note that voxels often waste resources since shape,
especially surface, is often sparse in the 3D space (a random point in the 3D space is
less likely to fall on the surface or inside of the shape than to land in free space).

Implicit Representations Voxels can be thought of as a 3D discrete field of

scalars. If we use a continuous field of scalars to represent shape, we will be able
to represent smooth geometry better. Signed Distance Function (SDF) is a realiza-
tion of this idea: It is a continuous function that returns the shortest distance from
the query point to the surface, with the sign of the distance indicating whether the
query point is inside or outside of the shape. Since shape is implicitly represented by
the zero level set of the SDF, these functions are sometimes referred to as “implicit
shape representations.” See Figure 1-2 (D) for an example.
Since neural networks are universal function approximators [Hornik et al., 1989]
that are compact and tend to produce smooth output, researchers recently proposed
to parameterize these implicit representations using Multi-Layer Perceptrons (MLPs)
[Park et al., 2019a, Mildenhall et al., 2020]. Following this line of work, Chapter 2 of

this dissertation explores using MLPs to represent geometry in two ways: The first
half of the chapter maintains a volumetric representation using MLPs [Srinivasan
et al., 2021], while the second half opts for a surface representation but also using
MLPs [Zhang et al., 2021c].

2.5D Maps Besides 3D representations, there are also 2D representations that can
describe 3D shape. With 3D semantics such as depth or normals, these 2D images are
often referred to as “2.5D” maps or buffers [Marr, 1982]. More specifically, a depth
map has its pixel values indicating how far the camera rays travel before hitting the
objects in the scene; a normal map has its pixel values specifying the 3D orientations
of the surface points visible from this view. See Figure 1-2 (E) for an example. Unlike
3D representations, these 2.5D maps are dependent on the view: Different views of
the same scene lead to different 2.5D maps since different 3D points fall onto the
image plane.
Because these maps are essentially 2D images exploiting sparseness of 3D surface,
they are amenable to image CNNs and other network architectures designed for im-
ages. We use depth maps and other custom 2.5D maps (such as spherical maps in
Section 4.4) in recovering 3D shape in Chapter 4. In addition, Chapter 2 also vi-
sualizes many geometric properties such as surface normals and light visibility using
these 2.5D representations.

1.1.2 Materials

With the shape defined, one next specifies the material properties for the object,
possibly in a spatially-vary way. The simplest material description is reflectance,
concerning only a local surface point where the light ray lands. Because this type of
reflectance depends on only the incoming and outgoing directions (𝜔i and 𝜔o ) w.r.t.
the local surface normal 𝑛 at that point (i.e., no non-local information required), it can
conveniently expressed using a Bidirectional Reflectance Distribution Function
(BRDF): 𝑓 (𝜔i , 𝜔o ). Intuitively, 𝑓 (·) describes how the outgoing energy is distributed
over all possible 𝜔o ’s given every 𝜔i , as visualized in Figure 1-3. The fact that 𝜔i

and 𝜔o are often defined in the local frame with 𝑛 as the 𝑧-axis demonstrates why we
often require the shape be defined before considering materials (not to mention that
we need geometry to find the ray-surface intersection too).

Normal (n) n n n

Mirror Glossy Diffuse General

Figure 1-3: Example BRDFs. A perfectly reflective BRDF reflects the incoming light
to the mirrored direction. A glossy BRDF reflects light to a lobe of directions centered
around the mirror direction. A diffuse BRDF reflects light equally to all directions.
A general BRDF reflects light into all directions non-uniformly.

With this formulation, one can describe a diffuse material using the Lambertian
BRDF. Because a perfectly Lambertian material reflects the incoming light to all
outgoing directions equally, the Lambertian BRDF simply returns the same constant
for all 𝜔o ’s given any 𝜔i . Other commonly used BRDFs include the Blinn-Phong
reflection model [Blinn, 1977] and the microfacet BRDF by Walter et al. [2007],
both of which are capable of describing glossy materials with specular highlights (like
those shown in Figure 1-1). If an object has different BRDFs at different surface
locations, Spatially-Varying BRDFs (SVBRDFs) are necessary to specify its material
properties. In Chapter 2, we use the microfacet BRDF by Walter et al. [2007] as the
main reflection model [Srinivasan et al., 2021] and as an analytic alternative to our
learned BRDF [Zhang et al., 2021c].
Despite easy to use, these surface reflectance models deal with only local light
transport happening right at the ray-surface contact points. Therefore, they are un-
able to express non-local light transport such as subsurface scattering (SSS) as com-
monly observed on human skin [Hanrahan and Krueger, 1993] or transmitting light
transport as observed in translucent materials. As such, researchers have developed
more general material-describing functions such as the Bidirectional Scattering
Distribution Function (BSDF) by Jensen et al. [2001]. The first half of Chapter 2
computes local radiance values with BRDFs only (i.e., no scattering or transmittance)
but then employs volume rendering to alpha composite the resultant radiance values

along a camera ray [Srinivasan et al., 2021]. On the other hand, the second half opts
for an entirely surface-based treatment: Radiance is computed locally with BRDFs
only, and that local radiance directly arrives at the camera, with no volume rendering
or attenuation along the path.
Besides what has been discussed, there are other important BXDF (“X” being
a wildcard for “R,” “S,” etc.) topics that this dissertation does not touch on. For
instance, while many BXDF models are designed to look realistic, others are carefully
crafted to be physically correct, with properties that a naturally existing material
would possess such as energy conservation and the Helmholtz reciprocity. Our learned
BRDF in Chapter 2 [Zhang et al., 2021c] falls into the former category, with no
guarantee to be physically accurate.
Another essential BXDF topic is importance sampling: the technique that ef-
ficiently samples Monte Carlo paths based on the BXDF to enable efficient, low-
variance rendering [Lawrence et al., 2004]. Incorporating such techniques to the
BRDFs in Chapter 2 could be interesting but meanwhile challenging because the
BRDFs there are unknown and being estimated jointly [Srinivasan et al., 2021, Zhang
et al., 2021c].

1.1.3 Lighting

Shape and materials are intrinsic properties of an object. Extrinsic to the object are
lighting and the camera.
Broadly, lighting can be categorized into direct or indirect illumination. Di-
rect illumination is the light arriving at the object directly from the light source,
while indirect illumination is the light bounced to the object from another object
in the scene rather than the light source, as illustrated in Figure 1-4. Taking into
account also the indirect illumination is crucial to photorealism: Figure 1-4 shows a
comparison between a scene rendered with direct illumination only vs. with global
illumination. It is clear that simulation of many light transport phenomena, such
as the green tint cast by the right green wall (pictured only indirectly via the mir-
ror ball) onto the other wall and the diffuse ball, requires modeling of the indirect

illumination. In Chapter 2, we study inverse rendering for both setups: considering
one-bounce indirect illumination [Srinivasan et al., 2021] and considering just direct
illumination [Zhang et al., 2021c].

Figure 1-4: Direct
vs. indirect
illumination. A
render made with
direct illumination
only (A) misses
global illumination
effects present in the
(A) Direct Illumination Only (B) With Indirect Illumination full render (B).

How do we represent lighting? The two common representations are latitude-

longitude maps or light probe images and coefficients for some predefined basis
functions such as spherical harmonics or Gaussians. The former method directly
stores the High-Dynamic-Range (HDR) values for all possible latitude-longitude com-
binations in the image grid, while the latter projects the spherical signal into a set
of basis functions and stores just the coefficients. Apparently, the basis function
method has the advantage of being more compact (e.g., only 15 scalar coefficients for
five levels of spherical harmonics) than the latitude-longitude representation. How-
ever, it struggles to represent arbitrary, high-frequency lighting that is easy to rep-
resent (e.g., a latitude-longitude map with alternating pixel colors along both image
dimensions) unless using an excessive number of coefficients. Chapter 2 uses exclu-
sively the latitude-longitude light probe representation [Srinivasan et al., 2021, Zhang
et al., 2021c]. Chapter 5 also uses an image representation but without the latitude-
longitude semantics to the pixel locations.
For each latitude-longitude direction of the light probe, we can compute light vis-
ibility at each scene point by casting a ray from that point to the latitude-longitude
direction and checking whether that ray gets blocked by other geometry. In a scene
without (semi-)transparent objects, every 3D point has a binary visibility to a given
light direction (blocked or not), although values ∈ [0, 1] may also arise when we “ras-

terize” the 3D visibility into visibility maps associated with different views, or the
scene representation is volumetric as in Chapter 2 [Srinivasan et al., 2021]. Given a
light direction, the visibility map (as visualized in Figure 1-5) can be thought of as
a “shadow map,” informing us which pixels in this particular view are in shadow. If
we average these per-light visibility maps over all incoming light directions, we get
the ambient occlusion map (as visualized in Figure 1-5) that encodes how “exposed”
each point is to all light directions. We use these maps extensively in Chapter 2 for

Figure 1-5: Example shadow

map and ambient occlusion
map. Per-direction visibilities
can be thought of as shadow
maps. Averaging visibilities
over all light directions leads
to the ambient occlusion map
quantifying how “exposed”
each point is to all lights.
Shadow Map Ambient Occlusion

Many of the image features that we observe depend on the incoming light direc-
tion: When this direction varies, those image features such as shadows and specular
highlights change in the same fixed viewpoint. Consider the apple scene in Figure 1-1.
When the light bulb moves around, we will see the shadows and specular highlights
moving accordingly. Other light-dependent effects that are more subtle include
shadow softness and specularity spread: Still in the apple scene, if the light bulb
shrinks in size, approaching a point light, the cast shadows will become harder with
the penumbra gradually disappearing, and the specular highlights will become more
concentrated. Relighting is the problem of synthesizing such light-dependent effects
under novel lighting, addressed by Chapter 3 and Chapter 2 of this dissertation [Sun
et al., 2020, Zhang et al., 2021b, Srinivasan et al., 2021, Zhang et al., 2021c].

1.1.4 Cameras

Cameras record a 2D projection of the 3D world onto the image plane. The projection
is governed by camera extrinsics and intrinsics.
Camera extrinsics describes the rigid-body transformation from the world co-
ordinate system to the camera’s local coordinate system, usually in the form of a 3D
rotation matrix 𝑅 ∈ R3×3 and a 3D translation vector 𝑡 ∈ R3 . Camera extrinsics
can then be expressed with a 3 × 4 matrix 𝐸 = [𝑅 | 𝑡]. Therefore, given a 3D point
(in homogeneous coordinates) 𝑥homo
w in the world space, 𝑥c = 𝐸𝑥homo
w produces the
non-homogeneous coordinates of the same point in the camera’s local frame.
Camera intrinsics, on the other hand, specifies how the 3D-to-2D projection is
performed in the camera’s local space. In this dissertation (and most of the computer
vision projects), we assume zero skew, square pixels, and a center
[︂ optical
]︂ center.
𝑓 0 𝑤/2
These assumptions lead to the 3 × 3 intrinsics matrix 𝐾 = 0 𝑓 ℎ/2 , where 𝑓 is
0 0 1
the focal length in pixels1 , and (ℎ, 𝑤) are the image resolution. With both extrinsics
and intrinsics specified, the “one-stop” projection matrix projecting 𝑥homo
w to its 2D
homogeneous coordinates in the image space is 𝑥homo
i = 𝐾𝐸𝑥homo
w . We refer the
reader to Szeliski [2010] for more on camera models and to Hartley and Zisserman
[2004] for in-depth mathematics on projective geometry.
We use these 3D-to-2D projections, their inversions (as simple as matrix inversion),
and their extensions heavily throughout this dissertation. Specifically, in Chapter 2,
we cast camera rays to the scene by inverting the aforementioned 3D-to-2D camera
projection [Srinivasan et al., 2021, Zhang et al., 2021c]. In Chapter 3, we resample
pixels from the camera space to the UV texture space and back [Zhang et al., 2021b].
Finally, we estimate the extrinsics and intrinsics parameters [Sun et al., 2018b] and
backproject 2.5D depth maps to the 3D space (and to “spherical maps”) [Zhang et al.,
2018b] in Chapter 4.
There is a surprising (and perhaps unintuitive) duality between cameras and

To convert a mm focal length to pixels, one needs to compare the image resolution (which is
in pixels) with the effective sensor size (which is in mm), then compute how many pixels 1 mm
translates to, and finally scale the mm focal length accordingly.

lights as shown by the Dual Photography work of Sen et al. [2005], where they
successfully synthesize the scene appearance from the projector’s perspective and
also relight the scene as if the camera were the projector (light). Similar to the
light-dependent effects discussed above, view-dependent effects are the appearance
variations due to viewpoint changes. Unsurprisingly, specularity moves as you view
it from different viewpoints, e.g., by swaying your head left and right. Shadows,
however, are seldom view-dependent: Shadows do not move w.r.t. the rest of the
3D scene as the viewpoint varies. This is a distinction between cameras and lights
despite their similarities in other aspects. The task of view synthesis is about
synthesizing the view-dependent effects for a novel viewpoint, and we address this
task in Chapter 3 and Chapter 2 [Zhang et al., 2021b, Srinivasan et al., 2021, Zhang
et al., 2021c].

1.1.5 Rendering

We have defined the four essential scene aspects—shape, materials, lighting, and
cameras—and introduced their representations commonly used. The final missing
piece of the puzzle is rendering, the process of “combining” the four elements into an
RGB image.
To figure out the appearance for a 3D point 𝑥, one solves the rendering equa-
tion [Kajiya, 1986, Immel et al., 1986] often using Monte Carlo methods. In this
dissertation where no object emits light, we simplify the full equation to:

(︀ )︀
𝐿o (𝑥, 𝜔o ) = 𝑅(𝑥, 𝜔i , 𝜔o )𝐿i (𝑥, 𝜔i ) 𝜔i · 𝑛 ∆𝜔i ,

where 𝐿o (𝑥, 𝜔o ) is the outgoing radiance at 𝑥 as viewed from 𝜔o , 𝑅(𝑥, 𝜔i , 𝜔o ) is the

SVBRDF at 𝑥 with directions 𝜔i and 𝜔o , 𝐿i (𝑥, 𝜔i ) is the incoming radiance (masked
by the visibility) arriving at 𝑥 along 𝜔i , 𝑛 is the surface normal at 𝑥, and ∆𝜔i is the
solid angle corresponding to the lighting sample at 𝜔i .
Note the recursive nature of Equation 1.1: 𝐿i (𝑥, 𝜔i ) in this iteration may equal
𝐿o (𝑥, 𝜔o ) from the previous iteration, e.g., when computing indirect illumination. In

the first half of Chapter 2, there is such recursion: 𝐿i is the sum of a light probe
pixel and the one-bounce indirect illumination from a nearby point [Srinivasan et al.,
2021], whereas in the second half, 𝐿i directly takes values from the light probe pixels
since we consider only direct illumination [Zhang et al., 2021c].

Although the rendering equation is expressive and general, one may not be able
to or need to fully decompose Equation 1.1 into every term. For instance, it is
error-prone, if possible at all, to explicitly find 𝑅 from samples of 𝐿o in the setup
of Chapter 3. Moreover, it is unnecessary to solve for every term in Equation 1.1
just for relighting and view synthesis in that setup since we do not plan to edit the
materials 𝑅. In such cases, a middle level of abstraction such as the light transport
function comes in useful. Formally, we reparameterize Equation 1.1 at a higher level
of abstraction:
𝐿o (𝑥, 𝜔o ) = 𝑇 (𝑥, 𝜔i , 𝜔o )𝐿′i (𝜔i )∆𝜔i , (1.2)

where 𝑇 (𝑥, 𝜔i , 𝜔o ) is the light transport function that embraces the BRDF, cosine
term, light visibility, and the recursive nature of 𝐿i , and 𝐿′i (𝜔i ) is the light intensity
from 𝜔i . Crucially, unlike 𝐿i , 𝐿′i (𝜔i ) bears no dependency on 𝑥, thereby eliminating
the recursive nature of Equation 1.1. Intuitively, 𝑇 directly returns the “net radi-
ance” at 𝑥 when lit from 𝜔i and viewed from 𝜔o , concealing the actual recursion of
intermediate light bounces.

Chapter 3 demonstrates the usefulness of Equation 1.2, a level of abstraction

higher than the full decomposition of Equation 1.1. Instead of solving for the com-
plex reflectance of human skin, we opt to learn to directly interpolate 𝑇 , thereby
supporting relighting [Sun et al., 2020] or simultaneous relighting and view synthesis
[Zhang et al., 2021b]. That said, a decomposition more shallow than Equation 1.1 has
its own disadvantages: Such an approach is unable to export the underlying geome-
try or edit the materials. In contrast, the low-level abstraction that explicitly solves
for every term in Equation 1.1 (Chapter 2) further supports geometry estimation and
material editing besides free-viewpoint relighting [Srinivasan et al., 2021, Zhang et al.,

1.2 Inverting the Image Formation Process

Although the image formation process is so well understood that we can render im-
ages indistinguishable from real photographs, inverting this process—recovering from
images the scene properties that we discussed—is still highly challenging because in-
formation loss is huge during the forward process: 3D shape gets projected to the 2D
image plane; reflectance and lighting get convolved together and then observed. In
other words, inverting the image formation process is ill-posed: There are multiple
sets of scene elements that could have caused the images that we observe.

In this section, we introduce four subproblems under the overarching theme of

this dissertation: shape, reflectance, and illumination from appearance. Each sub-
problem has its own dedicated chapter, and the following subsections correspond to
the upcoming four chapters. I) Corresponding to Chapter 2 (“shape, reflectance, and
illumination from appearance”), Section 1.2.1 introduces the task of jointly estimat-
ing shape, reflectance, and illumination from multi-view images. II) Section 1.2.2
proposes the problem of interpolating the light transport function, to which Chap-
ter 3 (“light transport function from appearance”) is dedicated. III) Section 1.2.3
introduces the problem of reconstructing 3D shapes single from images, correspond-
ing to Chapter 4 (“shape from appearance”). IV) Preparing the reader for Chapter 5
(“lighting from appearance”), Section 1.2.4 defines the task of lighting recovery from
the appearance of the illuminated object.

The four subproblems also represent three different levels of abstraction for the
inverse rendering problem. At a low level of abstraction, Chapter 2 attempts to solve
for every term in our (simplified) rendering equation (Equation 1.1) by re-rendering
all the estimated elements back to RGB images, which then get compared against the
observed images for loss computation. Despite challenging, this low-level approach
allows us to export the estimated shape and edit the estimated reflectance in ad-
dition to what a mid-level abstraction would support. Ascending to a middle level
of abstraction, Chapter 3 explores interpolating the light transport function (as in
Equation 1.2) given sparse samples thereof. Our models based on this mid-level ab-

straction enable relighting, view synthesis, and both tasks simultaneously while easily
including global illumination effects. Finally, at a high level of abstraction, Chapter 4
and Chapter 5 recover shape or lighting from single images, without modeling the
other scene elements or the rendering process. Relying on large datasets of shapes
or lighting patterns, these two chapters train deep learning models that directly map
the appearance observations to the underling shape or lighting.

1.2.1 Joint Estimation of Shape, Reflectance, & Illumination

In this subsection, we introduce the problem that Chapter 2 attempts to solve: es-
timating shape, reflectance, and illumination from the object’s appearance. This
amounts to explicitly solving for every term in Equation 1.1 and then re-rendering
these estimated factors into RGB images in a physically-based manner. As such, this
low level of abstraction supports operations on the estimated factors, such as lighting
editing (i.e., relighting), reflectance editing (i.e., material change), and shape export
(e.g. into a graphics engine).
Note that the well-known problem of Intrinsic Image Decomposition (IID) [Barrow
and Tenenbaum, 1978] solves only part of this factorization problem. In terms of
shape, the IID methods recover depth or surface normal maps only for the input
view, rather than a full 3D shape [Weiss, 2001, Tappen et al., 2003, Bell et al.,
2014, Barron and Malik, 2014, Janner et al., 2017]. This makes view synthesis with
these approaches impossible. Material-wise, these IID methods mostly assume the
Lambertian reflectance and tend to fail on more complicated materials. Finally,
lighting recovered by the IID approaches is also in the space of the input view (e.g., a
“lighting image”), making relighting with arbitrary lighting difficult. The appearance
factorization approaches that we propose in Chapter 2 address all of these issues that
the IID methods suffer from.
In Chapter 2, we study full appearance decomposition under two setups. In the
first setup, we assume that we observe the object under multiple arbitrary but known
lighting conditions [Srinivasan et al., 2021]. Note that “arbitrary” means that the
lighting does not have to be of a certain form such as one point light in the dark.

We also model first-bounce indirect illumination in this setup. In the second setup,
we relax the requirement for input lighting: We observe the object under only one
unknown lighting condition [Zhang et al., 2021c]. This relaxation allows us to apply
our method to a user capture under a natural, unknown lighting condition, such as
one made of a car on the street.

1.2.2 Interpolating the Light Transport Function

As discussed in Section 1.1.5, the light transport function 𝑇 is a convenient abstrac-

tion of complex BXDFs and ray bounces. Having access to 𝑇 enables relighting and
view synthesis: When we query 𝑇 at novel light directions, we are relighting the
scene, without actually knowing the underlying shape or BXDF; when we query 𝑇 at
novel viewing directions, we are synthesizing novel views of the scene, again without
having to evaluate Equation 1.1. Relighting and view synthesis, as applications of
light transport function interpolation, have their own more downstream applications
in Extended Reality (XR), as already discussed.
Recall that at the low level of abstraction, we can already perform relighting, view
synthesis, and both tasks simultaneously. Why do we need this mid-level abstraction
using 𝑇 , especially given that it would not support material editing or shape export?
It is still preferable to perform relighting and view synthesis with 𝑇 because 𝑇 en-
compasses the convoluted interactions between BXDFs and illuminations (which by
themselves may be already complex), making no simplifying assumption as needed
by the low-level abstraction. Therefore, this mid-level abstraction can deliver high-
quality relighting with global illumination effects, without requiring the underlying
geometry and BXDFs be estimated or multiple bounces be simulated. In contrast,
at the low level of abstraction, even the state-of-the-art model of ours supports only
one-bounce indirect illumination [Srinivasan et al., 2021], and most physically-based
models including Zhang et al. [2021c] consider only direct illumination.
In Chapter 3, we study the middle level of abstraction using the light transport
function 𝑇 . We first learn to interpolate 𝑇 in only the light direction, by observing
sparse samples of 𝑇 [Sun et al., 2020]. Although such interpolation supports only

relighting (i.e., not view synthesis), this approach has the advantages of being purely
image-based and requiring no 3D modeling. With the additional input of geometry
proxy, we continue exploring the interpolation of 𝑇 in both light and view directions,
thereby enabling simultaneous relighting and view synthesis [Zhang et al., 2021b].

1.2.3 Shape Reconstruction

Reconstructing 3D shapes from images is an important subproblem within inverse ren-

dering. It has wide applications in robotics, autonomous driving, Virtual/Augmented
Reality (A/VR), etc. To name a few example applications, a robotic system often
needs to understand the 3D shape of an object before being able to manipulate (e.g.,
grasp) it; driver-less cars need 3D understanding to avoid obstacles; an AR system
has to know the shape of a desk before allowing a user to place a virtual object on it.
Computer vision, graphics, and cognitive science researchers have been working
on shape reconstruction for decades, with a series of notable “Shape From X” works.
Shape From Shading attempts to recover the shape of a surface from the shading vari-
ation [Horn, 1975]. When multiple images lit by different light sources are available,
Photometric Stereo2 performs shape from shading more robustly [Woodham, 1981].
Shape From Texture aims to recover surface geometry from an image of texture often
using the prior that texture should be roughly regular [Witkin, 1981]. Depth From
Defocus infers depth from the strong depth cue of blur [Pentland, 1987]. Multi-View
Stereo reconstructs 3D shape from multi-view images of the object [Seitz et al., 2006].
Structure From Motion aims to recover both the 3D geometry and camera poses from
a series of images [Ullman, 1979].
Made possible by recent advances in deep learning, single-image 3D shape recon-
struction concerns estimating the 3D shape from just a single generic image (i.e.,
not necessarily an image of just texture) by learning category-specific (e.g., chairs)
priors from a large-scale dataset of shapes. Chapter 4 studies two such problems.
The first problem that Chapter 4 tackles is that when trained with a supervised loss,
The word “stereo” again implies the duality between lights and cameras [Szeliski, 2010] as dis-
cussed in Section 1.1.4.

the reconstruction network tends to produce blurry “mean shapes” that satisfy the
ℓ2 loss but do not look realistic [Wu et al., 2018]. The second problem addressed by
Chapter 4 is the generalizability of these reconstruction networks: They work well
only on the shape categories seen during training but generalize poorly to novel shape
classes, still “retrieving” shapes from the training classes [Zhang et al., 2018b].
Operating at a high level of abstraction, all solutions proposed in Chapter 4 treat
rendering as a black box and invert it directly with deep learning models that learn
data-driven priors. These models based on the high level of abstraction rely on data
rather than physics and have the advantage of being applicable to single images (cf.
multiple images as required by the mid- and low-level abstractions).

1.2.4 Lighting Recovery

Recovering lighting from the scene or object appearance is a challenging subproblem

of inverse rendering that has wide applications. The most relevant applications of
lighting recovery are arguably in AR. For instance, when an AR user wants to insert
a virtual object into their scene, it is crucial to have the target lighting recovered
so that the virtual object can be lit properly by the same lighting so as to appear
consistently with the real scene. Similarly, in future AR communication systems (de-
veloped hopefully not due to another pandemic), Alice needs to relight the teleported
Bob using Alice’s lighting, which needs recovering, for a photorealistic face-to-face
In Chapter 5 of this dissertation, we aim to recover lighting that is responsible for
the appearance of the illuminated object as captured in a single image. Specifically,
we study this problem in a special Moon-Earth setup where the Earth serves as the
light source that we aim to recover, and the Moon is the Earth-lit object that we
observe. Note that in reality, the Sun is the light source emitting light that travels
to the Earth and bounces off to the dark side of the Moon (whose bright side gets
directly lit by the Sun), and we are simplifying the setup by removing the Sun and
making the Earth emissive. At the current stage of this work in progress, we perform
all of our experiments on simulated data, and testing the model on real captured data

remains future work.
As alluded to previously, Chapter 5 continues to stay at the high level of abstrac-
tion. Specifically, we train a conditional generative model to directly “regress” the
Earth image from the Moon observation and the timestamp. Our data-driven solu-
tion circumvents the need to model the image formation process for this extreme case
and enables lighting recovery from single images.

1.3 Dissertation Structure

The overarching theme of this dissertation is recovering shape, reflectance, and illumi-
nation from appearance. We study four instances of inverse rendering: I) “shape,
reflectance, and illumination from appearance” in Chapter 2, II) “light transport func-
tion from appearance” in Chapter 3, III) “shape from appearance” in Chapter 4, and
IV) “lighting from appearance” in Chapter 5.
These four subtopics represent three levels of abstraction to tackle inverse
rendering: I) the low level of abstraction where we explicitly solve for every term—
shape, reflectance, and illumination—in the rendering equation (Equation 1.1) in a
physically-based manner, achieving full editability and exportability that a mid- or
high-level solution is incapable of, II) the middle level where we utilize the light
transport function (𝑇 in Equation 1.2) to abstract away intermediate light transport
and focus on just the final “net effect,” delivering high-quality relighting results with
global illumination effects for challenging reflectance (such as that of human skin),
and III) the high level where we treat rendering as a black box and invert it with
data-driven priors, supporting single-image input at test time.
In Chapter 1 (this chapter), we have introduced the image formation process by
explaining the four main scene elements (i.e., shape, materials, lighting, and cameras),
their representations in computer vision and graphics, and the rendering process that
“combines” these elements into images that we see. We then defined the problem of in-
verting the image formation process, where we aim to recover the scene elements from
image observations passively. Specifically, we have provided the problem statements

for the aforementioned four instances of inverse rendering.

At the low level of abstraction, Chapter 2 presents physically-based models

that solve for every term in the rendering equation (Equation 1.1), recovering shape,
reflectance, and illumination from multi-view images of the object and their camera
poses. This is the right level of abstraction when we need to further edit the estimated
shape or material and export them into common graphics pipelines. The two setups
that we consider are multiple arbitrary but known lighting conditions and a single
unknown lighting condition. For the former case, we develop Neural Reflectance
and Visibility Fields (NeRV) that estimates the shape and reflectance of an object
while modeling one-bounce indirect illumination [Srinivasan et al., 2021]. Under the
latter setup, we present Neural Factorization of Shape and Reflectance (NeRFactor)
capable of jointly estimating shape, reflectance, and the unknown lighting [Zhang
et al., 2021c].

At the middle level of abstraction, Chapter 3 utilizes the light transport func-
tion, 𝑇 in Equation 1.2, to abstract away intermediate light bounces and model
directly the “net effect” radiance. Specifically, we attempt to interpolate 𝑇 from the
sparse samples thereof. This is the right abstraction level for our problem, at which we
can perform high-quality relighting and view synthesis including global illumination
effects, without having to explicitly solve for geometry and reflectance or simulate all
light bounces. We first interpolate the light transport function in just the light direc-
tion, achieving precise, high-frequency portrait relighting with a model that we call
Light Stage Super-Resolution (LSSR) [Sun et al., 2020]. With the additional input of
a geometry proxy, we then develop Neural Light Transport (NLT) that interpolates
𝑇 in both the light and view directions, enabling simultaneous relighting and view
synthesis of humans with complex geometry and reflectance [Zhang et al., 2021b].

At the high level of abstraction, Chapter 4 studies shape from appearance by

treating rendering as a black box and inverting it with a data-driven machine learn-
ing approach. Specifically, we consider single-image 3D shape reconstruction, where
neural networks learn a direct mapping from a single image to the 3D shape therein
using data-driven shape priors. We first present ShapeHD, a model that achieves

high-quality reconstruction with an adversarially learned perceptual loss [Wu et al.,
2018]. Tackling the generalization problem of ShapeHD and similar learning mod-
els, we then propose Generalizable Reconstruction (GenRe) capable of generalizing
to novel shape categories unseen during training [Zhang et al., 2018b]. Finally, we
briefly discuss how Pix3D—our own real-world dataset of image-shape pairs with
pixel-level alignment—is constructed [Sun et al., 2018b] and facilitates the evaluation
of ShapeHD and GenRe.
Staying at the high level of abstraction, Chapter 5 presents the current progress
of our work on data-driven lighting recovery from appearance, where we train a con-
ditional generative model of possible lighting patterns given various appearances of
the object illuminated. We frame this problem in a special Moon-Earth setup where
the Earth, as the light source, illuminates the dark side of the Moon. Our model,
Generative Adversarial Networks for the Earth (EarthGANs), aims to recover the
Earth appearance given a single-pixel Moon appearance and the corresponding times-
tamp. This is the proper level of abstraction that circumvents the need to model this
extreme image formation process and makes EarthGAN applicable to a “backyard
image” taken by a mobile phone camera.
Finally, Chapter 6 concludes the dissertation and discusses future directions.

This dissertation draws on multiple collaborative projects that I either led or

participated in. Here we list all the publications involved and organize them by level
of abstraction and chapter (the asterisk indicates equal contribution):

Low Level of Abstraction (Chapter 2)

• Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben

Mildenhall, and Jonathan T. Barron. NeRV: Neural Reflectance and Visibil-
ity Fields for Relighting and View Synthesis. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2021.

• Xiuming Zhang, Pratul P. Srinivasan, Boyang Deng, Paul Debevec, William

T. Freeman, and Jonathan T. Barron. NeRFactor: Neural Factorization of

Shape and Reflectance Under an Unknown Illumination. ACM Transactions
on Graphics (TOG), TBA, 2021.

Middle Level of Abstraction (Chapter 3)

• Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhe-
mann, Paul Debevec, Yun-Ta Tsai, Jonathan T. Barron, and Ravi Ramamoor-
thi. Light Stage Super-Resolution: Continuous High-Frequency Relighting.
ACM Transactions on Graphics (TOG), 39(6):1–12, 2020.

• Xiuming Zhang, Sean Fanello, Yun-Ta Tsai, Tiancheng Sun, Tianfan Xue, Ro-
hit Pandey, Sergio Orts-Escolano, Philip Davidson, Christoph Rhemann, Paul
Debevec, Jonathan T. Barron, Ravi Ramamoorthi, and William T. Freeman.
Neural Light Transport for Relighting and View Synthesis. ACM Transactions
on Graphics (TOG), 40(1):1–17, 2021.

High Level of Abstraction (Chapter 4 & Chapter 5)

• Xingyuan Sun*, Jiajun Wu*, Xiuming Zhang, Zhoutong Zhang, Chengkai Zh-
ang, Tianfan Xue, Joshua B. Tenenbaum, and William T. Freeman. Pix3D:
Dataset and Methods for Single-Image 3D Shape Modeling. In IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

• Jiajun Wu*, Chengkai Zhang*, Xiuming Zhang, Zhoutong Zhang, William T.

Freeman, and Joshua B. Tenenbaum. Learning 3D Shape Priors for Shape
Completion and Reconstruction. In European Conference on Computer Vision
(ECCV), 2018.

• Xiuming Zhang*, Zhoutong Zhang*, Chengkai Zhang, Joshua B. Tenenbaum,

William T. Freeman, and Jiajun Wu. Learning to Reconstruct Shapes From Un-
seen Classes. In Advances in Neural Information Processing Systems (NeurIPS),

• Xiuming Zhang and William T. Freeman. Data-Driven Lighting Recovery: A

Moon-Earth Case Study. Work in Progress, 2021.


Chapter 2

Low-Level Abstraction:
Physically-Based Appearance

In this chapter, we model appearance at a low level of abstraction, explicitly solving

for every term in the rendering equation (Equation 1.1). Specifically, we address the
problem of estimating shape, Spatially-Varying Bidirectional Reflectance Distribution
Functions (SVBRDFs), and direct or indirect illumination from multi-view images of
an object lit by a single unknown or multiple arbitrary but known lighting conditions.
With our appearance factorization, one is able to synthesize the object appearance
from a novel viewpoint under any arbitrary lighting. Crucially, our approaches explic-
itly model visibility and are therefore able to not only remove shadows from albedo
during training but also render soft and hard shadows under novel test lighting.
We start with an introduction of inverse rendering (Section 2.1) and then review
the related work in Section 2.2. Next, we present Neural Reflectance and Vis-
ibility Fields (NeRV) that is capable of jointly estimating from scratch shape,
SVBRDFs, and indirect illumination from multi-view images of an object lit by mul-
tiple arbitrary (but known) lighting conditions (Section 2.3) [Srinivasan et al., 2021].
To relax the capture requirement of multiple known illuminations in NeRV, we fur-
ther devise Neural Factorization of Shape and Reflectance (NeRFactor) that

factorizes the object appearance into shape, SVBRDFs, and direct illumination from
multi-view images of an object lit by just one unknown lighting condition (Section 2.4)
[Zhang et al., 2021c].
In Section 2.5, we describe our experiments that evaluate how well NeRV and
NeRFactor perform appearance decomposition (and subsequently free-viewpoint re-
lighting), and how they compare with the existing solutions to our tasks, under two
setups: multiple arbitrary but known lighting conditions (for NeRV) and one un-
known lighting condition (for NeRFactor). We also perform additional analyses, in
Section 2.6, to study the importance of each major component of the NeRV and NeR-
Factor models and analyze whether NeRFactor predicts albedo consistently for the
same object when lit by different lighting conditions.

2.1 Introduction

Recovering an object’s geometry and material properties from captured images, such
that it can be rendered from arbitrary viewpoints under novel lighting conditions,
is a longstanding problem within computer vision and graphics. In addition to its
importance for recognition and robotics, a solution to this could democratize 3D
content creation and allow anyone to use real-world objects in Extended Reality (XR)
applications, film-making, and game development. The difficulty of this problem
stems from its fundamentally underconstrained nature, and prior work has typically
addressed this either by using additional observations such as scanned geometry or
images of the object under controlled laboratory lighting conditions, or by making
restrictive assumptions such as assuming a single material for the entire object or
ignoring self-shadowing.
The vision and graphics communities have recently made substantial progress
towards the novel view synthesis portion of this goal. Neural Radiance Fields (NeRF)
has shown that it is possible to synthesize photorealistic images of scenes by training
a simple neural network to map 3D locations in the scene to a continuous field of
volume density and color [Mildenhall et al., 2020]. Volume rendering is trivially

differentiable, so the parameters of a NeRF can be optimized for a single scene by
using gradient descent to minimize the difference between renderings of the NeRF
and a set of observed images. Although NeRF produces compelling results for view
synthesis, it does not provide a solution for relighting. This is because NeRF models
just the amount of outgoing light from a location – the fact that this outgoing light
is the result of interactions between incoming light and the material properties of an
underlying surface is ignored.

At first glance, extending NeRF to enable relighting appears to require only chang-
ing the image formation model: Instead of modeling scenes as fields of density and
view-dependent color, we can model surface normals and material properties (e.g.,
the parameters of a Bidirectional Reflectance Distribution Function [BRDF]), and
simulate the transport of the scene’s light sources (which we first assume are known)
according to the rules of physically-based rendering [Pharr et al., 2016]. However,
simulating the attenuation and reflection of light by particles is fundamentally chal-
lenging in NeRF’s neural volumetric representation because content can exist any-
where within the scene, and determining the density at any location requires querying
a neural network.

Consider the naïve procedure for computing the radiance along a single camera
ray due to direct illumination, as illustrated in Figure 2-1: First, we query NeRF’s
Multi-Layer Perceptron (MLP) for the volume density at samples along the cam-
era ray to determine the amount of light reflected by particles at each location that
reaches the camera. For each location along the camera ray, we then query the MLP
for the volume density at densely sampled points between the location and every light
source to estimate the attenuation of light before it reaches that location. This proce-
dure quickly becomes prohibitively expensive if we want to model environment light
sources or global illumination, in which case scene points may be illuminated from
all directions. Prior methods for estimating relightable volumetric representations
from images have not overcome this challenge and can only simulate direct illumina-
tion from a single point light source when training. This is what we refer to as the
“computational complexity problem” of extending NeRF for relighting.

Naïve Ours
Figure 2-1: How NeRV
reduces the
0.4 computational
0.6 complexity. Brute-force
light transport
simulation through
NeRF’s volumetric
0.8 representation with

naïve raymarching

(left) is intractable. By
visibility with a neural
visibility field (right)
𝑛 is the number of samples along each ray, ℓ is the number of light
that is optimized
sources, and 𝑑 is the number of indirect illumination directions sam-
pled. Black dots represent evaluating a shape MLP for volume den- alongside the shape
sity at a position, red arrows represent evaluating the visibility MLP MLP, we are able to
at a position along a direction, and the blue arrow represents eval- make optimization
uating the visibility MLP for the expected termination depth of a with complex
ray. Output visibility multipliers and termination depths from the
illumination tractable.
visibility MLP are displayed as text.

The problem of efficiently computing visibility is well explored in the graphics

literature. In standard raytracing graphics pipelines, where the scene geometry is
fixed and known ahead of time, a common solution is to precompute a data structure
that can be efficiently queried to obtain the visibility between pairs of scene points or
between scene points and light sources. This can be accomplished with approaches
including octrees [Samet, 1989], distance transforms [Cohen and Sheffer, 1994], or
bounding volume hierarchies [Pharr et al., 2016]. However, these existing approaches
do not provide a solution to our task: Our geometry is unknown, and our model’s
estimate of geometry changes constantly as it is optimized. Although conventional
data structures could perhaps be used to accelerate rendering after optimization is
complete, we need to efficiently query the visibility between points during optimiza-
tion, and existing solutions are prohibitively expensive to rebuild after each training
iteration (of which there may be millions).

In the first half of this chapter, we present Neural Reflectance and Visibility
Fields (NeRV), an approach for estimating a volumetric 3D representation from

images of a scene under multiple arbitrary but known lighting conditions [Srinivasan
et al., 2021], such that novel images can be rendered from arbitrary unseen viewpoints
and under novel unobserved lighting conditions, as shown in Figure 2-2.

(a) Input images of the scene under unconstrained varying (known) lighting conditions

(b) Output renderings from novel viewpoints and lighting conditions

Figure 2-2: Example input and output of NeRV. We optimize a NeRV 3D represen-
tation from multi-view images of a scene illuminated by known but unconstrained
lighting. Our NeRV representation can be rendered from novel views under arbitrary
novel lighting conditions. Here we visualize example input data and renderings for
two scenes. The first two output rendered images for each scene are from the same
viewpoint, each illuminated by a point light at a different location, and the last image
is from a different viewpoint under a random colored illumination.

NeRV can simulate realistic environment lighting and global illumination. Our
key insight is to train an MLP to act as a lookup table into a visibility field during
rendering. Instead of estimating light or surface visibility at a given 3D position
along a given direction by densely evaluating an MLP for the volume density along
the corresponding ray (which would be prohibitively expensive), we simply query
this visibility MLP to estimate visibility and expected termination depth in any di-
rection (see Figure 2-1). This visibility MLP is optimized alongside the MLP that

represents volume density and supervised to be consistent with the volume density
samples observed during optimization. Using this neural approximation of the true
visibility field significantly eases the computational burden of estimating volume ren-
dering integrals while training. NeRV enables the recovery of a NeRF-like model
that supports relighting in addition to view synthesis. While previous solutions for
relightable NeRFs are limited to controlled settings that require the input images be
illuminated by a single point light [Bi et al., 2020a], NeRV supports training with
arbitrary environment lighting and one-bounce indirect illumination.
In the second half of this chapter, we continue to investigate whether we can
achieve what NeRV accomplishes but with just one unknown illumination, a setup
often encountered when the user wants to capture daily in-the-wild objects. To this
end, we develop Neural Factorization of Shape and Reflectance (NeRFactor),
a model capable of recovering convincing relightable representations from images of
an object captured under one unknown natural illumination condition [Zhang et al.,
2021c], as shown in Figure 2-3.

Normals Visibility Free-Viewpoint

Relighting Relighting Material Editing
Posed Multi-View Images
under an Unknown Illumination

Albedo BRDF

Real-World Capture

NeRFactor Applications

Figure 2-3: NeRFactor overview. Given a set of posed images of an object captured
from multiple views under just one unknown illumination condition (left), NeRFactor
is able to factorize the scene into 3D neural fields of surface normals, light visibil-
ity, albedo, and material (center), which enables applications such as free-viewpoint
relighting and material editing (right).

Our key insight is that we can first optimize a NeRF [Mildenhall et al., 2020] from
the input images to initialize our model’s surface normals and light visibility, and then
jointly optimize these initial estimates along with the spatially-varying reflectance and
the lighting condition, such that these estimates, when re-rendered, match the ob-

served images. The use of NeRF to produce a high-quality geometry initialization
helps break the inherent ambiguities among shape, reflectance, and lighting, thereby
allowing us to recover a full 3D model for convincing view synthesis and relight-
ing using just a re-rendering loss, simple spatial smoothness priors for each of these
components, and a novel data-driven BRDF prior. Because NeRFactor models light
visibility explicitly and efficiently, it is capable of removing shadows from albedo esti-
mation and synthesizing realistic soft or hard shadows under arbitrary novel lighting

Different than NeRV, NeRFactor addresses the computational complexity problem

by using a “hard surface” approximation of the NeRF geometry, where we only perform
shading calculations at a single point along each ray, corresponding to the expected
termination depth along the ray. Besides the computational complexity problem,
there is also the “noisy geometry problem” in extending NeRF for relighting: The
geometry estimated by NeRF contains extraneous high-frequency content that, while
unnoticeable in view synthesis results, introduces high-frequency artifacts into the
surface normals and light visibility computed from NeRF’s geometry. This issue per-
sists in many NeRF-based models including NeRV. NeRFactor addresses this noisy
geometry problem by representing the surface normal and light visibility at any 3D
location on this surface as continuous functions parameterized by MLPs, and encour-
age these functions to produce values that are spatially smooth and stay close to
those derived from the pretrained NeRF.

Thus, NeRFactor decomposes the observed images into estimated environment

lighting and a 3D surface representation of the object including surface normals, light
visibility, albedo, and spatially-varying BRDFs. This enables us to render novel views
of the object under arbitrary novel environment lighting. In summary, NeRFactor
makes the following technical contributions:

• a method for factorizing images of an object under an unknown lighting con-

dition into shape, reflectance, and direct illumination, thereby supporting free-
viewpoint relighting with shadows and material editing,
• a strategy to distill the NeRF-estimated volume density into surface geometry

(with normals and visibility) to use as an initialization when improving the
geometry and recovering reflectance, and
• novel data-driven BRDF priors based on training a latent code model on real
measured BRDFs.

Input & Output The input to NeRV is a set of multi-view images of an object
illuminated under multiple arbitrary but known lighting conditions, while NeRFactor
requires only one unknown lighting condition. Both methods require the camera poses
of these images, which can be obtained with an off-the-shelf Structure From Motion
(SFM) package, such as COLMAP [Schönberger and Frahm, 2016]. Both methods
jointly estimate a plausible collection of surface normals, light visibility, albedo, and
spatially-varying BRDFs, which together explain the observed views. NeRFactor
additionally estimates the environment lighting. We then use the recovered geome-
try and reflectance to synthesize images of the object from novel viewpoints under
arbitrary lighting. Modeling visibility explicitly, both methods are able to remove
shadows from albedo and synthesize soft or hard shadows under arbitrary lighting.

Assumptions NeRFactor considers objects to be composed of hard surfaces with

a single intersection point per ray, so volumetric light transport effects such as scat-
tering, transparency, and translucency are not modeled. NeRV, however, utilizes
this “hard surface” assumption only sparely, to speed up the modeling of one-bounce
indirect illumination. In contrast, NeRFactor models only direct illumination since
doing so simplifies computation, and under unknown lighting, we expect most of the
usable signals to be from direct illumination. Finally, our reflectance models con-
sider materials with achromatic specular reflectance (dielectrics), so we do not model
metallic materials (though one can easily extend our models for metallic materials by
additionally predicting a specular color for each surface point).

2.2 Related Work

NeRV and NeRFactor both tackle the problem of inverse rendering, whose literature is
reviewed in Section 2.2.1. We also review the coordinate-based neural object or scene
representation, which is fundamental to both works, in Section 2.2.2. Section 2.2.3
surveys precomputation in computer graphics, which motivates the fast “visibility
lookup” in both NeRV and NeRFactor. Finally, because our models can be applied
to perform object capture for downstream graphics applications, we also review prior
art on material capture in Section 2.2.4.

2.2.1 Inverse Rendering

Intrinsic image decomposition aims to attribute what aspects of an image are due to
material, lighting, or geometric variation [Horn, 1970, Land and McCann, 1971, Horn,
1974, Barrow and Tenenbaum, 1978]. The more general problem that additionally
involves non-Lambertian reflectance, global illumination, etc. is often referred to as
inverse rendering [Sato et al., 1997, Marschner, 1998, Yu et al., 1999, Ramamoorthi
and Hanrahan, 2001]. In other words, the goal of inverse rendering is to factorize the
appearance of an object in observed images into the underlying geometry, material
properties, and lighting conditions. It is a longstanding problem in computer vision
and graphics, the difficulty of which (a consequence of its underconstrained nature) is
typically addressed using one of the following strategies: I) learning priors on shape,
illumination, and reflectance, II) assuming known geometry, or III) using multiple
input images of the scene under one or multiple lighting conditions.
Most recent single-image inverse rendering methods [Barron and Malik, 2014, Li
et al., 2018, Yu and Smith, 2019, Sengupta et al., 2019, Li et al., 2020c, Wei et al.,
2020, Sang and Chandraker, 2020] belong to the first category and use large datasets
of images with labeled geometry and materials to train machine learning models to
predict these properties. Most prior works in inverse rendering that recover full 3D
models for graphics applications [Weinmann and Klein, 2015] fall under the second
category and use 3D geometry obtained from active scanning [Park et al., 2020,

Schmitt et al., 2020, Zhang et al., 2021b], proxy models [Dong et al., 2014, Chen
et al., 2020, Gao et al., 2020], silhouette masks [Oxholm and Nishino, 2014, Xia et al.,
2016], or multi-view stereo [Nam et al., 2018] as a starting point before recovering
reflectance and refining geometry.
Both NeRV and NeRFactor belong to the third category: We only require as input
posed images of an object under one unknown or multiple known lighting conditions.
The most relevant prior works are Deep Reflectance Volumes (DRV) that estimates
voxel geometry and BRDF parameters [Bi et al., 2020b], and the follow-up work
Neural Reflectance Fields that replaces DRV’s voxel grid with a continuous volume
represented by a Multi-Layer Perceptron (MLP) [Bi et al., 2020a]. NeRV extends
Neural Reflectance Fields, which requires scenes be illuminated by only a single point
light at a time due to their brute-force visibility computation strategy visualized in
Figure 2-1 and models only direct illumination, to work for arbitrary lighting and
global illumination.

2.2.2 Coordinate-Based Neural Representations

We build upon a recent trend within the computer vision and graphics communities
that replaces traditional shape representations such as polygon meshes or discretized
voxel grids with MLPs that represent geometry as parametric functions. These MLPs
are optimized to approximate continuous 3D geometry by mapping 3D coordinates
to properties of an object or scene (such as volume density, occupancy, or signed dis-
tance) at that location. This strategy has been explored for the tasks of representing
shape [Genova et al., 2019, Mescheder et al., 2019, Park et al., 2019a, Deng et al.,
2020, Sitzmann et al., 2020, Tancik et al., 2020] and scenes under fixed lighting for
view synthesis [Niemeyer et al., 2019, Sitzmann et al., 2019b, Mildenhall et al., 2020,
Liu et al., 2020, Yariv et al., 2020].
As one such coordinate-based representation, Neural Radiance Fields (NeRF) has
been particularly successful for optimizing volumetric geometry and appearance from
observed images for the purpose of rendering photorealistic novel views [Mildenhall
et al., 2020]. It can be thought of as a modern neural reformulation of the clas-

sic problem of scene reconstruction: given multiple images of a scene, inferring the
underlying geometry and appearance that best explain those images. While classic
approaches have largely relied on discrete representations such as textured meshes
[Hartley and Zisserman, 2004, Snavely et al., 2006] and voxel grids [Seitz and Dyer,
1999], NeRF has demonstrated that a continuous volumetric function, parameterized
as an MLP, is able to represent complex scenes and render photorealistic novel views.
NeRF works well for view synthesis, but it does not enable relighting because it has
no mechanism to disentangle the outgoing radiance of a surface into an incoming
radiance and an underlying surface material.

One technique that has been used for extending NeRF to support relighting is
conditioning the MLP’s output appearance on a latent code that encodes a per-image
lighting, as in NeRF in the Wild [Martin-Brualla et al., 2021] (and previously with
discretized scene representations [Meshry et al., 2019, Li et al., 2020b]). Although
this strategy can effectively explain the appearance variation of training images, it
cannot be used to render the same scene under new lighting conditions not observed
during training (Figure 2-13), since it does not utilize the physics of light transport.

Very recently, several physically-based approaches extend NeRF’s neural repre-

sentation to enable relighting [Bi et al., 2020a, Boss et al., 2021, Zhang et al., 2021a].
NeRV and NeRFactor differ from Bi et al. [2020a] in that we do not require images
be captured under multiple known One-Light-at-A-Time (OLAT) lighting conditions:
NeRV handles arbitrary, non-OLAT environment lighting, and NeRFactor deals with
one unknown arbitrary lighting condition. The methods of Boss et al. [2021] and
Zhang et al. [2021a] can work with the same casual capture setup as in NeRFactor
(i.e., one arbitrary unknown lighting condition), but both crucially do not consider
light visibility and are thus unable to simulate lighting occlusion or shadowing effects.
NeRV and NeRFactor use the estimated geometry to model accurate high-frequency
shadowing and lighting occlusion.

NeRV uses the same volumetric shape representation as NeRF. On the other hand,
NeRFactor continues with the coordinate-based neural representation, but shows that
starting with the NeRF volume and then optimizing a surface representation enables

us to recover a fully-factorized and high-quality 3D model using just images captured
under one unknown illumination. Crucially, using a neural volumetric representation
to estimate the initial geometry enables us to recover factored models for objects that
have proven to be challenging for traditional geometry estimation methods.

2.2.3 Precomputation in Computer Graphics

NeRV is inspired by a long line of work in graphics that explores precomputation

[Sloan et al., 2002, Ritschel et al., 2007] and approximation [Bunnell, 2004, Green
et al., 2007, Ritschel et al., 2008, 2009] strategies to efficiently compute global illu-
mination in physically-based rendering. Our neural visibility fields can be thought
of as a neural analogue to visibility precomputation techniques and is specifically de-
signed for use in our neural inverse rendering setting where geometry is dynamically
changing during optimization.

2.2.4 Material Acquisition

A large body of work within the computer graphics community has focused on the
specific subproblem of material acquisition, where the goal is to estimate BRDF
properties from images of materials with known (typically planar) geometry. These
methods have traditionally utilized a signal processing reconstruction strategy, and
used complex controlled camera and lighting setups to adequately sample the BRDF
[Foo, 2015, Matusik et al., 2003, Nielsen et al., 2015]. More recent methods have en-
abled material acquisition from more casual smartphone setups [Aittala et al., 2015,
Hui et al., 2017]. However, this line of work generally requires the geometry be simple
and fully known, while we focus on a more general problem where our only observa-
tions are images of an object with complex shape and spatially-varying reflectance
(plus the environment lighting for NeRV).

2.3 Method: Multiple Known Illuminations

We extend Neural Radiance Fields (NeRF) [Mildenhall et al., 2020] to include the
simulation of light transport, which allows NeRFs to be rendered under arbitrary
novel illumination conditions. Instead of modeling a scene as a continuous 3D field of
particles that absorb and emit light as in NeRF, we represent a scene as a 3D field of
oriented particles that absorb and reflect the light emitted by external light sources
(Section 2.3.2). Naïvely simulating light transport through this model is inefficient
and unable to scale to simulate realistic lighting conditions or global illumination. We
remedy this by introducing a neural visibility field representation (optimized alongside
NeRF’s volumetric representation) that allows us to efficiently query the point-to-light
and point-to-point visibilities needed to simulate light transport (Section 2.3.3). The
resulting Neural Reflectance and Visibility Fields (NeRV) [Srinivasan et al., 2021] are
visualized in Figure 2-4.

= ( + )
x × × dωi
(b) Light Visibility
(b) Light Visibility (c) Incident Direct Illum. (d) Incident Indirect Illum.
(c) Direct Illumination (e) BRDF
(d) Indirect Illumination (e) BRDF
(a) Our Rendered Image
(a) (Novel
Our Rendered
View and Image
(Novel View and Lighting)

x x x x x x
(f) Normals (g) Albedo (h) Roughness (i) Shadow Map (j) Direct (k) Indirect

(f) Normals (g) Albedo (h) Roughness (i) Shadow Map (j) Direct (k) Indirect

Figure 2-4: Example decomposition of NeRV. Given any continuous 3D location as

input, such as the point at the cyan “x” (a), NeRV outputs the volume density and
the visibility (b) to a spherical environment map surrounding the scene, which is
multiplied by the direct illumination (c) at that point and added to the estimated
indirect illumination (d) at that point to determine the full incident illumination.
This is then multiplied by the predicted BRDF (e) and integrated over all incoming
directions to determine the outgoing radiance at that point. In the bottom row, we
visualize these outputs for the full rendered image: surface normals (f) and BRDF
parameters for diffuse albedo (g) as well as specular roughness (h). We can use the
predicted visibilities to compute the fraction of the total illumination that is actually
incident at any location, visualized as a shadow map (i). We also show the same
rendered viewpoint if it were lit by only direct (j) and indirect illumination (k).

2.3.1 Neural Radiance Fields (NeRF)

NeRF represents a scene as a continuous function, parameterized by a “radiance”

Multi-Layer Perceptron (MLP) whose input is a 3D position and viewing direction,
and whose output is the volume density 𝜎 and radiance 𝐿𝑒 (RGB color) emitted by
particles at that location along that viewing direction. NeRF uses standard emission-
absorption volume rendering [Kajiya and Herzen, 1984] to compute the observed
radiance 𝐿(c, 𝜔𝑜 ) (the rendered pixel color) at camera location c along direction 𝜔𝑜
as the integral of the product of three quantities at any point x(𝑡) = c − 𝑡𝜔𝑜 along
the ray: the visibility 𝑉 (x(𝑡), c), which indicates the fraction of emitted light from
position x(𝑡) that reaches the camera at c, the density 𝜎(x(𝑡)), and the emitted
radiance 𝐿𝑒 (x(𝑡), 𝜔𝑜 ) along the viewing direction 𝜔𝑜 :
∫︁ ∞
𝐿(c, 𝜔𝑜 ) = 𝑉 (x(𝑡), c)𝜎(x(𝑡))𝐿𝑒 (x(𝑡), 𝜔𝑜 ) 𝑑𝑡 , (2.1)
(︂ ∫︁ 𝑡 )︂
𝑉 (x(𝑡), c) = exp − 𝜎(x(𝑠)) 𝑑𝑠 . (2.2)

A NeRF is recovered from observed input images of a scene by sampling a batch

of observed pixels, sampling the corresponding camera rays of those pixels at strat-
ified random points to approximate the above integral using numerical quadrature
[Max, 1995], and optimizing the weights of the radiance MLP via gradient descent to
minimize the error between the estimated and observed pixel colors.

2.3.2 Neural Reflectance Fields

A NeRF representation does not separate the effect of incident light from the material
properties of surfaces. This means that NeRF is only able to render views of a
scene under the fixed lighting condition presented in the input images – a NeRF
cannot be relit. Modifying NeRF to enable relighting is straightforward, as initially
demonstrated by the Neural Reflectance Fields work of Bi et al. [2020a]. Instead
of representing a scene as a field of particles that emit light, it is represented as a
field of particles that reflect incoming light. With this, given an arbitrary lighting

condition, we can simulate the transport of light through the volume as it is reflected
by particles until it reaches the camera with a standard volume rendering integral
[Kajiya and Herzen, 1984]:
∫︁ ∞
𝐿(c, 𝜔𝑜 ) = 𝑉 (x(𝑡), c)𝜎(x(𝑡))𝐿𝑟 (x(𝑡), 𝜔𝑜 ) 𝑑𝑡 , (2.3)
𝐿𝑟 (x, 𝜔𝑜 ) = 𝐿𝑖 (x, 𝜔𝑖 )𝑅(x, 𝜔𝑖 , 𝜔𝑜 ) 𝑑𝜔𝑖 , (2.4)

where the view-dependent emission term 𝐿𝑒 (x, 𝜔𝑜 ) in Equation 2.1 is replaced with
an integral, over the sphere 𝒮 of incoming directions, of the product of the incoming
radiance 𝐿𝑖 from any direction and a reflectance function 𝑅 (often called a phase
function in volume rendering) that describes how much light arriving from direction
𝜔𝑖 is reflected towards direction 𝜔𝑜 .
We follow Bi et al. [2020a] and use the standard microfacet Bidirectional Re-
flectance Distribution Function (BRDF) described by Walter et al. [2007] as the re-
flectance function, so 𝑅 at any 3D location is parameterized by a diffuse RGB albedo,
a scalar specular roughness, and a surface normal. We replace NeRF’s radiance MLP
with two MLPs: a “shape MLP” that outputs volume density 𝜎 and a “reflectance
MLP” that outputs BRDF parameters (3D diffuse albedo a and scalar roughness 𝛾)
for any input 3D point: MLP𝜃 : x → 𝜎, MLP𝜓 : x → (a, 𝛾).
Instead of parameterizing the 3D surface normal n as a normalized output of
the shape MLP, as in Bi et al. [2020a], we compute n as the negative normalized
gradient vector of the shape MLP’s output 𝜎 w.r.t. x, computed using automatic
differentiation. We further discuss this choice in Section 2.6.3.

2.3.3 Light Transport via Neural Visibility Fields

Although modifying NeRF to enable relighting is straightforward, estimating the vol-

ume rendering integral for general lighting scenarios is computationally challenging
with a continuous volumetric representation such as NeRF. Figure 2-1 visualizes the
scaling properties that make simulating volumetric light transport particularly dif-

camera center
light source
expected termination
depth along camera ray
, estimated
termination depth along
indirect bounce ray

Figure 2-5: The geometry of an indirect illumination path in NeRV. The light ray
departs its source, hits 𝑥′ first, gets reflected to 𝑥, and eventually reaches the camera.

ficult. Even if we only consider direct illumination from light sources to a scene
point, a brute-force solution is already challenging for more than a single point light
source as it requires repeatedly querying the shape MLP for volume density along
paths from each scene point to each light source. Moreover, general scenes can be
illuminated by light arriving from all directions, and addressing this is imperative
to recovering relightable representations in unconstrained scenarios. Simulating even
simple global illumination in a brute-force manner is intractable: Rendering a single
ray in our scenes under one-bounce indirect illumination with brute-force sampling
would require a petaflop of computation, and we need to render roughly a billion rays
over the course of training.

We ameliorate this issue by replacing several brute-force volume density integrals

with learned approximations. We introduce a “visibility MLP” that emits an approxi-
mation of the environment lighting visibility at any input location along any input di-
rection and an approximation of the expected termination depth of the corresponding
ray: MLP𝜑 : (x, 𝜔) → (𝑉˜𝜑 , 𝐷
˜ 𝜑 ). When rendering, we use these MLP-approximated

quantities in place of their actual values:

(︂ ∫︁ ∞ )︂
𝑉 (x, 𝜔) = exp − 𝜎(x + 𝑠𝜔) 𝑑𝑠 , (2.5)
∫︁ ∞ (︂ ∫︁ 𝑡 )︂
𝐷(x, 𝜔) = exp − 𝜎(x + 𝑠𝜔) 𝑑𝑠 𝑡𝜎(x + 𝑡𝜔) 𝑑𝑡 . (2.6)
0 0

In Section 2.3.5 we place losses on the visibility MLP outputs (𝑉˜𝜑 , 𝐷
˜ 𝜑 ) to encourage

them to resemble the (𝑉, 𝐷) corresponding to the current state of the shape MLP.

Below, we provide a detailed walkthrough of how our Neural Visibility Field ap-
proximations simplify the volume rendering integral computation. Figure 2-5 is pro-
vided for reference. We first decompose the reflected radiance 𝐿𝑟 (x, 𝜔𝑜 ) into its direct
and indirect illumination components. Let us define 𝐿𝑒 (x, 𝜔𝑖 ) as radiance due to a
light source arriving at point x from direction 𝜔𝑖 . As defined in Equation 2.3, 𝐿(x, 𝜔𝑖 )
is the estimated incoming radiance at location x from direction 𝜔𝑖 . This means the
incident illumination 𝐿𝑖 decomposes into 𝐿𝑒 + 𝐿 (direct plus indirect light). The
shading calculation for 𝐿𝑟 then becomes:
𝐿𝑟 (x, 𝜔𝑜 ) = (𝐿𝑒 (x, 𝜔𝑖 ) + 𝐿(x, −𝜔𝑖 )) 𝑅(x, 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 (2.7)
∫︁ 𝒮 ∫︁
= 𝐿𝑒 (x, 𝜔𝑖 )𝑅(x, 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 + 𝐿(x, −𝜔𝑖 )𝑅(x, 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 .
⏟ ⏞ ⏟𝒮 ⏞
component due to direct lighting component due to indirect lighting

To calculate incident direct lighting 𝐿𝑒 we must account for the attenuation of the
(known) environment map 𝐸 due to the volume density along the incident illumina-
tion ray 𝜔𝑖 :

𝐿𝑒 (x, 𝜔𝑖 ) = 𝑉 (x, 𝜔𝑖 )𝐸(x, −𝜔𝑖 ) . (2.8)

Instead of evaluating 𝑉 as another line integral through the volume, we use the
visibility MLP’s approximation 𝑉˜𝜑 . With this, our full calculation for the direct
lighting component of camera ray radiance 𝐿(c, 𝜔𝑜 ) simplifies to:
∫︁ ∞ (︀ ∫︁
𝑉 x(𝑡), c 𝜎 x(𝑡) 𝑉˜𝜑 x(𝑡), 𝜔𝑖 𝐸 x(𝑡), 9𝜔𝑖 𝑅 x(𝑡), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 𝑑𝑡 . (2.9)
)︀ (︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀
0 𝒮

By approximating the integrals along rays from each point on the camera ray toward
each environment direction when computing the color of a pixel due to direct lighting,
we have reduced the complexity of rendering with direct lighting from quadratic in
the number of samples per ray to linear.

Next, we focus on the more difficult task of accelerating the computation of ren-
dering with indirect lighting, for which a brute force approach would scale cubically
with the number of samples per ray. We make two approximations to reduce this
intractable computation. Our first approximation is to replace the outermost integral
(the accumulated radiance reflected towards the camera at each point along the ray)
with a single point evaluation by treating the volume as a hard surface located at the
expected termination depth 𝑡′ = 𝐷(c, −𝜔𝑜 ). Note that we do not use the visibility
MLP’s approximation of 𝑡′ here, since we are already sampling 𝜎(x) along the camera
ray. This reduces the indirect contribution of 𝐿(c, 𝜔𝑜 ) to a spherical integral at a
single point x(𝑡′ ):
𝐿 x(𝑡′ ), −𝜔𝑖 𝑅 x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 . (2.10)
(︀ )︀ (︀ )︀

To simplify the recursive evaluation of 𝐿 inside this integral, we limit the indirect
contribution to a single bounce, and use the hard surface approximation a second
time to replace the integral along a ray for each incoming direction:

𝐿(x(𝑡 ), −𝜔𝑖 ) ≈ 𝐿𝑒 (x′ (𝑡′′ ), 𝜔𝑖′ )𝑅(x′ (𝑡′′ ), 𝜔𝑖′ , −𝜔𝑖 )𝑑𝜔𝑖′ , (2.11)

where 𝑡′′ = 𝐷
˜ 𝜑 (x(𝑡′ ), 𝜔𝑖 ) is the expected intersection depth along the ray x′ (𝑡′′ ) =

x(𝑡′ ) + 𝑡′′ 𝜔𝑖 as approximated by the visibility MLP. Thus the expression for the
component of camera ray radiance 𝐿(c, 𝜔𝑜 ) due to indirect lighting is:
𝐿𝑒 (x′ (𝑡′′ ), 𝜔𝑖′ )𝑅(x′ (𝑡′′ ), 𝜔𝑖′ , 9𝜔𝑖 )𝑑𝜔𝑖′ 𝑅(x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 )𝑑𝜔𝑖 , (2.12)

and fully expanding the direct radiance 𝐿𝑒 (x′ (𝑡′′ ), 𝜔𝑖′ ) incident at each secondary in-
tersection point gives us:
𝑉˜𝜑 x′ (𝑡′′ ), 𝜔𝑖′ 𝐸 x′ (𝑡′′ ), −𝜔𝑖′ 𝑅 x′ (𝑡′′ ), 𝜔𝑖′ , 9𝜔𝑖 𝑑𝜔𝑖′ 𝑅 x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 , (2.13)
(︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀

Finally, we can write out the complete volume rendering equation used by NeRV as

the sum of Equations 2.9 and 2.13:
∫︁ ∞ ∫︁
𝐿(c, 𝜔𝑜 ) = 𝑉 x(𝑡), c 𝜎 x(𝑡) 𝑉˜𝜑 x(𝑡), 𝜔𝑖 𝐸 x(𝑡), 9𝜔𝑖 𝑅 x(𝑡), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 𝑑𝑡
(︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀ (︀ )︀

∫︁∫︁ 0 𝒮

+ 𝑉˜𝜑 x (𝑡 ), 𝜔𝑖′ 𝐸 x′ (𝑡′′ ), 9𝜔𝑖′ 𝑅 x′ (𝑡′′ ), 𝜔𝑖′ , 9𝜔𝑖 𝑑𝜔𝑖′ 𝑅 x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 𝑑𝜔𝑖 . (2.14)
(︀ ′ ′′ )︀ (︀ )︀ (︀ )︀ (︀ )︀

Figure 2-1 illustrates how the approximations made by NeRV reduce the computa-
tional complexity of computing direct and indirect illumination from quadratic and
cubic (respectively) to linear. This enables the simulation of direct illumination from
environment lighting and one-bounce indirect illumination within the training loop
of optimizing a continuous relightable volumetric scene representation.

2.3.4 Rendering

To render a camera ray x(𝑡) = c − 𝑡𝜔𝑜 passing through a NeRV, we estimate the
volume rendering integral in Equation 2.14 using the following procedure:
1. We draw 256 stratified samples along the ray and query the shape and re-
flectance MLPs for the volume densities, surface normals, and BRDF parame-
ters at each point: 𝜎 = MLP𝜃 (x(𝑡)), n = ∇x MLP𝜃 (x(𝑡)), (a, 𝛾) = MLP𝜓 (x(𝑡)).
2. We shade each point along the ray with direct illumination by estimating the
integral in Equation 2.9. First, we generate 𝐸(x(𝑡), −𝜔𝑖 ) by sampling the known
environment lighting on a 12 × 24 grid of directions 𝜔𝑖 on the sphere around
each point. We then multiply this by the predicted visibility 𝑉˜𝜑 (x(𝑡), 𝜔𝑖 ) and
microfacet BRDF values 𝑅(x(𝑡), 𝜔𝑖 , 𝜔𝑜 ) at each sampled 𝜔𝑖 , and integrate this
product over the sphere to produce the direct illumination contribution.
3. We shade each point along the ray with indirect illumination by estimating the
integral in Equation 2.13. First, we compute the expected camera ray termi-
nation depth 𝑡′ = 𝐷(c, −𝜔𝑜 ) using the density samples from Step 1. Next,
we sample 128 random directions on the upper hemisphere at x(𝑡′ ) and query
the visibility MLP for the expected termination depths along each of these
rays 𝑡′′ = 𝐷
˜ 𝜑 (x(𝑡′ ), 𝜔𝑖 ) to compute the secondary surface intersection points

x′ (𝑡′′ ) = x(𝑡′ ) + 𝑡′′ 𝜔𝑖 . We then shade each of these points with direct illu-
mination by following the procedure in Step 2. This estimates the indirect
illumination incident at x(𝑡′ ), which we then multiply by the microfacet BRDF
values 𝑅(x(𝑡′ ), 𝜔𝑖 , 𝜔𝑜 ) and integrate over the sphere to produce the indirect
illumination contribution.
4. The total reflected radiance at each point along the camera ray 𝐿𝑟 (x(𝑡), 𝜔𝑜 ) is
the sum of the quantities from Steps 2 and 3. We composite these along the ray
to compute the pixel color using the same quadrature rule [Max, 1995] used in

𝐿(c, 𝜔𝑜 ) = 𝑉 (x(𝑡), c)𝛼(𝜎(x(𝑡))𝛿)𝐿𝑟 (x(𝑡), 𝜔𝑜 ) , (2.15)
(︃ )︃
𝑉 (x(𝑡), c) = exp − 𝜎(x(𝑠))𝛿 , 𝛼(𝑧) = 1 − exp (−𝑧) , (2.16)

where 𝛿 is the distance between samples along the ray.

2.3.5 Training & Implementation Details

Instead of directly passing 3D coordinates x and direction vectors 𝜔 to the MLPs, we

map these inputs using NeRF’s positional encoding [Mildenhall et al., 2020, Tancik
et al., 2020], with a maximum frequency of 27 for 3D coordinates and 24 for 3D
direction vectors. The shape and reflectance MLPs each use eight fully-connected
Rectified Linear Unit (ReLU) layers with 256 channels. The visibility MLP uses
eight fully-connected ReLU layers with 256 channels each to map the encoded 3D
coordinates x to an 8-dimensional feature vector, which is concatenated with the
encoded 3D direction vector 𝜔 and processed by four fully-connected ReLU layers
with 128 channels each.
We train a separate NeRV representation from scratch for each scene, which re-
quires a set of RGB images as well as their camera poses and corresponding lighting
environments. At each training iteration, we randomly sample a batch of 512 pixel
rays ℛ from the input images and use the procedure previously described to render

these pixels from the current NeRV model. We additionally sample 256 random rays
ℛ′ per training iteration that intersect the volume, and we compute the visibility and
expected termination depth at each location, into either direction along each ray for
use as supervision for the visibility MLP. We minimize the sum of three losses:

ℒ= ˜
‖𝜏 (𝐿(r)) − 𝜏 (𝐿(r))‖22 +
∑︁ (︁ )︁
𝜆 ˜ 𝜑 (r′ (𝑡)) − 𝐷𝜃 (r′ (𝑡))‖2 ,
‖𝑉˜𝜑 (r′ (𝑡)) − 𝑉𝜃 (r′ (𝑡))‖22 + ‖𝐷 2 (2.17)
r′ ∈ℛ′ ∪ℛ,𝑡

where 𝜏 (𝑥) = 𝑥/1+𝑥 is a tone mapping operator [Gharbi et al., 2019], 𝐿(r) and 𝐿(r)
˜ are
the ground truth and predicted camera ray radiance values (ground-truth values are
simply the colors of input image pixels), 𝑉˜𝜑 (r) and 𝐷
˜ 𝜑 (r) are the predicted visibility

and expected termination depth from our visibility MLP given its current weights 𝜑,
𝑉𝜃 (r) and 𝐷𝜃 (r) are the estimates of visibility and termination depth implied by the
shape MLP given its current weights 𝜃, and 𝜆 = 20 is the weight of the loss terms
encouraging the visibility MLP to be consistent with the shape MLP.

Note that the visibility MLP is not supervised using any “ground truth” visibility
or termination depth: It is only optimized to be consistent with the NeRV’s current
estimate of scene geometry, by evaluating Equation 2.5 and Equation 2.6 using the
densities 𝜎 emitted by the shape MLP𝜃 . We apply a stop_gradient to 𝑉𝜃 and 𝐷𝜃
in the last two terms of the loss, so the shape MLP is not encouraged to degrade its
own performance to better match the output from the visibility MLP. We implement
our model in JAX [Bradbury et al., 2018] and optimize it using Adam [Kingma
and Ba, 2015] with a learning rate that begins at 10−5 and decays exponentially to
10−6 over the course of optimization (the other Adam hyperparameters are default
values: 𝛽1 = 0.9, 𝛽2 = 0.999, and 𝜖 = 10−8 ). Each model is trained for a million
iterations using 128 Tensor Processing Unit (TPU) cores, which takes around one
day to converge.

2.4 Method: One Unknown Illumination

The input to Neural Factorization of Shape and Reflectance (NeRFactor) is assumed

to be only multi-view images of an object (and the corresponding camera poses)
lit by one unknown illumination condition. NeRFactor represents the shape and
spatially-varying reflectance of an object as a set of 3D fields, each parameterized
by Multi-Layer Perceptrons (MLPs) whose weights are optimized so as to “explain”
the set of observed input images. After optimization, NeRFactor outputs the surface
normal 𝑛, light visibility in any direction 𝑣(𝜔i ), albedo 𝑎, and reflectance 𝑧BRDF
that together explain the observed appearance at any 3D location 𝑥 on the object’s
surface1 . By recovering the object’s geometry and reflectance, NeRFactor enables
applications such as free-viewpoint relighting with shadows and material editing.

2.4.1 Shape

The input to NeRFactor is the same as what is used by Neural Radiance Fields (NeRF)
[Mildenhall et al., 2020], so we can apply NeRF to our input images to compute
initial geometry. NeRF optimizes a neural radiance field: an MLP that maps from
any 3D spatial coordinate and 2D viewing direction to the volume density at that
3D location and color emitted by particles at that location along the 2D viewing
direction. NeRFactor leverages NeRF’s estimated geometry by “distilling” it into a
continuous surface representation that we use to initialize NeRFactor’s geometry. In
particular, we use the optimized NeRF to compute the expected surface location
along any camera ray, the surface normal at each point on the object’s surface, and
the visibility of light arriving from any direction at each point on the object’s surface.
This subsection describes how we derive these functions from the optimized NeRF
and how we re-parameterize them with MLPs so that they can be finetuned after this
initialization step to improve the full re-rendering loss (Figure 2-7).

In this section, vectors and matrices (as well as functions that return them) are in bold; scalars
and scalar functions are not.

NeRFactor is an pseudo-GT as sup.
: pre-trained; frozen
!i Light : pre-trained; jointly finetuned
all-MLP surface model <latexit sha1_base64="CqK5YU2YsyAHk8kp6/xwQyNdvAA=">AAACBXicbVBNS8NAEN34WetX1KMegkXwVJIi6rHoxWMF+wFNKZvttF26yYbdiVhCL178K148KOLV/+DNf+OmzUFbHyz7eG+GmXlBLLhG1/22lpZXVtfWCxvFza3tnV17b7+hZaIY1JkUUrUCqkHwCOrIUUArVkDDQEAzGF1nfvMelOYyusNxDJ2QDiLe54yikbr2kR9I0dPj0HypL0MY0K6P8IApn0y6dsktu1M4i8TLSYnkqHXtL78nWRJChExQrdueG2MnpQo5EzAp+omGmLIRHUDb0IiGoDvp9IqJc2KUntOXyrwInan6uyOloc4WNZUhxaGe9zLxP6+dYP+yk/IoThAiNhvUT4SD0skicXpcAUMxNoQyxc2uDhtSRRma4IomBG/+5EXSqJS983Ll9qxUvcrjKJBDckxOiUcuSJXckBqpE0YeyTN5JW/Wk/VivVsfs9IlK+85IH9gff4AGUuZmw==</latexit>

Visibility v : trained from scratch that predicts the
<latexit sha1_base64="zWgGn+KxnGT4WVzvscDZLe5fxpU=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHaJUY9ELx4hkUcCGzI79MLI7OxmZpaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWj+7nfGqPSPJaPZpKgH9GB5CFn1FipPu4VS27ZXYCsEy8jJchQ6xW/uv2YpRFKwwTVuuO5ifGnVBnOBM4K3VRjQtmIDrBjqaQRan+6OHRGLqzSJ2GsbElDFurviSmNtJ5Ege2MqBnqVW8u/ud1UhPe+lMuk9SgZMtFYSqIicn8a9LnCpkRE0soU9zeStiQKsqMzaZgQ/BWX14nzUrZuy5X6lel6l0WRx7O4BwuwYMbqMID1KABDBCe4RXenCfnxXl3PpatOSebOYU/cD5/AOV7jQE=</latexit>

surface normal 𝑛, light
xsurf BRDF visibility 𝑣, albedo 𝑎, <latexit sha1_base64="gXJmz29XFah1xKGcAW/+7Sy8n4k=">AAACA3icbVDLSsNAFJ3UV62vqDvdBIvgqiRF1GXRjcsK9gFNCJPJpB06eTBzIy0h4MZfceNCEbf+hDv/xkmbhbYeGOZwzr3ce4+XcCbBNL+1ysrq2vpGdbO2tb2zu6fvH3RlnApCOyTmseh7WFLOItoBBpz2E0Fx6HHa88Y3hd97oEKyOLqHaUKdEA8jFjCCQUmufmR7MfflNFRfNnFtoBPIZCqCPHf1utkwZzCWiVWSOirRdvUv249JGtIICMdSDiwzASfDAhjhNK/ZqaQJJmM8pANFIxxS6WSzG3LjVCm+EcRCvQiMmfq7I8OhLNZUlSGGkVz0CvE/b5BCcOVkLEpSoBGZDwpSbkBsFIEYPhOUAJ8qgolgaleDjLDABFRsNRWCtXjyMuk2G9ZFo3l3Xm9dl3FU0TE6QWfIQpeohW5RG3UQQY/oGb2iN+1Je9HetY95aUUrew7RH2ifP9fHmPE=</latexit>

Identity zBRDF
and BRDF latent code
<latexit sha1_base64="mw0GeuR83/hro1F8N5+3qQU9ms4=">AAACA3icbVDLSsNAFJ3UV62vqDvdBIvgqiRF1GWpIi6r2Ac0oUymk3bo5MHMjVhDwI2/4saFIm79CXf+jZM2C209MMzhnHu59x434kyCaX5rhYXFpeWV4mppbX1jc0vf3mnJMBaENknIQ9FxsaScBbQJDDjtRIJi3+W07Y7OM799R4VkYXAL44g6Ph4EzGMEg5J6+p7thrwvx776koeeDfQekvrNxWWa9vSyWTEnMOaJlZMyytHo6V92PySxTwMgHEvZtcwInAQLYITTtGTHkkaYjPCAdhUNsE+lk0xuSI1DpfQNLxTqBWBM1N8dCfZltqaq9DEM5ayXif953Ri8MydhQRQDDch0kBdzA0IjC8ToM0EJ8LEimAimdjXIEAtMQMVWUiFYsyfPk1a1Yp1UqtfH5Vo9j6OI9tEBOkIWOkU1dIUaqIkIekTP6BW9aU/ai/aufUxLC1res4v+QPv8AeMHmFE=</latexit>

MLP 𝑧BRDF for each surface
ray term.
MLP a location 𝑥surf , as well as <latexit sha1_base64="Mzrq9frzKs/D+ICaSB2FHA8Q12o=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZki6rLoxmUF+4B2LJlMpg3NJEOSUcrQ/3DjQhG3/os7/8ZMOwttPRByOOdecnKChDNtXPfbWVldW9/YLG2Vt3d29/YrB4dtLVNFaItILlU3wJpyJmjLMMNpN1EUxwGnnWB8k/udR6o0k+LeTBLqx3goWMQINlZ66AeSh3oS2yvD00Gl6tbcGdAy8QpShQLNQeWrH0qSxlQYwrHWPc9NjJ9hZRjhdFrup5ommIzxkPYsFTim2s9mqafo1CohiqSyRxg0U39vZDjWeTQ7GWMz0oteLv7n9VITXfkZE0lqqCDzh6KUIyNRXgEKmaLE8IklmChmsyIywgoTY4sq2xK8xS8vk3a95l3U6nfn1cZ1UUcJjuEEzsCDS2jALTShBQQUPMMrvDlPzovz7nzMR1ecYucI/sD5/AEyo5L5</latexit>

(!i , !o ) ( d , ✓h , ✓d )
the lighting condition. <latexit sha1_base64="PNyVwDzPr+pVAAW9S4Fxuo0bkdg=">AAACJXicdVBNSwMxEM36bf2qevQSLEIFKbtF1IOHohePCvYDuqVk02kbmt0syaxYlv0zXvwrXjwoInjyr5jWHrTVByGP92aYmRfEUhh03Q9nbn5hcWl5ZTW3tr6xuZXf3qkZlWgOVa6k0o2AGZAigioKlNCINbAwkFAPBpcjv34H2ggV3eIwhlbIepHoCs7QSu38edEPlOyYYWi/1Fch9FjbR7jHVGTZEf3XVVl22M4X3JI7Bp0l3oQUyATX7fyr31E8CSFCLpkxTc+NsZUyjYJLyHJ+YiBmfMB60LQ0YiGYVjq+MqMHVunQrtL2RUjH6s+OlIVmtKmtDBn2zbQ3Ev/ymgl2z1qpiOIEIeLfg7qJpKjoKDLaERo4yqEljGthd6W8zzTjaIPN2RC86ZNnSa1c8k5K5ZvjQuViEscK2SP7pEg8ckoq5Ipckyrh5IE8kRfy6jw6z86b8/5dOudMenbJLzifX849p08=</latexit> <latexit sha1_base64="8JxPNoU0Qk/1/Ko8TBjh20evX1Y=">AAACGnicbZBNS8NAEIY3ftb6FfXoZbEIClKSIuqx6MVjBatCE8pmMzWLmw92J2IJ+R1e/CtePCjiTbz4b9zWHLT6wsLLMzPMzhtkUmh0nE9ranpmdm6+tlBfXFpeWbXX1i90misOXZ7KVF0FTIMUCXRRoISrTAGLAwmXwc3JqH55C0qLNDnHYQZ+zK4TMRCcoUF9293xskj0PYQ7LMJyj3oYAbIKRJMgLHf7dsNpOmPRv8atTINU6vTtdy9MeR5DglwyrXuuk6FfMIWCSyjrXq4hY/yGXUPP2ITFoP1ifFpJtw0J6SBV5iVIx/TnRMFirYdxYDpjhpGerI3gf7VejoMjvxBJliMk/HvRIJcUUzrKiYZCAUc5NIZxJcxfKY+YYhxNmnUTgjt58l9z0Wq6B83W2X6jfVzFUSObZIvsEJcckjY5JR3SJZzck0fyTF6sB+vJerXevlunrGpmg/yS9fEFNp+hng==</latexit>

x (1st NeRFactor does not

<latexit sha1_base64="OfIqcm9DmvjZbClZA6CQ4jlPlV8=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqswUUZdFNy4r2Ae0Y8lk0jY0kwxJRi1D/8ONC0Xc+i/u/Bsz7Sy09UDI4Zx7yckJYs60cd1vp7Cyura+UdwsbW3v7O6V9w9aWiaK0CaRXKpOgDXlTNCmYYbTTqwojgJO28H4OvPbD1RpJsWdmcTUj/BQsAEj2FjpvhdIHupJZK/0adovV9yqOwNaJl5OKpCj0S9/9UJJkogKQzjWuuu5sfFTrAwjnE5LvUTTGJMxHtKupQJHVPvpLPUUnVglRAOp7BEGzdTfGymOdBbNTkbYjPSil4n/ed3EDC79lIk4MVSQ+UODhCMjUVYBCpmixPCJJZgoZrMiMsIKE2OLKtkSvMUvL5NWreqdV2u3Z5X6VV5HEY7gGE7Bgwuoww00oAkEFDzDK7w5j86L8+58zEcLTr5zCH/gfP4AVZaTEA==</latexit>

<latexit sha1_base64="+QEJNuyxbank3U3v1VkBjLYmsJs=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKewGUY9BLx4jmAckS5idzCZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWye1eoYNBe6XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2v3aKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPX0YBpSiyfOIKJZu5WREZYY2JdQCUXQrD88ipp1arBZbV2f1Gp3+RxFOEETuEcAriCOtxBA5pA4BGe4RXePOW9eO/ex6K14OUzx/AH3ucPnqmPKg==</latexit>

half) MLP n <latexit sha1_base64="zuMR+F00Fn7dYyoq+1dafhGCjJE=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZki6rLoxmUF+4B2LJlMpg3NJEOSUcrQ/3DjQhG3/os7/8ZMOwttPRByOOdecnKChDNtXPfbWVldW9/YLG2Vt3d29/YrB4dtLVNFaItILlU3wJpyJmjLMMNpN1EUxwGnnWB8k/udR6o0k+LeTBLqx3goWMQINlZ66AeSh3oS2ysT00Gl6tbcGdAy8QpShQLNQeWrH0qSxlQYwrHWPc9NjJ9hZRjhdFrup5ommIzxkPYsFTim2s9mqafo1CohiqSyRxg0U39vZDjWeTQ7GWMz0oteLv7n9VITXfkZE0lqqCDzh6KUIyNRXgEKmaLE8IklmChmsyIywgoTY4sq2xK8xS8vk3a95l3U6nfn1cZ1UUcJjuEEzsCDS2jALTShBQQUPMMrvDlPzovz7nzMR1ecYucI/sD5/AFGZJMG</latexit>

d Light require supervision on

<latexit sha1_base64="s4v2+FC9uJUs0myW2JdRLiWj2q0=">AAACAnicbVDLSsNAFJ34rPUVdSVuBovgqiRF1GXRjcsK9gFNKJPJpB06mQkzE7GE4MZfceNCEbd+hTv/xkmbhbYeGOZwzr3ce0+QMKq043xbS8srq2vrlY3q5tb2zq69t99RIpWYtLFgQvYCpAijnLQ11Yz0EklQHDDSDcbXhd+9J1JRwe/0JCF+jIacRhQjbaSBfehFEuEszLPQCwQL1SQ2X/aQ5wO75tSdKeAicUtSAyVaA/vLCwVOY8I1Zkipvusk2s+Q1BQzkle9VJEE4TEakr6hHMVE+dn0hByeGCWEkZDmcQ2n6u+ODMWq2M1UxkiP1LxXiP95/VRHl35GeZJqwvFsUJQyqAUs8oAhlQRrNjEEYUnNrhCPkMlEm9SqJgR3/uRF0mnU3fN64/as1rwq46iAI3AMToELLkAT3IAWaAMMHsEzeAVv1pP1Yr1bH7PSJavsOQB/YH3+ANhsmF8=</latexit>

dx pseudo-GT as sup. (lat.-long. map) any of the intermediate

𝑥 denotes 3D locations, 𝜔i represents the light direction, 𝜔o denotes factors but rather relies
the viewing direction, and 𝜑d , 𝜃h , 𝜃d are the Rusinkiewicz coordinates. only on priors and a
reconstruction loss.
Our Factorization Rendering
Normals Visibility (novel view; original lighting) Here we show an
example factorization
by NeRFactor,
visualizing visibility as
the average light
visibility over all
incoming directions
Albedo BRDF
(i.e., ambient
occlusion) and 𝑧BRDF
as an RGB image
(same colors mean
same materials).

Figure 2-6: NeRFactor model and its example output. NeRFactor is a surface model
that factorizes, in an unsupervised manner, the appearance of a scene observed under
one unknown lighting condition. It tackles this severely ill-posed problem by using a
reconstruction loss, simple smoothness regularization, and data-driven BRDF priors.
Modeling visibility explicitly, NeRFactor is a physically-based model that supports
hard and soft shadows under arbitrary lighting.

Surface Points Given a camera and a trained NeRF, we compute the location
at which a ray 𝑟(𝑡) = 𝑜 + 𝑡𝑑 from that camera 𝑜 along direction 𝑑 is expected to
terminate according to NeRF’s optimized volume density 𝜎:
(︂∫︁ ∞ )︂ (︂ ∫︁ 𝑡 )︂
(︀ )︀ (︀ )︀
𝑥surf = 𝑜 + 𝑇 (𝑡)𝜎 𝑟(𝑡) 𝑡 𝑑𝑡 𝑑 , 𝑇 (𝑡) = exp − 𝜎 𝑟(𝑠) 𝑑𝑠 ,
0 0

where 𝑇 (𝑡) is the probability that the ray travels distance 𝑡 without being blocked.
Instead of maintaining a full volumetric representation, we fix the geometry to lie
on this surface distilled from the optimized NeRF. This enables much more efficient
relighting during both training and inference because we can compute the outgoing
radiance just at each camera ray’s expected termination location instead of every
point along each camera ray.

Surface Normals We compute analytic surface normals 𝑛a (𝑥) at any 3D location

as the negative normalized gradient of NeRF’s 𝜎-volume w.r.t. 𝑥. Unfortunately, the
normals derived from a trained NeRF tend to be noisy (Figure 2-7) and therefore
