Professional Documents
Culture Documents
Learn Gen
Learn Gen
Learn Gen
Siddhartha Chaudhuri1,2 Daniel Ritchie3 Jiajun Wu4 Kai Xu5† Hao Zhang6
1 Adobe Research 2 IIT Bombay 3 Brown University 4 Stanford University
5 NationalUniversity of Defense Technology 6 Simon Fraser University
Figure 1: A sampler of representative results from generative models of 3D shape and scene structures that were learned from data [HKM15,
FRS∗ 12, WLW∗ 19, LXC∗ 17].
Abstract
3D models of objects and scenes are critical to many academic disciplines and industrial applications. Of particular interest is
the emerging opportunity for 3D graphics to serve artificial intelligence: computer vision systems can benefit from synthetically-
generated training data rendered from virtual 3D scenes, and robots can be trained to navigate in and interact with real-world
environments by first acquiring skills in simulated ones. One of the most promising ways to achieve this is by learning and
applying generative models of 3D content: computer programs that can synthesize new 3D shapes and scenes. To allow users to
edit and manipulate the synthesized 3D content to achieve their goals, the generative model should also be structure-aware: it
should express 3D shapes and scenes using abstractions that allow manipulation of their high-level structure. This state-of-the-
art report surveys historical work and recent progress on learning structure-aware generative models of 3D shapes and scenes.
We present fundamental representations of 3D shape and scene geometry and structures, describe prominent methodologies
including probabilistic models, deep generative models, program synthesis, and neural networks for structured data, and cover
many recent methods for structure-aware synthesis of 3D shapes and indoor scenes.
CCS Concepts
• Computing methodologies → Structure-aware generative models; Representation of structured data; Deep learning; Neural
networks; Shape and scene synthesis; Hierarchical models;
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
Figure 2: Different representations for the low-level geometry of individual objects / parts of objects [MPJ∗ 19].
GAN was able to first perform unconditioned synthesis of high- memory cost of the representation scales cubically with the reso-
resolution 3D shapes. Liu et al. [LYF17] extended the model to lution of the voxel grid. This problem can be partially alleviated
allow interactive shape synthesis and editing. As discussed later by using efficient and adaptive volumetric data structures such as
in Section 3.1, GANs have later been adapted to other shape rep- octrees [ROUG17, WLG∗ 17, KL17, TDB17].
resentations, achieving great successes with point clouds, octrees,
With the proliferation of neural networks for point cloud pro-
and surface meshes.
cessing [QSMG17, QYSG17], deep generative models for direct
Variational autoencoders (VAEs) [KW14] have also demon- synthesis of 3D point clouds have become popular. A point cloud
strated their potential in unconditional 3D shape synthe- represents only the surface of a 3D object and thus sidesteps
sis [BLRW16, SHW∗ 17], especially on domain-specific areas the resolution restriction of volumetric representations [FSG17,
such as face and human body modeling with meshes [RBSB18, ADMG18]. However, point clouds are only discrete samples of a
TGLX18]. With these modern deep learning tools, researchers have surface and do not provide information about the continuous be-
been able to obtain very impressive results on shape synthesis and havior of the surface between point samples.
reconstruction.
A compelling alternative to voxel and point based representa-
These papers mostly ignore the structure of 3D shapes and tions are implicit representations: functions f (x, y, z) whose output
scenes, and instead aim to synthesize their entire geometry “from determines whether a point is inside or outside of a surface. Such a
scratch.” In this survey, we focus on works which learn to synthe- function captures the smooth, continuous behavior of a surface and
size 3D content while explicitly incorporating their internal struc- enables sampling the implied surface at arbitrary resolution. Recent
ture as priors. efforts have shown that it is possible to use neural networks to learn
such functions from data [CZ19, PFS∗ 19, MPJ∗ 19, MON∗ 19]. At
present, these methods are the state-of-the-art approaches for gen-
3. Structure-Aware Representations erating unstructured, detailed geometry.
A structure-aware representation of a 3D entity (i.e. a shape or Another possible approach is to consider directly generating a
scene) must contain at least two subcomponents. First, it must have triangle mesh representation of a surface. This is an appealing idea,
some way of representing the geometry of the atomic structural as meshes are the most commonly used geometry representation
elements of the entity (e.g. the low-level parts of a 3D object). Sec- for most graphics tasks. An early step in this direction generated
ond, it must have some way of representing the structural patterns “papier-mache” objects by learning to deform and place multi-
by which these atoms are combined to produce a complete shape ple regular mesh patches [GFK∗ 18, YFST18]. Alternatively, one
or scene. In this section, we survey the most commonly-used rep- can define a convolution operator on meshes [MBBV15, HHF∗ 19]
resentations for both of these components. and attempt to generate meshes by iterative subdivision + convo-
lution starting from some simple initial mesh. This approach has
3.1. Representations of Part/Object Geometry been used for 3D reconstruction from images [WZL∗ 18, GG19]
and from 3D scans [DN19].
In computer graphics, 3D geometric objects are mainly represented
with voxel grids, point clouds, implicit functions (level sets), and
3.2. Representations of Structure
surface meshes (see Figure 2). Other alternatives include paramet-
ric (e.g., tensor-product) surfaces, multi-view images [SMKL15], The previous section outlined the most frequently used represen-
geometry images [GGH02], and constructed solid geometry. For tations for the individual atomic parts of objects (or the objects
the purpose of shape generation, voxel grids are the simplest rep- within larger scenes). In this section, we survey a range of ways
resentation to work with: a 3D voxel grid is a straightforward to represent 3D structures composed of atomic parts (or objects), in
extension of a 2D pixel grid, making it easy to transfer deep increasing order of structural complexity.
convolution-based models developed for 2D image synthesis to the
3D domain [WSK∗ 15, WZX∗ 16]. However, it is difficult to gener- Segmented geometry. Perhaps the simplest way to add structure
ate high-detail shapes with such volumetric representations, as the to a representation of a 3D entity is to simply associate structural
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
tion is simply p(T ) = ∏ni p(ri ). Figure 9 shows an example of a where xi is the ith random choice in the trace and φi (x1 . . . xi−1 ) en-
derivation from a toy PCFG that generates branching structures. A capsulates whatever computation the program performs to compute
more general variant is a parametric PCFG, where each symbol the parameters of the ith random choice (e.g. mean and variance,
has associated parameters (for example, the size of the correspond- if it is a sample from a Gaussian distribution). Expressed in this
ing component), and the application of a rule involves sampling the form, it is clear how probablistic programs subsume PCFGs: if the
parameters of the RHS conditioned on the parameters of the LHS. φ function retrieves data from a parent in a tree structure, then this
Thus, if step i replaces symbol si with symbols c1 . . . ck , then equation can be made to exactly express the probability distribution
n defined by a PCFG.
p(T ) = ∏ p (c1 . . . ck | φi (ri , si )) ,
i
Probabilistic programs show their true power as a representation
for generative models with their capabilities for inference: most
where φi determines the distribution from which the parameters of probabilistic programming languages come equipped with built-in
the output symbols are drawn. This expression explicitly highlights operators for sampling conditional probability distributions of the
the Markovian nature of the derivation: each step in the derivation form
is conditioned only on the previous one, and more specifically on
just one symbol – the “parent” – in the intermediate pattern. PCFGs p(T |c(T ) = true).
can be thought of as “dynamic” Bayesian networks, whose output That is, they provide the ability to compute how the distribution
is variable-dimensional. This gives them richer expressive power of execution traces changes when some condition c on the trace is
than the “static” graphical models described above. required to be true (e.g. that certain random choices take on spe-
The dynamic, recursive nature of PCFGs makes them well-suited cific values). Computing these kinds of conditional distributions is
for model structures that are defined by recursive or hierarchical intractable in general, so real-world probabilistic programming lan-
processes. Trees and other types of vegetation are the most com- guages make sacrifices, either by restricting the types of conditions
mon examples, but buildings and other shape structures that feature c that are expressable and/or by only computing approximations to
hierarchies of spatial patterns are also good candidates. Typically, the true conditional distribution.
PCFGs model shapes by treating pieces of atomic geometry (i.e. Whereas PCFGs are restricted to generating tree structures,
shape parts) as terminal symbols. Since their derivation process is probabilistic programs can in theory generate any shape struc-
tree-structured by nature, PCFGs can be difficult to adapt to shapes ture (e.g. cyclically-connected arches [RLGH15]). This generality
that do not exhibit tree-like structures (e.g. part-based shapes whose comes at a cost in terms of learning, though. While it is difficult
connectivity graphs include cycles). to induce a PCFG from data, some approaches do exist. However,
While PCFG rules are typically designed by hand, and proba- there is no known general-purpose algorithm for inducing proba-
bilities and other parameters estimated from data, an appropriate bilistic programs from examples. In the domain of shape synthesis,
set of rules can also be learned from examples. This is known as some domain-specific approaches have been proposed, though the
grammar induction, and is a distinctly non-trivial exercise. Exam- kinds of programs they induce are not much more expressive than
ples from shape generation are rare: some are described below. grammars [HSG11, RJT18].
Probabilistic programs. In the study of the theory of computa- 4.2. Deep Generative Models
tion, context-free grammars (CFGs) are capable of recognizing any
context-free language, whereas a Turing machine can recognize The generative models surveyed in the last section were all devel-
any language (and thus a language which requires a Turing ma- oped before the deep learning “revolution” which began in approx-
chine to be recognized is said to be Turing-complete). Probabilis- imately 2012. Since that time, new types of generative models with
tic programs are to probabilistic context-free grammars (PCFGs) deep neural networks at their core have been developed—so-called
what Turing machines are to deterministic CFGs: where PCFGs can deep generative models. In some cases, these models actually have
represent only a certain class of probability distributions over tree close connections to the “classical” models we have already dis-
structures, probabilistic programs can represent any computable cussed; we will highlight these connections where they exist.
probability distribution [GMR∗ 08]. In particular, any PCFG can
be represented by a probabilistic program. Autoregressive models. An autoregressive model is an iterative
generative model that consumes its own output from one iteration
Practically, probabilistic programs are usually implemented by as the input to its next iteration. That is, an autoregressive model
extending existing deterministic Turing-complete programming examines the output it has generated thus far in order to decide
languages (e.g. Lisp [GMR∗ 08], Javascript [GS14], C [PW14], what to generate next. For example, an autoregressive model for
Matlab [WSG11]) with primitive operators for sampling from el- generating 3D indoor scenes could synthesize one object at a time,
ementary probability distributions (e.g. Gaussian, Bernoulli). Ex- where the decision about what type of object to add and where to
ecuting a probabilistic program produces a trace through the pro- place it are based on the partial scene that has been generated so far
gram, which is a sequence of all the random sampling decisions (Figure 10). Formally, an autoregressive model defines a probabil-
made by the program (analogous to a derivation from a PCFG). ity distribution over vectors x as
The probability of a particular trace T is
|x|
p(T ) = ∏ p(xi |φi (x1 . . . xi−1 )), p(x) = ∏ p(xi |NNi (x1 . . . xi−1 )),
xi ∈T i=1
Current scene Image representation Next Category Location Orientation Dimensions Insert Object
𝑧 𝑧
CNN
CNN FCN
CNN CNN CNN
CNN
cos 𝜃 , sin 𝜃
Translate Snap?
Rotate
Category Counts
1 1 0 1 0 1 0 0 Category Location Orientation Dimensions
Figure 10: An autoregressive model for generating 3D indoor scenes [RWaL19]. The model generates scenes one object at a time (and one
attribute of each object at a time), conditioned on the partial scene that has been synthesized thus far.
where NNi denotes a neural network. In other words, the model space, followed by decoding with the generator, is equivalent to
represents a distribution over (potentially high-dimensional) vec- sampling from the target distribution. VAEs and GANs are differ-
tors via a product of distributions over each of its entries. In the ent methods to learn a good generator G.
process of sampling from this distribution, the value for xi−1 must
be fed into the neural network to sample a value for xi , which earns A VAE [KW14] can beR thought of as maximizing a variational
this type of model its autoregressive name. Bayes objective p(x) = p(x | z)p(z)dz over all training exam-
ples x ∈ X, where p(x | z) is modeled by the generator. We desire
If NNi is allowed be a different neural network at each step, this p(x | z) to be smooth and compact (a single latent should gener-
representation is very similar to a probabilistic program, where the ate similar outputs): this is achieved indirectly by making p(z | x)
parameter function φ is a learnable neural network instead of a fixed smooth and compact (a single output can be traced back to a small
function. More commonly, the same neural network is used at ev- range of similar latents). A VAE enforces this with an encoder
ery step (e.g. recurrent neural networks are a form of autoregressive neural network that maps a training input x to a small Gaussian
generative model that fits this description, as are pixel-by-pixel im- N (µ(x), σ(x)) in latent space. Samples from this Gaussian are fed
age generative models [VDOKK16]). to the generator, also modeled with a neural network, and the final
output is constrained to be identical to the input. Thus, a VAE re-
Autoregressive models are well-suited to generating linear struc-
sembles an autoencoder where the bottleneck produces a compact
tures (i.e. chains), as well as for mimicking structure design pro-
distribution rather than a single latent vector. Since the prior distri-
cesses that involve making a sequence of decisions (e.g. placing
bution of latents is assumed to be (typically) the standard normal,
the objects in a scene one after another). Their primary weakness is
we also constrain the mixture of Gaussians ∑x∈X N (µ(x), σ(x)) to
their susceptibility to “drift”: if an output of one step differs from
match the prior. At test time, we simply sample the standard normal
the training data in some statistically noticeable way, then when fed
in latent space and decode the sampled vector to an actual object
back into the model as input for the next step, it can cause the sub-
using the generator. Figure 11 illustrates the framework.
sequent output to diverge further. Such errors can compound over
multiple prediction steps. A GAN [GPAM∗ 14] dispenses with the encoder network and
instead uses the training exemplars to develop a “learned loss
Deep latent variable models. Recently, two generative frame- function”, modeled by a discriminator network D. The genera-
works have become extremely popular in deep neural network re- tor tries to map latent vectors sampled from the standard nor-
search: variational autoencoders (VAEs) and generative adversar- mal prior to plausible objects. The discriminator tries to distin-
ial networks (GANs). Both these methods allow us to efficiently guish “fake” objects produced by the generator from “real” ob-
sample complex, high-dimensional probability distributions, such jects from the training set. This constitutes the adversarial na-
as that specified by a plausibility prior over shapes, images or sen- ture of the framework, with each network trying to outdo the
tences. This is a classic hard problem, traditionally addressed with other. As the generator gets better at fooling the discrimina-
slow, iterative methods like Markov Chain Monte Carlo. Under the tor, the discriminator also gets better at distinguishing fake from
right conditions, if the distribution is represented by a set of land- real. Training minimizes over G, and maximizes over D, the loss
marks (i.e. an unsupervised training set), sampling can be reduced Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))].
to just a forward pass through a neural network, conditioned on
Both VAEs and GANs can be conditioned on additional input,
random noise from a known distribution.
to enable synthesis guided by user actions or data in other modali-
The core idea behind both VAEs and GANs is the following: ties. They are suitable for modeling a wide variety of shape struc-
let p : X → R be the probability function for some hard-to-sample tures, depending upon the specific architectures chosen for their
distribution over (say) 3D shapes X. We will sample instead from decoder/generator networks (see the next section for more on this).
some “easy” distribution (typically the standard Gaussian) over a The use of a global latent variable to control the generation of the
low-dimensional latent space Z, and learn a generator function structure encourages global coherence in the output, in contrast
G : Z → X that maps latent vectors to actual objects. Training tries to the tendency of autoregressive models to drift (i.e. to generate
to make p(z) = p(G(z)) for all z ∈ Z, so sampling from the latent locally-plausible but globally-incoherent structures). It is also pos-
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
sible to combine latent variable and autoregressive models, as an NNs on graphs: graph convolutional networks. In recent years,
autoregressive model such as a recurrent neural network can be much research effort has been devoted to the application of neural
used as the decoder/generator of a VAE/GAN. network models to arbitrarily structured graphs. The most prevalent
approach has been to generalize the concept of convolution from
4.3. Neural Networks for Structured Data regular lattices to irregular graphs, which allows the construction
of graph convolutional networks (GCNs). While some approaches
The deep generative models defined in the last section used neural to defining convolution on graphs are domain-specific [DMI∗ 15,
networks, which we denoted as “NN,” as critical building blocks. HHF∗ 19], the most popular approaches are general-purpose and
The precise form (or architecture, in neural network parlance) that fall into one of two camps: spectral graph convolution or message
these networks take depends upon the type of data being modeled. passing neural networks.
In this section, we survey the types of neural network architectures
that are most appropriate for the structured data representations dis- Spectral graph convolution approaches define convolutional net-
cussed in Section 3.2. works on graphs by approximating the equations for local spectral
filters on graphs (i.e. multiplying a per-node signal by a filter in the
NNs on trees: recursive neural networks. The layout of parts frequency domain) [HBL15, KW17]. Just as image-based CNNs
of a shape inherently induces a non-Euclidean, graph-based topol- compute a sequence of per-pixel learnable feature maps, spectral
ogy defined by inter-part relations. The main challenge in handling GCNs compute a sequence of per-node learnable feature maps H(l)
such graph-based data with deep NN is the non-uniform data for- via update rules like the following:
mat: the number of parts is arbitrary even for shapes belonging
to the same category. In the field of natural language processing, f (H(l+1) ) = σ AH(l) W(l) ,
researchers adopt a special kind of NN, named Recursive Neural
Networks (RvNN), to encode and decode sentences with arbitrary where W(l) is a weight matrix for the l-th neural network layer, σ(·)
lengths [SLNM11]. The premise is that a sentence can be repre- is a non-linear activation function, and A is related to the graph’s
sented with a parse tree, following which the word vectors can be Laplacian matrix [KW17].
aggregated and expanded in a recursive manner, up and down the
tree. The general form of an RvNN is recursive bottom-up merging Recently, message passing approaches (sometimes called mes-
of tree nodes for encoding: sage passing neural networks or MPNNs) have become more
widespread than spectral approaches. Rather than operating in the
x p = tanh(We · [xl xr ] + be ),
frequency domain, MPNNs define convolution in the spatial do-
and top-down splitting for decoding: main by direct analogy to spatial convolution on images: the output
of a convolution f (hv ) on some signal hv at graph vertex v is some
[xl xr ] = tanh(We · x p + be ),
learned combination of the signal values at neighboring vertices.
where x p , xl , xr represent the feature vectors of a parent node and For a more detailed explanation, see Wang et. al [WLW∗ 19], which
its left and right children, respectively. Wad ∈ R2n×n and bad ∈ R2n adopts MPNNs + an autoregressive generative modeling approach
are network parameters to be learned. for synthesizing indoor scene relationship graphs.
4.4. Program Synthesis sketches [Lez08,MQCJ18], to include guidance from program exe-
Finally, we highlight techniques for generating the program-based cution [CLS19,TLS∗ 19,ZW18,WTB∗ 18], or to induce subroutines
representations discussed at the end of Section 3.2. These tech- from existing programs [EMSM∗ 18]. The main intuition behind
niques belong to the field of program synthesis, which is a much these approaches is that deep networks can learn the distribution of
broader field of study that originated with and includes applications possible programs from data and use that to guide the search.
to domains outside of computer graphics (e.g. synthesizing spread- Neural nets can also completely replace search-based synthe-
sheet programs from examples [Gul11]). As mentioned in Sec- sis by learning to generate program text given input requirements.
tion 3.2, within the domain of shape synthesis, programs are well- Menon et al. [MTG∗ 13] explored the use of data-driven meth-
suited to modeling any shape structure, as a program can be written ods for program induction. The idea has been extended in various
to output many types of structure (linear chains, trees, graphs, etc.) ways to incorporate deep learning for program synthesis [DUB∗ 17,
PMS∗ 17, RDF16]. For example, RobustFill [DUB∗ 17] is an end-
Constraint-based program synthesis. Program synthesis is a to-end neural program synthesizer designed for text editing. These
field with a lengthy history, with many different techniques approaches rely purely on neural nets, leading to a huge increase
proposed based on search, enumeration, random sampling, in speed, but at the expense of flexibility, generality, and complete-
etc. [GPS17] Of late, the most prominent of these methods is ness: while neural nets are good at capturing complex patterns in
constraint-based synthesis. In this formulation, the user first spec- data, they face challenges when generalizing to novel, more com-
ifies the grammar of a domain-specific language, or DSL, which plex scenarios. Such generalization can be critical for successful
is the language in which synthesized programs will be expressed program synthesis itself, as well as for downstream tasks such as
(for 3D shapes, this might be constructive solid geometry or some 3D shape and scene synthesis.
related CAD language). This DSL defines the space of possible pro-
grams that can be synthesized. Constraint-based program synthesis As we will see later in Section 7, the graphics and vision com-
then attempts to solve the following optimization problem: munity have applied and extended neural program synthesis ap-
proaches for procedural shape and scene synthesis.
argminP∈DSL cost(P),
s.t. c(P) = 0,
4.5. Summary
where P is a program expressible in the DSL and c is some
constraint function. In other words, the synthesizer tries to find To summarize our discussion of different synthesis methods for
the minimum-cost program which is consistent with some user- structured shapes and scenes, we provide a flowchart detailed
provided constraints. The constraint function c can take many the circumstances under which each should be used (Figure 12).
forms; it is most commonly used to enforce consistency with a set When only a few training examples are available, or the input pro-
of user-provided (input, output) example pairs (i.e. programming vided to the system is some other form of user-specific constraint,
by demonstration). Cost is often defined to be proportional to pro- constraint-based program synthesizers are the best fit (though they
gram length / complexity, so as to encourage simpler programs that require some effort up-front in order to author the DSL in which
are more likely to generalize to unseen inputs (via an appeal to Oc- output programs will be synthesized). For small datasets of exam-
cam’s razor). ples, deep learning methods can’t be used, so more “classic” proba-
bilistic models are the best bet. If the type of content to be modeled
The optimization problem defined above is an intractable com- has a fixed structure (e.g. a fixed grid structure, or some maximum
binatorial problem, and the most challenging part of the problem is number of parts which can only connect in so many ways), proba-
merely finding any program which satisfies the constraints. One bilistic graphical models such as Bayesian networks are applicable.
approach is to convert this search problem into a Boolean sat- If the structure can vary dynamically across the dataset, a PCFG or
isfiability (SAT) problem, and then apply a state-of-the-art SAT probabilistic program is a better fit (though learning algorithms are
solver [DMB08]. The Sketch system is perhaps the most well- only known for tree-structured content). Lastly, if a large amount of
known instance of this kind of constraint-based program synthe- training data is available, deep neural networks are the best choice.
sizer [Lez08]. While such solvers cannot always return a solution If the structures to be generated feature strong global coherence,
in a reasonable amount of time (in the worst case, searching for then a latent variable model such as a VAE or GAN is desirable. For
a satisfying program has exponential time complexity), they often more flexible, locally-coherent structures, an autoregressive model
perform well in practice, and they guarantee that the program they may suffice. Regardless, the particular neural network architecture
return is in fact the globally-optimal (i.e. minimum-cost) program. used to generate the structure depends on the the type of structure:
RNNs are the natural choice for linear chains, RvNNs are special-
Neural program synthesis. Beyond traditional program synthesis ized for trees, and GCNs handle general graphs. Finally, if the ap-
algorithms that rely mostly on hand-crafted constraints, neural nets plication at hand demands a readable and editable representation of
can also be incorporated for more effective and efficient synthesis. the output structure, neural program synthesis methods should be
There are at least two ways that neural nets can be helpful. investigated.
First, they can be used to speed up search-based synthesis by sug-
gesting which search sub-trees to explore first, or whether a sub-
5. Application: Part-based Shape Synthesis
tree should be explored at all [KMP∗ 18, LMTW19]. Similarly,
they may also be used to suggest parameters/attributes of func- Structural representations of individual 3D objects typically rely on
tion calls or programs [BGB∗ 17, ERSLT18], to suggest program compact encodings of discrete components of a shape. These com-
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
Input type?
r
, or othe
amples
ful of ex constraints Big data
A hand vi ded set
user -pro Small dataset
PCFGs /
Probabilistic Programs
Structure Type?
C hain Gra
ear Tree ph
Neural Yes Lin
Editable?
Program Synthesis RNN RvNN GCN
Figure 12: A flowchart detailing the conditions to which each structure synthesis method discussed in Section 4 is best suited.
Exemplars
… … …
S⟶
0.25
( | | )S S ⟶ ??? S⟶
0.25
( | | | )S
Most Most
general Bayes-optimal: specific
Exemplars arg max P(X | G) P(G)
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
The above methods rely on shapes having a fixed structural Real structures
topology, or a set of topologies represented as deformable tem-
plates. Hence, they typically require all parts to conform to a small
set of labels. However, shapes in the real world can have (a) exten-
sive structural variation, (b) difficult-to-label components, and (c)
fine-grained details that themselves have compositional structure.
Generative Adversarial Network
To address these challenges, Li et al. [LXC∗ 17] propose GRASS, Variational Autoencoder
a recursive VAE that models layouts of arbitrary numbers of unla-
beled parts, with relationships between them. Based on the observa-
tion that shape parts can be hierarchically organized [WXL∗ 11], the
method utilizes an RvNN for encoding and decoding shape struc-
tures. The hierarchical organization follows two common percep-
tual/cognitive criteria for recursive merging: a mergeable subset of
parts is either an adjacent pair, or part of a symmetry group. Three
different symmetries are each represented by a generator part plus
further parameters: (1) pairwise reflectional symmetry; (2) k-fold Figure 17: Top: The symmetry hierarchy of a chair model and
rotational symmetry; and (3) k-fold translational symmetry. Sepa- the corresponding RvNN encoder. Structure decoding reverses
rate adjacency and symmetry encoders/decoders handle the two cri- the process. Middle: The recursive VAE-GAN architecture of
teria, and a classifier is trained in parallel to predict which should GRASS [LXC∗ 17]. Bottom: An example of topology-varying
be applied during decoding. shape interpolation with GRASS. Note the changes in spoke, slat
and arm counts.
Training data is created by randomly sampling multiple hier-
archies that respect the merging criteria [WXL∗ 11] for each seg-
mented (but not labeled) training shape. Despite the absence of con- aware shape editing by encoding and decoding shape differences
sistent, ground truth training hierarchies, the low-capacity RvNN (deltas) using the hierarchical graph representation. [MGY∗ 19b].
modules converge to predict relatively consistent hierarchies for Dubrovina et al. [DXA∗ 19] also proposed structure-aware editing,
any input shape. Figure 17 shows an illustration of the symmetry by factorizing the latent space of an autoencoder into independent
hierarchy of a chair model and its RvNN encoder. Since the encoder subspaces for different part types. However, their network is not
and decoder are linked by a generative VAE, the method can sample strictly speaking a generative model.
feature vectors from the latent space and decode them to synthesize
novel shapes, complete with part hierarchies. A GAN loss improves Structure-aware models as priors for cross-modal synthesis.
the plausibility of the output, and a separate network generates fine- The learned models described above can not only be sampled di-
grained part geometry for the synthesized layout. Figure 17 also rectly for unconstrained shape synthesis, they can also be used
shows the RvNN pipeline, and topology-varying interpolation via to generate shapes from cues in other modalities. For instance,
the latent space. linguistic predicates can be used to interactively create new 3D
shapes guided by discrete [CKGF13] or continuous [YCHK15]
Recent work by Zhu et al. [ZXC∗ 18] uses an RvNN to itera- generative priors. Similarly, structure-aware priors have been used
tively refine a rough part assembly into a plausible shape, option- to reconstruct 3D shapes from incomplete, noisy and partially oc-
ally adding or removing parts as needed. The method learns a sub- cluded scans [SKAG15, ZXC∗ 18], as well as from 2D RGB im-
structure prior to handle novel structures that only locally resem- ages [NLX18]. A full discussion of existing and anticipated work
ble training shapes. Mo et al. [MGY∗ 19a] extended the RvNN ap- in this space is beyond the scope of the current report.
proach with graph neural networks as the encoder/decoder mod-
ules, instead of traditional single- or multi-layer perceptrons as in
6. Application: Indoor Scene Synthesis
GRASS. This allows direct encoding of n-ary adjacencies, lead-
ing to better reconstruction and more cognitively plausible hier- With the resurgence of computer surveillance, robotics, virtual and
archies. In very recent work, the same group proposed structure- augmented reality (VR/AR), as well as home safety and intelli-
gence applications, there has been an increasing demand for virtual or airplanes, a good alignment is typically achievable. Such an
models of indoor scenes. At the same time, the recent rapid devel- alignment would bring out more predictability in the positioning
opments of machine learning techniques have motivated modern of object parts, so as to facilitate subsequent learning tasks. On
approaches to scene analysis and synthesis to adopt various data- the other hand, it is significantly more difficult, or even impos-
driven and learning-based paradigms. In this section, we survey no- sible, to find reasonable alignments between scene layouts, even
table works on indoor scene synthesis with an emphasis on learn- if they are all of kitchens, bedrooms, or living rooms.
ing and generating scene structures. We direct the reader to [Ma17]
for a more comprehensive, albeit less up-to-date, coverage on this Scene retrieve-n-adjust. A straightforward way to generate new
topic, as well as scene analysis and understanding. scenes when given a large dataset of available scenes is to retrieve
There are different ways to examine or classify scene syn- “pieces” of sub-scenes from the dataset, based on some search cri-
thesis methods. In terms of input, earlier works on probabilis- teria, and then adjust the retrieved pieces when composing them
tic models, e.g., [FRS∗ 12], generates a new scene by taking a into the final scene. While there may not be a formal training
random sample from a learned distribution, while recent works process in these approaches, some rules/criteria often need to be
on deep generative neural networks, e.g., [LPX∗ 19], can pro- learned from the data to guide the scene adjustment step.
duce a novel scene from a random noise vector. The input In Sketch2Scene, Xu et al. [XCF∗ 13] propose a framework that
can also be a hand sketch [XCF∗ 13], a photograph [ISS17, turns a 2D hand drawing of a scene into a 3D scene composed of
LZW∗ 15], natural language commands [MGPF∗ 18], or human ac- semantically valid and well arranged objects. Their retrieval-based
tions/activities [FLS∗ 15, MLZ∗ 16]. In terms of output, while most approach centers around the notion of structure groups which are
methods have been designed to generate room layouts with 3D extracted from an existing 3D scene database; this can be regarded
furniture objects, some methods learn to produce floor or build- as a learning process. A structure group is a compact summariza-
ing plans [MSK10, WFT∗ 19]. Our coverage will mainly organize tion of re-occurring relationships among objects in a large database
the methods based on the methodologies applied, e.g., probabilis- of 3D indoor scenes. Given an input sketch, their method performs
tic models vs. deep neural networks, interactive vs. fully automated co-retrieval and co-placement of relevant 3D models based on the
modeling, and wholistic approaches vs. progressive synthesis. “learned” structural groups to produce a 3D scene.
Shape parts vs. scene objects. The basic building blocks of an Given a single photograph depicting an indoor scene, Liu et
indoor scene are 3D objects whose semantic and functional com- al. [LZW∗ 15] reconstruct a corresponding 3D scene by first seg-
position and spatial arrangements follow learnable patterns, but still menting the photo into objects and estimating their 3D locations.
exhibit rich variations in object geometry, style, arrangements, and This is followed by 3D model retrieval from an existing database
other relations, even within the same scene category (e.g., think of by matching the line drawings extracted from the 2D objects and
all the possible layouts of kitchens, living rooms, and bedrooms). line features of the 3D models in the database.
It is natural to regard parts in a 3D object in the same vein as ob- Ma et al. [MGPF∗ 18] use natural language commands as a prior
jects in a 3D scene. However, while object-level part-part relations to guide a classical retrieve-n-adjust paradigm for 3D scene gen-
and scene-level object-object relations are both highly structured, eration. Natural language commands from the user are first parsed
there are several important distinctions between them which have and then transformed into a semantic scene graph that is used to
dictated how the various scene generation methods are designed: retrieve corresponding sub-scenes from an existing scene database.
• As first observed by Wang et al. [WXL∗ 11] and then reinforced Known scene layouts from the database also allow the method to
by various follow-up works, the constituent parts of a 3D object augment retrieved sub-scenes with other objects that may be im-
are strongly governed by their connectivity and symmetry rela- plied by the scene context. For example, when the input command
tions. In contrast, relations between objects in a scene are much only speaks of a TV being in the living room, a TV stand is natu-
looser. For example, many objects which are clearly related, rally inserted to support the TV object. The 3D scene is generated
e.g., sofas and sofa tables, desks and chairs, are rarely physi- progressively by processing the input commands one by one. At
cally connected. Also, due to perturbations arising from human each step, a 3D scene is synthesized by “splicing” the retrieved
usage, symmetries (e.g., the rotational symmetry of dining chairs sub-scene, with augmented objects, into the current scene.
around a dining table) between objects are often not realized.
• In a 3D scene, closely related objects may not even be in close Probabilistic synthesis. Example-based approaches to scene syn-
proximity. For example, TVs and sofas are often found together thesis [MSK10, FRS∗ 12, JLS12, MSSH14, ZHG∗ 16, SCH∗ 16] rely
in living rooms as they are related by the human action “watch- on probabilistic reasoning over exemplars from scene databases.
ing TV", but that very action dictates that TVs are almost never Fisher et al. [FRS∗ 12] develop Bayes networks and Gaussian Mix-
close to sofas geometrically. Indeed, the object-object relations ture Models (GMMs) to learn two probabilistic models for mod-
in a scene are often more about semantics or functionality than eling sub-scenes (e.g. an office corner or a dining area): (a) object
geometric proximity. Unlike symmetry and connectivity, which occurrence, which indicates what objects should be placed in the
are central to part-part relations in objects, functional and seman- scene, and (b) layout optimization, which controls where the ob-
tic relations are more difficult to model and learn. jects ought to be placed. Given an example scene, new variants can
• Alignment between objects is often a prerequisite to further anal- be synthesized based on the learned priors. For floor plan gener-
ysis. Despite the rich geometric and structural variations that ation, Merrell et al. [MSK10] learn structured relations between
may exist within a category of man-made objects, e.g., chairs features of architectural elements using a probabilistic graphical
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
Figure 21: Scene synthesis pipeline of PlanIT [WLW∗ 19]. Relation graphs are automatically extracted from scenes (left), which are used to
train a deep generative model of such graphs (middle). In the figure, the graph nodes are labeled with an icon indicating their object category.
Image-based reasoning is then employed to instantiate a graph into a concrete scene by iterative insertion of 3D models for each node (right).
with respect to human activities and they would guide the scene Ritchie et al. [RWaL19] develop a deep convolutional generative
synthesis based on the predicted activities in the template. model which allows faster and more flexible scene synthesis.
Ma et al. [MLZ∗ 16] introduce action-driven 3D scene evolu- More recently, Wang et al. [WLW∗ 19] present PlanIT, a “plan-
tion, which is motivated by the observation that in real life, indoor and-instantiate” framework for layout generation, which is applica-
environments evolve due to human actions or activities which af- ble to indoor scene synthesis. Specifically, their method plans the
fect object placements and movements; see Figure 20. The action structure of a scene by generating a relationship graph, and then
model in their work defines one or more human poses, objects, and instantiates a concrete 3D scene which conforms to that plan; see
their spatial configurations that are associated with specific human Figure 21. Technically, they develop a deep generative graph model
actions, e.g., “eat dinner at table while sitting”. Importantly, the based on message-passing graph convolution, and a specific model
action models are learned from annotated photographs in the Mi- architecture for generating indoor scene relationship graphs. In ad-
crosoft COCO dataset, not from 3D data which is relatively scarce, dition, they devise a neurally-guided search procedure for instan-
especially in terms of action labels and object/pose designations. tiating scene relationship graphs, which utilize image-based scene
Partial orders between the actions are analyzed to construct an ac- synthesis models that are constrained by the input graph.
tion graph, whose nodes correspond to actions and edges encode Like Wang et al. [WLW∗ 19], Zhou et al. [ZWK19] also develop
transition probabilities between actions. Then, to synthesize an in- a graph neural network to model object-object relations in 3D in-
door scene, an action sequence is sampled from the action graph, door scenes and a message passing scheme for scene augmentation.
where each sampled action would imply plausible object insertions Specifically, their network is trained to predict a probability distri-
and placements based on the learned action models. bution over object types that provide a contextual fit at a scene lo-
cation. Rather than focusing on relations between objects in close
Deep generative models. Highly structured models, including in- proximity, the relation graphs learned by Zhou et al. [ZWK19] en-
door scenes and many man-made objects, could be represented as a codes both short- and long-range relations. Their message passing
volume or using multi-view images and undergo conventional con- relies on an attention mechanism to focus on the most relevant sur-
volutionary operations. However, such operations are oblivious to rounding scene context when predicting object types.
the underlying structures in the data which often play an essential Li et al. [LPX∗ 19] introduce GRAINS, a generative neural net-
role in scene understanding. This could explain in part why deep work which can generate plausible 3D indoor scenes in large quan-
convolutional neural networks, which have been so successful in tities and varieties. Using their approach, a novel 3D scene can be
processing natural images, had not been as widely adopted for the generated from a random vector drawn from a Gaussian in a frac-
analysis or synthesis of 3D indoor scenes until recently. tion of a second, following the pipeline shown in Figure 22. Their
key observation is that indoor scene structures are inherently hier-
Wang et al. [WSCR18] learn deep convolutional priors for in-
archical. Hence, their network is not convolutional; it is a recursive
door scene synthesis. Specifically, the deep convolutional neural
neural network [SLNM11] (RvNN), which learns to generate hier-
network they develop generates top-down views of indoor scene
archical scene structures consisting of oriented bounding boxes for
layouts automatically and progressively, from an input consisting
scene objects that are endowed with semantic labels.
of a room architecture specification, i.e., walls, floor, and ceiling.
Their method encodes scene composition and layout using multi- Using a dataset of annotated scene hierarchies, they train a varia-
channel top-view depth images. An autoregressive neural network tional recursive autoencoder, or RvNN-VAE, which performs scene
is then trained with these image channels to output object place- object grouping during its encoding phase and scene generation
ment priors as a 2D distribution map. Scene synthesis is performed during decoding. Specifically, a set of encoders is recursively ap-
via a sequential placement of new objects, guided by the learned plied to group 3D objects in a scene, bottom up, and encodes in-
convolutional object placement priors. In one of the follow-ups, formation about the objects and their relations, where the resulting
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
Figure 22: Overall pipeline of scene generation using GRAINS [LPX∗ 19]. The decoder of the trained RvNN-VAE (variational autoencoder)
turns a randomly sampled code from the learned distribution into a plausible indoor scene hierarchy composed of oriented bounding boxes
(OBBs) with semantic labels. The labeled OBBs are used to retrieve 3D objects to form the final 3D scene.
Figure 23: Overall pipeline of reconstructing 3D shapes via a set of geometric primitives [TSG∗ 17]. The model learns to decompose shapes
into primitives and uses Chamfer distance, which computes the difference between the reconstructed shape with the input shape, as the loss
function to train the inference network.
fixed-length codes roughly follow a Gaussian distribution. A new 7. Application: Visual Program Induction
scene can then be generated top-down, i.e., hierarchically, by de- In addition to synthesizing novel 3D content, structure-aware gen-
coding from a randomly generated code. erative models can also be used to infer the underlying structure of
existing 3D content (to facilitate editing and other downstream pro-
cessing tasks). Here, the visual input can be 3D shapes and scenes
Similar to GRAINS, Zhang et al. [ZYM∗ 18] train a genera- (voxels, point clouds, etc.), or even 2D images. In this section, we
tive model which maps a normal distribution to the distribution focus on one particularly active research area along this theme: vi-
of primary objects in indoor scenes. They introduce a 3D object sual program induction, i.e. synthesizing a plausible program that
arrangement representation that models the locations and orienta- recreates an existing piece of 3D content. While there are non-
tions of objects, based on their size and shape attributes. Moreover, learning-based instances of this paradigm [DIP∗ 18, NAW∗ 19], we
their scene representation is applicable for 3D objects with dif- focus on learning-based methods in this section.
ferent multiplicities or repetition counts, selected from a database.
Their generative network is trained using a hybrid representation, One line of work attempts to recover shape-generating programs
by combining discriminator losses for both a 3D object arrange- from an existing 3D shape. The earliest instances of this paradigm
ment representation and a 2D image-based representation. are works which learn to reconstruct 3D shapes via a set or se-
quence of simple geometric primitives, which can be interpreted
as very simple shape-generating programs [TSG∗ 17, ZYY∗ 17]. As
Most recently, Wu et al. [WFT∗ 19] develop a data-driven tech- shown in Figure 23, the model from Tulsiani et al. [TSG∗ 17] learns
nique for automated floor plan generation when given the boundary to decompose shapes into primitives and uses Chamfer distance,
of a residential building. A two-stage approach is taken to imitate which computes the difference between the reconstructed shape
the human design process: room locations are estimated first and and the input shape, as the loss function during training. A more
then walls are added while conforming to the input building bound- complex class of program is constructive solid geometry (CSG);
ary. They follow a heuristic generative process which begins with Sharma et al. [SGL∗ 18] presented an autoregressive generative
the living room and then continues to generate other rooms pro- model which synthesizes CSG programs that represent a given in-
gressively. Specifically, an encoder-decoder network is trained to put shape (in both 2D and 3D domains).
generate walls first, where the results can be rough; this is followed Very recently, Tian et al. [TLS∗ 19] presented an autoregressive
by refining the generated walls to vector representations using ded- model that takes an input 3D shape (represented as a volumetric
icated rules. A key contribution of their work is RPLAN, a large- grid) and outputs an imperative 3D shape program consisting of
scale, densely annotated dataset of floor plans from real residential loops and other high-level structures (Figure 24). Their system also
buildings, which is used to train their neural network. includes a neural network for executing these programs in a differ-
Figure 24: The shape program representation aims to explain an input shape with high-level programs that encodes basic parts, as well as
their structure such as repetition [TLS∗ 19].
Program
Program Block 1
Executor Max
Pool
Program Block 2 …
Figure 25: The model for inferring and executing 3D shape pro-
grams [TLS∗ 19]. At each step, it infers the next program block (e.g.
a set of cuboids as legs), executes the program block, and compares
the current shape with the input to suggest what to synthesize next. Input Image Patches Edited Image Input Image
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
produce; heavy reliance on them limits the domains in which these ing colored and textured 3D objects [CCS∗ 18,ZZZ∗ 18]. In general,
methods can be applied. It would be better if structure-aware gen- however, this area is also under-explored.
erative models could be learned directly from raw 3D content (e.g.
meshes or point clouds) without having to manually annotate the Modeling object and scene style. Material and texture appear-
structure of each training item. ance contributes strongly to an object or scene’s perceived “style.”
State-of-the-art generative models of 2D images have developed
One potential way to do this is self-supervised learning (some-
powerful facilities for controlling the style of their synthesized out-
times also called predictive learning). The idea is to use a training
put [KLA19]; in comparison, 3D generative models lag behind.
procedure in which the model generates a structure-aware represen-
Style can be a function of both geometry (e.g. a sleek modern chair
tation, converts that representation into a raw, unstructured repre-
vs. ornate antique chair) as well as structure (e.g. a compact, cozy
sentation, and then compares this to the ground-truth unstructured
living space vs. an airy, open one).
geometry. Training to minimize this comparison loss allows the
model to learn without ever having access to the ground-truth struc- One form of style that is worth mentioning especially is “messi-
ture; instead, it is forced to discover a structure which explains the ness.” Computer graphics has historically suffered from produc-
geometry well. ing imagery that looks ‘too perfect’ and thus appears obviously
fake. In the realm of artist-created 3D content, this problem has
Self-supervision has been applied successfully to train models largely been solved via more photorealistic rendering algorithms
which can reconstruct a 3D shape from a 2D image without hav- and higher-detail modeling and texturing tools. However, it is still
ing access to ground truth 3D shapes at training time [YYY∗ 16]. a problem with content created by learned generative models. It
As far as we are aware, the only instance of self-supervised is particularly problematic for 3D scenes: generating scenes which
learning for structure-aware generative models is the work of appear ‘cluttered’ in a naturalistic way is extremely difficult and
Tian et al. [TLS∗ 19], which still required supervised pre-training. still an open problem.
The challenge with applying self-supervised learning to structure-
aware generative models is that the entire training pipeline needs
Modeling object and scene functionality. Existing generative
to be end-to-end differentiable in order to train via gradient-
models focus on visual plausibility: that is, they attempt to generate
based methods, but generating structures inherently requires non-
3D content that looks convincingly like a particular type of entity
differentiable, discrete decisions. Sometimes it is possible to learn
(e.g. a chair). However, the content being generated by these mod-
a smooth, differentiable approximation to this discrete process (this
els increasingly needs to do more than just look good: it needs to be
is the method employed by Tian et al.). It is also possible that ad-
interacted with in a virtual environment, or even physically fabri-
vances in reinforcement learning will allow self-supervised struc-
cated and brought into the real world. Thus, we need generative
ture generation with non-differentiable training pipelines.
models that create functionally plausible 3D content, too. Func-
tional plausibility can have a variety of meanings, ranging from
Improving the coupling of structure and geometry. As men- ensuring that generated objects are equipped with expected affor-
tioned in Section 3, there are two parts to a representation: the dances (e.g. the drawers of a generated desk can open), that gen-
geometry of individual structural atoms and the structure that com- erated scenes support desired activities (e.g. that the living room
bines them. We surveyed the different possible representations for of a five-bedroom house contains enough seating arranged appro-
each and different generative models for them. But we have treated priately to allow the whole family to watch television), and that
them as separate components, when they are in actuality coupled: synthesized content is physically realizable (i.e. it will not collapse
structure influences the geometry and vice-versa. Most prior work under gravity or under typical external loads placed upon it).
largely ignores this coupling and trains separate generative mod-
els for the structure vs. the fine-grained geometry of individual
atoms. One recent work recognizes the necessity of this coupling 9. Author Bios
and defines a structure-aware generative model that attempts to bet- Siddhartha Chaudhuri is a Senior Research Scientist in the Cre-
ter couple the two [WWL∗ 19]. Another line of research tries to ative Intelligence Lab at Adobe Research, and Assistant Professor
impose structural constraints (symmetries) on implicit surface rep- (on leave) of Computer Science and Engineering at IIT Bombay.
resentations through learning a structured set of local deep implicit He obtained his Ph.D. from Stanford University, and his under-
functions [GCV∗ 19, GCS∗ 19]. Such structured implicit functions graduate degree from IIT Kanpur. He subsequently did postdoc-
ensure both accurate surface reconstruction and structural correct- toral research at Stanford and Princeton, and taught for a year at
ness. Overall, this direction of research is still under-explored. Cornell. Siddhartha’s work combines geometric analysis, machine
learning, and UI innovation to make sophisticated 3D geomet-
Modeling the coupling of geometry and appearance. The gener- ric modeling accessible even to non-expert users. He also studies
ative models we have surveyed generate geometric models without foundational problems in geometry processing (retrieval, segmen-
materials or textures, but these are critical components of actual tation, correspondences) that arise from this pursuit. His research
objects and scenes, and they are strongly coupled to the structure themes include probabilistic assembly-based modeling, semantic
and geometry (i.e. structure and geometry provide strong cues to attributes for design, generative neural networks for 3D structures,
material and texture appearance). There is some work on inferring and other applications of deep learning to 3D geometry process-
materials for parts of objects [Arj12, LAK∗ 18] and for objects in ing. He is the original author of the commercial 3D modeling tool
scenes [CXY∗ 15], as well as more recent explorations on generat- Adobe Fuse, and has taught tutorials on data-driven 3D design
(SIGGRAPH Asia 2014), shape “semantics” (ICVGIP 2016), and computational design, fabrication, and creativity. He has published
structure-aware 3D generation (Eurographics 2019). more than 120 papers on these topics. Most relevant to the STAR
topic, Richard was one of the co-authors of the first Eurographics
Daniel Ritchie is an Assistant Professor of Computer Science at STAR on structure-aware shape processing and taught SIGGRAPH
Brown University. He received his PhD from Stanford University, courses on the topic. With his collaborators, he has made origi-
advised by Pat Hanrahan and Noah Goodman. His research sits nal and impactful contributions to structural analysis and synthesis
at the intersection of computer graphics and artificial intelligence, of 3D shapes and environments including co-analysis, hierarchical
where he is particularly interested in data-driven methods for de- modeling, semi-supervised learning, topology-varying shape corre-
signing, synthesizing, and manipulating visual content. In the area spondence and modeling, and deep generative models.
of generative models for structured 3D content, he co-authored the
first data-driven method for synthesizing 3D scenes, as well as the
first method applying deep learning to scene synthesis. He has also References
worked extensively on applying techniques from probabilistic pro- [ADMG18] ACHLIOPTAS P., D IAMANTI O., M ITLIAGKAS I., G UIBAS
gramming to procedural modeling problems, including to learning L.: Learning Representations and Generative Models for 3D Point
Clouds. In International Conference on Machine Learning (ICML)
procedural modeling programs from examples. In related work, he (2018). 3
has developed systems for inferring generative graphics programs
[Arj12] A RJUN JAIN AND T HORSTEN T HORMÄHLEN AND T OBIAS
from unstructured visual inputs such as hand-drawn sketches
R ITSCHEL AND H ANS -P ETER S EIDEL: Material Memex: Automatic
material suggestions for 3D objects. In Annual Conference on Computer
Jiajun Wu is an Acting Assistant Professor at the Department of Graphics and Interactive Techniques Asia (SIGGRAPH Asia) (2012). 19
Computer Science, Stanford University, and will join the depart- [ASK∗ 05] A NGUELOV D., S RINIVASAN P., KOLLER D., T HRUN S.,
ment as an Assistant Professor in Summer 2020. He received his RODGERS J., DAVIS J.: SCAPE: Shape Completion and Animation of
Ph.D. and S.M. in Electrical Engineering and Computer Science, People. ACM Transactions on Graphics (TOG) 24, 3 (2005), 408–416.
both from Massachusetts Institute of Technology, and his B.Eng. 11
from Tsinghua University. His research interests lie in the intersec- [BGB∗ 17] BALOG M., G AUNT A. L., B ROCKSCHMIDT M., N OWOZIN
tion of computer vision (in particular 3D vision), computer graph- S., TARLOW D.: DeepCoder: Learning to Write Programs. In Interna-
tional Conference on Learning Representations (ICLR) (2017). 10
ics, machine learning, and computational cognitive science. On
topics related to this STAR, he has worked extensively on generat- [BLRW16] B ROCK A., L IM T., R ITCHIE J. M., W ESTON N.: Gener-
ative and discriminative voxel modeling with convolutional neural net-
ing 3D shapes using modern deep learning methods such as gener- works. In Advances in Neural Information Processing Systems Work-
ative adversarial nets. He has also developed novel neural program shops (NeurIPS Workshop) (2016). 3
synthesis algorithms for explaining structures within 3D shapes and [BSW∗ 18] BALASHOVA E., S INGH V. K., WANG J., T EIXEIRA B.,
scenes and applied them in downstream tasks such as shape and C HEN T., F UNKHOUSER T.: Structure-aware shape synthesis. In In-
scene editing. ternational Conference on 3D Vision (3DV) (2018). 12
[BWS10] B OKELOH M., WAND M., S EIDEL H.-P.: A connection be-
Kai (Kevin) Xu is an Associate Professor at the School of Com- tween partial symmetry and inverse procedural modeling. ACM Trans-
puter Science, National University of Defense Technology, where actions on Graphics (TOG) 29, 4 (2010). 12
he received his Ph.D. in 2011. He conducted visiting research [CCS∗ 18] C HEN K., C HOY C. B., S AVVA M., C HANG A. X.,
at Simon Fraser University (2008-2010) and Princeton University F UNKHOUSER T., S AVARESE S.: Text2Shape: Generating shapes from
(2017-2018). His research interests include geometry processing natural language by learning joint embeddings. In Asian Conference on
Computer Vision (ACCV) (2018). 19
and geometric modeling, especially on data-driven approaches to
the problems in those directions, as well as 3D vision and its [CFG∗ 15] C HANG A. X., F UNKHOUSER T., G UIBAS L., H ANRAHAN
P., H UANG Q., L I Z., S AVARESE S., S AVVA M., S ONG S., S U H.,
robotic applications. He has published 70+ research papers, includ- X IAO J., Y I L., Y U F.: ShapeNet: An Information-Rich 3D Model
ing 20+ SIGGRAPH/TOG papers. He organized two SIGGRAPH Repository. arXiv:1512.03012 (2015). 2
Asia courses and a Eurographics STAR tutorial, both on data-driven [CKGF13] C HAUDHURI S., K ALOGERAKIS E., G IGUERE S.,
shape analysis and processing. He is currently serving on the edi- F UNKHOUSER T.: AttribIt: Content creation with semantic at-
torial board of ACM Transactions on Graphics, Computer Graph- tributes. In ACM Symposium on User Interface Software and Technology
ics Forum, Computers & Graphics, and The Visual Computer. Kai (UIST) (2013), pp. 193–202. 13
has made several major contributions to structure-aware 3D shape [CKGK11] C HAUDHURI S., K ALOGERAKIS E., G UIBAS L., KOLTUN
analysis and modeling with data-driven approach, and recently with V.: Probabilistic Reasoning for Assembly-Based 3D Modeling. ACM
deep learning methods. Transactions on Graphics (TOG) 30, 4 (2011), 35. 11, 12
[CLS19] C HEN X., L IU C., S ONG D.: Execution-Guided Neural Pro-
Hao (Richard) Zhang is a professor in the School of Comput- gram Synthesis. In International Conference on Learning Representa-
tions (ICLR) (2019). 10
ing Science at Simon Fraser University, Canada. He obtained his
Ph.D. from the Dynamic Graphics Project (DGP), University of [CXY∗ 15] C HEN K., X U K., Y U Y., WANG T.-Y., H U S.-M.: Magic
Decorator: Automatic Material Suggestion for Indoor Digital Scenes. In
Toronto, and M.Math. and B.Math degrees from the University of Annual Conference on Computer Graphics and Interactive Techniques
Waterloo, all in computer science. Richard’s research is in com- Asia (SIGGRAPH Asia) (2015). 19
puter graphics with special interests in geometric modeling, analy- [CZ19] C HEN Z., Z HANG H.: Learning Implicit Fields for Generative
sis and synthesis of 3D contents (e.g., shapes and indoor scenes), Shape Modeling. In IEEE Conference on Computer Vision and Pattern
machine learning (e.g., generative models for 3D shapes), as well as Recognition (CVPR) (2019). 3
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
[DGF∗ 19] D EPRELLE T., G ROUEIX T., F ISHER M., K IM V., RUSSELL [GCV∗ 19] G ENOVA K., C OLE F., V LASIC D., S ARNA A., F REEMAN
B., AUBRY M.: Learning elementary structures for 3D shape generation W. T., F UNKHOUSER T.: Learning shape templates with structured im-
and matching. In Advances in Neural Information Processing Systems plicit functions. In IEEE Conference on Computer Vision and Pattern
(NeurIPS) (2019). 4 Recognition (CVPR) (2019). 19
[DIP∗ 18] D U T., I NALA J. P., P U Y., S PIELBERG A., S CHULZ A., RUS [GFK∗ 18] G ROUEIX T., F ISHER M., K IM V. G., RUSSELL B. C.,
D., S OLAR -L EZAMA A., M ATUSIK W.: InverseCSG: Automatic con- AUBRY M.: AtlasNet: A Papier-Mâché Approach to Learning 3D Sur-
version of 3D models to CSG trees. ACM Transactions on Graphics face Generation. In IEEE Conference on Computer Vision and Pattern
(TOG) 37, 6 (2018). 17 Recognition (CVPR) (2018). 3, 4
[DMB08] D E M OURA L., B JØRNER N.: Z3: An efficient smt solver. In [GFRG16] G IRDHAR R., F OUHEY D. F., RODRIGUEZ M., G UPTA A.:
International conference on Tools and Algorithms for the Construction Learning a Predictable and Generative Vector Representation for Ob-
and Analysis of Systems (2008), Springer, pp. 337–340. 10 jects. In European Conference on Computer Vision (ECCV) (2016). 2
[DMI∗ 15] D UVENAUD D. K., M ACLAURIN D., I PARRAGUIRRE J., [GG19] G EORGIA G KIOXARI J ITENDRA M ALIK J. J.: Mesh R-CNN.
B OMBARELL R., H IRZEL T., A SPURU -G UZIK A., A DAMS R. P.: Con- In IEEE International Conference on Computer Vision (ICCV) (2019). 3
volutional networks on graphs for learning molecular fingerprints. In
Advances in Neural Information Processing Systems (NeurIPS) (2015). [GGH02] G U X., G ORTLER S., H OPPE H.: Geometry images. In Annual
9 Conference on Computer Graphics and Interactive Techniques (SIG-
GRAPH) (2002). 3
[DN19] DAI A., N IESSNER M.: Scan2Mesh: From unstructured range
scans to 3D meshes. In IEEE Conference on Computer Vision and Pat- [GMR∗ 08] G OODMAN N. D., M ANSINGHKA V. K., ROY D. M.,
tern Recognition (CVPR) (2019). 3 B ONAWITZ K., T ENENBAUM J. B.: Church: a language for generative
models. In Conference on Uncertainty in Artificial Intelligence (UAI)
[DUB∗ 17] D EVLIN J., U ESATO J., B HUPATIRAJU S., S INGH R., M O - (2008). 7
HAMED A.- R ., KOHLI P.: RobustFill: Neural Program Learning under
Noisy I/o. In International Conference on Machine Learning (ICML) [GPAM∗ 14] G OODFELLOW I., P OUGET-A BADIE J., M IRZA M., X U
(2017). 10 B., WARDE -FARLEY D., O ZAIR S., C OURVILLE A., B ENGIO Y.: Gen-
erative Adversarial Nets. In Advances in Neural Information Processing
[DXA∗ 19] D UBROVINA A., X IA F., ACHLIOPTAS P., S HALAH M., G R - Systems (NeurIPS) (2014). 2, 8
VOSCOT R., G UIBAS L.: Composite shape modeling via latent space
factorization. In IEEE International Conference on Computer Vision [GPS17] G ULWANI S., P OLOZOV A., S INGH R.: Program synthesis.
(ICCV) (2019). 13 Foundations and Trends
R in Programming Languages 4, 1-2 (2017),
1–119. 10
[EMSM∗ 18] E LLIS K., M ORALES L., S ABLÉ -M EYER M., S OLAR -
L EZAMA A., T ENENBAUM J. B.: Library Learning for Neurally-Guided [GS14] G OODMAN N. D., S TUHLMÜLLER A.: The Design and Imple-
Bayesian Program Induction. In Advances in Neural Information Pro- mentation of Probabilistic Programming Languages. http://dippl.
cessing Systems (NeurIPS) (2018). 10 org, 2014. Accessed: 2015-12-23. 7
[ENP∗ 19] E LLIS K., N YE M., P U Y., S OSA F., T ENENBAUM J., [Gul11] G ULWANI S.: Automating string processing in spreadsheets us-
S OLAR -L EZAMA A.: Write, execute, assess: Program synthesis with a ing input-output examples. In ACM SIGPLAN-SIGACT symposium on
repl. In Advances in Neural Information Processing Systems (NeurIPS) Principles of programming languages (POPL) (2011). 10
(2019). 18 [GYW∗ 19] G AO L., YANG J., W U T., Y UAN Y.-J., F U H., L AI Y.-
[ERSLT18] E LLIS K., R ITCHIE D., S OLAR -L EZAMA A., T ENENBAUM K., Z HANG H. R.: SDM-NET: Deep generative network for structured
J.: Learning to Infer Graphics Programs from Hand-Drawn Images. In deformable mesh. In Annual Conference on Computer Graphics and
Advances in Neural Information Processing Systems (NeurIPS) (2018). Interactive Techniques Asia (SIGGRAPH Asia) (2019). 12
5, 10, 18
[GZE19] G ROVER A., Z WEIG A., E RMON S.: Graphite: Iterative gen-
[FAvK∗ 14] F ISH N., AVERKIOU M., VAN K AICK O., S ORKINE - erative modeling of graphs. In International Conference on Machine
H ORNUNG O., C OHEN -O R D., M ITRA N. J.: Meta-representation of Learning (ICML) (2019). 4
shape families. ACM Transactions on Graphics (TOG) 33, 4 (2014). 11
[HBL15] H ENAFF M., B RUNA J., L E C UN Y.: Deep convolutional net-
[FKS∗ 04] F UNKHOUSER T., K AZHDAN M., S HILANE P., M IN P., works on graph-structured data. arXiv:1506.05163 (2015). 9
K IEFER W., TAL A., RUSINKIEWICZ S., D OBKIN D.: Modeling by
example. ACM Transactions on Graphics (TOG) 23, 3 (2004), 652–663. [HHF∗ 19] H ANOCKA R., H ERTZ A., F ISH N., G IRYES R., F LEISH -
MAN S., C OHEN -O R D.: MeshCNN: A network with an edge. ACM
11
Transactions on Graphics (TOG) 38, 4 (2019), 90. 3, 9
[FLS∗ 15] F ISHER M., L I Y., S AVVA M., H ANRAHAN P., N IESSNER
M.: Activity-centric scene synthesis for functional 3D scene modeling. [HKM15] H UANG H., K ALOGERAKIS E., M ARLIN B.: Analysis and
ACM Transactions on Graphics (TOG) 34, 6 (2015), 212:1–10. 14, 15 synthesis of 3d shape families via deep-learned generative models of sur-
faces. Computer Graphics Forum (CGF) 34, 5 (2015). 1, 12
[FRS∗ 12] F ISHER M., R ITCHIE D., S AVVA M., F UNKHOUSER T.,
H ANRAHAN P.: Example-based synthesis of 3D object arrangements. [HSG11] H WANG I., S TUHLMÜLLER A., G OODMAN N. D.: Inducing
ACM Transactions on Graphics (TOG) 31, 6 (2012), 135:1–11. 1, 14, 15 Probabilistic Programs by Bayesian Program Merging. arXiv:1110.5667
(2011). 7
[FSG17] FAN H., S U H., G UIBAS L. J.: A point set generation network
for 3D object reconstruction from a single image. In IEEE Conference [ISS17] I ZADINIA H., S HAN Q., S EITZ S. M.: IM2CAD. In IEEE Con-
on Computer Vision and Pattern Recognition (CVPR) (2017). 3 ference on Computer Vision and Pattern Recognition (CVPR) (2017). 14
[FSH11] F ISHER M., S AVVA M., H ANRAHAN P.: Characterizing struc- [JHR16] JAISWAL P., H UANG J., R AI R.: Assembly-based concep-
tural relationships in scenes using graph kernels. ACM Transactions on tual 3D modeling with unlabeled components using probabilistic factor
Graphics (TOG) 30, 4 (2011), 34. 4 graph. Computer-Aided Design 74 (2016), 45–54. 12
[FW16] FAN L., W ONKA P.: A probabilistic model for exteriors of res- [JLS12] J IANG Y., L IM M., S AXENA A.: Learning object arrangements
idential buildings. ACM Transactions on Graphics (TOG) 35, 5 (2016), in 3D scenes using human context. In International Conference on Ma-
155:1–155:13. 11 chine Learning (ICML) (2012). 14
[GCS∗ 19] G ENOVA K., C OLE F., S UD A., S ARNA A., F UNKHOUSER [JTRS12] JAIN A., T HORMÄHLEN T., R ITSCHEL T., S EIDEL H.-P.: Ex-
T.: Deep structured implicit functions. In IEEE International Conference ploring shape variations by 3D-model decomposition and part-based re-
on Computer Vision (ICCV) (2019). 19 combination. Computer Graphics Forum (CGF) 31, 2 (2012). 11
[KCKK12] K ALOGERAKIS E., C HAUDHURI S., KOLLER D., KOLTUN [Ma17] M A R.: Analysis and modeling of 3D indoor scenes.
V.: A probabilistic model of component-based shape synthesis. ACM arXiv:1706.09577 (2017). 14
Transactions on Graphics (TOG) 31, 4 (2012). 11
[MBBV15] M ASCI J.,
B OSCAINI D., B RONSTEIN M., VAN -
[KF09] KOLLER D., F RIEDMAN N.: Probabilistic Graphical Models: DERGHEYNST P.: Geodesic convolutional neural networks on
Principles and Techniques. MIT Press, 2009. 6 Riemannian manifolds. In IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPR Workshop) (2015). 3
[KJS07] K RAEVOY V., J ULIUS D., S HEFFER A.: Model composition
from interchangeable components. In Pacific Conference on Computer [MG13] M ARTINOVI Ć A., G OOL L. V.: Bayesian grammar learning for
Graphics and Applications (Pacific Graphics) (2007). 11 inverse procedural modeling. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (2013). 12
[KKTM15] K ULKARNI T. D., KOHLI P., T ENENBAUM J. B., M ANS -
INGHKA V.: Picture: A Probabilistic Programming Language for Scene [MGPF∗ 18] M A R., G ADI PATIL A., F ISHER M., L I M., P IRK S., H UA
Perception. In IEEE Conference on Computer Vision and Pattern Recog- B.-S., Y EUNG S.-K., T ONG X., G UIBAS L., Z HANG H.: Language-
nition (CVPR) (2015). 18 driven synthesis of 3D scenes from scene databases. ACM Transactions
on Graphics (TOG) 37, 6 (2018), 212:1–212:16. 14, 15
[KL17] K LOKOV R., L EMPITSKY V.: Escape from cells: Deep kd-
networks for the recognition of 3D point cloud models. In IEEE In- [MGY∗ 19a] M O K., G UERRERO P., Y I L., S U H., W ONKA P., M ITRA
ternational Conference on Computer Vision (ICCV) (2017). 3 N., G UIBAS L.: StructureNet: Hierarchical graph networks for 3D shape
generation. In Annual Conference on Computer Graphics and Interactive
[KLA19] K ARRAS T., L AINE S., A ILA T.: A style-based generator ar-
Techniques Asia (SIGGRAPH Asia) (2019). 5, 13, 18
chitecture for generative adversarial networks. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (2019). 19 [MGY∗ 19b] M O K., G UERRERO P., Y I L., S U H., W ONKA P., M I -
TRA N., G UIBAS L. J.: StructEdit: Learning structural shape variations.
[KLTZ16] K ERMANI Z. S., L IAO Z., TAN P., Z HANG H.: Learning
arXiv:1911.11098 (2019). 13
3D scene synthesis from annotated RGB-D images. Computer Graphics
Forum (CGF) 35, 5 (2016), 197–206. 15 [MLZ∗ 16] M A R., L I H., Z OU C., L IAO Z., T ONG X., Z HANG H.:
Action-driven 3D indoor scene evolution. ACM Transactions on Graph-
[KMP∗ 18] K ALYAN A., M OHTA A., P OLOZOV O., BATRA D., JAIN P.,
ics (TOG) 35, 6 (2016). 14, 15, 16
G ULWANI S.: Neural-Guided Deductive Search for Real-Time Program
Synthesis from Examples. In International Conference on Learning Rep- [MON∗ 19] M ESCHEDER L., O ECHSLE M., N IEMEYER M., N OWOZIN
resentations (ICLR) (2018). 10 S., G EIGER A.: Occupancy Networks: Learning 3D Reconstruction in
Function Space. In IEEE Conference on Computer Vision and Pattern
[KW14] K INGMA D. P., W ELLING M.: Auto-Encoding Variational
Recognition (CVPR) (2019). 3
Bayes. In International Conference on Learning Representations (ICLR)
(2014). 3, 8 [MPJ∗ 19] M ICHALKIEWICZ M., P ONTES J. K., JACK D., BAKTASH -
MOTLAGH M., E RIKSSON A. P.: Deep level sets: Implicit surface rep-
[KW17] K IPF T. N., W ELLING M.: Semi-supervised classification with
resentations for 3D shape inference. arXiv:1901.06802 (2019). 3
graph convolutional networks. In International Conference on Learning
Representations (ICLR) (2017). 9 [MQCJ18] M URALI V., Q I L., C HAUDHURI S., J ERMAINE C.: Neural
Sketch Learning for Conditional Program Generation. In International
[LAK∗ 18] L IN H., AVERKIOU M., K ALOGERAKIS E., KOVACS B.,
Conference on Learning Representations (ICLR) (2018). 10
R ANADE S., K IM V., C HAUDHURI S., BALA K.: Learning material-
aware local descriptors for 3D shapes. In International Conference on [MSK10] M ERRELL P., S CHKUFZA E., KOLTUN V.: Computer-
3D Vision (3DV) (2018). 19 generated residential building layouts. In Annual Conference on Com-
puter Graphics and Interactive Techniques Asia (SIGGRAPH Asia)
[Lez08] L EZAMA A. S.: Program synthesis by sketching. PhD thesis,
(2010). 14
University of California Berkeley, 2008. 10
[MSL∗ 11] M ERRELL P., S CHKUFZA E., L I Z., AGRAWALA M.,
[LMTW19] L U S., M AO J., T ENENBAUM J. B., W U J.: Neurally-
KOLTUN V.: Interactive Furniture Layout Using Interior Design Guide-
Guided Structure Inference. In International Conference on Machine
lines. In Annual Conference on Computer Graphics and Interactive
Learning (ICML) (2019). 10
Techniques (SIGGRAPH) (2011). 2, 15
[LNX19] L I J., N IU C., X U K.: Learning part generation and assembly
[MSSH14] M AJEROWICZ L., S HAMIR A., S HEFFER A., H OOS H. H.:
for structure-aware shape synthesis. arXiv:1906.06693 (2019). 4, 12
Filling your shelves: Synthesizing diverse style-preserving artifact ar-
[LPX∗ 19] L I M., PATIL A. G., X U K., C HAUDHURI S., K HAN O., rangements. IEEE Transactions on Visualization and Computer Graph-
S HAMIR A., T U C., C HEN B., C OHEN -O R D., Z HANG H.: GRAINS: ics (TVCG) 20, 11 (2014), 1507–1518. 14
Generative recursive autoencoders for indoor scenes. ACM Transactions
[MTG∗ 13] M ENON A., TAMUZ O., G ULWANI S., L AMPSON B.,
on Graphics (TOG) 38, 2 (2019), 12:1–12:16. 4, 14, 16, 17
K ALAI A.: A Machine Learning Framework for Programming by Exam-
[LVD∗ 18] L I Y., V INYALS O., DYER C., PASCANU R., BATTAGLIA ple. In International Conference on Machine Learning (ICML) (2013).
P. W.: Learning deep generative models of graphs. arXiv:1803.03324 10
(2018). 4
[MWH∗ 06] M ÜLLER P., W ONKA P., H AEGLER S., U LMER A.,
[LWR∗ 19] L IU Y., W U Z., R ITCHIE D., F REEMAN W. T., T ENEN - VAN G OOL L.: Procedural modeling of buildings. In Annual Conference
BAUM J. B., W U J.: Learning to Describe Scenes with Programs. In on Computer Graphics and Interactive Techniques (SIGGRAPH) (2006).
International Conference on Learning Representations (ICLR) (2019). 2, 12
18
[MZC∗ 19] M O K., Z HU S., C HANG A. X., Y I L., T RIPATHI S.,
[LXC∗ 17] L I J., X U K., C HAUDHURI S., Y UMER E., Z HANG H., G UIBAS L. J., S U H.: PartNet: A large-scale benchmark for fine-grained
G UIBAS L.: GRASS: Generative recursive autoencoders for shape struc- and hierarchical part-level 3D object understanding. In IEEE Conference
tures. ACM Transactions on Graphics (TOG) 36, 4 (2017), 52. 1, 13 on Computer Vision and Pattern Recognition (CVPR) (2019). 18
[LYF17] L IU J., Y U F., F UNKHOUSER T.: Interactive 3D Modeling with [MZWVG07] M ÜLLER P., Z ENG G., W ONKA P., VAN G OOL L.:
a Generative Adversarial Network. In International Conference on 3D Image-based procedural modeling of facades. In Annual Conference on
Vision (3DV) (2017). 3 Computer Graphics and Interactive Techniques (SIGGRAPH) (2007). 2
[LZW∗ 15] L IU Z., Z HANG Y., W U W., L IU K., S UN Z.: Model-driven [NAW∗ 19] NANDI C., A NDERSON A., W ILLSEY M., W ILCOX J. R.,
indoor scenes modeling from a single image. In Graphics Interface DARULOVA E., G ROSSMAN D., TATLOCK Z.: Using e-graphs for CAD
(2015). 14 parameter inference. arXiv:1909.12252 (2019). 17
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.
S. Chaudhuri, D. Ritchie, J. Wu, K. Xu, & H. Zhang / Learning Generative Models of 3D Structures
[NLX18] N IU C., L I J., X U K.: Im2Struct: Recovering 3D shape struc- IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
ture from a single RGB image. In IEEE Conference on Computer Vision (2017). 3
and Pattern Recognition (CVPR) (2018). 13
[SKAG15] S UNG M., K IM V. G., A NGST R., G UIBAS L.: Data-driven
[PFS∗ 19] PARK J. J., F LORENCE P., S TRAUB J., N EWCOMBE R., structural priors for shape completion. ACM Transactions on Graphics
L OVEGROVE S.: DeepSDF: Learning continuous signed distance func- (TOG) 34, 6 (2015), 175:1–175:11. 13
tions for shape representation. In IEEE Conference on Computer Vision
[SLNM11] S OCHER R., L IN C. C., N G A. Y., M ANNING C. D.: Parsing
and Pattern Recognition (CVPR) (2019). 3
natural scenes and natural language with recursive neural networks. In
[PL96] P RUSINKIEWICZ P., L INDENMAYER A.: The Algorithmic Beauty International Conference on Machine Learning (ICML) (2011). 9, 16
of Plants. Springer-Verlag, Berlin, Heidelberg, 1996. 2, 12
[SMKL15] S U H., M AJI S., K ALOGERAKIS E., L EARNED -M ILLER
[PM01] PARISH Y. I. H., M ÜLLER P.: Procedural modeling of cities. In E. G.: Multi-view convolutional neural networks for 3D shape recog-
Annual Conference on Computer Graphics and Interactive Techniques nition. In IEEE International Conference on Computer Vision (ICCV)
(SIGGRAPH) (2001). 2 (2015). 3
[PMS∗ 17] PARISOTTO E., M OHAMED A.- R ., S INGH R., L I L., Z HOU [SPW∗ 13] S OCHER R., P ERELYGIN A., W U J., C HUANG J., M ANNING
D., KOHLI P.: Neuro-Symbolic Program Synthesis. In International C., N G A., P OTTS C.: Recursive deep models for semantic composition-
Conference on Learning Representations (ICLR) (2017). 10 ality over a sentiment treebank. In Conference on Empirical Methods in
[PW14] PAIGE B., W OOD F.: A Compilation Target for Probabilis- Natural Language Processing (EMNLP) (2013). 9
tic Programming Languages. In International Conference on Machine [SSK∗ 17] S UNG M., S U H., K IM V. G., C HAUDHURI S., G UIBAS
Learning (ICML) (2014). 7 L.: ComplementMe: Weakly-supervised component suggestions for 3D
[QSMG17] Q I C. R., S U H., M O K., G UIBAS L. J.: PointNet: deep modeling. In Annual Conference on Computer Graphics and Interactive
learning on point sets for 3D classification and segmentation. In Techniques Asia (SIGGRAPH Asia) (2017). 4, 12
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [TDB17] TATARCHENKO M., D OSOVITSKIY A., B ROX T.: Octree
(2017). 3 Generating Networks: Efficient Convolutional Architectures for High-
[QYSG17] Q I C. R., Y I L., S U H., G UIBAS L. J.: PointNet++: Deep Resolution 3D Outputs. In IEEE International Conference on Computer
hierarchical feature learning on point sets in a metric space. In Advances Vision (ICCV) (2017). 3
in Neural Information Processing Systems (NeurIPS) (2017). 3 [TGLX18] TAN Q., G AO L., L AI Y.-K., X IA S.: Variational Autoen-
[RBSB18] R ANJAN A., B OLKART T., S ANYAL S., B LACK M. J.: Gen- coders for Deforming 3D Mesh Models. In IEEE Conference on Com-
erating 3D Faces Using Convolutional Mesh Autoencoders. In European puter Vision and Pattern Recognition (CVPR) (2018). 3
Conference on Computer Vision (ECCV) (2018). 3 [TLL∗ 11] TALTON J. O., L OU Y., L ESSER S., D UKE J., M ĚCH R.,
[RDF16] R EED S., D E F REITAS N.: Neural Programmer-Interpreters. In KOLTUN V.: Metropolis Procedural Modeling. ACM Transactions on
International Conference on Learning Representations (ICLR) (2016). Graphics (TOG) 30, 2 (2011). 2
10 [TLS∗ 19] T IAN Y., L UO A., S UN X., E LLIS K., F REEMAN W. T.,
[RJT18] R ITCHIE D., J OBALIA S., T HOMAS A.: Example-based au- T ENENBAUM J. B., W U J.: Learning to Infer and Execute 3D Shape
thoring of procedural modeling programs with structural and continuous Programs. In International Conference on Learning Representations
variability. In Annual Conference of the European Association for Com- (ICLR) (2019). 5, 10, 17, 18, 19
puter Graphics (EuroGraphics) (2018). 7, 12 [TSG∗ 17] T ULSIANI S., S U H., G UIBAS L. J., E FROS A. A., M ALIK J.:
[RLGH15] R ITCHIE D., L IN S., G OODMAN N. D., H ANRAHAN P.: Learning Shape Abstractions by Assembling Volumetric Primitives. In
Generating Design Suggestions under Tight Constraints with Gradient- IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
based Probabilistic Programming. In Annual Conference of the European (2017). 17
Association for Computer Graphics (EuroGraphics) (2015). 7 [TYK∗ 12] TALTON J. O., YANG L., K UMAR R., L IM M., G OODMAN
[RMGH15] R ITCHIE D., M ILDENHALL B., G OODMAN N. D., N. D., M ECH R.: Learning design patterns with Bayesian grammar in-
H ANRAHAN P.: Controlling procedural modeling programs with duction. In ACM Symposium on User Interface Software and Technology
stochastically-ordered sequential monte carlo. ACM Transactions on (UIST) (2012). 12
Graphics (TOG) 34, 4 (2015), 1–11. 2 [VDOKK16] VAN D EN O ORD A., K ALCHBRENNER N.,
[ROUG17] R IEGLER G., O SMAN U LUSOY A., G EIGER A.: OctNet: K AVUKCUOGLU K.: Pixel recurrent neural networks. In Interna-
Learning deep 3D representations at high resolutions. In IEEE Confer- tional Conference on Machine Learning (ICML) (2016). 8
ence on Computer Vision and Pattern Recognition (CVPR) (2017). 3 [WFT∗ 19] W U W., F U X.-M., TANG R., WANG Y., Q I Y.-H., L IU
[RWaL19] R ITCHIE D., WANG K., AN L IN Y.: Fast and Flexible In- L.: Data-driven interior plan generation for residential buildings. ACM
door Scene Synthesis via Deep Convolutional Generative Models. In Transactions on Graphics (TOG) 38, 6 (2019). 14, 17
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [WLG∗ 17] WANG P.-S., L IU Y., G UO Y.-X., S UN C.-Y., T ONG X.:
(2019). 8, 16 O-CNN: Octree-based convolutional neural networks for 3D shape anal-
[SCH∗ 16] S AVVA M., C HANG A. X., H ANRAHAN P., F ISHER M., ysis. ACM Transactions on Graphics (TOG) 36, 4 (2017), 72. 3
N IE ç NER M.: PiGraphs: Learning interaction snapshots from obser- [WLW∗ 19] WANG K., L IN Y.-A., W EISSMANN B., S AVVA M.,
vations. ACM Transactions on Graphics (TOG) 35, 4 (2016). 14, 15 C HANG A. X., R ITCHIE D.: PlanIt: Planning and instantiating indoor
[SGF16] S HARMA A., G RAU O., F RITZ M.: VConv-DAE: Deep Volu- scenes with relation graph and spatial prior networks. ACM Transactions
metric Shape Learning without Object Labels. In European Conference on Graphics (TOG) 38, 4 (2019), 132. 1, 4, 9, 16
on Computer Vision Workshops (ECCV Workshop) (2016). 2 [WSCR18] WANG K., S AVVA M., C HANG A. X., R ITCHIE D.: Deep
[SGL∗ 18] S HARMA G., G OYAL R., L IU D., K ALOGERAKIS E., M AJI Convolutional Priors for Indoor Scene Synthesis. In Annual Conference
S.: CSGNet: Neural Shape Parser for Constructive Solid Geometry. In on Computer Graphics and Interactive Techniques (SIGGRAPH) (2018).
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4, 16
(2018). 17 [WSG11] W INGATE D., S TUHLMÜLLER A., G OODMAN N. D.:
[SHW∗ 17] S OLTANI A. A., H UANG H., W U J., K ULKARNI T. D., Lightweight Implementations of Probabilistic Programming Languages
T ENENBAUM J. B.: Synthesizing 3D Shapes via Modeling Multi- Via Transformational Compilation. In International Conference on Arti-
View Depth Maps and Silhouettes with Deep Generative Networks. In ficial Intelligence and Statistics (AISTATS) (2011). 7
[WSH∗ 18] WANG H., S CHOR N., H U R., H UANG H., C OHEN -O R D., [YYW∗ 12] Y EH Y.-T., YANG L., WATSON M., G OODMAN N. D.,
H UANG H.: Global-to-local generative model for 3D shapes. ACM H ANRAHAN P.: Synthesizing Open Worlds with Constraints Using Lo-
Transactions on Graphics (TOG) 37, 6 (2018), 214:1–214:10. 4, 13 cally Annealed Reversible Jump MCMC. In Annual Conference on Com-
[WSK∗ 15] W U Z., S ONG S., K HOSLA A., Y U F., Z HANG L., TANG puter Graphics and Interactive Techniques (SIGGRAPH) (2012). 2
X., X IAO J.: 3D ShapeNets: A Deep Representation for Volumetric [YYY∗ 16] YAN X., YANG J., Y UMER E., G UO Y., L EE H.: Perspective
Shapes. In IEEE Conference on Computer Vision and Pattern Recogni- transformer nets: Learning single-view 3D object reconstruction without
tion (CVPR) (2015). 2, 3 3D supervision. In Advances in Neural Information Processing Systems
(NeurIPS) (2016). 19
[WTB∗ 18] WANG C., TATWAWADI K., B ROCKSCHMIDT M., H UANG
P.-S., M AO Y., P OLOZOV O., S INGH R.: Robust Text-to-SQL Genera- [ZHG∗ 16] Z HAO X., H U R., G UERRERO P., M ITRA N. J., KOMURA T.:
tion with Execution-Guided Decoding. arXiv:1807.03100 (2018). 10 Relationship templates for creating scene variations. ACM Transactions
[WWL∗ 19] W U Z., WANG X., L IN D., L ISCHINSKI D., C OHEN -O R on Graphics (TOG) 35, 6 (2016), 1–13. 14
D., H UANG H.: SAGNet: Structure-aware generative network for 3D- [ZW18] Z OHAR A., W OLF L.: Automatic Program Synthesis of Long
shape modeling. ACM Transactions on Graphics (TOG) 38, 4 (2019), Programs with a Learned Garbage Collector. In Advances in Neural
91:1–91:14. 19 Information Processing Systems (NeurIPS) (2018). 10
[WXL∗ 11] WANG Y., X U K., L I J., Z HANG H., S HAMIR A., L IU L., [ZWK19] Z HOU Y., W HILE Z., K ALOGERAKIS E.: SceneGraphNet:
C HENG Z., X IONG Y.: Symmetry hierarchy of man-made objects. Com- Neural message passing for 3D indoor scene augmentation. In IEEE
puter Graphics Forum (CGF) (2011). 4, 13, 14 International Conference on Computer Vision (ICCV) (2019). 16
[WZL∗ 18] WANG N., Z HANG Y., L I Z., F U Y., L IU W., J IANG Y.-G.: [ZXC∗ 18] Z HU C., X U K., C HAUDHURI S., Y I R., Z HANG H.:
Pixel2Mesh: Generating 3D mesh models from single RGB images. In SCORES: Shape composition with recursive substructure priors. ACM
European Conference on Computer Vision (ECCV) (2018). 3 Transactions on Graphics (TOG) 37, 6 (2018), 211:1–211:14. 13
[WZX∗ 16] W U J., Z HANG C., X UE T., F REEMAN W. T., T ENENBAUM [ZYM∗ 18] Z HANG Z., YANG Z., M A C., L UO L., H UTH A., VOUGA
J. B.: Learning a Probabilistic Latent Space of Object Shapes via 3D E., H UANG Q.: Deep generative modeling for scene synthesis via hybrid
Generative-Adversarial Modeling. In Advances in Neural Information representations. arXiv:1808.02084 (2018). 17
Processing Systems (NeurIPS) (2016). 2, 3 [ZYY∗ 17] Z OU C., Y UMER E., YANG J., C EYLAN D., H OIEM D.: 3D-
[XCF∗ 13] X U K., C HEN K., F U H., S UN W.-L., H U S.-M.: PRNN: Generating Shape Primitives with Recurrent Neural Networks.
Sketch2Scene: Sketch-based co-retrieval and co-placement of 3D mod- In IEEE International Conference on Computer Vision (ICCV) (2017).
els. ACM Transactions on Graphics (TOG) 32, 4 (2013), 123. 14 17
[XMZ∗ 14] X U K., M A R., Z HANG H., Z HU C., S HAMIR A., C OHEN - [ZZZ∗ 18] Z HU J.-Y., Z HANG Z., Z HANG C., W U J., T ORRALBA A.,
O R D., H UANG H.: Organizing heterogeneous scene collection through T ENENBAUM J., F REEMAN B.: Visual object networks: Image gen-
contextual focal points. ACM Transactions on Graphics (TOG) 33, 4 eration with disentangled 3D representations. In Advances in Neural
(2014), Article 35. 4 Information Processing Systems (NeurIPS) (2018). 19
[YCHK15] Y UMER M. E., C HAUDHURI S., H ODGINS J. K., K ARA
L. B.: Semantic shape editing using deformation handles. ACM Trans-
actions on Graphics (TOG) 34 (2015). 13
[YFST18] YANG Y., F ENG C., S HEN Y., T IAN D.: FoldingNet: Point
Cloud Auto-encoder via Deep Grid Deformation. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (2018). 3
[YKC∗ 16] Y I L., K IM V. G., C EYLAN D., S HEN I.-C., YAN M., S U
H., L U C., H UANG Q., S HEFFER A., G UIBAS L.: A scalable active
framework for region annotation in 3D shape collections. In Annual Con-
ference on Computer Graphics and Interactive Techniques Asia (SIG-
GRAPH Asia) (2016). 11
[YSS∗ 17] Y I L., S HAO L., S AVVA M., H UANG H., Z HOU Y., WANG
Q., G RAHAM B., E NGELCKE M., K LOKOV R., L EMPITSKY V. S.,
G AN Y., WANG P., L IU K., Y U F., S HUI P., H U B., Z HANG Y., L I Y.,
B U R., S UN M., W U W., J EONG M., C HOI J., K IM C., G EETCHAN -
DRA A., M URTHY N., R AMU B., M ANDA B., R AMANATHAN M.,
K UMAR G., P REETHAM P., S RIVASTAVA S., B HUGRA S., L ALL B.,
H ÄNE C., T ULSIANI S., M ALIK J., L AFER J., J ONES R., L I S.,
L U J., J IN S., Y U J., H UANG Q., K ALOGERAKIS E., S AVARESE S.,
H ANRAHAN P., F UNKHOUSER T. A., S U H., G UIBAS L. J.: Large-
scale 3D shape reconstruction and segmentation from ShapeNet Core55.
arXiv:1710.06104 (2017). 11
[YYR∗ 18] YOU J., Y ING R., R EN X., H AMILTON W. L., L ESKOVEC
J.: GraphRNN: A deep generative model for graphs. In International
Conference on Machine Learning (ICML) (2018). 4
[YYT∗ 11] Y U L.-F., Y EUNG S. K., TANG C.-K., T ERZOPOULOS D.,
C HAN T. F., O SHER S.: Make it home: automatic optimization of furni-
ture arrangement. ACM Transactions on Graphics (TOG) 30, 4 (2011),
86:1–12. 2, 15
[YYT16] Y U L.-F., Y EUNG S. K., T ERZOPOULOS D.: The Clutter-
palette: An interactive tool for detailing indoor scenes. IEEE Trans-
actions on Visualization and Computer Graphics (TVCG) 22, 2 (2016),
1138–1148. 15
c 2020 The Author(s)
Computer Graphics Forum
c 2020 The Eurographics Association and John Wiley & Sons Ltd.