Download as pdf or txt
Download as pdf or txt
You are on page 1of 114

Research Collection

Master Thesis

Sketch-Based 4D Prototyping for Smoke Simulations

Author(s):
Huang, Xingchang

Publication Date:
2019

Permanent Link:
https://doi.org/10.3929/ethz-b-000392356

Rights / License:
In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more
information please consult the Terms of use.

ETH Library
Sketch-Based 4D Prototyping for
Smoke Simulations

Xingchang Huang

Master Thesis
November 2019

Supervisors: Byungsoo Kim


CGL Simulation & Animation Group
Prof. Dr. Markus Gross
Abstract
In this thesis, we address the problem of sketch-based 4D prototyping for smoke simulations,
aiming to provide a more intuitive and interactive tool for artists and users to animate 3D smoke
sequences in time based on key-framed sketches. As sketch-based 3D modelling is considered
to be an ill-posed and ambiguous problem, we focus on using data-driven and machine learning
approaches to solve it. We propose several deep neural networks to encode sketches, reconstruct
single-frame density fields and velocity fields for multi-frame interpolation consistently with
fluid behavior. We further introduce a differentiable sketch renderer to improve the consistency
between reconstructed density fields and input sketches and the quality of reconstruction.
Experiments show that our proposed model can synthesize plausible density fields based on
multi-view sketches as well as temporally-coherent interpolated densities via sparse key-framed
sketches. We also experiment and evaluate our model under various smoke simulation scenar-
ios as well as real human sketches. The results show that our model works in several smoke
scenes with different parameterizations and can be well-generalized to other line drawing styles
different from the training sketches.

i
Zusammenfassung
In dieser Arbeit beschäftigen wir uns mit auf Skizzen basiertem 4D Prototyping zur Simulation
von Rauch. Das Ziel ist es ein benutzerfreundlicheres Tool für Künstler und andere Nutzer zur
Animation von 3D-Rauchsequenzen, welche durch, mit key-frames markierten, Skizzen zeitlich
abgestimmt werden, zu schaffen.
Weil auf Skizzen basiertes 3D-Modelling als mathematisch schlecht gestelltes und vielschichtiges
Problem angesehen wird, konzentrieren wir uns darauf die Lösung mittels “Data Driven“ oder
“Machine Learning“ Ansätzen zu finden.
Unsere Idee ist es mehrere tiefe neurale Netzwerke zu verwenden, um die Skizzen zu kodieren
beziehungsweise single-frame density fields und velocity fields für die Multi-Frame Interpola-
tion konsistent und mit flüssigem Ergebnis wiederherzustellen.
Weiters führen wir einen differenzierbaren Skizzen Renderer zur Verbesserung der Konsistenz
zwischen den wiederhergestellten density fields und dem Input ein, sowie auch zur Verbesserung
der Qualität der Rekonstruktion.
Die Experimente zeigen, dass unser vorgeschlagenes Modell, plausible density fields sowohl auf
Grundlage von Multiview-Skizzen als auch von zeitlich voneinander abhängigen interpolierten
densities mit wenigen, mit key-frames markierten, Skizzen, synthetisieren kann.
Während des Experimentierens haben wir unser System mittels verschiedenen Rauchsimula-
tionsszenarien sowie eben auch handgemachten Skizzen evaluiert.
Die Ergebnisse zeigen, dass unser Model in vielen verschiedenen Rauch-Szenarien mit un-
terschiedlichen Parametern funktioniert und zudem lässt es sich gut auf andere Zeichenstille,
welche von den Trainingszeichnungen abweichen, anwenden.

iii
Acknowledgement
First, I would like to express my sincere thanks to my supervisors: Byungsoo Kim, Dr. Guil-
laume Cordonnier, Dr. Vinicius Da Costa De Azevedo, Jingwei Tang, Dr. Barbara Solenthaler
and Prof. Dr. Markus Gross, for your support and advises over the past year. You lead me
into the world of computer graphics and machine learning. Though there are still many largely
unexplored areas and uncertainties during this project, you give me inspiring thoughts during
meetings and even practical engineering guidance. This thesis would be an impossible task
without the support of all of yours.
I would like to thank the artist Maurizio Nitti from Disney Research Studio as well, for sharing
his idea from an artistic viewpoint and gives some examples for evaluating and improving our
work.
I would like to express my thanks to all my dear friends, for their help and support that led me
through my hardest times at ETH. I would like to give a huge thanks to my muimui. You always
give me unconditional support, patience and believe in me that I can make it.
Last but not least, I am grateful to my parents, for your unconditional love and support. I would
never have become the person I am today without you. You support me to study at ETH and
give me the warmest harbour whenever I need.

v
Master Thesis

Sketch-Based Proto-Typing for Smoke Simulations

Project Description
Designing artistic fluid simulation is a time-consuming process, where iterative parameter
tuning is required to get a desired output based on the input concept art. This leads to the
necessity of a handy proto-typing tool for smoke simulations, for instance, sketch-based 4D
proto-typing.

In the thesis, we will improve existing sketch-based 3D density field reconstruction algorithm
and propose a method for predicting next few frames of physically plausible smoke motions
with the guidance arrows in an input sketch.

Tasks
1) Our smoke density reconstruction should be general enough to handle various types of
synthetic sketch types including real sketches from artists. We will come up with 2-3
general scenarios of smoke simulations. Our method should be based on deep-learning
techniques, and we will train the network on augmented sketch dataset resulting from
several methods for synthetic line generation. This should be validated on human
sketches.
2) The second step is to predict the next few frames based on input control primitives in
sketches. We call them guidance arrows, and we will propose a physically plausible
approach for the predictions.

Skills
Knowledge in Machine Learning and physically-based simulation

Remarks
A written report and an oral presentation conclude the work. The thesis will be supervised by
Byungsoo Kim, CGL-Simulation & Animation Group and Prof. Markus Gross. Start: asap.

Contact
Thesis coordinator: cgl-thesis@inf.ethz.ch
Declaration of originality
The signed declaration of originality is a component of every semester paper, Bachelor’s thesis,
Master’s thesis and any other degree paper undertaken during the course of studies, including the
respective electronic versions.

Lecturers may also require a declaration of originality for other written papers compiled for their
courses.
__________________________________________________________________________

I hereby confirm that I am the sole author of the written work here enclosed and that I have compiled it
in my own words. Parts excepted are corrections of form and content by the supervisor .

Title of work (in block letters):

Sketch-Based 4D Prototyping for Smoke Simulations

Authored by (in block letters):


For papers written by groups the names of all authors are required.

Name(s): First name(s):


Huang Xingchang

With my signature I confirm that


− I have committed none of the forms of plagiarism described in the ‘Citation etiquette’ information
sheet.
− I have documented all methods, data and processes truthfully.
− I have not manipulated any data.
− I have mentioned all persons who were significant facilitators of the work .

I am aware that the work may be screened electronically for plagiarism.

Place, date Signature(s)

Zurich,29.10.2019

For papers written by groups the names of all authors are


required. Their signatures collectively guarantee the entire
content of the written paper.
Contents

List of Figures xv

List of Tables xix

1. Introduction 1

2. Related Work 3
2.1. Fluid Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1. Navier-Stokes Equations . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2. Smoke Simulation & Control . . . . . . . . . . . . . . . . . . . . . . . 4
2.2. Sketch-Based Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3. Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1. Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2. Machine Learning for Sketch-Based Modelling . . . . . . . . . . . . . 9
2.3.3. Machine Learning for Fluids . . . . . . . . . . . . . . . . . . . . . . . 14

3. Overview 17
3.1. Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2. Single-frame Density Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 18
3.3. Multi-frame Density Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4. Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4. Single-Frame Density Reconstruction 23


4.1. Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1. Variational AutoEncoder . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2. Sketch Encoder Network . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.3. Density Decoder Network . . . . . . . . . . . . . . . . . . . . . . . . 25

xi
Contents

4.2. Improving the Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


4.2.1. Generative Adversarial Nets . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2. Differentiable Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3. Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4. Self-supervised Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5. Multi-Frame Density Interpolation 39


5.1. Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1. Latent Space Interpolation Network . . . . . . . . . . . . . . . . . . . 39
5.2. Improving the Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1. Velocity Decoder Network . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3. Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4. Smoke Simulations via Sketches . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.1. Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.2. MLP-based Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.3. Re-Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6. Experiments 47
6.1. Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.1. 3D Smoke Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1.2. Sketch Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1.3. Smoke Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2. Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.3. Implementation & Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.4. Single-Frame Density Reconstruction . . . . . . . . . . . . . . . . . . . . . . 56
6.4.1. Input Sketches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4.2. Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4.3. Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.5. Multi-Frame Density Interpolation . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5.1. Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5.2. Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5.3. Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.6. Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.7. User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.7.1. Human Sketches Acquisition . . . . . . . . . . . . . . . . . . . . . . . 77
6.7.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7. Conclusions and Future Work 81


7.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2. Limitations and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A. Supplemental Material 83
A.1. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A.2. More Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

xii
Contents

Bibliography 85

xiii
List of Figures

2.1. Screenshot from [TMPS03] to show how they can control the smoke shapes via
key-frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2. From left to right: examples of image-based artistic control for smoke simula-
tions and sketch-based control for fluid flow design proposed in [KAGS19][HXF+ 19],
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3. Examples of sketch-based 3D modelling from [OSSJ09]. . . . . . . . . . . . . 6
2.4. An example of Multi-Layer Perceptron with 1 layer, from scikit-learn [BLB+ 13]. 9
2.5. Different 3d representations related to our work, reconstructed from 2D sketches/images. 12
2.6. Examples of machine learning applied in 3D fluid flow super-resolution pro-
posed in [XFCT18][WXCT19]. . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1. Overview of our method. We begin with a differentiable renderer to render


two-view sketches from each density frame. For the single-frame density re-
construction (in Section 4), the multi-view sketch encoder Fs2z first generates a
latent code for two-view sketches and the latent code is decoded to reconstruct
the corresponding density field by a density decoder Fz2d . For multi-frame den-
sity interpolation (in Section 5), we use a mapping network Fz2ẑ and a velocity
decoder Fẑ2v to estimate the velocity fields between two sketches key-frames.
Our networks are trained separately for density and velocity reconstruction. At
inference time, we reconstruct an initial density frame based on the multi-view
sketches at the first key-frames and re-simulate (in Section 5.4.3) the densities
based on reconstructed velocity fields between key-frames. . . . . . . . . . . . 19

xv
List of Figures

4.1. This is the architecture of reconstructing single-frame density field from multi-
view sketches. Specifically, we use a two-view setting where the weights of
sketch encoder [Fs2f , Ff 2z ] are shared among single-view sketches, as well
as the pre-trained feature extractor DenseNet-121. Fz2d represents the density
decoder to decode multi-view sketches latent code (in green color) into recon-
structed density field. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2. Architecture of residual blocks (ResBlock) and decoder block (DecBlock) used
in our 3D voxel decoder. We use 2 (considering the memory) convolutional
layers before element-wise skip addition and a nearest interpolation for upsam-
pling afterwards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3. Architecture of densenet components used in our sketch encoder and 3D voxel
decoder, where k represents the growth rate in dense blocks and the order of
a denseblock is from batch normalization to activation function to convolution.
We find that DenseNet-121 [HLVDMW17] can give slightly better sketch fea-
ture representations with even less parameters than ResNet-18 [HZRS16]. Im-
age courtesy is from the original DenseNet paper [HLVDMW17]. . . . . . . . 28
4.4. Architecture of sketch encoder and voxel decoder. . . . . . . . . . . . . . . . . 29
4.5. Examples of generated sketches and intermediate results (e.g., shading, normal,
physics-based rendered images) from two datasets. . . . . . . . . . . . . . . . 33
4.6. Our renderer is also able to simulate the process from simple contours to more
details to shaded sketches, by adjusting the threshold values easily without any
pre-processing step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7. Different thickness value c for equation (c · x + 1) · e−c·x . . . . . . . . . . . . . 34

5.1. Learnt manifold in single-frame density reconstruction and latent codes of St ,


St+T generated from Fs2z . There are three paths for estimating intermediate
latent codes from two key-frames. Path 1 is to directly estimate the latent vector,
while 2, 3 is to learn the residual function between linearly-interpolated latent
code, previous one latent code and the learnt manifold, respectively. . . . . . . 40
5.2. This is the architecture of our velocity fields prediction model, demonstrat-
ing latent space interpolation and velocity fields estimation based on two key-
frames and consecutive frames. Specifically, multi-view sketches of two key-
frames are shown in this graph and similarly the weights of sketch encoder
[Fs2f , Ff 2z ] are shared among sketches of all frames. Lẑ2v represents the ve-
locity decoder that implicitly estimates the stream function with an curl operator
after the final output. And we introduce an additional mapping network Fz2ẑ to
map the concatenated latent space of two key-frames (in green color consistent
with Figure 4.1) to all intermediate latent codes and estimate the velocity fields
between consecutive latent codes (ẑα , ẑα+1 ) together with Fẑ2v . . . . . . . . . . 43

6.1. Training and test samples from smoke plume with obstacle dataset and its higher-
resolution version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2. Training and test samples from smoke inflow and buoyancy dataset. It is shown
that the inflow velocity and buoyancy vary across scenes. . . . . . . . . . . . . 50

xvi
List of Figures

6.3. Comparisons between suggestive contour (left), our rendered sketch (middle)
with toon shading and physic-based rendered images by Mitsuba (right) from
two scenes. The difference is clear that the left one looks like digital sketch
with sharper edges, while ours looks more like pencil sketch. We can see
that both types of sketches can accurately describe the contours compared with
physically-based rendered references. . . . . . . . . . . . . . . . . . . . . . . 52
6.4. Convergence plot for single-frame 3D density reconstruction in 185 epochs.
The average L1 loss, PSNR of 3D reconstructed density field, PSNR of sketches
generated from reconstructed density field in the test set are plotted on the graph
using TensorBoard [ABC+ 16]. . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.5. Training curves of L1 loss for velocity fields reconstruction. . . . . . . . . . . 56
6.6. Comparison of two different encoder architecture: ResNet-18 and DenseNet-18
in the test scene of smoke plume with obstacle dataset. . . . . . . . . . . . . . 58
6.7. Comparison between ResNet- and DenseNet-like decoder for density recon-
struction in terms of MAE metric in the smoke plume with obstacle test set. . . 59
6.8. Qualitative results of slice views evolved in time. Top row shows the recon-
structed frames with lower λKL = 0.001 and bottom row shows the results with
higher one 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.9. Linear Interpolation in the latent space between two key-frames from last frame
of two consecutive simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.10. Qualitative comparisons of training with and without LSSIM 3D in the smoke
plume and obstacle test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.11. Comparisons of training with and without Lsketch in the smoke plume and ob-
stacle test set, in terms of density PSNR. It is illustrated that sketch loss do not
degrade the results of density reconstruction and helps to improve reconstructed
sketch quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.12. Qualitative comparisons of reconstructed density and corresponding sketches. . 64
6.13. From left to right: ground truth density, reconstructed density training with high
sketch loss weight λsketch = 10 and low one weight λsketch = 0.1, respectively. 65
6.14. Quantitative comparisons of training with and without LM SSSIM 2D , only den-
sity MAE goes lower while and others are almost the same. . . . . . . . . . . . 66
6.15. An example of reconstructed density field with (middle) and without (left-most)
self-supervised fine-tuning. The right-most one is the ground truth. From the
zoom-in view in the second row we find that fine-tuning can correct the output
structure to make resulting density field closer to the ground truth. . . . . . . . 67
6.16. From front to left view by rotating 0, 30, 60, 90 degrees, respectively. No
artifacts are observed from viewpoints between front and side views. . . . . . . 68
6.17. An example of training with and without GAN for additional 175 epochs. From
left to right: reconstructed density through training without and with GAN,
ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.18. Examples of abrupt changes using MLP-based interpolation between two con-
secutive density frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.19. From left to right: full objective with λz = 0 and λz = 1, respectively. Advected
density field of 140th frame from 110th is shown using our approach. . . . . . 71
6.20. From left to right: ground truth density, advected density using velocity-based
model trained with advection loss Ladvect and without advection loss, respectively. 72

xvii
List of Figures

6.21. Advected density starting from 10 − 150th frame under 20-frame interpolation
setting. Every 10 frame of density are displayed. . . . . . . . . . . . . . . . . 73
6.22. Multi-view evaluation of advected density using our velocity-based approach.
No artifacts are observed in the reconstructed density. . . . . . . . . . . . . . . 74
6.23. Qualitative comparisons of density-based and velocity-based methods. . . . . . 75
6.24. Qualitative comparisons of reconstructed density field at 140th frame between
baseline model, velocity-based model and ground truth. . . . . . . . . . . . . . 76
6.25. Comparisons of rendered 190th frame in the smoke inflow and buoyancy scene
between density-based and velocity-based model. From left to right: density-
based baseline model, our velocity-based model and ground truth. . . . . . . . 76
6.26. Comparisons of rendered sketches via our renderer and real sketches from a
novice user. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.27. Reconstructed density field of rendered 180th frame. From left to right: user
sketch, Redge and Rblend as input, respectively. . . . . . . . . . . . . . . . . . . 78
6.28. Artist’s input sketches and reconstructed density fields using our model. As the
artist only sketches on the front viewpoints, we copy them as the side viewpoints
to evaluate our model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A.1. Results from the higher-resolution smoke plume and obstacle dataset and verti-
cal smoke plume dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.2. Reconstructed simulation from 110 ∼ 150th frames of the smoke inflow and
buoyancy dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

xviii
List of Tables
3.1. Notations of hyper-parameters, symbols and operators used in this thesis. . . . 21

4.1. Our 2D encoder after extracting features from pre-trained networks, where
Layer is the layer name, Ops the operations used in that layer, K the kernel
size, S the stride for convolution and P the number of padding consistent with
[PGC+ 17]. Further, we show the number of feature maps and the corresponding
output size in the Output Size column and the input layer name for each layer
in Input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2. Our 3D voxel encoder with residual blocks incorporated into our baseline net-
works architecture, where Layer is the layer name, Ops the operations used in
that layer, K the kernel size, S the stride for convolution and P the number of
padding consistent with our used deep-learning framework [PGC+ 17]. Further,
we show the number of feature maps and the corresponding output size in the
Output Size column and the input layer name for each layer in Input. . . . . . 29

6.1. Quantitative results of different multi-frame interpolation settings by linear in-


terpolation using different λKL . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2. Quantitative results in terms of PSNR on sketches and density fields with high
(10) and low (0.1) weighted sketch loss Lsketch . . . . . . . . . . . . . . . . . . 65
6.3. Quantitative results of 20-frame interpolation using baseline model with and
without MLP-based interpolation. . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4. Quantitative results of reconstructed density with λz = 0 and 1, from 110 ∼
150th frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5. Quantitative results of reconstructed density with λSSIM 3D = 0 and 1. . . . . . 73
6.6. Quantitative comparisons using the baseline model, velocity-based model, from
110 ∼ 150th frames under 20-frame interpolation setting. . . . . . . . . . . . . 74

xix
1
Introduction

Fluid simulation is becoming increasingly important in the area of visual effects in movies and
games and it receives worldwide attention on how to artistically control and animate smoke in
an interactive way by humans. But artistic control for fluid simulation is usually hard using
optimization techniques. In this work, we propose to generate 3D smoke animations controlled
by key-framed sketches. Provided multi-view sketches at every few key-frames, our goal is to
reconstruct the smoke at each key-frame and at the same time animate the interpolated smoke
in a realistic way. Compared with the flow stylization using images in [KAGS19], sketch-based
smoke simulation systems tend to provide more freedom and interactions for humans to design
smoke shape and dynamics.
However, building such a sketch-based 3D smoke simulations system is intrinsically challeng-
ing with high ambiguity and considered to be a ill-posed problem. This is because sketches
usually contain only partial information such as the overall shape and contour of the smoke.
At the same time, without provided densities in the sketches, it is hard to reconstruct realistic
smoke with lots of high-frequency details. Moreover, one more challenge is that we aim to
reconstruct a sequence of 3D density fields with only a few sketches as key-frames. This 4D
setting makes it more difficult than single-frame density estimation as we need to take temporal
consistency into consideration while only a few frames of sketches are provided by humans.
Super resolution fluid flow simulations also need to model the temporal consistency as shown
in [XFCT18], but every frame of low resolution are provided so they don’t need to consider the
problem of sparse inputs.
Overall, to solve these problems, we mainly need to handle the following two tasks:
• For single-frame density reconstruction via sketches from multiple viewpoints, we need
to not only synthesize high quality 3D density fields, but also keep the reconstructed fields
consistent with what the artists or users draw.
• Secondly, with only sparse input sketches, we need to find a way to interpolate the 3D

1
1. Introduction

density fields between them, instead of only considering single-frame 3D reconstruction.


And the interpolated density fields should be temporally-consistent within neighbouring
frames.
Inspiring from [KAT+ 19] and [DAI+ 18], we build a network for single-frame density recon-
struction, with a multi-view sketch encoder to encode sketches into latent features and a density
decoder to decode latent features into 3D densities. However, it is not straightforward how to
animate the transition of intermediate 3D densities between reconstructed density key-frames.
One naive approach is to encode two sketches key frames into multiple 3D density in-between
at the same time, but it may suffer from memory limits while predicting high-dimensional 3D
density field and large amount of interpolated smokes. Due to this, it is better to output a
single-frame density field controlled by some low-dimensional representation each time, allow-
ing us to reduce the memory consumption with such compact representations. In this thesis, we
will present our proposed method to address the above problems by animating intermediate 3D
density fields by velocity fields estimation based on latent space.
To train the neural networks, the third challenge is about how to acquire training data with
paired sketches and density fields and velocity fields. Normally, collecting thousands of hand-
drawings together with corresponding density fields by artists is time-consuming and expensive.
Therefore, some data-driven approaches such as [LGK+ 17][WCPM18], use non-photorealistic
rendering to generate synthetic line drawings for 3D shapes. In this case, we can apply similar
approach on smoke simulation data to generate a large set of synthetic data. Further, we propose
a differentiable way to render sketches on the fly based on density fields only, which allows
users to evaluate the difference between inputs and reconstructions in a more interactive way
and fine-tune the output in a self-supervised way.
To summarize, we introduce a sketch-based modelling system for smoke simulations with
sparse input sketches and the technical contributions of this thesis include:
• A sketch to density network architecture that can directly convert multi-view sketches
into real 3D density fields.
• A velocity-based model to interpolate intermediate density fields with only sparse key-
framed sketches as input to achieve temporally-coherent interpolated density fields.
• A differentiable method to render sketches on the fly that can be used to optimize density
fields reconstruction and fine-tune the density decoder at the test time to fit the unseen
input sketches.
We experiment our proposed method under different smoke scenes for the single-frame 3D
density reconstruction and multi-frame density interpolation. Our method is also experimented
using human sketches in a different line drawing style compared with the training sketches,
showing possibility to be generalized to handle real human sketches.

2
2
Related Work

2.1. Fluid Simulation


Fluid simulation is one of the important research topic in the area of physically-based simulation
and has become pervasively used in visual effects in movies. Here we will cover some basics
about how to generate fluid simulation data and related topics about smoke simulations and
control, which are more related to our work.

2.1.1. Navier-Stokes Equations

Fluid simulation is known to be modeled with the famous imcompressible Navier-Stokes (NS)
equations as the following

∂u 1
+ u · ∇u + p = g + v∇ · ∇u (2.1)
∂t ρ

∇·u=0 (2.2)

We use the same notations as shown in [BMF07], where u represents the velocity, g is the
gravity, ρ is the density, p is the pressure, and v is the viscosity coefficient. There exists many
numerical methods for solving these equations, which fall in either of the two main categories:
Eulerian and Lagrangian approaches.
Specifically, gravity is used as an external force in our simulations, which is exerted on the
whole body of fluid. Due to incompressibility, we have to solve the pressure equation to obtain
divergence-free velocity using a CG (conjugate gradient) solver. Then given a known velocity

3
2. Related Work

field and pre-defined time step, we can use e.g., Semi-Lagrangian advection to encode the Euler-
step lookup of source positions and linear interpolation to compute the solution. We use no
viscosity for smoke simulations and for more details please refer to the fluids simulation courses
[BMF07].
To solve the complex NS-equations above, we will employ so-called splitting method and solve
separated simpler equations, including advection, forces, and the pressure, incompressibility
as shown in [BMF07]. After splitting, we can derive three simpler equations corresponding to
advection, body forces and the one solving incompressibility. Thus, to generate fluid simulation
data, we can vary the strength of the applied forces (e.g., gravity, buoyancy), as well as the
inflow parameters to generate fluid simulations based on different sets of parameters. In this
work, we mainly run fluid simulation pipeline in the phase of data generation. Also, advection
would be included during training neural networks, which will be discussed in detail in Chapter
5.

2.1.2. Smoke Simulation & Control

The general NS-equations can be applied for smoke simulations as well. In this thesis, we
mainly focus on smoke simulations without considering the scene with liquids. Simple semi-
Lagrangian advection scheme (i.e., second-order advection) will be used to generate our smoke
simulation data.
Besides, smoke animation control might be a more relevant topic compared with our work as
our goal is to generate smoke simulation via key-framed sketches. [TMPS03] propose a way to
control the smoke simulations by key-frames and external forces under the optimization-based
framework. One limitation of this work is that it would be computationally prohibitive for large
problems with fine-grained control, making it difficult to be generalized to larger length of the
simulation, and highly-detailed key-frames. While [FL04] propose a different control paradigm
so-called target-driven smoke animation, using a sequence of smoke states as targets instead of
key-frames. Recently, [IEGT17] use primal-dual optimization to increase the details of smoke
simulations. These works show promising results in terms of controlling smoke simulations,
however, it is not easy and intuitive for users to interactively define the ways to control the
smoke using such parameters during the optimization.

Figure 2.1.: Screenshot from [TMPS03] to show how they can control the smoke shapes via key-frames.

There are some existing work showing how to use images as input for smoke reconstruction and
controlling the style of the outputs in an artistic way, such as [ODAO15][KAGS19][JFA+ 15].
These are indeed more intuitive than the previous smoke control methods [FL04] to stylize
smoke simulations for artists using images. Besides, there are works using sketches instead of

4
2.2. Sketch-Based Modelling

images for fluid flow control, which is a more interactive way for humans and more related to
what we are doing in this thesis. [SCK10][ZIH+ 11] propose sketch-based systems for visu-
alizing 2D vector fields, illustrating dynamic fluid system, respectively. And unlike these two
works, [HXF+ 19] propose a learning-based approach to encode 2D sketches drawn from artists
to 2D velocity fields, enabling us to generates 2D fluid animations from hand-drawn sketches
and interact with existing images. However, this is limited to 2D application and sketch-based
system for 3D fluids is still largely unexplored and unsolved. In our case, as we aim to use
sketch-based modelling techniques for 3D smoke simulations, we need to combine the exist-
ing methods from the domain of sketch-based 3D modelling techniques together with smoke
simulation control in this thesis.

Figure 2.2.: From left to right: examples of image-based artistic control for smoke simulations and
sketch-based control for fluid flow design proposed in [KAGS19][HXF+ 19], respectively.

2.2. Sketch-Based Modelling


Sketch-based interfaces for modeling (SBIM), considered to be an important topic in the area
of computer graphics, modeling and user interfaces, has been widely studied for 3D geometry
applications. As a prototyping stage for designs, the goal of SBIM is basically to reconstruct
complex 3D shapes from human sketches (e.g., pencil strokes) and allow users or artists to edit
the generated models progressively. Figure 2.3(a) shows the pipeline for artists to design a 3D
car model using human sketches and complex software. As this process is non-trivial, the goal
of SBIM is to simplify this process.
As shown in Figure 2.3(a) mentioned in [OSSJ09], a sketch-based system usually needs artist
to sketch contours of the target 3D model with shape cues and complex software is needed
to build the 3D model. So it is quite crucial that users’ sketches can be interpreted by such
a software system. At the same time, with the development of Non-Photorealistic Rendering
(NPR), people are given more references on how to sketch accurately for a target 3D model to
reveal its shape.
There are existing traditional methods providing users an interactive approach to reconstruct 3D
models. For example, [BBS08] propose an approach to reconstruct 3D curve models via multi-

5
2. Related Work

(a) Pipeline to create 3D model from as sketch, which might need expert and complex software to achieve this
[OSSJ09].

(b) An example of extracted contour (left) and suggestive contour (right) shown in [DFRS03].

Figure 2.3.: Examples of sketch-based 3D modelling from [OSSJ09].

6
2.3. Machine Learning

view sketches. One drawback of such kind of method is that users have to take some constraints
into consideration and some surfaces of reconstructed curves should be pre-defined. Besides,
data-driven approaches are also available for sketch-based modelling, aiming to alleviate the
constraints from traditional methods and reconstruct free-form shape from human sketches. For
example, [EHA12] aim to learn better human sketching style to improve the sketch-based 3D
objects retrieval system from a database. While other approaches, such as [DAI+ 18], propose a
direct way to reconstruct 3D models and voxel grids, which do not require an explicit database
for newly input sketches. These kinds of direct 3D reconstruction systems show more and more
promising results as human do not need to take care about the domain knowledge for sketch-
ing and can be able to progressively refine the reconstructed 3D models in an interactive way.
Hence, in this work, we will mainly focus on data-driven approach using deep neural networks
as shown in [DAI+ 18]. The goal is similar as we aim to reconstruct 3D smoke simulations
from multi-view sketches directly. Also, 3D smoke simulations can have similar representa-
tions compared with 3D objects using voxel grids and in the following section we will cover the
area of 3D machine learning for 3D reconstruction, multi-view modelling, etc.

2.3. Machine Learning

2.3.1. Basics

Firstly, some basics for machine learning (ML) and deep learning (DL) will be covered here
as in this work, two different kinds of learning types: supervised learning and unsupervised
learning are explored for sketch-based smoke simulations. We refer the recent deep learning
book [GBC16] for more details and better understanding of deep neural networks. Learning al-
gorithms are usually divided by three mainstreams: supervised learning, unsupervised learning
and reinforcement learning. Here we will cover some basics for the first two areas as we do not
use reinforcement learning algorithms in this work.
Supervised learning can be further categorized as classification and regression. For classifica-
tion, a ML model will be trained on the input data labeled with different classes. Such a model
can learn to map the inputs to one or more classes and can be generalized to unseen inputs well.
Another type of problem is called regression, which is what we explore in this thesis. Similarly,
the goal of regression is to map input to continuous values instead of discrete ones.

Supervised Learning

Recently, machine learning has received more and more attention due to the success of deep
learning, in which supervised learning plays an important role. As there is a great amount of
available labeled data in ML society, supervised learning becomes a better way to learn directly
and efficiently from data to map inputs to ground truth labels. More importantly, current state-
of-the-art models are learned in a supervised way, such as [KSH12][HZRS16][HLVDMW17]
learned from ImageNet [DDS+ 09] in the area of image classification.
Similarly, many works depend on images and corresponding labels as inputs, such as a binary

7
2. Related Work

value in image classification and real values of bounding box in the field of object detection
[GDDM14]. Generally, training a machine learning model is basically approximating some
kind of function f given inputs and outputs, which can be achieved by well-known models, such
as perceptron, multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs).

Neural Networks

With neural networks (NNs), the non-linear function f we aim to learn is normally modeled as
multiple layers where each consists of multiple neurons. The output for each layer generally
follows a non-linear activation function after linear combination of weights and input feature
maps. The activation function is actually the one who enables the networks to capture non-
linearities while approximating the function f . NNs are considered general models for data-
driven applications as they are powerful to learn the mapping between input and output. Figure
2.4 illustrates how a MLP network with 1 layer works. Suppose X are input features, which
can be arbitrary representation, such as images, text, etc. The hidden layer is represented by k
neurons and each is represented by ak . To obtain hidden state, we can use matrix multiplication
with learnable weights represented as lines in the graph. Then the output would be the matrix
multiplication again between hidden state and weights, with an additional activation function.
We can therefore stack such one block to build deeper network with multiple layers, so-called
MLP.
Here there are lots of choices for an activation function, such as ReLU [GBB11], Leaky ReLU
[MHN13], ELU [CUH15], etc. The most common one should be ReLU function, which is
found to perform well in image classification tasks. Recently, this is going to be replaced by
others (e.g., PReLU [HZRS15], Leaky ReLU) in the area of super resolution [LTH+ 17], 3D
reconstruction [KAT+ 19] to have better and sharper reconstructed details.
Besides, CNNs are considered a more popular architecture for various tasks [LBB+ 98][Kim14]
as it has a feature of sharing weights to reduce the problem of over-fitting. Similarly, it works as
matric multiplication like MLP with some activation functions mentioned above. The difference
is that the learnable weights in CNNs are shared for pixels in an image so it can be seen as a
regularized version of MLP. Also, CNNs have the ability to model some invariance properties
with deeper networks, such as translation invariance.

Unsupervised Learning

Sometimes when the ground truth label is not available, we may need the techniques so-called
unsupervised learning. Clustering (e.g., k-means) is one of the popular directions to extract
some intrinsic properties directly from data. It can categorize data points into different clusters
in an unsupervised way. Also, other similar algorithms, such as Gaussian Mixtures Models
(GMMs), can achieve that by softly assigning data points to clusters with probabilities.
In the field of deep neural networks, some models called generative models are designed to
estimate the data distribution with some underlying priors or assumptions (e.g., Gaussian Mix-
tures). They have become increasingly powerful with the development of deep neural networks
to model arbitrary distributions even starting from simple Gaussian distributions. Among them

8
2.3. Machine Learning

Figure 2.4.: An example of Multi-Layer Perceptron with 1 layer, from scikit-learn [BLB+ 13].

one of the most popular ways is to use Generative adversarial networks (GANs) introduced by
[GPAM+ 14]. They were shown to be particularly powerful at re-creating the distributions of
complex data like images of different categories, human faces, and even 3D volumes. Also,
there exists other generative models like variational autoencoder (VAE) to model explicit den-
sity. VAEs and GANs are both explored in this thesis while other types of generative models
(e.g., PixelCNN [VdOKE+ 16]) is not discussed.

2.3.2. Machine Learning for Sketch-Based Modelling

As mentioned above, nowadays deep neural networks are more popular and widely used. Com-
pared with traditional shallow machine learning models, deep learning methods can capture
more non-linearities with deeper model and solve more complex tasks. They have achieved
great success as well in the area of computer graphics and vision. For example, they have been
used for rendering noise-free [BVM+ 17] images and volumes [KMM+ 17], character control
[PALvdP18], image-to-image translations [ZPIE17][IZZE17], etc.
Meanwhile, 3D related tasks like shape retrieval, reconstruction have also been widely explored
besides those images related tasks. Large-scale shape datasets, including ModelNet [WSK+ 15],
ShapeNet [CFG+ 15] have been proposed for research in this area, making it possible to train
deep neural networks to directly reconstruct 3D shapes from a single image or noise [WZX+ 16].
As sketches and images have similar properties, we refer related works on both topics.

9
2. Related Work

Non-Photorealistic Rendering

Neural network-based methods highly depends on training data, while collecting a great amount
of human sketches together with corresponding density/velocity fields by artists would be time-
consuming and expensive. Luckily, we can rely on so-called Non-Photorealistic Rendering
(NPR) techniques to solve this problem. Previous data-driven approaches use non-photorealistic
rendering to generate synthetic sketches for 3D shapes. Existing NPR methods like suggestive
contour [DFRS03] is used in many sketch-based 3D modelling tasks [LGK+ 17][WCPM18].
This method provides sketches with more internal details and curves based on zero-crossings
of the radial curvature. Also, the image-space contour rendering methods proposed in [ST90]
is used in [DAI+ 18]. In addition, many existing modeling software (e.g., Maya) have built-in
shading methods (e.g., Toon Shading) for NPR applications.
Despite the fact that we have many tools and algorithms to generate sketches from 3D meshes,
this step is not reversible and how to reconstruct 3D models from pure sketches and line draw-
ings is not that straightforward. But thanks to the development of machine learning, especially
deep learning as mentioned above, we are able to train a deep neural network to achieve these
kinds of tasks on available datasets. As we have many existing sketch acquisition methods, we
do not have to build a large dataset manually with the help of artists. Instead, we can just run
algorithms on existing 3D models and construct pairs of sketchs and models to build the large
dataset for training, which is the main pipeline of deep learning-based methods to reconstruct
3D data from 2D sketches.
While in our case, we can apply similar approach on smoke simulation data to generate a large
set of synthetic data. Further, we propose a differentiable way to render sketches with accurate
contours based on density fields only, enabling us to build a sketches to densities dataset for
training.

3D Representations

Once we have 2D sketches as input, we need to specify the 3D representation of smoke sim-
ulations, which determines how we design our model. 3D objects are usually represented as
voxel grids, point clouds, surface, etc. Therefore, 3D reconstruction is normally divided into
volumetric grid reconstruction and point cloud reconstruction.
Figure 2.5 displays these two main representations: point clouds/surface and 3D volume. Fig-
ure 2.5(a) shows the pipeline from [LGK+ 17][LPL+ 18], respectively. It is illustrated that the
first step of 3D surface reconstruction pipeline is to estimate the depth and normal maps from
observations. We choose two related works which use sketches as input like our task. Then
the second step is to fuse those estimated maps to reconstruct the 3D point clouds and cor-
responding surface representation via existing surface reconstruction method [KH13]. Both
works benefit from U-Net architecture together with Convolutional Neural Networks (CNNs)
to have good quality output [DAI+ 18][LSS+ 19]. While for the volumetric reconstruction tasks
in Figure 2.5(b), the usual way is to use 2D image encoder to encode images into compact
representation and then apply 3D convolutional decoder to obtain the output. Unlike the first
representation, this volumetric reconstruction task tends to suffer from limited GPU memory
when we meet with high-resolution data. This problem also exists in [XFCT18] in the area of

10
2.3. Machine Learning

fluid flow super resolution via deep learning and our work as well. Hence, one way to allevi-
ate this is to model 3D volume as slices and use 2D convolutional neural network to achieve
similar reconstrcution quality. The left work in 2.5(b) shows the possibility to use 2D U-Net
for 3D reconstruction. Similar one is mentioned in Figure 2.6 [WXCT19], where they propose
to reconstruction 2D slices of a whole 3D volume via multiple steps and popular progressive
growing GANs [KALL17]. This allows us to alleviate the memory problem and achieve much
higher resolution data (e.g., 8× upsampling) from low resolution ones as input.
In this work, we use volumetric representation for smoke as simulation data is generated from
volumetric grids, while our target resolution is possible for training using a single modern GPU.

Multi-view Systems

For both tasks, 3D volumes or point clouds, are usually reconstructed or classified based on
multi-view 2D images captured by a multi-camera system from different viewpoints. Thus,
neural networks should be designed to incorporate information from different views of the same
object. [KHM17] and [LGK+ 17] show mainly two different ways to input multi-view images.
For example, [KHM17] first uses an image encoder to incorporate the features from all collected
views of the object, and then apply unprojection operation to lift 2D feature maps into 3D voxels
for later depth estimation and volumetric reconstruction. While [LGK+ 17] stacks multiple
views as channels and estimate the depth and normal with U-Net architecture first. Then the
estimated depth and normal information would be used to reconstruct point cloud under an
optimization framework. Multi-view images give strong priors and reduce ambiguity for 3D
reconstruction, that is the reason why lots of work would like to build a multi-view capture
system for better performance. For example, [DAI+ 18] shows that they can use sketches from
multiple viewpoints to refine printable 3D shapes, even with 2D CNNs.
However, in the case of using sketches as input, we also need to take the human effort into
considerations as sketching is normally more difficult than taking photo images. For example,
for a sketch-based 3D reconstruction system, user might prefer sketching less than some frames
(e.g., 3) but can still obtain good resulting 3D reconstruction. As shown in Figure 2.5(a), both
work requires merely 2 sketches from front and side views, which is user-friendly. Of course,
sketching in an progressive way like [DAI+ 18] is also a good idea to refine reconstructed 3D
shapes. In this work, we follow the two-view input framework to reduce human inputs, but at
the same time to reduce ambiguity for 3D reconstruction using more than a single view.

Single-view Modeling

Meanwhile, there are lots of works on single-view 3D reconstructions via learning from data.
[WWX+ 17] propose a disentangled formulation and by introducing 2.5D sketches for single-
image 3D reconstruction. They claim that 2.5D sketches benefit from sketch transfer and 3D
shape reconstruction as well. For this kind of ill-posed problem, estimating some intermediate
representations is usually a more robust way to achieve good results. For example, [LPL+ 18]
propose to estimate flow field in the first stage before depth and normal maps estimation to
have more robust reconstruction results. The success of these works shows that some additional
mid-level representation for reconstructing 3D fields might give us better quality output even

11
2. Related Work

(a) Surface-based 3D representation reconstructed from multi-view sketches [LGK+ 17][LPL+ 18].

(b) Volumetric 3D representation reconstructed from multi-view sketches ref or images [DAI+ 18][LSS+ 19].

Figure 2.5.: Different 3d representations related to our work, reconstructed from 2D sketches/images.

12
2.3. Machine Learning

with a single view as input. On the other hand, the amount and quality of data is also important
for learning enough priors for those ill-posed problems. As we find that in smoke simulations,
a single-view input would usually lead to ill-posed problem when the smoke inflow source
is moving in a 2D/3D plane, we start from using two-views as input as mentioned above to
alleviate this problem.

Generative Models

Besides supervised learning by deep neural networks from paired data, it is also possible to
infer 3D representations from compact vectors representation (e.g., noise, distribution) directly
without observations. For example, [WSK+ 15] provides a 3D reconstruction framework that in-
ferring 3D shape from latent space by an unconditional GAN, showing that GANs [GPAM+ 14]
also have the capability to generate complex 3D data from samples of a distribution like Gaus-
sian distribution, similar to 2D images generation [RMC15] or translation [ZPIE17]. This pro-
vides a way for unsupervised learning to learn data distribution and generate new unseen data
without any paired data. But of course, another kind of GAN called conditional GANs would
be used for supervised cases, where it is known to preserve more high frequency details for
the final output images [IZZE17]. Moreover, [ZPIE17] propose a cycle-consistent to translate
images between different classes without any paired images needed, becoming a powerful tool
in the area of domain adaptation.

Novel View Synthesis

Further, in some cases, it is possible that we are able to reconstruct 3D representations without
any supervision in 3D space. These works are usually categorized as novel view synthesis
[ZTF+ 18]. The goal of novel view synthesis is to produce RGB images of novel views given
a set of RGB images inputs, where multi-view stereo algorithms are normally applied in these
cases. Usually, people find a way to render 3D shape into 2D representation, such as normal,
silhouette as 2D supervision. [KUH18] successfully achieves this by a differentiable renderer,
so-called Nueral 3D Mesh Renderer and can be applied in the area of neural style transfer.
[YYY+ 16] share similar idea using projected silhouette as 2D supervision without available 3D
ground truth.
Most recently, [LSS+ 19] propose a method that is able to generate volumetric representation
from captured multi-view images by 2D supervision using raymarching and hybrid rendering
algorithm. Their model is applicable in their Virtual Reality (VR) environment and can be used
to generate dynamics information like a sequence of images and novel views by user inputs.
This work is more related to us as we also use an differentiable rendering approach in our
model for training to reconstruct 3D volumes.

Sketch-Based Smoke Animation

Though we find that there are many existing work proposed for sketch-based 2D/3D synthesis,
how to use sketch to interact or animate smoke simulations is still largely unexplored. One

13
2. Related Work

reference from the artists’ side is to apply hand drawings for visual effects animation in indus-
try. A book called Elemental Magic written by Joseph Gilland reveals the art of hand-drawn
special effects animation, showing some basic rules that artists or animators should follow. For
example, it is demonstrated in the book that there are mainly 3 steps for sketching smoke. The
first step is to sketch simple volumes shapes. Then you can add shading and details on top of
that. Afterwards, we need to clean up the strokes. Further, arrows would be needed to show
how smoke would move considering physics involved around the smoke plume to have a nice
animated smoke. Similarly, we can progressively sketch clean lines from rough drawing at the
beginning, as well as shadows.
In this thesis, we will mainly follow this kind of drawing styles for animating smoke to gener-
ate our datasets for sketch-based smoke simulations. Afterwards, we apply machine learning
models for 3D reconstruction and reconstructing smoke simulations.

2.3.3. Machine Learning for Fluids

While machine learning techniques have been achieving breakthroughs in many different ar-
eas including computer vision, graphics and 3D shape modeling, etc, they also have achieved
significant breakthroughs in the area of numerical estimation and fluid simulation. As a great
amount of data can be generated by just running a large number of fluid simulations with dif-
ferent parameters, data-driven approaches are starting to attract more attention and some of the
works have already shown the accuracy and efficiency using data-driven methods.
Specifically, the method called Regression Forests [JSP+ 15] is the first one combining fluid
solvers with machine learning techniques, achieving real-time simulation. Further, with the
progress of convolutional neural networks (CNNs) and deep generative models, reconstructing
Eulerian fluid simulation velocities becomes possible from a set of reduced parameters in a more
efficient way in terms of speed. [TSSP17] propose a CNN-based approach to learn physics
of fluid and smoke, achieving realistic simulation results but with less execution time than
traditional simulation methods. Both works shows that neural network-based approaches have
the capability to solve physics-based problem not only accurately, but also faster. To preserve
high-fidelity in fluid simulation, a neural network-based approach is first introduced to model
liquid splashing [UHT18].
One disadvantage of training neural networks with normal loss functions (e.g., MSE and L1
loss) is that it might lead to blurry results for reconstructed fluid and smoke [CT17]. Hence,
[CT17] propose to use a fluid repository to match fluid by CNN-based descriptor, instead of
reconstructing smoke directly. However, this is the story before generative adversarial networks
(GANs) are widely used. In the tasks such as fluid flow super resolution shown in Figure
2.6, [XFCT18] successfully uses a temporally coherent, volumetric GAN for fluid flow super-
resolution, showing that deep neural network can approximate the numerical methods well and
reconstruct fluid with good quality and temporal consistency. A follow up work for higher-
resolution upsampling is proposed in [WXCT19], to alleviate the memory consumption problem
of high-resolution data and make 8× super resolution achievable with good quality. Here GAN
shows its power to make reconstructing fluid flow with high-quality and sharp details possible.
Besides applications in super-resolution, it is also possible to generate fluid simulation with a
few parameters [KAT+ 19] or to artistically control smoke simulations via images [KAGS19] as

14
2.3. Machine Learning

mentioned in Section 2.1.2.


Overall, the works combining machine learning and fluid flows, tend to receive more and more
attention as they are exploring how to leverage neural network-based approaches to make fluid
simulation easier and faster while still preserving fluid behaviours with good quality. However,
as their settings are quite different from ours, it would be more interesting to apply deep learning
algorithms to 3D smoke simulations via sparse and multi-view human sketches.

Figure 2.6.: Examples of machine learning applied in 3D fluid flow super-resolution proposed in
[XFCT18][WXCT19].

15
3
Overview
In this thesis, we present a deep learning-based approach to generate single- and multi-frame
density fields from sparse key-framed and multi-view sketches. There are mainly two parts to
achieve these in our framework: a sketch encoder combined with a density decoder to recon-
struct single-frame 3D density from multi-view sketches, and a velocity decoder to reconstruct
temporally-coherent smoke simulations between key-framed sketches.

3.1. Problem Formulation

Firstly, given a single frame of simulated smoke and a set of corresponding sketches captured
from multiple viewpoints as inputs, our goal is to train convolutional neural networks (CNNs)
that produces a details-preserved solution for single-frame density estimation. In this case, the
inputs to our CNNs should be a pair of multi-view sketches and 3D smoke data. Let S ∈
0 0
RP ×H ×W represent multi-view input sketches for a single frame, where P , H 0 , W 0 mean the
number of viewpoints, height and width of the sketches, respectively. The input to the network
also includes the 3D density and we use d, dˆ ∈ RD×H×W , to represent ground truth density
from the simulation as well as the reconstructed ones from the network, which will be formally
defined after introducing our neural networks in Chapter 4 and 5. Ground truth 3D smoke data
can be obtained via physically-based fluid simulation engine. Simulation data usually consist
of 3D scalar fields (density) and 3D vector fields (velocity). Multi-view sketches for each
simulated frame can be acquired via some kind of rendering techniques (e.g., NPR, Suggestive
Contour [DFRS03]).
The second task is multi-frame density interpolation between two given key-framed sketches.
We should be able to generate a whole smoke simulation using our trained CNN or additional
networks, provided sparse sketches in different time steps. This problem can be re-formulated as
interpolating the 3D smokes between every two multi-view sketches. To achieve this, we have to

17
3. Overview

model the temporal relations between drawn sketches. In this case, we represent our key-framed
sketches with subscripts. For every two key-framed sketches, we denote them as St and St+T ,
where t is an arbitrary time step and T represents the distance between two sketched key-frames.
For such a pair of multi-view sketches, our models need to infer all the intermediate quantities,
including key-framed positions: dˆt , dˆt+1 , ..., dˆt+T −1 , dˆt+T , including the left- and right-most
key-frames. Our methods will rely on compact latent space (e.g., learned from variational
autoencoder) denoted as z for further interpolating the target quantities, to avoid computation
directly on high resolution data. Additionally, we need to reconstruct velocity fields to model
the temporal consistency between consecutive density frames. Similarly, we can define ground
truth velocity fields and the reconstructed ones as v, v̂ ∈ RD×H×W ×3 .
As T means the distance of two sketched key-frames for multi-frame interpolation problem, we
can formally define several cases as T varies. And in this thesis we will explore the following
cases:
• T = 1: a special case which is equivalent to single-frame interpolation, where we will
sketch on all the frames.
• T = 10: 10-frame interpolation means our input paired sketches St , St+T have a distance
of 10 and we need to interpolate totally T + 1 frames including the key-framed ones.
• T = 20: similar to the case when T = 10, but with sparser input.
We think T = 20 already provides a sparse input setting for application. For higher T , further
experiments are needed to see if it is hard for our models to reconstruct the whole sequences.
So in this thesis, we will mainly explore at most 20-frame interpolation, which is equal to the
case when T = 20.

3.2. Single-frame Density Reconstruction

The overall architecture of single-frame density reconstruction is shown in Figure 3.1 and the
detailed one is shown in 4.1. We start with two-view (front & side) sketches S rendered from
each density field d via a differentiable renderer R. A multi-view sketch encoder and a density
ˆ More details are shown
decoder Fz2d follow to convert input sketches to reconstructed density d.
in Chapter 4. With our proposed renderer, we can generate synthetic sketches from large amount
of smoke simulations as training data. Besides optimization on 3D densities consistency, we
incorporate a loss function to measure the difference between reconstructed sketches and input
sketches during training. This enables us to improve the quality of reconstructed sketches during
training.

3.3. Multi-frame Density Interpolation

The overall architecture of multi-frame density interpolation is shown in Figure 3.1 as well. We
present a way to interpolate density fields from dˆt to dˆt+T between two key-framed sketches
St and St+T based on the learnt latent space as input. This allows us to utilize the learnt in-

18
3.3. Multi-frame Density Interpolation

Figure 3.1.: Overview of our method. We begin with a differentiable renderer to render two-view
sketches from each density frame. For the single-frame density reconstruction (in Section
4), the multi-view sketch encoder Fs2z first generates a latent code for two-view sketches
and the latent code is decoded to reconstruct the corresponding density field by a density
decoder Fz2d . For multi-frame density interpolation (in Section 5), we use a mapping net-
work Fz2ẑ and a velocity decoder Fẑ2v to estimate the velocity fields between two sketches
key-frames. Our networks are trained separately for density and velocity reconstruction. At
inference time, we reconstruct an initial density frame based on the multi-view sketches at
the first key-frames and re-simulate (in Section 5.4.3) the densities based on reconstructed
velocity fields between key-frames.

19
3. Overview

formation in the first single-frame density reconstruction phase. A mapping network Fz2ẑ and
a velocity decoder Fẑ2v is added to reconstruct the intermediate velocities based on only key-
framed sketches. Then we re-simulate the process by advection to get dt , ..., dt+T , with initially
reconstructed density dˆt and estimated velocities from v̂t to v̂t+T −1 .

3.4. Notations
Before going to the model part, we list some notations of hyper-parameters, symbols of data,
operators (e.g., renderer, neural networks) in Table 3.1, which are frequently used in later sec-
tions.
To generate the sketches data, we need to render the density fields into sketches. This rendering
operator called R can therefore generate multi-view sketches S based on simulated 3D den-
sity field d. Normally, this rendering operator R can represent any general method to render
sketches from 3D representation. And in this work, we further extend it with different subscripts
mentioned later to render different properties from 3D smoke.
Briefly, we can first apply our renderer to generate normals from a specific viewpoint. The
approximated normals allow us to render contour edges and shadings by thresholding and light-
ing, respectively. We finally generate the input sketches by blending the rendered edges and
shadings. We can also use other existing NPR methods to render the smoke after converting it
to meshes, which should be commonly used in other works [WCPM18]. Thus, the output can
be presented as:
S = R(d) (3.1)

In our proposed renderer, R can be one of those 3 types: Redge , Rshade and Rblend . We choose
to use Rblend as the rendered sketches by Rblend contain both contours and shading information.
Advection operator A is used in simulation to get the next quantities, such as density field or
velocity field based on current ones. For example, with ground truth fields dt , vt at time step
t we can get dt+1 ← A(dt , vt ) via advection operator. This is a must while generating fluid
data and it can also be applied during training our neural network, e.g., [XFCT18][KAGS19].
D, H, W, P are hyper-parameters for resolutions and number of viewpoints. In this work, we set
all D, H, W to be 112 or 224 for higher resolution and only use cubic 3D volume representation
as it is more convenient to generate sketches with squared shape in this work. P is fixed to be 2
for minimal user input and H 0 , W 0 are both fixed to be 224 as it already shows good rendered
sketches in this resolution. The dimension of latent space representation is fixed to be 256,
which is the same as the one in [LSS+ 19].

20
3.4. Notations

Table 3.1.: Notations of hyper-parameters, symbols and operators used in this thesis.

Notations Meanings

H 0, W 0 Height and width for input sketches


D, H, W Depth, height, width parameters for a 3D volumetric representation
d, dˆ Ground truth density field and reconstructed density field
v, v̂ Ground truth velocity field and reconstructed velocity field
Ψ̂ Stream function output, where v̂ = ∇ × Ψ̂
R Differentiable renderering operator
A Differentiable advection operator
S Multi-view sketches input
T Distance between two key-frames
St , St+T Two sketched key-frames at time step t and t + T
z, zdim Latent code corresponding to multi-view sketches S and its dimensionality
P Number of viewpoints for multi-view sketches S

21
4
Single-Frame Density Reconstruction
In this section, we present our proposed framework, which is an encoder-decoder networks to
convert multi-view sketches into 3D density fields. Further, we introduce a differentiable sketch
renderer to improve the reconstruction quality by modelling 2D-3D consistency. In addition, we
reformulate our architecture to be a GAN-based architecture, in order to reconstruct 3D smoke
with sharper details.

4.1. Baseline Model


Our baseline model consists of a multi-view sketch encoder and a density decoder to convert
multi-view sketches to 3D density fields. Inspired by [LSS+ 19], we directly incorporate the
variational autoencoder (VAE) architecture into our encoder-decoder network. Often, VAE
combined with CNNs are used for encoding or decoding image and 3D volumetric data. The
convolutional VAE architecture encourages smoothness for learnt latent space compared with
naive autoencoder (AE)-based models without any constraint on the latent space. Therefore, we
can apply arithmetic operations such as linear interpolation on the latent space to approximate
the in-between sketches on the manifold. Also, such learnt latent space enables us to animate the
transition between key-frames and reconstruct intermediate 3D densities between key-framed
sketches.

4.1.1. Variational AutoEncoder

As mentioned above, one significant component used in our models is called variational autoen-
coder (VAE) [KW13]. This variational architecture aims to encourage smoothness for the latent
space, which is considered to be the bottleneck of the neural network. The latent space is usu-
ally sampled from a diagonal normal distribution parameterized by the encoder. Our goal is to

23
4. Single-Frame Density Reconstruction

Figure 4.1.: This is the architecture of reconstructing single-frame density field from multi-view
sketches. Specifically, we use a two-view setting where the weights of sketch encoder
[Fs2f , Ff 2z ] are shared among single-view sketches, as well as the pre-trained feature ex-
tractor DenseNet-121. Fz2d represents the density decoder to decode multi-view sketches
latent code (in green color) into reconstructed density field.

minimize both reconstruction error and KL-divergence between a standard normal distribution
and our parameterized distribution, where KL-divergence can be seen as a regularization term.
Reparameterization is used to make the sampling operation differentiable so that the network is
trainable. During the test time, the latent space will be considered as the estimated mean values
from the encoder.
Once the latent space is smooth enough, we are able to interpolate or apply arithmetic operations
between two latent vector to generate new samples, which are extensively discussed in different
types of generative models such as GANs [WZX+ 16] and VAEs [KW13]. This property can
actually be plugged in our models in a natural way, where we want to interpolate between two
frames to generate the simulation in-between to model dynamics.

4.1.2. Sketch Encoder Network

To encode the images into the latent space, we build a 2D convolutional neural networks called
0 0
sketch encoder, Fs2z : RP ×H ×W → Rzdim . Specifically, this architecture consists of 2D convo-
lutional layers coupled with batch normalization [IS15] and Leaky ReLU activation function. It
has been explored that Leaky Rectified Linear Unit (Leaky ReLU) activation functions can yield
sharp outputs while minimizing L1 loss for fluid data [KAT+ 19], so we use the same setting for
all layers. Thus, one building block of our sketch encoder includes: conv2d-batchnorm2d-lrelu,
representing 2D convolutioanl layer with pre-defined number of channels, 2D batch normaliza-
tion and element-wise Leaky ReLU activation function. This is summarized in the later section
as shown in Figure 4.1.
Though our input for a single frame contains sketches from 2 viewpoints, the input to the Fs2z
should be a single-channel sketch from one specific viewpoint, meaning that the sketches from

24
4.1. Baseline Model

different viewpoints (i.e., front & side views) are designed to share the same weights of the
sketch encoder. The output of the each shared branch should be a lower-dimensional feature
map corresponding to a specific viewpoint, which can be flattened to be a high-dimensional vec-
tor. These feature vectors are then concatenated to form an aggregated representation for multi-
view sketches. Besides training the sketch encoder from scratch, it is also possible for further
improvement to use one of the pre-trained networks trained on large image database [DDS+ 09]
as part of the sketch encoder to extract the features from the input sketches. Though we generate
sketches through NPR methods, we might still get some features such as edges, and compact
feature representation from those pre-trained networks (e.g., VGG [SZ14], ResNet [HZRS16],
DenseNet [HLVDMW17]) that trained on natural images. This is inspired by [WCPM18],
where they encode sketches of clothes into feature vectors during training.
Therefore, we choose to incorporate some shallow layers from those pre-trained networks into
our sketch encoder network. In this way, the sketch encoder network Fs2z partly contains pre-
trained weights trained on large image database and we freeze these weights during training.
The remaining weights of deeper layers are still learnable with our original encoder block to
fit our own sketch data. In this way, we can improve the generalization of our model and the
training of sketch encoder would be easier to fit different datasets. Specifically, if we denote
the pre-trained part as Fs2f and the learnable part as Ff 2z , where f means feature maps, we can
represent the whole encoder network as the concatenation of these two sub-networks: Fs2z =
[Fs2f , Ff 2z ]. In this thesis, we experiment with two options: ResNet-18 and DenseNet-121 and
find that even with less parameters, DenseNet-121 shows better reconstruction results while
other elements are fixed to be the same. More experimental details are concluded in Chapter 6.
After encoding sketches into feature representations, we incorporate the VAE architecture con-
taining 2 fully-connected layers to estimate the parameters (i.e., µ and σ) of a diagonal 256-
dimensional Gaussian, which is so-called latent space z. In this case, another loss function is
used to minimize the KL-divergence between our estimated distribution and the standard Nor-
mal distribution [LSS+ 19]. To generate a sample from the encoder and VAE, we can sample
from our learned distribution using reparameterization, and afterwards decode the latent space
into the corresponding volumetric smoke prediction. By minimizing the KL loss, we can there-
fore enforce smoothness on the latent space.

4.1.3. Density Decoder Network

The input to our proposed 3D voxel decoder is the reshaped latent space, i.e., 4D tensor with
resolution 256 × 13 . The voxel decoder is to decode latent code obtained from input sketches to
reconstructed 3D fields: Fz2d : Rzdim → RD×H×W . Here we aim to reconstruct 3D density field
dˆ to match to corresponding ground truth, where dˆ can be represented as dˆ = Fz2d (Fs2z (S))
given input multi-view sketches S.
As suggested in [ODO16] and [KAT+ 19], we do not use transposed convolutions, but convo-
lutions coupled with nearest neighbour interpolation for upsampling instead to avoid the pos-
sibility of checkerboard artifacts. We find that transposed convolution will indeed introduce
artifacts while naive upsampling layers degrade the final results. So we do not simply stack
several convolutional layers with upsampling. Instead, we follow the architecture in [KAT+ 19]
to reconstruct density or velocity by incorporate some residual blocks to increase the model ca-

25
4. Single-Frame Density Reconstruction

pacity. We find it good to reconstruct more accurate fields and do not observe severe artifacts as
residual learning is usually more applicable for deep networks. More details would be covered
in the next section in 4.2 after introducing the residual blocks.
Specifically, we totally have 7 such convolutional upsampling layers in order to get volumet-
ric grid output with size 1123 from the 256-dimensional latent vector. A simple decoder block
consists of convolution, activation function and batch normalization and upsampling in order,
which is denoted as decoder block in table 4.2 later. However, we find that it is insufficient to
reconstruct high-quality density fields using decoder block (i.e., conv3d+batchnorm3d+lrelu)
only and the architecture can be improved further. One intuitive way to improve it is to incor-
porate some more advanced network architecture (i.e., ResNet, DenseNet) into our network,
similar to the sketch encoder part. Inspired by [KAT+ 19], which successfully applies residual
connections in the decoder part to predict velocity fields from reduced parameters, we exper-
iment with different decoder architecture using residual blocks and dense blocks and make a
comparison between them in Chapter 6.

ResNet

Residual networks (ResNet) is firstly proposed in [HZRS16], which formulates the layers to
learn residual functions with so-called skip connections. Based on this, ResNet allows much
more deeper network design and has become one of the main backbones for many computer
vision tasks such as image classification, object detection, semantic segmentation, etc. Here our
residual block architecture follows more on [KAT+ 19] instead of the original paper for recon-
structing 3D fields. As shown in Figure 4.2, suppose our input feature map has nD channels ,
the residual layer will keep the number of channels the same in the output layer, but with a skip
connection after several normal convolutional layers by element-wise addition. Due to further
upsampling layer (e.g., nearest interpolation), we can have the output feature maps with the
same number of channel but 2× larger resolution. In this way, the internal convolutional layers
only need to learn a residual mapping from input to output, making the network easier to train.
As residual block do not change the number of channels, we need to introduce the 3D decoder
block similar to the 2D one (i.e., conv3d+batchnorm3d+lrelu) to change the number of channels
by 3D convolutional layer. As shown in Figure 4.2, decoder block (DecBlock) will output
feature maps with nD channels. More details are listed in Table 4.2.

DenseNet

DenseNet is first introduced in [HLVDMW17], which is to improve original ResNet by densely


connection, using so-called dense block. Also, another block called transition block will be
added between dense block to downsample/upsample the feature maps and adjust the number
of channels. It is known to be more efficient in terms of number of parameters, runtime and
accuracy as it can efficiently reuse feature maps from previous layers, and thus make a great
reduction on the number of parameters.
Specifically, DenseNet introduces more hyper-parameters to build the dense blocks as shown
in Figure 4.3. In each dense block, user can define number of convolutional layers. Further,

26
4.1. Baseline Model

Figure 4.2.: Architecture of residual blocks (ResBlock) and decoder block (DecBlock) used in our 3D
voxel decoder. We use 2 (considering the memory) convolutional layers before element-
wise skip addition and a nearest interpolation for upsampling afterwards.

growth rate k can control the number of output feature map channels in each convolutional
layer, which can be small (e.g., 12). For transition block, we need to specify the number of
output feature maps as well, which is mainly used to change the number of feature channels
and it can be half of the input feature maps in our case. For the encoder part, there is one more
downsampling layer in the transition block. But for the decoder part, we have to replace the
downsampling layer in the transition block to an upsampling layer to increase the resolution of
feature maps.
Overall, we find that for the sketch encoder part, DenseNet-121 can give slightly better results
in terms of reconstruction error with even fewer parameters than ResNet-18 [HZRS16]. But
the DenseNet-like architecture used in the decoder will degrade the reconstruction error of 3D
fields obviously. Also, we find that it will introduce artifacts for reconstructing velocity fields.
One explanation is that DenseNet architecture might be more suitable for classification, e.g.,
image classification and semantic segmentation [JDV+ 17][LCQ+ 18]. How to use DenseNet
for regression tasks might be still an unexplored area.
To design the voxel decoder, we also need to care about the memory limits while training our
model. We can not build a very large network without considering this. Instead, in this thesis,
I experiment with different network architecture, such as number of layers at each block to
maximize the utilization of GPU memory and support a batch size of 1 ∼ 2 during training.
Finally, both sketch encoder and density decoder networks are summarized in Table 4.1 and
4.2. And the visual representation of our network is shown in Figure 4.4, where encoder block
(EncBlock) is the same as DecBlock but with downsampling at the end. The final output of
voxel decoder can be density or velocity fields with 1 or 3 channels.

Architecture for Higher Resolution

Table 4.2 shows the network architecture assuming that the output resolution is 1123 . And the
overall architecture of both sketch encoder and density decoder are shown in Figure 4.4. To

27
4. Single-Frame Density Reconstruction

Figure 4.3.: Architecture of densenet components used in our sketch encoder and 3D voxel decoder,
where k represents the growth rate in dense blocks and the order of a denseblock is from
batch normalization to activation function to convolution. We find that DenseNet-121
[HLVDMW17] can give slightly better sketch feature representations with even less pa-
rameters than ResNet-18 [HZRS16]. Image courtesy is from the original DenseNet paper
[HLVDMW17].

Table 4.1.: Our 2D encoder after extracting features from pre-trained networks, where Layer is the layer
name, Ops the operations used in that layer, K the kernel size, S the stride for convolution
and P the number of padding consistent with [PGC+ 17]. Further, we show the number of
feature maps and the corresponding output size in the Output Size column and the input
layer name for each layer in Input.

Layer Ops K S P Input Output Size

features - - - - sketch features 256 × 14 × 14


enc_0 conv2d+batchnorm2d+lrelu 3 2 1 input 256 × 7 × 7
enc_1 conv2d+batchnorm2d+lrelu 3 2 1 enc_0 256 × 4 × 4
concat_1 enc_1(front) + enc_1(side) - - - enc_1(front), enc_1(side) 8092
fc_1 fc256 - - - concat_1 256
fc_2 fc256 - - - concat_1 256
z reparameterization - - - fc_1, fc_2 256

28
4.1. Baseline Model

Table 4.2.: Our 3D voxel encoder with residual blocks incorporated into our baseline networks archi-
tecture, where Layer is the layer name, Ops the operations used in that layer, K the kernel
size, S the stride for convolution and P the number of padding consistent with our used
deep-learning framework [PGC+ 17]. Further, we show the number of feature maps and the
corresponding output size in the Output Size column and the input layer name for each layer
in Input.

Layer Ops K S P Input Output Size

input - - - - sketch latent codes 256 × 1 × 1 × 1


dec_0 Nearest Upsampling - - - input 256 × 2 × 2 × 2
dec_1 DecoderBlock 3 1 1 dec_0 128 × 4 × 4 × 4
dec_2 ResidualBlock 3 1 1 dec_1 128 × 8 × 8 × 8
dec_3 DecoderBlock 4 1 1 dec_2 64 × 14 × 14 × 14
dec_4 ResidualBlock 3 1 1 dec_3 64 × 28 × 28 × 28
dec_5 ResidualBlock 3 1 1 dec_4 64 × 56 × 56 × 56
dec_6 DecoderBlock 3 1 1 dec_5 8 × 112 × 112 × 112
dec_7 DecoderBlock 3 1 1 dec_6 1 × 112 × 112 × 112

Figure 4.4.: Architecture of sketch encoder and voxel decoder.

29
4. Single-Frame Density Reconstruction

train with higher resolution data, we can simply add one more residual blocks after the 3rd
one as shown in Table 4.2 between layer dec_5 and dec_6 while others are fixed. For higher
resolution data such as 2243 , we find that our network can still be trained with batch size equal
to 2, showing the ability of our model to be extended to reconstruct higher resolution data. For
even higher resolution data such as 4483 , we might only support a single sample during training
on a single Nvidia GeForce GTX 1080 Ti GPU (with 11GB Ram).

4.2. Improving the Baseline

Though we already have a working version in terms of network architecture, there are still some
additional ways to improve the reconstruction quality further. For the direct density reconstruc-
tion, an intuitive idea is to incorporate an additional discriminator after the generator. In this
way, we hope to reconstruct single-frame density fields with sharper details.

4.2.1. Generative Adversarial Nets

Generative Adversarial Nets (GANs) become a popular method nowadays as shown in related
work, which are known for capturing sharp details and more perceptual properties for generated
outputs (i.e. images) in the area of computer vision and graphics. GAN-based methods usually
need to add a second network so-called discriminator after the generator, learning to tell how
naturally and closely the generated data match the ground truth. As we find in the experiments
training with L1 loss only produce blurry results, we will employ the adversarial loss for the
smoke density outputs to capture the inhomogeneity of the ground truth smoke density data.
Training with adversarial loss usually tends to preserve more high-frequency details for pre-
dicted volumes. One successful case is described in [WZX+ 16], where they use a 3D-GAN to
reconstruct 3D shape from noise vector and a 3D-VAE-GAN model to reconstruct 3D shape
again from a single 2D image. Also, [XFCT18] shows that GAN can also works with vol-
umetric reconstruction with continuous density values in each voxel such as fluid flow super
resolution. However, we find these tasks not completely the same with ours.
Given sketches as input, we lack the density information compared with super resolution but
still need to reconstruct continuous fluid flow, which makes our task more difficult. And we
find that the training strategy from [WZX+ 16] might not be directly used in our case and the
discriminator tends to outperform the generator one more easily. Hence, we choose to train
our generator first until getting plausible results and then add the discriminator for additional
training to infer high-frequency and sharp details. We currently only test an unconditional
GAN setting for single-frame density reconstruction with a binary classifier. More specifically,
we use a discriminator with 6 convolutional layers together with a sigmoid function to encoder
1123 voxels into a single output value. Results in Chapter 6 show that our model trained with
an additional discriminator can yield sharper outputs. We need further experiments with more
advanced GAN architecture, such as [KALL17].

30
4.2. Improving the Baseline

4.2.2. Differentiable Renderer

As shown in Figure 3.1, we start the pipeline with rendering 2D sketches from densities. One
improvement is that we can design a renderer, which can render sketches directly from sketches
in a differentiable way. In this way, we can train our networks in an end-to-end fashion, where
the reconstructed 3D density fields can be more related to the corresponding input sketches. And
it can also provide data augmentation on the fly without any pre-processing step. In this section,
we will explore how to render the sketches in an differentiable way to reveal the contours of 3D
smoke accurately and close to the suggestive contour reference. This rendering pipeline can be
formulated as: S = R(d) where R can represent a differentiable sketch renderer. In this way,
if we have d,ˆ d as reconstructed density field and ground truth one, we can further define some
kind of loss function to minimize the difference between R(d) ˆ and R(d).

Our initial trial is to apply a commonly used method called suggestive contour, which is de-
signed to render lines extracted from the meshes. With this we can pre-process all the density
fields into meshes and apply this algorithm directly to generate sketches from multiple view-
points. One example is shown in Figure 2.3(b) where we can extract accurate contours and
suggestive contours from a mesh. We can also apply the same method to meshed 3D smoke as
shown in Figure 6.3. Here we try to generate similar sketches compared with this algorithm,
e.g., generating representative contours directly from a 3D density field and compare our results
with the suggestive contour reference.
The definition of the contour is described in [DFRS03]:

n(p) · v(p) = 0 (4.1)

where p is a point on the surface of a mesh and n(p), v(p) represents the unit surface normal and
view vector. So we need to find these two quantities to render the contours of 3D smoke. Though
4.1 is defined on surface or mesh instead of a volumetric representation, we can still find a way
to approximate this equation. In our case, we have 3D smoke as volumetric representation,
where the continuous density values are stored as voxels. Based on this, we can easily find the
view vector in an orthographic camera setting, which is axis-aligned directions from one of the
x, y, z axis.
As normals are defined on a surface, we need to first reconstruct a surface based on the smoke
volume. In order to get a surface, it is necessary to select an iso-value and then extract the iso-
surface based on this iso-value. However, in a smoke volume it is usually hard to define a proper
iso-value as there are many level sets from thin to thick smoke. So we propose a way to average
the normals of all level sets in the view direction, based on a weighted function parameterized
by the density values. Hence, the average normals projected the view plane is represented as:

R +∞
0
q(d, s)w(d, s)ds
N̂ij = R +∞ (4.2)
0
w(d, s)ds

where N̂ij is the estimated normals at pixel (i, j), s is the distance from origin, d is the cor-
responding density field, q(d, s) is the quantity we want to integrate and w(d, s) the weighted

31
4. Single-Frame Density Reconstruction

function. The normalization term in the denominator is to make sure that we get the normalized
normal vector.
More specifically, to approximate the normals in the view direction for each level set, we cal-
culate the 3D gradient based on the density values. In this way, q(d, s) = ∇d(s). The weighted
function also depends on densities and distance from the viewpoint, which is defined to be
w(d, s) = w(τR(s)), where τ is a function calculating the cumulative sum starting from the
s
viewpoint τ = 0 d(t)dt.
We can further Ruse an exponential weighted function to ensure that the denominator can be
+∞
equal to 1 (i.e., 0 xe−x dx = 1) as the following:

w(τ (s)) = −c · τ (s)2 e−c·τ (s) (4.3)

where c is a hyper-parameter representing the thickness of the densities. The intuition of this
weighted function is that we only care about the gradient calculated within a specific range and
filter far away gradients based on the weighted function.
To calculate the integral 4.2 further, we can discretize the integral, especially the distance s,
based on our volumetric smoke representation as the following:
i=n−1 Z i+1
X q(i) + q(i + 1)
N̂ij = w(τ (s))ds
i=0
2 i
i=n−1
(4.4)
X q(i) + q(i + 1)
= [(c · x + 1) · e−c·x ]|i+1
i
i=0
2

where i is from the starting index to the last one of a smoke volume and n = 112 in our case.
The function (c · x + 1) · e−c·x is plotted in Figure 4.7. In this way, we can get the weighted
average gradients of voxels intersecting with the orthogonal viewing ray at each pixel (i, j).
Figure 4.5 shows the generated normals on the right-most sub-figure. Based on the normals and
view vector, we can further calculate the dot product of them and find the zero values. We use
a simple threshold to filter the non-contours above some value as we consider that the lower
dot product value, the closer to generate contour based on the definition. We can also render
shadings applying dot product between normals and light direction, which is quite similar to the
dot product between normals and view direction. The left-most column in Figure 4.5 shows the
resulting shading from two views.

4.3. Loss Functions


Loss in 3D Space

Here we define the objective that our network aims to learn. We adopt L1 loss function, instead
of MSE loss, between the density prediction dˆ and ground truth density d, which is defined as
density loss Ld :
Ld (d) = ||dˆ − d||1 . (4.5)

32
4.3. Loss Functions

(a) From left to right: blended edge and shading, edge only, toon shading only and approximated normals derived
from 3D density using our differentiable renderer.

(b) From left to right: shadings, sketches rendered by our differentiable renderer and the Mitsuba
rendering respectively, from front and side views of the smoke inflow and buoyancy dataset.

Figure 4.5.: Examples of generated sketches and intermediate results (e.g., shading, normal, physics-
based rendered images) from two datasets.

Figure 4.6.: Our renderer is also able to simulate the process from simple contours to more details to
shaded sketches, by adjusting the threshold values easily without any pre-processing step.

33
4. Single-Frame Density Reconstruction

Figure 4.7.: Different thickness value c for equation (c · x + 1) · e−c·x .

where dˆ = Fz2d (Fs2z (S)) represents the reconstructed density field from multi-view sketches.
To improve the quality of reconstructed density, we further incorporate gradient loss used in
[KAT+ 19], which is represented as:

Ldgrad (d) = ||∇dˆ − ∇d||1 . (4.6)

[KAT+ 19] shows that velocity gradient loss clearly helps to reduce artifacts in the generated
flow fields and similar to velocity estimation, we apply this loss as default for density recon-
struction.
These two losses have been used in [KAT+ 19] paper to reconstruct 3D fields. In this thesis, we
further try one more loss function in 3D space called SSIM loss [WSB03][ZGFK16], which is
widely used in image restoration and show better results compared with L1 loss only. Model
trained with this loss tends to reconstruct image with higher perceptual quality, as this loss
function considers the neighbouring pixels.
In 2D case, SSIM loss can be represented as:
k
1X
LSSIM 2D = SSIM (xi , yi ) (4.7)
k i=1

where k is the number of sliding windows whose size are set to be 11 × 11 for each image, and

(2µx µy + c1 )(2σxy + c2 )
SSIM (x, y) = (4.8)
(µ2x+ µ2y + c1 )(σx2 + σy2 + c2 )

34
4.3. Loss Functions

where µ, σ represent the mean and variance of pixel value in that window. We use default values
for c1 = 0.0001 and c2 = 0.0009. This equation can be extended into 3D case and naturally
used for density reconstruction task, where density values are in [0, 1], just like normalized
gray-scale images. We directly use the pytorch_ssim [WSB03] library to implement this loss
function and incorporate LSSIM 3Dd (d) as part of the 3D density reconstruction loss.

Loss in Latent Space

Besides, we need to incorporate KL loss to encourage smoothness for our learned latent space,
which is written as the KL-divergence between latent space z and the standard normal distribu-
tion. Suppose we have multi-view sketches S as input for single-frame density reconstruction,
we can define LKL as the following:

LKL (d) = DKL (z||N (0, 1)) (4.9)

where z is the encoded latent space via Fs2z (S) for input S and DKL (z||N (0, 1)) is the KL
divergence between z and a standard normal distribution.

Loss for 2D-3D consistency

As we can now generate sketches in a differentiable way, it is natural to incorporate another loss
function called sketch loss, which aims to ensure that both shaded and gradient/edges informa-
tion can be consistent with the input sketches, which is crucial for artists and users to see what
they get more easily after sketching. And this can be seen as a loss function aiming to keep
2D-3D consistency, which is one of our goals.
Further, as our sketches are generated by combination of normals, shadings and edges, we
already mix the information by some non-linear operations as mentioned in the differentiable
sketcher section. It is not clear that whether it is suitable to calculate the sketch loss based on
final mixed sketches as shown in Figure 4.5 on the left-most. We can also decouple the sketch
loss into several individual parts, such as edge and shading. Therefore, we propose two types
of sketch loss:
• Calculate the loss based on blended edges and shading called Lblend .
• Calculate the loss based on edges and shading separately Ledge and Lshade .This allows us
to use different weights for two losses.
Let us consider the blended edges and shadings and here is how we define Lblend :

P
1 X ˆ − Rblend (d)||1 .
Lblend (d) = ||Rblend (d) (4.10)
P p=1

where Rblend is already defined in our differentiable sketcher to render blended edges and shad-
ings given density field as input. P here represents the number of viewpoints we use for sketch-
ing and the loss function actually takes the average value from all input viewpoints. In this work

35
4. Single-Frame Density Reconstruction

we always use a two-view setting, where P = 2. And we only experiment with only Lblend as
it already contains information for both edges and shadings. Thus, the input to Lblend is the
ground truth sketches generated from d and reconstructed sketches from dˆ using Rblend .
Besides, as we have used SSIM loss for 3D reconstruction, we can also incorporate SSIM
and even MS-SSIM loss [WSB03] into sketch loss. MS-SSIM loss can be seen as a linear
combination of SSIM loss in different scales. Thus finally the sketch loss can be formulated as:

Lsketch (d) = λblend Lblend (d) + λM SSSIM 2D LM SSSIM 2D (d) (4.11)

Adversarial Loss

Suppose the discriminator is represented as D, we need to minimize the loss function for D,
[XFCT18]:

ˆ
LD (G, D) = E[− log(D(d))] + E[− log(1 − D(d))] (4.12)

as well as the adversarial loss for generator G:

ˆ
Ladv (G, D) = LG (G, D) = E[− log(D(d))] (4.13)

where G is considered to be combining the sketch encoder and density decoder: [Fs2z , Fz2d ]
and dˆ = Fz2d (Fs2z (Rblend (d))).

Full Objective

We can define full objective for reconstructing single-frame and multi-frame density fields using
our baseline model:

Lden = Ld + Ldgrad + LSSIM 3Dd + λKL LKL + λsketch Lsketch + λadv Ladv (4.14)

4.4. Self-supervised Fine-tuning


[WWX+ 17] propose a MarrNet, which introduces intermediate 2.5D sketches representation for
image-based 3D shape reconstruction. 2.5D sketches contain depth and normal maps rendered
from corresponding 3D shape and one of the contribution is that they introduce a loss function
on the consistency between reprojected depth and normals from reconstructed 3D shapes and
the ground truth. This is called reprojection consistency loss, allowing us to fine-tune the model
on newly input images outside the training data.
Inspired by their work, we can also incorporate a loss function called sketch loss, Lsketch since
we can render sketches using the proposed differentiable renderer. This can not only be used
in the training phase, but can also be used for fine-tuning unseen input sketches without any

36
4.5. Summary

3D supervision. For each pair of input two-view sketches, we can use our trained network Fs2z
and Fz2d to reconstruct 3D density dˆ and the differentiable renderer Rblend to render sketches Ŝ
from d.ˆ Then we can minimize the difference between input sketches S and the reconstructed
Ŝ, e.g., ||S − Ŝ||1 .
During test time, instead of fixing decoder and fine-tuning on encoder, we choose to fix encoder
and fine-tune density decoder as our decoder has much more degree of freedom than the sketch
encoder. This strategy provides much more improvements than the one in [WWX+ 17] based on
our experiments. Considering both performance and improvement, we choose to fine-tune 20
iterations for density decoder, which takes ∼ 7 seconds per frame on a single Nvidia GeForce
GTX 1080 Ti GPU (with 11GB Ram).

4.5. Summary
In this section, we mainly introduce the baseline model, including sketch encoder, density voxel
decoder, interpolation network, and full objective function for density reconstruction used in
this thesis. These are important factors and the network architecture can be widely re-used
for later experiments. Also, it provides a benchmark to test if our further approach can show
improvements. In the next section, we will introduce the details of our approaches, where we
will re-use the similar architecture and loss functions to reconstruct velocity fields for better and
temporally-coherent interpolation.

37
5
Multi-Frame Density Interpolation

5.1. Baseline Model

5.1.1. Latent Space Interpolation Network

Based on Chapter 4, we can animate 3D densities between key-framed sketches using linear in-
terpolation as we learn a smooth latent space under the VAE framework. However, this method
has no guarantee on reconstructing temporally-coherent simulation over time. Here, we try
to fix the error between linear interpolation and the actual manifold of single-frame density
reconstruction from sketches. One intuitive approach is to build an additional latent space inter-
polation network to learn how to infer the intermediate latent codes between two key-frames.
The latent space interpolation network is designed to be a Multi-Layer Perceptron (MLP) called
Fz2ẑ : R2×zdim → R(T −1)×zdim , where z = Fs2z (S) and ẑ represents the output of the MLP, or
the input of the decoder. In this way, we need to unify the notation to make them consistent:
• Density decoder network Fz2d can be over-written as Fẑ2d . For the single-frame recon-
struction setting, we consider ẑ is identical to z, that is, ẑ = z.
• Linear interpolation can also share the same notation, but with an additional superscript
α
α that represents linear interpolation between two latent codes: Fz2ẑ (zt , zt+T ) = Tα × zt +
(1 − Tα ) × zt+T .
• Interpolation using MLP can be represented as Fz2ẑ , where ẑ represents the output of
the MLP. The output latent code of one position can also be denoted by α in a similar
way: Fz2ẑ (zt , zt+T )[α] , where α ∈ [t, t + T ]. The left-, right-most latent codes should be
identical to be input: ẑt = zt , ẑt+T = zt+T , while the intermediate latent codes are output
from MLP.

39
5. Multi-Frame Density Interpolation

More specifically, suppose we have two zdim -dimensional latent codes zt , zt+T as input, the
MLP network should give us a long vector as output with (T − 1) × zdim -dimension. Hence,
we can get T − 1 latent codes with the same dimension as input. There are 3 interpretations for
the meaning of generated latent codes as shown in Figure 5.1:
1. The first one is to directly output the vector of intermediate latent code. However, we find
this would introduce severe artifacts for reconstructed fields, especially velocity fields
introduced in later section.
2. Learn a residual mapping between linearly interpolated latent codes to the actual mani-
fold.
3. Learn a residual mapping between the previously one estimated latent code ẑt−1 to the
next one ẑt .
Both second and third paths (i.e., orange and blue) are considered as residual functions from
a pre-defined position, making it easier for training the MLP. But two paths learn different
residual functions. For example, the third one shows how we can move the first latent code to
the green one, while the second one shows how we can move the linearly-interpolated vector to
get the green one. It would be interesting to compare these two ways. In this work, we apply
the orange path in the later section. One intuitive advantage is that we can enforce the left-right
consistency for the learnt residual functions of intermediate latent codes.
The architecture of the MLP network is similar to the one in [KAT+ 19]. We apply 3 linear
blocks and the dimensions are 512, 1024, (T − 1) × zdim , respectively. Leaky ReLU with 0.2
slope are placed between linear blocks to introduce non-linearity.

Figure 5.1.: Learnt manifold in single-frame density reconstruction and latent codes of St , St+T gen-
erated from Fs2z . There are three paths for estimating intermediate latent codes from two
key-frames. Path 1 is to directly estimate the latent vector, while 2, 3 is to learn the residual
function between linearly-interpolated latent code, previous one latent code and the learnt
manifold, respectively.

40
5.2. Improving the Baseline

5.2. Improving the Baseline


Inspired by [KAT+ 19][KAGS19], we try to take velocity fields into account for generating
smoke simulations with sharper details and higher temporal consistency as we find that our
baseline models can not reconstruct temporally-coherent density fields in time. As shown in
[KAT+ 19], reconstructing density field directly from a set of parameters leads to discontinuity
and less vivid. Our task shares some similarities, where we want to reconstruct smoke simula-
tions as well with different inputs though. Though our input sketch is more intuitive, we need
to parse more abstract sketch data into latent space and reconstruct the density or velocity fields
with additional networks.
Therefore, we propose our approach by not only considering the single-frame density recon-
struction, but also the temporal coherency of reconstructed between two sketched key-frames.
The baseline with variational autoencoder is able to generate a smoother latent space for linear
interpolation, but it does not have explicit guarantee on the temporal consistency on recon-
structed fields by naive interpolation. Here the goal becomes estimating the velocity fields
between two input key-frames and running advections based on initial reconstructed density
field and intermediate velocity fields, similar to [KAT+ 19]. Further, we hope that our recon-
structed density fields can be tightly related to what we sketch, which is crucial for users or
artists to edit their sketches provided consistent output. This needs us to find a way to model the
consistency between 2D sketches and 3D density fields. Based on these, we need to find ways
to achieve two main goals mentioned at Chapter 1.
1. Reconstruct a sequence of density fields based on two input sketches with much less
temporal jittering via re-simulation based on initial reconstructed density field and inter-
polated velocity fields.
2. The reconstructed density fields on the input key-framed positions should more or less
match the input sketches via capturing long-term consistency and consistency in terms
of rendering from the input viewpoints, which can be addressed by incorporating loss
functions on long-term transport/advection and 2D-3D consistency.

5.2.1. Velocity Decoder Network

Before going to detailed components, we first give an overall architecture for our approaches.
As shown in Figure 4.1, we use a similar network for single-frame density reconstruction based
on two-view sketches. Here we include the whole pipeline in the figure, including the sketch
generation apart using a renderer: S = R(d). As mentioned above, if we want to model the
consistency between 3D density and 2D sketches, we need to make the renderer operator R to
be differentiable. Briefly, we would try to propose a way to render sketches from 3D density
fields directly to generate the input data in a differentiable way. In this way, we can also benefit
from data augmentation on the fly without the step of pre-processing the 3D density to generate
the sketches. This would be discussed in details in next section.
Besides, our approach using velocity fields are mainly shown in Figure 5.2. Beside networks
Fs2z , Fz2ẑ and Fẑ2d , we introduce two additional networks, called Fz2ẑ , aiming to learn a map-
ping from key-framed latent codes to intermediate ones and the corresponding velocity fields

41
5. Multi-Frame Density Interpolation

based on every two consecutive latent codes. With this method, we can therefore re-simulate
the reconstructed density and velocity via recursive advection operator A at the execution time.
The key idea of using velocity fields for re-simulation is to reduce temporal flickering via naive
linear interpolating latent codes in the baseline model. Notice that we use a curl operator after
the output of our velocity decoder, which enforces the velocity decoder to implicitly learn a
stream function Ψ̂. This can make sure that we our reconstructed velocity field, v̂ = ∇ × Ψ̂, is
fully divergence-free. But we still use v as output to denote our velocity decoder Fẑ2v .
More specifically, Fz2ẑ : R2×zdim → R(T −1)×zdim is realized as a multi-layer perceptron (MLP)
consisting of fully-connected layers (FCs). The definition and architecture are kept to be the
same as the one proposed in baseline model, where we use three FCs (512, 1024, (T −1)×zdim )
with Leaky ReLU after each linear block. The architecture of Fẑ2v : R2×zdim → RD×H×W ×3 ,
is the same as the Fz2d except that the input is a 2 × zdim -dimensional vector and the output
is 3-channel volume instead of 1. Still, dˆ = Fz2d (Fs2z (S)) and similarly v̂ = Fz2ẑ (ẑα , ẑα+1 ),
where ẑα is the output of Fz2ẑ at α position where α ∈ [t, t + T − 1], which is slightly different
to the one in the baseline model. Again, at the key-framed positions, we have ẑt = zt and
ẑt+T = zt+T .
The intuition is that we can obtain intermediate latent codes and velocities based on two sparse
input key-frames and then re-simulate the density based on reconstructed velocity to ensure
temporal coherency. However, as seen in Figure 5.2, each time we only estimate the veloc-
ity based on two consecutive latent codes ẑα and ẑα+1 , which has no guarantee for long-term
temporal consistency as error would be accumulated if the error between estimated interpolated
velocity fields and ground truth velocity fields does not decrease. Thus, we propose another loss
function based on advection used in [KAGS19] to make sure that the advected density can be
more or less consistent with the one generated directly from single-frame reconstruction. Also,
with such an advection loss, we can easily plug in the loss functions for density reconstruction
in the last advected frame (i.e., dˆt+T ) to correct the estimated velocity fields further.

5.3. Loss Functions


We train our sketch encoder and density decoder for density reconstruction, using loss functions
Ld , LKL , Ldgrad , LSSIM 3Dd as shown in Figure 4.1. Here we focus on training velocity decoder
Fẑ2v and MLP Fz2ẑ with the following loss functions for density interpolation:

Loss for Latent Space Interpolation

In addition, we add a loss function for training the MLP Fz2ẑ for interpolation, while other
networks are fixed. As our input to Fz2ẑ network is only two key-framed latent code zt and
zt+T at the left- and right-most time step, we can ensure the network can output meaningful
intermediate latent codes by using this loss function:

t+T −1
1 X
Lz (d) = ||zα − Fz2ẑ (zt , zt+T )[α] ||22 (5.1)
T − 1 α=t+1

42
5.3. Loss Functions

Figure 5.2.: This is the architecture of our velocity fields prediction model, demonstrating latent space
interpolation and velocity fields estimation based on two key-frames and consecutive
frames. Specifically, multi-view sketches of two key-frames are shown in this graph and
similarly the weights of sketch encoder [Fs2f , Ff 2z ] are shared among sketches of all
frames. Lẑ2v represents the velocity decoder that implicitly estimates the stream function
with an curl operator after the final output. And we introduce an additional mapping network
Fz2ẑ to map the concatenated latent space of two key-frames (in green color consistent with
Figure 4.1) to all intermediate latent codes and estimate the velocity fields between consec-
utive latent codes (ẑα , ẑα+1 ) together with Fẑ2v .

43
5. Multi-Frame Density Interpolation

For density reconstruction, the MLP do not need to output the latent codes of left- and right-most
key-frames, but only the intermediate ones. Here we use MSE loss to minimize the difference
between vectors. We explore this loss for our velocity reconstruction while it shows marginal
effects on the final reconstruction quality as shown in Figure 6.19.

Loss in 3D Space

The difference compared with our approach is on the velocity estimation part, where we incor-
porate another decoder for decoding velocity from two consecutive latent codes. Similarly, we
define the loss function on velocity decode including L1 loss on reconstructed field:

Lv (v) = ||v̂ − v||1 . (5.2)

and gradient loss for velocity as well:

Lvgrad (v) = ||∇v̂ − ∇v||1 . (5.3)

Additionally, similar to 3D SSIM loss for density reconstruction, we can apply similar one
called LSSIM 3Dv (v) for velocity reconstruction.

Loss for Temporal Consistency

As mentioned before, to capture long-term temporal consistency, we first introduce a recursive


advection operation as the following. We define d¯t as the advected density field at time step t,
so the advected density field at time t + T can be represented as:

d¯t+T = A(..., A(A(d¯t , v̂t ), v̂t+1 ), ..., v̂t+T −1 ) (5.4)

where

d¯t+1 = A(v̂t , d¯t ) (5.5)

v̂α = Fẑ2v (Fz2ẑ (zt , zt+T )[α,α+1] ) (5.6)

where d¯t = dˆt = Fz2d (zt ) and α ∈ [t, t + T − 1] means the index of latent codes. To consider
the balanced between accuracy and training time, we use first-order advection scheme here and
[XFCT18] already shows efficiency to do this.
Additionally, we can also use the same loss functions on the advected density fields as shown
in the single-frame density reconstruction part 4.3. Then the advection loss can be extended to:

Ladvect (dt+T ) = Ld (dt+T ) + Ldgrad (dt+T ) + LSSIM 3Dd (dt+T ) + λsketch Lsketch (dt+T ) (5.7)

44
5.4. Smoke Simulations via Sketches

Full Objective

Overall, our objective function can be divided into density and velocity parts. Lden has been
defined in Equation 4.14, while the full objective for velocity reconstruction can be summarized
as the following:

Lvel = Lv + Lvgrad + LSSIM 3Dv + λadvect Ladvect + λz Lz (5.8)

And the full objective of the whole model should be the summation of these two parts. However,
in this thesis we train our models separately. So we do not need to care about the combined full
objective.

5.4. Smoke Simulations via Sketches


Training details are explained in Section 6.3. Once our baseline network is trained already, we
are then able to use Algorithm 1 to reconstruct a sequence of 3D density fields as we can have a
smoothed latent space for interpolation. Given 2 sketched key frames St and St+T with distance
T , we first get the corresponding 2 density fields d̂t , dˆt+T and at the same time get the latent
space for both sketches via the trained sketch encoder Fs2z , represented as zt , zt+T respectively.
The main loop is to linearly interpolate the latent space of intermediate frames based on the
target frame’s position and Tα ∈ [0, 1] will represent the ratio to the right-most position. The
estimated latent space zα would be further fed into our 3D voxel decoder Fz2d to get the final
prediction for each intermediate frame.

5.4.1. Linear Interpolation

Algorithm 1 shows how we can use the smoothed latent space for linearly interpolate interme-
diate density fields given latent codes zt and zt+T from corresponding sketches St , St+T using
trained sketch encoder Fs2z .

Algorithm 1 Smoke Reconstruction and Interpolation via 2 Sketched Keyframes


Require: Given latent spaces zt , zt+T from 2 sketched keyframes, a Voxel Decoder Fz2d
Ensure: Generate interpolated density dˆα controlled by x
1: for α = t to t + T do
2: ẑα = Tα × zt + (1 − Tα ) × zt+T
3: dˆα = Fz2d (ẑα )
4: end for

5.4.2. MLP-based Interpolation

As we find that linear interpolation does not give us a smooth and physically-correct transitions
between two key-frames, we therefore train a MLP aiming to have more physically-looking

45
5. Multi-Frame Density Interpolation

interpolated density fields in time. Here we following the same pipeline as Algorithm 1 by
replacing the line 2 to be ẑα = Fz2ẑ (zt , zt+T )[α] . However, in the experiment section we find
that MLP interpolation method give more visual abrupt changes, meaning that this still can not
model the temporal consistency between interpolated frames.

5.4.3. Re-Simulation

The above two interpolation methods purely rely on the sketch encoder, density decoder with-
out considering temporal consistency for density interpolation. While in this re-simulation
approach we rely on our velocity decoder to predict velocity fields between two consecutive
frames. This can ensure the consecutive density frames can be correlated through estimated
velocity fields and advection operation.
Suppose now we have the trained models of our approach, we can re-simulate the density fields
based on the initial reconstructed density and velocity. Instead of naive linear interpolation, we
can utilize the estimated velocity fields to advect the initial density into the target one. The
following shows the algorithm during the execution time using our method:

Algorithm 2 Recursive Advection


Require: Given initial reconstructed density field dˆt at timestep t and all intermediate estimated
latent space ẑt , ..., ẑt+T between 2 sketches St , St+T and a velocity decoder Fẑ2v .
Ensure: Generate the advected density field d¯ at time step t + T .
1: d¯ ← d̂t
2: for α = t to t + T − 1 do
3: d¯ ← A(d,¯ Fẑ2v (ẑα , ẑα+1 ))
4: end for

Note that the intermediate latent codes ẑα , ẑα+1 shown in Algorithm 2 can be estimated from
left- and right-most latent codes via Fz2ẑ . The difference compared with the baseline is that
we do not use linear interpolation now but instead, we can use our trained Fz2ẑ network to
estimate the latent codes that can be more physically-correct. These latent codes should lie on
the manifold of reconstruction density fields. Thus, we can use Fẑ2v to estimate the velocity
field based on each two consecutive latent codes. Our experiments show that this velocity-based
interpolation approach can ensure temporal consistency for the reconstructed densities, e.g., in
Figure 6.21.

5.5. Summary
In this section, we explore how to achieve temporally-coherent density interpolation using only
key-framed sketches as input. We mainly propose our velocity-based model and re-simulation
approach to reconstruct density fields via advection based on the initial estimated density field
and intermediate estimated velocity fields. Compared with density-based only approach using
linear interpolation or MLP-based interpolation, our method shows more temporal consistency
and slightly sharper results in Chapter 6.

46
6
Experiments
In this chapter, we will evaluate various methods proposed in Chapter 4 and 5 for single-frame
density reconstruction and multi-frame density interpolation from sketches. We will first intro-
duce how we generate the simulation data, sketches and how we implement and train our models
based on the data, before experimenting on loss functions and network architecture through an
ablation study. Besides qualitative results for visual comparisons between different models and
loss functions, we will use several evaluation metrics to measure the reconstruction quality in a
quantitative way as shown in Section 6.2.

6.1. Data Generation

In this section, we will introduce our data generation pipeline, including 3D smoke simulations,
surface reconstruction and sketch generation. The ideal case would be to ask users or artists
to generate real human drawings based on the simulated 3D smokes from multiple viewpoints.
However, creating such a dataset will require significant amount of time and money, e.g., we
need an interactive tool for others that sketches can be drawn manually from what they see.
Instead, we adopt an automatic way to generate the whole dataset via existing computer pro-
grams. We apply Non-photorealistic rendering (NPR) methods to generate such synthetic data
for our experiments. In user study section, 6.7, we will evaluate our trained model on both types
of sketches and make a comparison between them.
For smoke simulations part, we mainly follow the steps proposed in [KAT+ 19] to generate
multiple scenarios with different scenes and different sets of parameters. We use a standard
fluids solver [Sta99] with MacCormack advection and MiC-preconditioned CG solver, which
is implemented in mantaflow [TP18] framework. Simulation with very high resolution can of
course obtain more details for both generated smokes and sketches. To consider the trade-off
between memory, time and details, we pick a reasonable number that the resolution for all

47
6. Experiments

simulated scenes for smokes is set to be 112 × 112 × 112. This ensure that our data contain
good-enough details for smoke simulations and sketching. And the time step of simulation
varies from different scenes as shown below.
After that, we will explain how we use existing sketch renderer, such as suggestive contour
[DFRS03] or our differentiable one to render synthetic sketches as input. Then final output
sketches are cropped and resized to 224 × 224 and aligned with the smoke data. Orthographic
camera is used for both smoke simulations and sketch rendering steps to keep consistency. As
our task is an ill-posed problem from sketches to 3D smoke, we need to sketch as more details
as possible to alleviate this problem.

6.1.1. 3D Smoke Simulations

In this work, we mainly focus on the smoke plume with and without obstacles. First, we propose
3 variations of 3D smoke plumes, including smoke plume with obstacles, smoke inflow and
buoyancy and vertical smoke plume. The last two scenes (i.e., moving smoke and smoky chair)
are proposed but tested so it would be a future work for testing our model with more general
smoke scenes and shapes. In each dataset, we pick one single simulation to be the test set while
the remaining are used for training. The simulation parameters test scene can be seen as the
interpolation of the training data.

Smoke Plume with Obstacles

One general problem is that the details of the simulated smoke might be restricted by the fixed
resolution 1123 . So in this dataset, we generate volcanic smoke effect using the methods pro-
posed in [IEGT17] to increase the details of smoke simulations given fixed grid resolution. We
run simulations with varying smoke source radius and obstacle positions to ensure randomness
and diversity. It refers to the Smoke & Sphere Obstacle scene in [KAT+ 19] paper, but with
volcanic smoke and different radius instead of buoyancy to generate different smoke shapes.
We generate a dataset with 3 different sphere obstacle positions in the x-direction on the left
side, and 3 different inflow source radius, where each setting runs for 150 frames. Altogether
we will have 3 × 3 × 150 = 1350 frames and the middle one simulation would be used as the
test scene, with 150 frames.
We generate a high-resolution dataset using a similar setting by varying smoke source radius
and obstacle positions, but with 2× resolution in each direction, reaching up to 2243 resolution
per frame. The slice views of both low- and high-resolution simulation data are visualized in
Figure 6.1. We aim to show that our model is able to reconstruct higher-resolution density fields
with more details and the result is shown in Figure A.1.

Smoke Inflow and Buoyancy

This dataset runs simulations with varying inflow speed and buoyancy. Again this scene is
inspired by the one in [KAT+ 19], where the source position is initialised to the left side and
inflow direction is to the right horizontally. Similar to the smoke plume with obstacle dataset,

48
6.1. Data Generation

(a) Training and test samples from smoke plume and obstacle dataset. It is shown that the inflow source size
and obstacle positions vary across scenes.

(b) Training and test samples from high-resolution smoke plume and obstacle
dataset. Similar to the lower-resolution one, the inflow source size and ob-
stacle positions vary across scenes.

Figure 6.1.: Training and test samples from smoke plume with obstacle dataset and its higher-resolution
version.

49
6. Experiments

we use 3 different inflow speed and 3 different buoyancy values to ensure diversity, and run 200
frames for each simulation. In this way we will have 3 × 3 × 200 = 1800 frames in total and
the middle simulation would be used as the test scene as well.

Figure 6.2.: Training and test samples from smoke inflow and buoyancy dataset. It is shown that the
inflow velocity and buoyancy vary across scenes.

Vertical Smoke Plume

This is a modified version of 2D smoke plume [KAT+ 19] by varying the inflow source radius
and positions in x-axis. In this thesis, we generate a 3D version in order to generate sketches
as input from different viewpoints. There are 3 inflow radius and 3 inflow positions while each
simulation contains 200 frames. We use the same way to split the training and test sets, so that
the middle scene is considered to be the test scene and the remaining data are used for training.

Moving Smoke

This scene is based on the same scene in [KAT+ 19] paper, but with higher resolution and fewer
frames per simulation for sketch-based applications. That is to say we can generate fewer
frames per scene compared with [KAT+ 19] as we do only need to model the smoke inside the
bounding box. Considering the size of the dataset, we finally run 50 scenes and 100 frames
per scene, totally 5k frames for the whole dataset. The smoke source will move around the
XZ-plane based on the path generated by Perlin noise to ensure randomness. And the final
generated positions are mainly distributed between 0.3 and 0.7 of XZ-plane. This dataset is a
good test scene for our multi-view sketching system as we need to model smokes in different
positions in a plane.

50
6.1. Data Generation

Smoky Chairs

Besides generating smokes from a sphere or cylinder source, we also build a dataset to capture
more general shapes. Here we pick the chairs dataset from ModelNet40 [WSK+ 15] dataset as
an example of general shapes. We need to initialize the density with existing voxelized model
and run simulations for each selected model. Buoyancy will be added and each model should
initially start from the bottom. As chairs dataset is known to capture intra-class variations, we
can test if our model can be generalized well to reconstruct smoky chairs. Specifically, we run
13 simulation steps and discard the first 3 frames as they look like binary voxels instead of
continuous density field. Training and test sets have been setup already in the original database
and randomly pick 500 models for training and 50 for testing among them, to build a dataset
with 5500 frames totally.

6.1.2. Sketch Generation

There are two proposed ways to render sketches from 3D smoke as shown in 6.1. For the
suggestive contour method [DFRS03], existing surface and mesh reconstruction methods based
on iso-surface will be applied, which is an intermediate step to generate input sketches. We
can compile openvdb [MLJ+ 13] to obtain the corresponding methods. Then the suggestive
contour tool can be applied on the polygon mesh data, in which there are several sketching
options including silhouette, contour, suggestive contour and shading. The output resolution
of sketching is set to be 512 × 512 to preserve the details and width of lines. Then sketches
would be cropped and resized to 224 × 224 and aligned with the smoke data. This method
can accurately extract the contour from an existing smoke surface. However, there are two main
drawbacks with this approach. The first is that we need to pick an iso-value to extract the surface
from the density. However, due to the large discrepancies in the density, a constant iso-value
cannot accurately capture all parts of the smoke. The second is that we want to incorporate
the sketch loss (in Section 4.3) in the training process by rendering sketches from densities.
The suggestive contour approach, however, is not differentiable and cannot be plugged in our
gradient-based optimization framework.
Therefore, we propose a novel differentiable sketch renderer as shown in 4.2.2. We can directly
apply our renderer to the simulated density fields without any pre-processing and can be done
on the fly in the training process. With orthographic camera used, the simulated smoke data are
automatically aligned with rendered sketches. We find that density fields with grid resolution
of 2243 is needed to render clean and clear sketches, which might consume extra memory and
computations while we need to upsampling lower-resolution data and apply Gaussian filter.
However, it also means that our rendering method does not need additional upsampling while
the density resolution have already reached up to 2243 . Figure 6.3 clearly shows the difference
where the left one looks like digital sketch with sharper edges and more lines. On the contrary,
ours looks more like pencil sketch and can be smoother. Both types of sketches can accurately
describe the contours compared with the physically-based rendered references in Figure 6.3.

51
6. Experiments

Figure 6.3.: Comparisons between suggestive contour (left), our rendered sketch (middle) with toon
shading and physic-based rendered images by Mitsuba (right) from two scenes. The differ-
ence is clear that the left one looks like digital sketch with sharper edges, while ours looks
more like pencil sketch. We can see that both types of sketches can accurately describe the
contours compared with physically-based rendered references.

52
6.2. Evaluation Metrics

6.1.3. Smoke Rendering

In order to give qualitative results, we also need to visualize the reconstructed smoke data in
a realistic way. To visualize the 3D smoke simulation data, the simplest way is to view the
slice views of the 3D volumes. Also, we can refer to existing physically-based smoke rendering
techniques, such as heterogeneous medium via volumetric path tracing techniques. This can
give us more realistic smoke rendering for visualization. In this thesis, we directly use Mitsuba
Renderer [Jak10] to achieve this in order to visualize our results and make visual comparisons.

6.2. Evaluation Metrics

We use Peak Signal-to-Noise Ratio (PSNR), Mean Squared Error (MSE) and Mean Absolute
Error (MAE) as evaluation metrics to show the results quantitatively. More specifically, MSE is
defined to be:
N
1 X
M SE = (Yi − Ŷi )2 (6.1)
N i=1

where Yi , Ŷi , N are ground truth, reconstructed quantities and number of samples, respectively.
Similarly, MAE is defined to be:
N
1 X
M AE = |Yi − Ŷi | (6.2)
N i=1

PSNR metric is defined via MSE:

P SN R = 20 · log10 (M AXYi ) − 10 · log10 (M SE) (6.3)

where M AXYi is the maximum possible value of ground truth Yi .


Based on the definitions of these metrics, lower MAE, MSE and higher PSNR means better
results.

6.3. Implementation & Training

Single-Frame Density Reconstruction

Our network architectures are implemented in PyTorch [PGC+ 17] deep learning framework.
For the density reconstruction models, batch size is set to be 2 and Adam optimizer [KB14]
with fixed learning rate (i.e., 1e-4) is used. As illustrated in Table 4.1, the baseline network
starts from encoding the images into the latent space with a pre-trained network and additional
4 layers where batch normalization and Leaky ReLU activation function follow. The dimension
of output vector from sketches is 4096 after being flattened. Then two fully-connected layers
are followed to estimate the parameters of the standard normal distribution, whose dimensions

53
6. Experiments

are fixed to be 256, as well as the sampled latent space as shown in [LSS+ 19]. For the pre-
trained network, we can reuse the pre-trained DenseNet on ImageNet [DDS+ 09], which can
be easily accessed in PyTorch. As for the voxel decoder, we employ 3D convolutions coupled
with nearest neighbour upsampling layer implemented in PyTorch to increase the resolution of
the feature maps. Leaky ReLU is used for decoder as well. Notice that we have to add tanh
activation function for predicting density field in the output layer of voxel decoder, otherwise
there would be many noises generated in the background.
The networks are trained on normalized fluid flow data in the range [−1, 1], using full objective
Lden . The original simulated density data are in the range of [0, 1] so it is easy to normalize them
by multiplying 2 and then minus 1. To train the MLP network, we use the same loss functions
as the training of density decoder, including Ld , Ldgrad , LKL and LSSIM 3Dd , while we do not
explicitly add the latent space loss Lz to keep consistent with training the velocity models later
in the next section. Figure 6.4 shows how our model converges on a smoke plume dataset with
obstacle on the left as shown in 6.1.1. We plot the training L1 loss, as well as the mean absolute
error (MAE), Peak signal-to-noise ratio (PSNR) metrics of reconstructed sketches and density
fields. It is shown that after training of 185 epochs, the test errors are more or less converged
and the PSNR of density field can reach between 31 ∼ 32.

Multi-Frame Density Interpolation

We normalize the velocity fields by dividing the maximum absolute value among the entire
velocity data for training. By adding curl operator to the output of velocity decoder, the gen-
erated fields is ensured to be divergence-free due to 2.2. Notice that we do not use activation
function after the last layer for predicting stream function. For the advection loss, we use first-
order Semi-Lagrangian advection scheme, which is not exactly the same compared with the
data generation process but can speed up the training process.
Also, similar to [KAT+ 19], we also find that using batch normalization will degrade the result a
lot for velocity fields reconstruction, as well as introducing streak artifacts, which is undesirable
similar to checkerboard artifacts. So we remove all batch normalization layers for velocity
reconstruction in Table 4.2.
We propose a way to train our models separately as training the velocity decoder need the
trained Fs2z to get the latent codes. For separate training, we can first train the density network
without considering the velocity decoder. In the second step, we can train velocity network
based on the trained sketch encoder in the first step as shown in the following:

1. Given samples of multi-view sketches as shown in Figure 4.1, train Fs2z and Fz2d based
on Lden for 185 epochs while fixing Fz2ẑ and Fẑ2v .
2. Given pairs of multi-view sketches as shown in Figure 5.2 with distance of T frames (e.g.,
T can be 10, 20, ... etc.), train Fz2ẑ and Fẑ2v based on Lvel for another 185 epochs, while
fixing Fs2z and Fz2d .

Another way is to jointly train both network by back-propagating the gradients based on velocity
and density reconstruction at the same time. However, as we need latent codes in the first stage
as input to train our velocity decoder and MLP network, we choose to train them separately to

54
6.3. Implementation & Training

(a) Training curves of L1 loss for density field reconstruction.

(b) From left to right: L1 loss, PSNR metric for reconstructed density field and sketches,
respectively, in the test set.

(c) An example of training progress for the smoke plume with obstacles dataset mentioned
in 6.1.1. Evolution of a single sample of reconstructed middle-sliced density field and
sketches from the front view during training.

Figure 6.4.: Convergence plot for single-frame 3D density reconstruction in 185 epochs. The average
L1 loss, PSNR of 3D reconstructed density field, PSNR of sketches generated from recon-
structed density field in the test set are plotted on the graph using TensorBoard [ABC+ 16].

55
6. Experiments

avoid confusing the models.


We find that training with advection loss on a 1123 resolution takes around 1 hour per epoch,
which makes the training take much more time to converge. As the full objective contains
the advection loss Ladvect , we activate this loss only after 150 epochs to speed up the training
progress until 185 epochs. Also, based on Figure 6.5, the model starts to convergence around
half of the iterations, meaning that the later training process is actually fine-tuning the model.
After incorporating the advection loss Ladvect , we find that the training becomes more unstable
and fluctuated. In this work we fix the learning rate to be 1e-4 and we can get plausible results.
But we think that lowering the learning by some decaying scheduler (cosine) might be able to
increase the stability of training in the future.

Figure 6.5.: Training curves of L1 loss for velocity fields reconstruction.

6.4. Single-Frame Density Reconstruction

6.4.1. Input Sketches

Here we show different possible input settings. The inputs to our neural networks consist of
multi-view sketches, density fields and velocity fields. Usually data fields would have values
from a fixed range, e.g., [0, 1] for density fields and [0, 255] for images. While velocity field
both have negative and positive values. And these data will be normalized to [−1, 1] via linear
mapping as mentioned in the training section. So for the inputs the difference here are the
generated sketches. As they are generated via NPR methods, it is possible that we use different
line drawing techniques and different shading options. In this work, we narrow down our NPR
to 2 options:
• Contours and suggestive contours [DFRS03], together with Toon Shading [LMHB00]
from one pre-defined light direction.
• Sketches generated by our differentiable rendering method mentioned in the section 6.1,
where Toon Shading is blended with extracted contours.
We do not provide the option to use simpler line drawing methods (e.g., silhouette only) as we
think the best way to deal with ill-posed sketch-based 3D modelling problem is to provide as
more information as possible for the input sketches. Also with our differentiable renderer, it is

56
6.4. Single-Frame Density Reconstruction

easier to augment the data on the fly without any pre-processing step. In this work, we only
augment with different thresholds while generating the contours. With lower threshold, we can
generate less and shallower contours while larger can generate thicker ones as shown in section
4. This is to simulate human sketching process from simple to more complex contours as shown
in Figure 4.6.
In the later experiments, we use the second one as we can apply some additional loss functions.
so called sketch loss, to improve our model via differentiably-rendered sketches as input. So we
can compare whether using this loss will help improve the results.

6.4.2. Network Architecture

We mainly explore the network architecture in our baesline models, where there are several
advanced architecture as options: ResNet and DenseNet. Here we mainly explore these two
architecture. Future work would be needed to evaluate other architecture such as VGG [SZ14].
We find that DenseNet-121 is quite good for less learnable parameters, even compared with
ResNet-18, but still outperforms the latter one.

• Encoder: DenseNet vs. ResNet. For the encoder part, we test two different architectures:
DenseNet-121 vs. ResNet-18 as mentioned in 4.1.2. The encoder part Fs2f is fixed to
be a feature extractor, while Ff 2z is learnable during backpropagation. The comparison
is shown in Figure 6.6, while other parts of the whole network are identical as well as
the loss functions. We conclude that DenseNet-121 gives slightly better results on final
reconstructed densities in terms of MAE and PSNR metrics.
• Decoder: DenseNet vs. ResNet. To explore different architecture in the decoder, we
keep other parts identical. Detailed architecture is explained in Section 4.1.3. More
specifically, densely-connected blocks have less learnable parameters than the residual
ones. However, we find that residual blocks tend to learn smoother and more accurate
velocity fields in our experiment 6.8. It might be still interesting to explore how to utilize
the advantages of each type of block to improve the results further.
• Convolutional variational autoencoder is seen as the baseline to reconstruct single-frame
density and a sequence of simulations as well by linear interpolation in the latent space.
For example, we test KL Loss with this baseline model and the results are shown in Table
6.1, Figures 6.8 and 6.9, etc.
• We also extend our model to cross-modal training similar to [WCPM18] by adding one
more encoder from 3D velocities to latent space into our current architecture, in order
to get a smoother latent space than VAE-based methods. However, we find that this
additional encoder makes the networks even harder to train and get worse results in our
initial experiments. So future work would be needed to explore in this direction.

57
6. Experiments

(a) Comparison of encoder architecture using ResNet-18 and DenseNet-18


in the smoke plume with obstacle scene. Mean absolute error of recon-
structed density fields is shown for the test scene.

(b) Comparison of encoder architecture using ResNet-18 and DenseNet-18


in the smoke plume with obstacle scene. PSNR of reconstructed density
fields is shown for the test scene.

Figure 6.6.: Comparison of two different encoder architecture: ResNet-18 and DenseNet-18 in the test
scene of smoke plume with obstacle dataset.

58
6.4. Single-Frame Density Reconstruction

Figure 6.7.: Comparison between ResNet- and DenseNet-like decoder for density reconstruction in
terms of MAE metric in the smoke plume with obstacle test set.

6.4.3. Loss Functions

Smoothness on Latent Space

The loss function on the latent space, so-called KL Loss: LKL , will also be analyzed with
different weights λKL . In this case, we train our baseline model for single-frame reconstruction
with full objective L and with different importance of LKL controlled by λKL .

Table 6.1.: Quantitative results of different multi-frame interpolation settings by linear interpolation us-
ing different λKL .

Interpolation λKL MSEd PSNRd MAEd

0.1 0.00269122 30.2618 0.00858043


single-frame
0.001 0.00251368 30.6805 0.00820302

0.1 0.00258913 30.3865 0.00847102


10-frame
0.001 0.00248748 30.626 0.0081982

0.1 0.00278183 29.8754 0.00888015


20-frame
0.001 0.00292788 29.4823 0.0091363

Theoretically, higher λKL learns a smoother latent space for better interpolation while at the
same time might be too regularized and degrade the results of single-frame reconstruction qual-
ity. So there is always a trade-off between reconstruction and interpolation. From Table 6.1,

59
6. Experiments

Figure 6.8.: Qualitative results of slice views evolved in time. Top row shows the reconstructed frames
with lower λKL = 0.001 and bottom row shows the results with higher one 0.1.

we find that high λKL will degrade the single-frame density reconstruction/interpolation results
compared with the lower one. However, with higher-frame interpolation (e.g., 20-frame), the
higher λKL starts to outperform the lower one. This is not a surprise as the model learns a
smoother latent space which is better in terms of linear interpolation with higher λKL value. So
we conclude that higher λKL will introduce lower-quality density reconstruction results in the
single-frame setting. But at the same time, it can ensure smoothness and handle larger-frame
interpolation with slightly better performance than the lower one.
However, single-frame reconstruction quality is more important in our case, when we try to train
another network to learn velocity reconstruction further. So for our approach in the following,
we use λKL in the first-stage single-frame density reconstruction. To be fair, we will compare
the results of two approaches with this same value. Another reason is that when we compare
the 20-frame interpolation visually in a video, we do not see much difference in terms of the
temporal consistency, so we assume that both values learn a latent space with similar smooth-
ness. More qualitative results about temporal reconstruction by our baseline models would be
displayed in the next section compared with our approaches.
We have shown that linear interpolation can be used inside each simulation. To further evaluate
the smoothness of the latent space, we also try to linearly interpolate density fields across dif-
ferent simulation. For example, in Figure 6.9, we show 10-frame interpolation results based on
the last frame of two consecutive simulations, showing that walking over the latent space can
generate transitions between two key-frames.

Figure 6.9.: Linear Interpolation in the latent space between two key-frames from last frame of two
consecutive simulations.

60
6.4. Single-Frame Density Reconstruction

3D SSIM Loss for Density Reconstruction

We introduce the SSIM loss in Section 4.3 of baseline model. As it already shows promising
results in image restoration [ZGFK16], it is interesting to see if this helps for 3D density recon-
struction. To test this, we use our full objective function for density reconstruction Lden , with
the weight λSSIM 3D set to be 0 and 1.
Figure 6.10 shows the loss curves on the test scene in the plume and obstacle dataset. Here
we track the average of PSNE and MAE of 3D density fields in the whole sequence of the test
scene, however, no obvious improvement is shown in terms of these metrics. Therefore, we
can conclude that this SSIM loss function does not improve further in terms of the single-frame
reconstruction quality of 3D density fields.

Effect of 2D-3D Consistency: Lsketch

We test the effect on the Lsketch proposed in 4.3 with different weights compared with full
objective Lden . Other weights are set to be default values, while the weight for sketch loss
λsketch is set to be 0 and 1. Again, we track the average error metrics in the test scene during
training our model and the results are shown in Figure 6.11. We can clearly see that with sketch
loss Lsketch , the PSNR metric of density field has marginal improvement, while the PSNR
of reconstructed sketches does consistently outperforms the model without sketch loss. This
proves that modeling the 2D-3D re-sketching consistency helps to bring better reconstructed
results from input viewpoints and can also have the possibility to reconstruct slightly more
structured 3D density field.
Besides, we show some qualitative results to compare the model trained with and without sketch
loss inside the full objective Lden . Figure 6.12 shows the rendered images from front view of
ground truth, reconstruction with sketch loss and without sketch loss, respectively, as well as
the corresponding reconstructed sketch from the same viewpoint. We can find that the middle
one with sketch loss can preserve slightly sharper details and look closer to the ground truth one
in terms of the rendered sketch.
The default weight of sketch loss Lsketch is set to be 0.1 as mentioned in 6.3 to avoid NaN value.
We also introduce a warm-up phase to train without Lsketch in the first 50 epochs to avoid the
NaN problem further. In this way, we can have more stable training process. We also test with
different weights λsketch for Lsketch to enforce the 2D-3D consistency. Figure 6.13 shows the
difference between λsketch = 0.1, 10 and Table 6.2 shows their difference quantitatively. It is
expected that the PSNR metric of sketch is higher with larger λsketch while the density PSNR
degrades.
Further, we will explore different weights inside the Lsketch to see if the MS-SSIM helps for
better reconstructed sketches. With full objective, we test λSSIM 2D = 0, 1, respectively. And
the quantitative and qualitative results are shown in Figure 6.14, where actually the MS-SSIM
loss simply consistently bring improvement in terms of density MAE. For other metric, we
do not see obvious difference in the test curve. This means the MS-SSIM loss for 2D sketch
doesn’t help a lot overall but somehow it can help reconstruct density with slightly lower MAE.

61
6. Experiments

(a) Comparisons of training with and without LSSIM 3D in the smoke plume and obstacle
test set. in terms of density MAE.

(b) Comparisons of training with and without LSSIM 3D in the smoke plume and obstacle
test set. in terms of density PSNR.

Figure 6.10.: Qualitative comparisons of training with and without LSSIM 3D in the smoke plume and
obstacle test set.

62
6.4. Single-Frame Density Reconstruction

(a) Comparisons of training with and without Lsketch in the smoke plume and obstacle test
set, in terms of sketch PSNR.

(b) Comparisons of training with and without Lsketch in the smoke plume and obstacle test
set, in terms of density PSNR.

Figure 6.11.: Comparisons of training with and without Lsketch in the smoke plume and obstacle test
set, in terms of density PSNR. It is illustrated that sketch loss do not degrade the results of
density reconstruction and helps to improve reconstructed sketch quality.

63
6. Experiments

(a) From left to right: ground truth rendered density, reconstructed density training with
Lsketch , without Lsketch , respectively.

(b) From left to right: rendered ground truth sketch, reconstructed sketch training with Lsketch ,
without Lsketch , respectively.

Figure 6.12.: Qualitative comparisons of reconstructed density and corresponding sketches.

64
6.4. Single-Frame Density Reconstruction

Figure 6.13.: From left to right: ground truth density, reconstructed density training with high sketch
loss weight λsketch = 10 and low one weight λsketch = 0.1, respectively.

Table 6.2.: Quantitative results in terms of PSNR on sketches and density fields with high (10) and low
(0.1) weighted sketch loss Lsketch .

λsketch PSNRs PSNRd

0.1 26.92 31.54

10 27.8 29.05

65
6. Experiments

Figure 6.14.: Quantitative comparisons of training with and without LM SSSIM 2D , only density MAE
goes lower while and others are almost the same.

66
6.4. Single-Frame Density Reconstruction

Self-supervised Fine-tuning

Besides applying sketch loss Lsketch during training phase, we can also fine-tune our model
on newly input sketches using Lsketch as mentioned in Section 4.4. Figure 6.15 shows the
difference before and after fine-tuning with sketch loss in a self-supervised manner. We can see
that the fine-tuned one have more closer structure to the ground truth compared with the one
without fine-tuning. For example, the left has some density missing on top while the middle
one is refined to look more like the input sketches and ground truth density.

Figure 6.15.: An example of reconstructed density field with (middle) and without (left-most) self-
supervised fine-tuning. The right-most one is the ground truth. From the zoom-in view
in the second row we find that fine-tuning can correct the output structure to make result-
ing density field closer to the ground truth.

As we optimize the network based on front and side only views, it is necessary to check if there
are artifacts seen from other viewpoints. Figure 6.16 shows multi-view rendering results from
front to side view every 30 degrees. Though we merely fine-tune the density using two-view
sketches as supervision, we do not observe obvious artifacts from other viewpoints. This shows
that this self-supervised fine-tuning can indeed give us better density reconstruction quality.

Adversarial Training with Discriminator

Here we explore whether our added 3D discriminator helps to improve the reconstructed density
using our baseline model. As shown in Figure 6.17, the GAN output in the middle looks sharper
than the left-most one without adversarial training.

67
6. Experiments

Figure 6.16.: From front to left view by rotating 0, 30, 60, 90 degrees, respectively. No artifacts are
observed from viewpoints between front and side views.

Figure 6.17.: An example of training with and without GAN for additional 175 epochs. From left to
right: reconstructed density through training without and with GAN, ground truth.

68
6.5. Multi-Frame Density Interpolation

6.5. Multi-Frame Density Interpolation

6.5.1. Models

Compared with the baseline model, we add an additional velocity decoder Fẑ2v to decoder
consecutive latent codes into corresponding velocity fields. Hence, in this section we will focus
on different setting for training the additional velocity decoder. For the sketch encoder and
density decoder part, we will use the best model based on the ablation study in the Section 6.4.
As shown in the following, we will make comparisons between 3 proposed models:

• We consider the encoder-decoder VAE model proposed in Chapter 4 as the baseline


model, which is able to animate the smoke simulations via linear interpolation. We call
this baseline wo/ MLP model in Table 6.3.
• In order to have better interpolation, we propose the latent space interpolation network,
aiming to estimate the intermediate latent codes and decode them to real density fields.
For example, we train a MLP for 20-frame interpolation and find that it can actually
improve the results in terms of loss metrics (e.g., PSNR, MAE, MSE). Table 6.3 shows
better results in the test scene compared with the baseline model without using MLP. We
use full objective for training MLP while λKL is set to be 0.001 to keep consistent with the
setting of training velocity decoder later. We call this baseline w/ MLP model as shown
in Table 6.3.
However, we find that this method does not solve the problem of temporal consistency as
it still cannot model the temporal information of 3D density via a simple MLP network.
We can see more severe abrupt changes in time from resulting density fields. We show
two pairs of consecutive frames in Figure 6.18, which contains obvious abrupt changes
visually and it is undesirable.

Table 6.3.: Quantitative results of 20-frame interpolation using baseline model with and without MLP-
based interpolation.

Models MSEd PSNRd MAEd

Baseline w/ MLP 0.0026751 30.0047 0.00852248

Baseline wo/ MLP 0.00292788 29.4823 0.0091363

• The third one is our proposed velocity decoder used for temporally-consistent density
interpolation. We call this velocity-based model in the following section, which aims to
reconstruct intermediate velocity fields between key-frames and interpolate density by
advection.

69
6. Experiments

Figure 6.18.: Examples of abrupt changes using MLP-based interpolation between two consecutive den-
sity frames.

6.5.2. Loss Functions

Here we will mainly explore different settings in the full objective of velocity Lvel , while the
setting of Lden has been determined based on the analysis in the baseline model. We build
a simple test case here to evaluate our velocity-based model. The idea is to use the baseline
model to initialize a frame of 3D density field and use our key-framed sketches to predict the
later velocity fields to generate the sequence of smoke simulation.
Specifically, we initialize the first density field via the baseline model in the 110th frame in the
test scene, and draw front and side view sketches on the 130, 150th frame, respectively, meaning
that our model is tested under the 20-frame interpolation setting. Thus, our model should be
able to generate the simulation starting from 110th and compare the results in both qualitative
and quantitative ways.

Effect of Latent Space Loss: Lz

The idea of latent space loss Lz is to better estimate the latent codes between two key-frames
as shown in Section 5.3. Suppose we can perfectly reconstruct the latent codes, then we have
all the intermediate information about the input sketches, which can help us reconstruct the
intermediate velocity fields in a better quality. Here we test velocity decoder with full objective
of velocity Lvel , and the weight of Lz , λv is set to be 0 and 1. We train the velocity decoder Fẑ2z
together with the latent space interpolation network MLP Fz2ẑ for the same 185 epochs for fair
comparisons.
Figure 6.19 shows the visual comparison between different λz value under the same model,
while Table 6.4 shows quantitative results in terms of density and sketch PSNR. We can see that
the results are very similar and cannot easily tell which one is better perceptually. The sketch
PSNR of λz = 0 is slightly higher while the density PSNR is lower. Hence, we conclude that the
latent space loss does not bring much improvement based on both quantitative and qualitative
results.
In terms of training progress, intermediate latent codes can be pre-computed with fixed input.
However, once the inputs are changed (e.g., data augmentation), the intermediate latent codes
should be computed on the fly, which significantly slows down the training process as we need
to do latent code inference by the pre-trained sketch encoder Fs2z network. Hence, we remove
that term as a default training setting.

70
6.5. Multi-Frame Density Interpolation

Figure 6.19.: From left to right: full objective with λz = 0 and λz = 1, respectively. Advected density
field of 140th frame from 110th is shown using our approach.

Table 6.4.: Quantitative results of reconstructed density with λz = 0 and 1, from 110 ∼ 150th frames.

λz PSNRd PSNRs

0 21.0757 16.6100

1 21.1489 16.5963

71
6. Experiments

Effect of Temporal Consistency: Ladvect

We use the same setting to test the effect on using our proposed recursive advection loss Ladvect
proposed in 5.3. Notice that here we use batch size of 1 as mentioned before as recursive
advection operator takes extra memory for backpropagation. Again, we test with full objective
for velocity estimation Lvel and set the weight λadvect of Ladvect to be 0 and 0.1. We have tested
with larger λadvect and find it will introduce some unexpected noise around the inflow source
as our inflow and advection scheme (e.g., first-order) is not completely the same as the data
generation part.
As shown in Figure 6.20, we find huge difference for w/ and wo/ advection loss based on the
rendered images. The one with advection loss significantly outperform the one without it and
looks closer to the ground truth, with finer and sharper details (e.g., in the red bounding box).
In the zoom-in view we can better see the difference of the resulting density.

Figure 6.20.: From left to right: ground truth density, advected density using velocity-based model
trained with advection loss Ladvect and without advection loss, respectively.

3D SSIM Loss for Velocity Reconstruction

Again, we want to know if velocity reconstruction can benefit from 3D SSIM loss. We already
see no obvious improvement in the case of 3D density reconstruction. As the results in Fig-
ure 6.20 are based on full objective of velocity reconstruction with LSSIM 3D , we remove this
loss term to see anything happens with the same training progress. The results of 20-frame
interpolation starting from 110th has been shown.

72
6.5. Multi-Frame Density Interpolation

Table 6.5.: Quantitative results of reconstructed density with λSSIM 3D = 0 and 1.

λSSIM 3D PSNRd PSNRs

0 20.8243 16.6909

1 21.1489 16.5963

Long-term Evaluation

The advantage of the density-based interpolation model (baseline) is that only intermediate
frames between two key-framed sketches are needed to be considered even for a long-term
interpolation. For velocity-based model, theoretically the errors of advected density field will
be accumulated with long-term advection.
Figure 6.21 shows the resulting advected density from 10th to 150th frames under 20-frame
interpolation setting. It looks plausible compared with the ground truth as shown in Figure 6.1.
This example also looks similar to the short-term interpolation (i.e., 110 ∼ 150th frames) shown
in Figure 6.23 later. More importantly, this shows that our velocity-based model is still able
to reconstruct temporally-consistent smoke simulations under such a long-term interpolation
setting starting from the beginning.

Figure 6.21.: Advected density starting from 10 − 150th frame under 20-frame interpolation setting.
Every 10 frame of density are displayed.

Multi-view Evaluation

Again, we want to see if there are artifacts observed in the advected density as the Ladvect from
other viewpoints other than front and side views. As shown in Figure 6.22, we do not observe
artifacts from intermediate viewpoints.

73
6. Experiments

Figure 6.22.: Multi-view evaluation of advected density using our velocity-based approach. No artifacts
are observed in the reconstructed density.

6.5.3. Comparisons

Finally, we make comparisons in this section between the baseline model and ours. In Fig-
ure 6.23 we visualize the generated middle slices of last 5 reconstructed density fields for the
baseline, our approach trained with Ladvect and the ground truth. The reconstructed density via
velocity and advection capture more flow structure, while the baseline might have lower frame-
by-frame error but more blurry results. Table 6.6 shows that our velocity-based model can be
competitive with the baseline model in this 20-frame interpolation setting from 110 ∼ 140th
frames.
Here we want to give a qualitative comparison between the baseline model, which is density-
based and our velocity-based model. Figure 6.24 shows that our method in the middle can
generate sharper details in the advected frame, while the left-most density-based method can
only reconstruct a blurry one. This shows the advantage of our model to have sharper details
and temporally-coherent reconstructed density combined with Figure 6.23.

Table 6.6.: Quantitative comparisons using the baseline model, velocity-based model, from 110 ∼ 150th
frames under 20-frame interpolation setting.

Models PSNRd PSNRs

Baseline 20.8243 16.6909

Ours 21.1489 16.5963

Similarly, we can find the same advantages using our method in other scenarios. For example,
in Figure 6.25, we can again see the difference between the density-based baseline model and
our velocity-based model. Also, we visualize the middle slice views of the last 5 generated
density frames.

74
6.5. Multi-Frame Density Interpolation

(a) Visualization of middle slices of reconstructed density fields for last 5 frames in the
simulation, under the 20-frame interpolation setting. From top to bottom: reconstructed
density using baseline model, velocity-based model and ground truth.

(b) Visualization of middle slices of reconstructed density fields for last 5 frames in the simulation, under
the 20-frame interpolation setting. From top to bottom: reconstructed density using baseline model,
velocity-based model and ground truth.

Figure 6.23.: Qualitative comparisons of density-based and velocity-based methods.

75
6. Experiments

Figure 6.24.: Qualitative comparisons of reconstructed density field at 140th frame between baseline
model, velocity-based model and ground truth.

Figure 6.25.: Comparisons of rendered 190th frame in the smoke inflow and buoyancy scene between
density-based and velocity-based model. From left to right: density-based baseline model,
our velocity-based model and ground truth.

76
6.6. Performance Analysis

6.6. Performance Analysis


For a sketch-based smoke simulations system, it is crucial to provide artists and users real-time
interactions between input sketches and resulting outputs. Though we currently do not have an
interactive tool for testing, we analyze the performance on predicting density fields and velocity
fields in an offline manner. More specifically, given two view sketches as input, the runtime to
generate a single-frame 3D density field using our model would take ∼ 5 seconds per frame
on CPUs (Intel (R) Core (TM) i7 CPU 940 @ 2.93GHz) and ∼ 0.03 per frame seconds on a
single Nvidia GeForce GTX 1080 Ti GPU (with 11GB Ram). As we have similar architecture
for density decoder and velocity decoder, the runtime for predicting a velocity field is also ∼ 5
seconds on CPUs and ∼ 0.03 seconds per frame on a GPU.

6.7. User Study


In this section, we try to evaluate the robustness of our deep learning model on real human
sketches. As our model is trained with synthetic sketches computed from the output of smoke
simulations, it is necessary validate our approach on real human sketches to see if our model can
be generalized well to practical usage. Therefore, we would collect some additional sketches
from artists or even novice users to achieve this.
Notice that during this thesis we do not run a large-scale user study due to time constraint. So
we create some small test cases with sketches drawn by a novice user and an artist to see how
our model behaves.

6.7.1. Human Sketches Acquisition

Firstly, we propose an evaluation protocol for users who try to validate our approach, including
the process of human sketches acquisition, levels of difficulty, etc.
For users who are not familiar with our system, we need to first train them to reproduce some
sampled synthetic sketches generated via our renderer. Then, we define 3 levels of difficulty for
evaluating our models from easy to difficult:
1. Easy: users can sketch on top of the shading results generated via our sketch renderer.
2. Medium: users can sketch on top of the physically-based rendered images of ground truth
data using Mitsuba Renderer [Jak10], which looks more realistic than our shading results.
3. Hard: users can sketch arbitrary smoke shapes and dynamics without any reference from
our rendered sketches and images.

6.7.2. Results

Instead of testing our model on large human sketches database, below we provide a small test
demo with sketches drawn from a novice user in Figure 6.26. In this figure, the user skips the

77
6. Experiments

easy one to medium and directly sketch on the physically-based rendered images. Though it
looks quite different from the reference synthetic sketches, we conclude that our model is quite
robust to different line drawing types.

Figure 6.26.: Comparisons of rendered sketches via our renderer and real sketches from a novice user.

Figure 6.27.: Reconstructed density field of rendered 180th frame. From left to right: user sketch, Redge
and Rblend as input, respectively.

We also test our model on an artist’s sketches for the easy level. As we provide vertical plume
samples to the artist, we also test with our model trained on the vertical smoke plume dataset
6.1. We can see that our model did not capture the details for this example but can reconstruct
the overall shapes.

78
6.7. User Study

Figure 6.28.: Artist’s input sketches and reconstructed density fields using our model. As the artist only
sketches on the front viewpoints, we copy them as the side viewpoints to evaluate our
model.

79
7
Conclusions and Future Work

7.1. Conclusions

To conclude this thesis, we propose a deep learning-based approach to achieve 4D prototyping


for smoke simulations and provide a way for artists to interactively animate with smoke simula-
tion via sparse key-framed and multi-view sketches. Though lots of works have been proposed
in the area of sketch-based 3D shape modelling, sketch-based smoke simulations is still largely
unexplored, where both single-frame reconstruction and temporally-consistent reconstruction
should be taken into account. We also explore the methods to render sketches from 3D den-
sity fields and find that our differentiable sketch renderer can help improve the reconstruction
quality by modelling 2D-3D consistency and be used to fine-tuned the reconstructed fields for
newly input sketches.
Experiments show that our model can successfully reconstruct single-frame density fields via
multi-view sketches as well as temporally-consistent density fields between key-framed sketches.
This shows promising results in the near future that artists and even novice users can design
smoke scenes and the animation of smoke simply using sketches as input. Based on deep neural
network, our model also shows good inference speed on GPU to reconstruct such 3D fields,
showing possibility of real-time interactions between human sketches and synthesized smoke.

7.2. Limitations and Discussion

One inherent limitation of our model is that our network outputs fixed resolution grids given
input sketches. In this work, we test our model on 3D simulation data with resolution of 1123
and at most 2243 . In some sense, this resolution is not high enough to generate more details of
fluid flow, such as vorticity. However, increasing the resolution will bring much more difficulties

81
7. Conclusions and Future Work

in terms of storage and GPU memory during training and introduce increasing time for both
simulation and training deep neural networks. For example, we find that running with 4483
resolution data will be painfully slow (e.g., over 3 hours per epoch) with limited batch size
(e.g., 1) when we test on a NVIDIA Tesla V100-SXM2 (32 GB). This might limit our networks
to model much higher resolution data currently.
Due to this, we also need to carefully design our CNN architecture, e.g., on the depth, width,
number of feature channels of the network considering the memory limit and reconstruction
quality. Our introduced differentiable sketch renderer also consumes extra memory during train-
ing with sketch consistent loss. So we can further optimize the implementation of our proposed
renderer.
Another limitation is about our datasets. Due to time constraints and large amount of simulation
data with 1123 resolution and even higher, we have generated a few scenes, such as smoke inflow
and buoyancy, smoke plume with obstacle, etc. Though our model can be generalized well
within each scenario, it is still not general enough to model different kinds of shapes compared
with ShapeNet[CFG+ 15], ModelNet [WSK+ 15] databases. Therefore, our model might not be
able to support arbitrary sketched shapes currently.

7.3. Future Work


There are still some space for further improvement in terms of inputs, model architectures,
etc. In the future, the first thing might be to build a real interactive tool for artists, where our
trained model can be plugged in for real-time interaction between human sketches and recon-
structed density and velocity fields. From the perspective of edit propagation or incremental
design, we can follow the idea from [DAI+ 18] and [SZF+ 19] to build a system that can pro-
gressively parse sketches from different viewpoints to refine the reconstructed results. In terms
of inputs, we can augment the sketches with shading from arbitrary angles and the physical
smoke data by rotation, scaling as shown in [XFCT18]. Further, it would be interesting to in-
corporate some more advanced connections, such as combining DenseNet and ResNet shown
in [LCQ+ 18][JDV+ 17], though it is still restricted by the memory. Similarly, our model might
benefit from state-of-the-art GAN methods, such as [KALL17], for higher-quality results.
There are even some more challenging works might be considered in the future. Due to the
restriction of fixed grid resolution, we can also explore other ways to represent 3D smoke re-
quiring less memory, such as point clouds. In addition, it would be interesting to extend the
work [HXF+ 19] to 3D and explore the automatic ways to sketch 3D velocities similar to hu-
mans.

82
A
Supplemental Material

A.1. Software
In this thesis, we mainly use MantaFlow [TP18] to generate smoke simulation data and PyTorch
[PGC+ 17] for training our neural network models. To visualize the smoke data, we use Mistuba
[Jak10], a physically-based renderer to render realistic smoke. Also, we use Houdini 17.5.229
for interactive 3D display, together with OpenVDB programs for file format conversion. Real-
time suggestive contour [DFRS03] has been used for generate sketches from meshed smoke
data.

A.2. More Results


We also experiment the same baseline model on other datasets, which are not shown in the
previous Section 6. Here we provide more visual results showing that our model should be
trainable under different smoke scenes and it would be interesting to train our models on all
generated data to improve the generalization. We also show a reconstructed simulation of the
smoke inflow and buoyancy scene using our velocity-based method in Figure A.2.

83
A. Supplemental Material

Figure A.1.: Results from the higher-resolution smoke plume and obstacle dataset and vertical smoke
plume dataset.

Figure A.2.: Reconstructed simulation from 110 ∼ 150th frames of the smoke inflow and buoyancy
dataset.

84
Bibliography
[ABC+ 16] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jef-
frey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Is-
ard, et al. Tensorflow: A system for large-scale machine learning. In 12th
{USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 16), pages 265–283, 2016.
[BBS08] Seok-Hyung Bae, Ravin Balakrishnan, and Karan Singh. Ilovesketch: as-
natural-as-possible sketching system for creating 3d curve models. In Pro-
ceedings of the 21st annual ACM symposium on User interface software and
technology, pages 151–160. ACM, 2008.
[BLB+ 13] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas
Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gram-
fort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt,
and Gaël Varoquaux. API design for machine learning software: experiences
from the scikit-learn project. In ECML PKDD Workshop: Languages for Data
Mining and Machine Learning, pages 108–122, 2013.
[BMF07] Robert Bridson and Matthias Müller-Fischer. Fluid simulation: Siggraph 2007
course notes video files associated with this course are available from the cita-
tion page. In ACM SIGGRAPH 2007 courses, pages 1–81. ACM, 2007.
[BVM+ 17] Steve Bako, Thijs Vogels, Brian McWilliams, Mark Meyer, Jan Novák, Alex
Harvill, Pradeep Sen, Tony Derose, and Fabrice Rousselle. Kernel-predicting
convolutional networks for denoising monte carlo renderings. ACM Transac-
tions on Graphics (TOG), 36(4):97, 2017.
[CFG+ 15] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qix-
ing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su,
et al. Shapenet: An information-rich 3d model repository. arXiv preprint
Bibliography

arXiv:1512.03012, 2015.
[CT17] Mengyu Chu and Nils Thuerey. Data-driven synthesis of smoke flows
with cnn-based feature descriptors. ACM Transactions on Graphics (TOG),
36(4):69, 2017.
[CUH15] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and ac-
curate deep network learning by exponential linear units (elus). arXiv preprint
arXiv:1511.07289, 2015.
[DAI+ 18] Johanna Delanoy, Mathieu Aubry, Phillip Isola, Alexei A Efros, and Adrien
Bousseau. 3d sketching using multi-view deep volumetric prediction. Proceed-
ings of the ACM on Computer Graphics and Interactive Techniques, 1(1):21,
2018.
[DDS+ 09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima-
genet: A large-scale hierarchical image database. In 2009 IEEE conference on
computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[DFRS03] Doug DeCarlo, Adam Finkelstein, Szymon Rusinkiewicz, and Anthony San-
tella. Suggestive contours for conveying shape. In ACM Transactions on
Graphics (TOG), volume 22, pages 848–855. ACM, 2003.
[EHA12] Mathias Eitz, James Hays, and Marc Alexa. How do humans sketch objects?
ACM Trans. Graph., 31(4):44–1, 2012.
[FL04] Raanan Fattal and Dani Lischinski. Target-driven smoke animation. In ACM
Transactions on Graphics (TOG), volume 23, pages 441–448. ACM, 2004.
[GBB11] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neu-
ral networks. In Proceedings of the fourteenth international conference on
artificial intelligence and statistics, pages 315–323, 2011.
[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT
Press, 2016. http://www.deeplearningbook.org.
[GDDM14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. In Pro-
ceedings of the IEEE conference on computer vision and pattern recognition,
pages 580–587, 2014.
[GPAM+ 14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad-
versarial nets. In Advances in neural information processing systems, pages
2672–2680, 2014.
[HLVDMW17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger.
Densely connected convolutional networks. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages 4700–4708, 2017.
[HXF+ 19] Zhongyuan Hu, Haoran Xie, Tsukasa Fukusato, Takahiro Sato, and Takeo
Igarashi. Sketch2vf: Sketch-based flow design with conditional generative
adversarial network. Computer Animation and Virtual Worlds, page e1889,

86
Bibliography

2019.
[HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into
rectifiers: Surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE international conference on computer vision, pages
1026–1034, 2015.
[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[IEGT17] Tiffany Inglis, M-L Eckert, James Gregson, and Nils Thuerey. Primal-dual
optimization for fluids. In Computer Graphics Forum, volume 36, pages 354–
368. Wiley Online Library, 2017.
[IS15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. arXiv preprint
arXiv:1502.03167, 2015.
[IZZE17] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-
image translation with conditional adversarial networks. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 1125–
1134, 2017.
[Jak10] Wenzel Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org.
[JDV+ 17] Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, and Yoshua
Bengio. The one hundred layers tiramisu: Fully convolutional densenets for
semantic segmentation. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, pages 11–19, 2017.
[JFA+ 15] Ondřej Jamriška, Jakub Fišer, Paul Asente, Jingwan Lu, Eli Shechtman, and
Daniel Sỳkora. Lazyfluids: appearance transfer for fluid animations. ACM
Transactions on Graphics (TOG), 34(4):92, 2015.
[JSP+ 15] SoHyeon Jeong, Barbara Solenthaler, Marc Pollefeys, Markus Gross, et al.
Data-driven fluid simulations using regression forests. ACM Transactions on
Graphics (TOG), 34(6):199, 2015.
[KAGS19] Byungsoo Kim, Vinicius C Azevedo, Markus Gross, and Barbara Solenthaler.
Transport-based neural style transfer for smoke simulations. arXiv preprint
arXiv:1905.07442, 2019.
[KALL17] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive
growing of gans for improved quality, stability, and variation. arXiv preprint
arXiv:1710.10196, 2017.
[KAT+ 19] Byungsoo Kim, Vinicius C Azevedo, Nils Thuerey, Theodore Kim, Markus
Gross, and Barbara Solenthaler. Deep fluids: A generative network for param-
eterized fluid simulations. In Computer Graphics Forum, volume 38, pages
59–70. Wiley Online Library, 2019.
[KB14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-

87
Bibliography

tion. arXiv preprint arXiv:1412.6980, 2014.


[KH13] Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruc-
tion. ACM Transactions on Graphics (ToG), 32(3):29, 2013.
[KHM17] Abhishek Kar, Christian Häne, and Jitendra Malik. Learning a multi-view
stereo machine. In Advances in neural information processing systems, pages
365–376, 2017.
[Kim14] Yoon Kim. Convolutional neural networks for sentence classification. arXiv
preprint arXiv:1408.5882, 2014.
[KMM+ 17] Simon Kallweit, Thomas Müller, Brian McWilliams, Markus Gross, and
Jan Novák. Deep scattering: Rendering atmospheric clouds with radiance-
predicting neural networks. ACM Transactions on Graphics (TOG), 36(6):231,
2017.
[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in neural informa-
tion processing systems, pages 1097–1105, 2012.
[KUH18] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh ren-
derer. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3907–3916, 2018.
[KW13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv
preprint arXiv:1312.6114, 2013.
[LBB+ 98] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-
based learning applied to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
[LCQ+ 18] Xiaomeng Li, Hao Chen, Xiaojuan Qi, Qi Dou, Chi-Wing Fu, and Pheng-
Ann Heng. H-denseunet: hybrid densely connected unet for liver and tu-
mor segmentation from ct volumes. IEEE transactions on medical imaging,
37(12):2663–2674, 2018.
[LGK+ 17] Zhaoliang Lun, Matheus Gadelha, Evangelos Kalogerakis, Subhransu Maji,
and Rui Wang. 3d shape reconstruction from sketches via multi-view convolu-
tional networks. In 2017 International Conference on 3D Vision (3DV), pages
67–77. IEEE, 2017.
[LMHB00] Adam Lake, Carl Marshall, Mark Harris, and Marc Blackstein. Stylized ren-
dering techniques for scalable real-time 3d animation. In Proceedings of the
1st international symposium on Non-photorealistic animation and rendering,
pages 13–20. ACM, 2000.
[LPL+ 18] Changjian Li, Hao Pan, Yang Liu, Xin Tong, Alla Sheffer, and Wenping Wang.
Robust flow-guided neural prediction for sketch-based freeform surface mod-
eling. In SIGGRAPH Asia 2018 Technical Papers, page 238. ACM, 2018.
[LSS+ 19] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas
Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable

88
Bibliography

volumes from images. arXiv preprint arXiv:1906.07751, 2019.


[LTH+ 17] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cun-
ningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz,
Zehan Wang, et al. Photo-realistic single image super-resolution using a gener-
ative adversarial network. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4681–4690, 2017.
[MHN13] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities
improve neural network acoustic models. In Proc. icml, volume 30, page 3,
2013.
[MLJ+ 13] Ken Museth, Jeff Lait, John Johanson, Jeff Budsberg, Ron Henderson, Mihai
Alden, Peter Cucka, David Hill, and Andrew Pearce. Openvdb: an open-source
data structure and toolkit for high-resolution volumes. In Acm siggraph 2013
courses, page 19. ACM, 2013.
[ODAO15] Makoto Okabe, Yoshinori Dobashi, Ken Anjyo, and Rikio Onai. Fluid vol-
ume modeling from sparse multi-view images by appearance transfer. ACM
Transactions on Graphics (TOG), 34(4):93, 2015.
[ODO16] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and
checkerboard artifacts. Distill, 1(10):e3, 2016.
[OSSJ09] Luke Olsen, Faramarz F Samavati, Mario Costa Sousa, and Joaquim A Jorge.
Sketch-based modeling: A survey. Computers & Graphics, 33(1):85–103,
2009.
[PALvdP18] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deep-
mimic: Example-guided deep reinforcement learning of physics-based charac-
ter skills. ACM Transactions on Graphics (TOG), 37(4):143, 2018.
[PGC+ 17] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam
Lerer. Automatic differentiation in pytorch. 2017.
[SCK10] David Schroeder, Dane Coffey, and Dan Keefe. Drawing with the flow: A
sketch-based interface for illustrative visualization of 2d vector fields. In Pro-
ceedings of the Seventh Sketch-Based Interfaces and Modeling Symposium,
pages 49–56. Eurographics Association, 2010.
[ST90] Takafumi Saito and Tokiichiro Takahashi. Comprehensible rendering of 3-d
shapes. In ACM SIGGRAPH Computer Graphics, volume 24, pages 197–206.
ACM, 1990.
[Sta99] Jos Stam. Stable fluids. In Siggraph, volume 99, pages 121–128, 1999.
[SZ14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[SZF+ 19] Yuefan Shen, Changgeng Zhang, Hongbo Fu, Kun Zhou, and Youyi Zheng.
Deepsketchhair: Deep sketch-based 3d hair modeling. arXiv preprint
arXiv:1908.07198, 2019.

89
Bibliography

[TMPS03] Adrien Treuille, Antoine McNamara, Zoran Popović, and Jos Stam. Keyframe
control of smoke simulations. In ACM Transactions on Graphics (TOG), vol-
ume 22, pages 716–723. ACM, 2003.
[TP18] Nils Thuerey and Tobias Pfaff. MantaFlow, 2018. http://mantaflow.com.
[TSSP17] Jonathan Tompson, Kristofer Schlachter, Pablo Sprechmann, and Ken Perlin.
Accelerating eulerian fluid simulation with convolutional networks. In Pro-
ceedings of the 34th International Conference on Machine Learning-Volume
70, pages 3424–3433. JMLR. org, 2017.
[UHT18] Kiwon Um, Xiangyu Hu, and Nils Thuerey. Liquid splash modeling with neu-
ral networks. In Computer Graphics Forum, volume 37, pages 171–182. Wiley
Online Library, 2018.
[VdOKE+ 16] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex
Graves, et al. Conditional image generation with pixelcnn decoders. In Ad-
vances in neural information processing systems, pages 4790–4798, 2016.
[WCPM18] Tuanfeng Y Wang, Duygu Ceylan, Jovan Popovic, and Niloy J Mitra. Learn-
ing a shared shape space for multimodal garment design. arXiv preprint
arXiv:1806.11335, 2018.
[WSB03] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural sim-
ilarity for image quality assessment. In The Thrity-Seventh Asilomar Confer-
ence on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402.
Ieee, 2003.
[WSK+ 15] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou
Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric
shapes. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 1912–1920, 2015.
[WWX+ 17] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh
Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances
in neural information processing systems, pages 540–550, 2017.
[WXCT19] Maximilian Werhahn, You Xie, Mengyu Chu, and Nils Thuerey. A multi-pass
gan for fluid flow super-resolution. arXiv preprint arXiv:1906.01689, 2019.
[WZX+ 16] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenen-
baum. Learning a probabilistic latent space of object shapes via 3d generative-
adversarial modeling. In Advances in neural information processing systems,
pages 82–90, 2016.
[XFCT18] You Xie, Erik Franz, Mengyu Chu, and Nils Thuerey. tempogan: A temporally
coherent, volumetric gan for super-resolution fluid flow. ACM Transactions on
Graphics (TOG), 37(4):95, 2018.
[YYY+ 16] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspec-
tive transformer nets: Learning single-view 3d object reconstruction without
3d supervision. In Advances in Neural Information Processing Systems, pages

90
Bibliography

1696–1704, 2016.
[ZGFK16] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for im-
age restoration with neural networks. IEEE Transactions on Computational
Imaging, 3(1):47–57, 2016.
[ZIH+ 11] Bo Zhu, Michiaki Iwata, Ryo Haraguchi, Takashi Ashihara, Nobuyuki
Umetani, Takeo Igarashi, and Kazuo Nakazawa. Sketch-based dynamic illus-
tration of fluid systems. In ACM Transactions on Graphics (TOG), volume 30,
page 134. ACM, 2011.
[ZPIE17] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-
to-image translation using cycle-consistent adversarial networks. In Proceed-
ings of the IEEE international conference on computer vision, pages 2223–
2232, 2017.
[ZTF+ 18] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely.
Stereo magnification: Learning view synthesis using multiplane images. arXiv
preprint arXiv:1805.09817, 2018.

91

You might also like