Professional Documents
Culture Documents
Journal Pre-Proof: Neurocomputing
Journal Pre-Proof: Neurocomputing
Yu
Journal Pre-proof
PII: S0925-2312(19)31735-7
DOI: https://doi.org/10.1016/j.neucom.2019.12.032
Reference: NEUCOM 21661
Please cite this article as: Yikuan Yu, Zitian Huang, Fei Li, Haodong Zhang, Xinyi Le, Point En-
coder GAN: A Deep Learning Model for 3D Point Cloud Inpainting, Neurocomputing (2019), doi:
https://doi.org/10.1016/j.neucom.2019.12.032
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
Yikuan Yua,b , Zitian Huanga,b , Fei Lic,d , Haodong Zhanga,b , Xinyi Lea,b,∗
a School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 20040, China
b Shanghai Key Laboratory of Advanced Manufacturing Environment,Shanghai 200240, China
c State Key Laboratory of Intelligent Manufacturing System Technology, Beijing Institute of Electronic System Engineering, Beijing 100854, China
d Beijing Complex Product Advanced Manufacturing Research Center, Beijing Simulation Center, Beijing 100854, China
Abstract
In this paper, we propose a Point Encoder GAN for 3D point cloud inpainting. Different from other 3D object inpainting networks,
our network can process point cloud data directly without any labeling and assumption. We use a max-pooling layer to solve the
unordered of point cloud during the learning procedure. We add two T -Nets (from PointNet) to the encoder-decoder pipeline, which
can yield better feature representation of the input point cloud and a more suitable rotation angle of the output point cloud. We then
propose a hybrid reconstruction loss function to measure the difference between the two sets of unordered data. Using small sample
models on ModelNet40 only, the proposed Point Encoder GAN yields end-to-end inpainting results surprisingly. Experiment results
have shown a high success rate. Several technical measures are used to identify the good qualities of our generated models.
Keywords: Point cloud, neural network, inpainting, encoder, generative adversarial nets(GANs)
1. Introduction cloud data. Quite a few previous works transform the 3D point
cloud to regular 3D voxel data [17–21]. But voxelization often
Nowadays, 3D laser scanners or photo scanners are fre- results in information reduction.
quently used for point cloud acquisition [1]. This data struc- In this paper, Point Encoder GAN, an original network struc-
ture is widely applied for engineering design [2], geographical ture, is proposed based on PointNet and GAN. This network
mapping [3], and scene recognition [4, 5]. However, due to the directly takes defective point cloud as input and generates the
limitation of instrument precision, point cloud sets are defective missing part. Our contributions are as follows:
and incomplete most of the time. The extra information miss-
ing hinders the use of point clouds, which leads to an urgent • The network is able to process raw defective point cloud
need for point cloud inpainting. Neural networks have great in- data without voxelization, which prevents the extra in-
formation processing ability [6] and help us complete this task. formation loss compared with the voxel-based methods
During the past few years, some inpainting methods are pro- [12–14].
posed and proved to be effective for image and 3D object pro-
cessing. For instance, inpainting methods based on CNN (Con- • The proposed Point Encoder GAN is trained on a small
volutional Neural Network) and GAN (Generative Adversarial data set of randomly corrupted point clouds constructed
Network) have achieved great performance for 2D images [7– from ModelNet40 [22, 23], which performs the great gen-
11] and 3D objects [12–16]. Effective as they are, these meth- eralization capability.
ods for 3D object inpainting still require 3D voxel data rather • The Point Encoder GAN needs no structure and classi-
than the first-hand point cloud data. As shown in Figure 2(c),
fication information above the objects such as symme-
different from previous works in Figure 2(a) and 2(b), inpaint- try and category. It is an authentic end-to-end model for
ing on point cloud set directly is our main target in this paper. point cloud inpainting.
As point clouds are irregularly defined in Euclidean Space,
it is difficult to feed typical convolutional architectures on point
✩ The work described in the paper was jointly sponsored by Startup Fund
for Youngman Research at SJTU (SFYR at SJTU), Open Fund of State Key
Laboratory of Intelligent Manufacturing System Technology, Natural Science
Foundation of Shanghai (18ZR1420100) and National Natural Science Foun-
dation of China (61703274).
∗ Corresponding author
2.1. Generative Adversarial Network Because of the diversity and specificity of point cloud data,
GAN (Generative Adversarial Network) proposed by Good- it is difficult to use regular deep learning network directly in
fellow [24, 25] consists of two deep networks, a generator G point cloud learning. In order to solve these problems, researchers
and a discriminator D. The generator generates fake samples propose some networks using point cloud directly as input. These
and the discriminator tries to distinguish real samples from over- networks usually have delicate structures.
all data. G and D are trained jointly until discriminator cannot PointNet [39] uses max-pooling and T -Net to obtain global
distinguish whether the generated samples are real or fake. We features of point cloud. PointNet++ [40] can perceive local fea-
can train generator and discriminator which constitutes GAN tures due to its hierarchical structure based on PointNet.
alternately. In other words, GAN can be regarded as a game
2
Figure 3: The architecture of Point Encoder GAN. The specific explanation is shown in Section 4. This network is trained by both reconstruction loss (hybrid) and
adversarial loss.
PointCNN [41] proposes a X-Conv operation for feature acqui- erased point cloud. Thus, the trained network can generate the
sition of point cloud. They respectively achieve the accuracy missing point cloud with the same amount as erased points. We
of 89.2%, 90.7%, and 91.7% for classification task on Model- use ModelNet40 for training and validation of Point Encoder
Net40. Therefore, some structures of these networks are worth GAN.
learning for our references. We call such task as point cloud inpainting throughout the
paper. There are two difficulties: the unique properties of point
cloud and the definition of loss function between two sets of
3. Task Statement
point clouds. Our solution and mathematical derivation is given
Different from 2D images, 3D point cloud has the following in Section 4.
unique features: unordered and rotational invariance.
4. Point Encoder GAN
Unordered In essence, a point cloud is a series of points in
3D space. The overall shape of the point cloud has no concern 4.1. Network Architecture
with the order of points. In other words, different sequences of
Overview As illustrated in Figure 3, the proposed Point En-
points in the input set should result in the same output of the
coder GAN consists of generator network (G-Net) and discrim-
network theoretically.
inator network (D-Net). The whole framework is inspired by
Rotational Invariance This property usually refers to ro- Context Encoders [7]. The encoder of G-Net transforms point
tation invariance. As for the same object, the coordinate of a clouds into a compact feature representation. The decoder of
certain point in a point cloud would vary with rotation. In our G-Net generates the missing point cloud data out of this rep-
method for 3D point cloud, point cloud rotations should not al- resentation. The D-Net is given to help the G-Net predict the
ter classification results. missing points from the latent feature representation. T -Net is
a data-dependent spatial transformer that helps to transform the
In our model, the primary input and output are unordered input data optimally in PointNet [39]. So, we add T -Net to
point cloud sets. A set of 3D point cloud with size n can be both G-Net and D-Net to solve the rotation invariance property
represented as {Pi |i = 1, . . . , n} and Pi is a vector of (xi , yi , zi ) in of point cloud data.
Euclidean Space. We bring the GAN model to promote training of the encoder-
Assume N and M are the numbers of the points in the ini- decoder network (G-Net). The essence of GAN training proce-
tial point cloud and the erased point cloud, respectively. Our dure is a game theory problem. The object is to get a G-Net
goal of proposed Point Encoder GAN is to output the generated which can learn the data contribution from the training sam-
missing point cloud with size M. We initialized the missing ples. The addition of GAN encourages the entire output of
point cloud as zero point cloud with (0, 0, 0) coordinates. In the encoder more realistic. In other words, during the inces-
other words, the initial input of Point Encoder GAN is a de- sant “frauds” between G-Net and D-Net, the output of G seems
fective point cloud with size (N − M) and a zero point cloud more suitable.
(xi , yi , zi = 0|i = 1, . . . , M) with size M. During the training To conclude, Point Encoder GAN enjoys the advantages of
process, the zero point cloud gradually converges towards the PointNet [39] for dealing with point cloud, Context Encoders
3
[7] for auto-encoding, and GANs [24] for discrimination and organized data is easy to define because it belongs to a one-to-
generation, thus delivering satisfactory results. one relationship. For example, the image loss of picture A and
picture B with the same size N×N can be determined by:
T -Net Structure We use a max-pooling layer to solve the
unordered of point cloud, and T -Net to overcome point cloud 1 XX
N N
invariance according to the structure of PointNet [39]. As shown L(A, B) = L(Ai, j , Bi, j ), (3)
N 2 i=1 j=1
in Figure 3, a T -Net combines with serial layers of shared 64-
MLP (Multi-Layer Perception), shared 128-MLP, shared 1024-
where Ai, j and Bi, j is respectively the pixel point location of
MLP, a max-pooling layer, 256-FCL (fully connected layer),
picture A and B.
9-FCL to obtain a 3 × 3 matrix. Its output is the matrix multi-
plication of the input point cloud matrix and this 3 × 3 matrix.
where λadv and λrec is the weight of the adversarial loss and the where ωÂ:B̂ is the weight of point cloud  to B̂, and ωB̂: satis-
reconstruction loss, respectively. They satisfy λadv + λrec = 1. fies ωÂ:B̂ + ωB̂: = 1.
Adversarial Loss This loss roots in GAN model [24]. We We then use Chamfer Distance [43] to define the loss be-
regard the G-Net and the D-Net as parametric functions. G : tween one point P and a point cloud Ŝ with length K. It is
X → Y is considered as the mapping function from input sam- worth noting that Chamfer Distance is L2 -Norm value:
ples X to real samples Y, which is the approximation of G0 :
L(P, Ŝ ) = min |P, Ŝ i |2 . (6)
X → Y0 that maps from input samples X to data contribution 1≤i≤K
Y0 . D-Net tries to distinguish the generated data from G-Net
and the authenticated samples. The adversarial loss function Combining with the Eq. 5 and Eq. 6, the loss function
can be defined by: definition is determined by:
X X N
Ladv = ln(D(yi )) + ln(1 − D(G(xi ))), (2) ωÂ:B̂ X
L2 (Â, B̂) = min |Âi , B̂ j |2
1≤i≤S 1≤i≤S N i=1 1≤ j≤N
N
(7)
where xi ∈ X, yi ∈ Y, i = 1, . . . , S . S is the sample size of ω X
+ B̂:Â min |Âi , B̂ j |2 .
X, Y. N j=1 1≤i≤N
Reconstruction Loss Pixel data (image) and voxel data
(3D grid) are both organized data. The loss function for such
4
5. Experimental Validation Earth Mover Ratio For evaluating the aggregation of the
generated points compared with the ground truth, we define an-
5.1. Model Training other ratio based on Earth Mover Distance [44]:
We use PyTorch as our deep learning framework to imple-
ment Point Encoder GAN. Our data set is composed of 12308 EMDG
EMR = 10 × | log10 | (dB), (10)
generated defective point clouds from ModelNet40. To acquire EMDT
training set, we take initial point clouds from ModelNet40 (1024
points) and erase 256 points around a random kernel in each where EMDG and EMDT is the Earth Mover Distance of the
one. Thus, each point cloud in our data set contains 768 points, generated point cloud and the ground truth, respectively. EMR
represented as coordinates (xi , yi , zi ). The data set is split into measures the density difference between generated samples and
two subsets: 9840 samples for training, 2468 samples for test true samples. The optimal value of EMR is 0 dB when EMDG =
and both of the subsets include all the 40 categories in Model- EMDT .
Net40.
5.3. Test Results
5.2. Evaluation Measures In our experiments, we test our model on 2468 samples
Since point cloud inpainting is a frontier in computer vision, within ModelNet40, and also examine it out of ModelNet40.
quantitative indexes for point cloud inpainting are inadequate. The visualized results and the evaluation comparisons are stated,
Some reasonable indexes are established for point cloud evalu- followed with relevant analysis.
ation.
Inpainting Results on ModelNet40 Some of the valida-
Regression Ratio Our first goal is to generate the point tion results on ModelNet40 are shown in Figure 5. The high-
clouds as similar as the original ones. In order to evaluate the lighted parts represent for the erased points of initial point cloud
disparity of generated M missing points from G-Net and erased and the generated missing points of our model. Generally, the
M points quantitatively, Regression Ratio is then raised. The inpainting results of most categories have met our expectation.
mathematical definition is as follows: Comparing with the ground truth point sets, the generated miss-
ˆ , S Real
L(S Gen ˆ ) ing points match with the defective point clouds pretty well,
Rreg = (1 − ) × 100%, (8) visually and perceptually. In the presented two views, the gen-
ˆ , S Real
L(S Zero ˆ )
erated points show sound similarity with the erased point cloud.
ˆ , S Real
where S Gen ˆ , S Zero
ˆ represents the generated, real, and zero
Evaluation Indexes of Different Models After visualiza-
point cloud with the same size M, respectively. Rreg ∈ [0% −
tion, quantitative indexes are calculated to evaluate the inpaint-
100%] indicates the reconstruction degree of the inpainting pro-
ing quality of different models, which substantiates the effec-
cess. Consider these two extreme situations:
tiveness of our model.
ˆ = SReal
Rreg = 100%, if SGen ˆ Rreg represents for the regression ratio of the missing points
ˆ = SZero
ˆ . based on the loss function (Rreg,L1 and Rreg,L2 ). MDR (FDMR
Rreg = 0%, if SGen
and VMDR) represents for the quality of generated point cloud,
We will use both L1 and L2 loss for specific evaluations. Thus, including the completeness and homogeneity. EMR quantifies
in our experiments, we have two evaluation indexes, Regres- the density difference between the generated point cloud and
sion Ratio of L1 -Norm Rreg,L1 and Regression Ratio of L2 -Norm the ground truth. A model with higher Rreg , lower MDR, and
Rreg,L2 . lower EMR is the preference. We calculate the indexes of four
models: 1, 5, 7, 10 epoch(s). All the indexes of the above mod-
Matching Distance Ratio In some cases, the generated els are shown in Table 1. The visualization of these models are
point cloud is not similar to the original one but still makes given in Figure 6.
sense. In such kind of generation verisimilitude, the matching
effect is great although the regression ratio is not high. So, Rreg,L1 Rreg,L2 FMDR VMDR EMR
we define Matching Distance Ratio (MDR). The mathematical Epoch
(%) (%) (dB) (dB) (dB)
definition is given as follows:
1 45.01 36.47 3.218 2.991 1.522
DM 5 61.81 55.79 2.328 1.814 1.005
MDR = 10 × | log10 | (dB), (9) 7 53.82 45.05 2.785 2.563 1.367
DS
10 51.15 43.07 2.996 2.565 1.416
where D M is the mean value of the point distances in inpainting
matching margin, and DS is the point cloud density. If tak- Table 1: Results of Evaluation Indexes on ModelNet40 of Different Epochs.
ing the density value of ground truth as DS , we call this value
Fixed Matching Distance Ratio (FMDR). If taking the density According to Table 1, it is clear that the 5-epoch model
value of generated point cloud as DS , we call this value Vari- achieves the best results in all the indexes, significantly higher
able Matching Distance Ratio (VMDR). The optimal value of than the others. This model get higher Rreg and lower EMR
MDR is 0 dB when D M = DS . which are also confirmed by visulized results in Figure 6. After
5
Figure 5: Examples of inpainting results with two views in ModelNet40. Our network achieves an end-to-end inpainting task without label-based data preprocessing.
7
Figure 10: Inpainting results for a bunny and a horse out of ModelNet40. It
performs well despite some problems.
8
[42] W. Yuan, T. Khot, D. Held, C. Mertz, M. Hebert, PCN: Point completion
network, International Conference on 3D Vision.
[43] G. Borgefors, Hierarchical chamfer matching: A parametric edge match-
ing algorithm, IEEE Transactions on Pattern Analysis and Machine Intel-
ligence 10 (6) (1988) 849–865.
[44] Y. Rubner, C. Tomasi, L. J. Guibas, The earth mover’s distance as a met-
ric for image retrieval, International Journal of Computer Vision 40 (2)
(2000) 99–121.
[45] Y. Yang, C. Feng, Y. Shen, D. Tian, FoldingNet: Point cloud auto-encoder
via deep grid deformation (2018) 206–215.
9
Yikuan Yu- writing, review editing, formal analysis
Zitian Huang- visualization, validation
Fei Li- resources, funding acquisition, investigation
Haodong Zhang- paper revision
Xinyi Le -Conceptualization, projection administration, supervision.
1
Authors’ Bios
Yikuan Yu, Zitian Huang, Fei Li, Haodong Zhang, Xinyi Le