Lidar Data Classification Using Spatial Transformation and CNN

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS 1

LiDAR Data Classification Using Spatial


Transformation and CNN
Xin He, Aili Wang, Pedram Ghamisi , Guoyu Li , and Yushi Chen

Abstract— Light detection and ranging (LiDAR) is a useful topic named LiDAR-derived rasterized digital surface model
data acquisition technique, which is widely used in a variety (LiDAR-DSM) classification. LiDAR-DSM can be obtained
of practical applications. The classification of LiDAR-derived by sampling the data to regular grids, and it characterizes the
rasterized digital surface model (LiDAR-DSM) is a fundamental
technique in LiDAR data processing. In recent years, deep learn-
elevation of different objects. DSM data have been widely
ing methods, especially convolutional neural networks (CNNs), used for a number of applications, including building detec-
have shown their capability in remote sensing areas, including tion [5] and tree parameter identification [6]. The accurate
LiDAR data processing. Traditional deep models empirically use classification of the rasterized DSM plays an important role
a fixed neighborhood system as input to the network. Therefore, in distinguishing different land cover classes.
the weight and height of the input rectangle may not be optimal. The classification task of the LiDAR-DSM data usually
In order to modify such handcrafted setting, a spatial transfor-
mation network is used here to identify optimal inputs. The trans- involves a pixel-based approach. For example, support vec-
formed inputs are fed into a well-designed CNN to obtain the final tor machine (SVM) was used in [7]. In [8], a decision
classification results. Furthermore, morphological profiles are tree classifier was utilized to analyze the average height of
combined with spatial transformation CNN to further improve the land features in each classification. In [9], a random
the classification accuracy. The proposed frameworks are tested forest classifier was applied to classify the savanna tree
on two LiDAR-DSMs (i.e., the Recology and Houston data sets). species.
The experimental results show that the proposed models provide
competitive results compared to the state-of-the-art methods. In [10], it was shown that convolutional neural net-
Furthermore, the proposed optimal input identification approach work (CNN) is able to provide better classification perfor-
can also be found beneficial for other remote sensing applications. mance than the traditional classification methods in terms
Index Terms— Convolutional neural networks (CNNs), deep of classification accuracy. Until now, only very few deep
learning, feature extraction, light detection and ranging (LiDAR), learning methods have been proposed to deal with DSM
morphological profile (MP), spatial transformation net- data [11]. The CNN methods usually choose the pixel and
work (STN). its neighborhood area as input to the network. In order to
I. I NTRODUCTION effectively utilize the spatial information of a pixel, the size
of the neighborhood (i.e., window) needs to be reasonably
A IRBORNE light detection and ranging (LiDAR) is an
advanced remote sensing technology, which uses pulsed
laser light to measure the variable distance to the earth surface.
selected in advance. However, there are huge differences in
the shape, size, and orientation of land cover classes in remote
sensing data (e.g., the class of the road may be characterized
By combining these light pulses with other recorded data taken
completely different in the direction of extension). The current
by airborne systems, one can obtain accurate 3-D information
CNN methods typically select a fixed-size rectangular area of
(i.e., 3-D spatial point cloud data) about shape and surface
pixels as input. The handcrafted settings including the weight
characteristics [1]. Many studies have focused on the point
and height of the input are determined through experience,
cloud data [2]–[4], while, in this letter, we focus on a specific
which may not be optimal for a specific data set. Furthermore,
Manuscript received April 16, 2018; revised July 1, 2018 and the fixed rectangular window cannot effectively utilize the
August 4, 2018; accepted August 29, 2018. This work was supported in spatial information of the input pixel with respect to the shape
part by the Natural Science Foundation of China under Grant 61771171 and of the corresponding class. Therefore, in order to obtain better
Grant 61871157, in part by the Open Fund of State Key Laboratory
of Frozen Soil Engineering under Grant SKLFSE201614, and in part by classification accuracy, the development of a suitable method
the “High Potential Program” of Helmholtz-Zentrum Dresden-Rossendorf. is necessary to precisely identify optimal inputs.
(Corresponding author: Yushi Chen.) The spatial transformation network (STN) can transform
X. He and A. Wang are with the Higher Education Key Laboratory
for Measure and Control Technology and Instrumentations of Heilongjiang, the input image by rotating, scaling, and translating trans-
Harbin University of Science and Technology, Harbin 150080, China (e-mail: formation, making the extraction of spatial information more
1091636421@qq.com; aili925@hrbust.edu.cn). suitable for the CNN [12]. The parameters of STN can be
P. Ghamisi is with the Helmholtz Institute Freiberg for Resource Technol-
ogy, Exploration, Helmholtz-Zentrum Dresden-Rossendorf, D-09599 Freiberg,
trained with the back-propagate loss-gradient. Note that, here,
Germany (e-mail: p.ghamisi@gmail.com). the transformation parameters here are not manually designed;
G. Li is with the State Key Laboratory of Frozen Soil Engineering, Cold they are automatically updated along with the weights of the
and Arid Regions Environmental and Engineering Research Institute, Chinese CNN in the training process.
Academy of Sciences, Lanzhou 730000, China (e-mail: guoyuli@lzb.ac.cn).
Y. Chen is with the School of Electronics and Information Engi- The neighborhood pixels in the LiDAR rasterized data most
neering, Harbin Institute of Technology, Harbin 150001, China (e-mail: probably belong to the same class, which encourages us to
chenyushi@hit.edu.cn). employ morphological profiles (MPs) [13] to extract spatial
Color versions of one or more of the figures in this letter are available
online at http://ieeexplore.ieee.org. information. To make full use of the neighboring information,
Digital Object Identifier 10.1109/LGRS.2018.2868378 the MP is combined with the STN.
1545-598X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS

Fig. 1. Flowchart of the proposed method. The proposed framework is composed of two subnetworks: 1) STN and 2) CNN. (1). The localization network
predicts the transformation matrix from the input image. Then, the grid generator uses parameters [a1 a2 b1 b2 a0 b0 ] to carry out the affine transformation.
At last, the sampler is used to obtain proper CNN input. (2) We utilize the transformed image as the input for the CNN-based classification.
TABLE I
A RCHITECTURE OF L OCALIZATION N ETWORK

TABLE II
Fig. 2. Recology data set. (a) DSM data. (b) Groundtruth map.
A RCHITECTURE OF THE D EEP CNN S

Fig. 3. Houston data set. (a) DSM data. (b) Groundtruth map.

several spatial features from such data. The MP is used to


The main contributions are listed in the following.
1) STN is used for the first time to identify the optimal enrich the input, and these spatial features will then combine
inputs for LiDAR-DSM data classification. with STN and CNN at the next stage.
2) Furthermore, a classification framework is proposed by
combining MPs, STN, and CNN to obtain high classifi- B. Spatial Transformer Network
cation accuracy.
The STN is an optimal input identification system, which
II. P ROPOSED M ETHOD aims to transform the input to a suitable form for the
subsequent classification. The spatial transformer mechanism
This letter explores the usefulness of the STN for the consists of three parts, including the localization network, grid
classification of LiDAR-DSM. The STN can transform the generator, and sampler.
data to make it more suitable for the subsequent CNN. Besides, The first step is the localization network which produces
we combine the MP with the STN to enrich the spatial θ for the spatial transformation. The output is a matrix of
information. Finally, softmax is employed to produce the final the affine transformation [a1 a2 b1 b2 a0 b0 ], as shown
classification map. The framework of the proposed method is in (2). The network takes the input data or feature maps
shown in Fig. 1. The main parts of the proposed framework as the source image and performs conditional transformation
are discussed in Sections II-A–II-C. on it. The design of the localization network is similar to the
CNN (i.e., two convolutional layers, two max-pooling layers,
A. Morphological Profiles and one fully connected layer are considered to produce the
MPs are constructed based on the repeated use of openings transformation parameter θ applied to the source image).
and closings by reconstruction with a structuring element (SE) The second step is a module called the grid generator.
of increasing size. The transformation of the original image is The affine transformation parameter θ is generated by the
derived from each opening and closing. Thus, the information localization network, and the grid generator is used to carry
about the size and shape of the objects was extracted by the out the transformation. In other words, the grid generator
MP applied to the image [13]. The MP, therefore, can be builds the coordinate mapping between the source image and
defined as follows: the transformed image. Equations (3)–(5) show the coordinate
mapping relationship of the rotation transformation, translation
MP(x) = [CPn (x), . . . , I (x), . . . OPn (x)] (1)
transformation, and scaling transformation
where CPn (x) is the opening profile and OPn (x) is the closing ⎡ ⎤
profile at the point x of the image I with an SE of size n.     x
u a b1 a0 ⎣ ⎦
Owing to the fact that the LiDAR-DSM data contain only = 1 y (2)
v b2 a2 b0
one channel, we can directly perform MP on it to extract 1
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HE et al.: LiDAR DATA CLASSIFICATION USING SPATIAL TRANSFORMATION AND CNN 3

TABLE III
T ESTING D ATA C LASSIFICATION R ESULTS (VALUES ± S TANDARD D EVIATION ) ON R ECOLOGY D ATA S ET

where x and y are the source coordinates, and u and v are the can automatically learn high-level features. The convolutional
target coordinates of the regular grid layer applies filter to generate feature maps, and the pooling
⎡ ⎤ layer makes the obtained features more abstract and robust
    x
cos α · x + sin α · y cos α sin α 0 ⎣ ⎦ 
= y (3) zs = Wis ∗ x i + bs . (6)
− sin α · x + cos α · y − sin α cos α 0
1
⎡ ⎤ Equation (6) describes how the convolutional layer calcu-
    x
x + a0 1 0 a0 ⎣ ⎦ lates the output feature map, where x i is the i thfeature map,
= y (4) W is the convolutional filters, ∗ is the convolution operator,
y + b0 0 1 b0
1 and b is the trainable bias parameter.
⎡ ⎤
    x The nonlinearity layer calculates the output feature map,
a1 · x a1 0 0 ⎣ ⎦
= y . (5) which is defined as follows:
a2 · y 0 a2 0
1
a s = f (z s ) (7)
In (3), α refers to the angle of anticlockwise rotation around
the origin of the rectangular coordinate system. cos α and sin α where f (·) is the rectified linear unit [i.e., f (x) = max(0, x)]
correspond to a1 and b1 in (2), respectively. in this letter.
The sampler is the last model. The shape of the image III. E XPERIMENTAL R ESULTS
after grid generator is not square and the shape of CNN input A. Data Description
should be square (e.g., 27 × 27). In order to obtain the right In this letter, we use two benchmark data sets to evaluate
shape, the sampler uses bilinear interpolation to obtain proper the performance of the proposed method. Both the groundtruth
CNN input. maps are shown in Figs. 2 and 3.
The parameters, i.e., [a1 a2 b1 b2 a0 b0 ], can be learned The first data set, Houston, was collected by the NSF-funded
through back-propagation algorithm. a1 , a2 , b1 , and b2 provide Center for airborne laser mapping. It was acquired over the
rotation transformation, a0 and b0 provide translation transfor- campus of the University of Houston, Houston, TX, USA,
mation, and a1 and a2 provide scaling transformation. on June 22, 2012. The spatial resolution is 2.5 m with 349 ×
1905 pixels. The available groundtruth covers 15 classes of
C. Convolutional Neural Network-Based Feature Extraction interests.
The optimal inputs obtained by STN are fed into CNN to The second data set, Recology, was acquired in June 2010.
finalize the classification task. The data set consists of 200 × 250 pixels with a spatial
A standard CNN consists of different convolutional layers, resolution of 1.8 m. The groundtruth map includes 11 classes
max-pooling layers, and nonlinearity mapping layers. CNNs of interest.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS

TABLE IV
T ESTING D ATA C LASSIFICATION R ESULTS (VALUES ± S TANDARD D EVIATION ) ON H OUSTON D ATA S ET

B. Experimental Design
The structure of the networks is shown in Tables I and II.
In this letter, a 2-D affine transformation was used. According
to the parameters of the previous step, a parameterized sam-
pling grid was formed from the feature map. Then, the data
with a size of 27 × 27 are used in the CNNs.
The DSM data were linearly mapped into [−1 1]. During
the training process of the CNN method, the learning rate was
set to 0.1 for the Recology data set and 0.03 for the Houston
data set. During the experiment, we found out that the small
learning rate fits for the STN. The learning rate was set to Fig. 4. False color map of (a) Recology data set. The classification map
0.0015 for the Houston data set and 0.01 for the Recology using (b) MP-SVM, (c) MAP-SVM, (d) CNN, (e) STN, and (f) MP-STN.
data set.
For the MP, a disk-shape structure element with a size of 2, C. Classification Performance Analysis
4, 6, and 8 was taken into account. The initial learning rate The experimental results are listed in Tables III and IV.
for MP-CNN was 0.01 and 0.0025 for the Recology and the Fig. 4 provides the classification maps. The best results were
Houston data sets, respectively. In this letter, the learning rate obtained using the morphological methods with the MP-STN,
was set to 0.015 and 0.009 for the MP-STN network for the which reached 97.10 ± 0.80% and 92.87 ± 1.42% for the
Recology and the Houston data sets, respectively. Recology and the Houston data sets, respectively.
The number of epochs was set to 100 and 80 for the As can be seen in Table III, the STN yields better
Houston and Recology data sets, respectively. To evaluate the results than does the original CNN for the Recology data
performance of the proposed methods, the traditional SVM set. In order to verify the performance of the proposed
with the MPs was taken into account. SVM with the radial methods when the number of training samples was limited
basis function kernel used the cross-validation to determine the (e.g., 40 training results of MP-STN outperform original CNNs
hyperplane parameters. Overall accuracy (OA), average accu- and MP-CNN by 8.25% and 1.42% in terms of OA, respec-
racy (AA), and kappa coefficient (K ) were used to compare tively. For different training samples per class (e.g., 50, 60, 70,
classification accuracies obtained by different methods. and 80), the results of MP-STN always outperform other
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HE et al.: LiDAR DATA CLASSIFICATION USING SPATIAL TRANSFORMATION AND CNN 5

TABLE V determined by STN. From Fig. 5, one can see that the STN
R ESULTS OF E ACH T RANSFORMATION FOR THE C LASSIFICATION OF excluded irrelevant objects [Fig. 5(e) and (h)].
R ECOLOGY D ATA S ET
Table VII shows different methods of training time and
test time, when the training samples of each class were 80.
The experiments were run on a 3.2-GHz CPU with a GTX
770 GPU card. In general, the deep learning-based methods
demand long training time, but they are fast for the testing
stage. The fast testing is important for practical applications.
IV. C ONCLUSION
TABLE VI In order to determine proper inputs, which was empirically
R ESULTS OF E ACH T RANSFORMATION FOR THE C LASSIFICATION OF set in traditional methods, STN is introduced for the first time
H OUSTON D ATA S ET for the accurate classification of LiDAR-DSM.
Furthermore, a new framework was proposed by combining
MP, STN, CNN, and softmax classification. The proposed
deep models, especially MP-STN, significantly improved
the classification accuracy. MP-STN achieved 94.61% and
86.87% in terms of OA on the Recology and Houston data
sets, respectively, when the number of training samples for
each class was 40.
The proposed idea of STN for optimal input identification
can be used for other classification problems in the remote
sensing community, which is the next move of our research.
R EFERENCES
[1] S. You, J. Hu, U. Neumann, and P. Fox, “Urban site modeling from
LiDAR,” in Proc. Int. Conf. Comput. Vis., 2003, pp. 579–588.
[2] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. (Jun. 2017). “PointNet++: Deep
hierarchical feature learning on point sets in a metric space.” [Online].
Available: https://arxiv.org/abs/1706.02413
Fig. 5. Optimal inputs of the Recology data set. (Left) Original inputs. [3] L. Zhang and L. Zhang, “Deep learning-based classification and recon-
(Right) Optimal inputs after STN. Sample of (a) building 5 and (b) building 6. struction of residential scenes from large-scale point clouds,” IEEE
(c)–(e) and (g) Samples of trees. (f), (h), and (i) Samples of building 1. Trans. Geosci. Remote Sens., vol. 56, no. 4, pp. 1887–1897, Apr. 2018.
[4] T. Hackel, J. D. Wegner, and K. Schindler, “Fast semantic segmen-
TABLE VII tation of 3D point clouds with strongly varying density,” ISPRS Ann.
RUNNING T IME OF CNN S AND STN S Photogramm., Remote Sens. Spatial Inf. Sci., vol. 3, pp. 177–184,
Apr. 2016.
[5] U. Weidner, “Digital surface models for building extraction,” in
Automatic Extraction of Man-Made Objects From Aerial and Space
Images (II), A. Gruen, E. P. Baltsavias, and O. Henricsson, Eds. Basel,
Switzerland: Springer, 1997, pp. 193–202.
[6] C. S. Lo and C. Lin, “Growth-competition-based stem diameter and
volume modeling for tree-level forest inventory using airborne LiDAR
data,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 4, pp. 2216–2226,
Apr. 2013.
[7] S. K. Lodha, E. J. Kreps, D. P. Helmbold, and D. N. Fitzpatrick, “Aerial
LiDAR data classification using support vector machines (SVM),” in
Proc. Int. Symp. 3D Data Process., Vis. Transmiss., Chapel Hill, NC,
USA, 2006, pp. 567–574.
studied methods such as MP-SVM, MAP-SVM, and CNN in [8] T. Sasaki, J. Imanishi, K. Ioki, Y. Morimoto, and K. Kitada, “Object-
based classification of land cover and tree species by integrating airborne
terms of OA, AA, and K . The proposed MP-STN worked well LiDAR and high spatial resolution imagery data,” Landscape Ecol. Eng.,
when the number of training samples was limited. Moreover, vol. 8, no. 2, pp. 157–171, 2012.
when the number of training samples of each class was [9] L. Naidoo, M. A. Cho, R. Mathieu, and G. Asner, “Classification of
savanna tree species, in the greater Kruger National Park region, by
decreased from 80 to 40, the OA of MP-STN decreased integrating hyperspectral and LiDAR data in a Random Forest data
slightly by 2.49% while the OAs of MAP-SVM and CNN mining environment,” ISPRS J. Photogramm. Remote Sens., vol. 69,
decreased by 5.91% and 4.48%, respectively. We also imple- pp. 167–179, Apr. 2012.
[10] I. Sutskever and G. E. Hinton, “Deep, narrow sigmoid belief net-
mented the proposed methods on the Houston data set; the works are universal approximators,” Neural Comput., vol. 20, no. 11,
results are shown in Table IV. The results obtained by the STN pp. 2629–2636, 2008.
outperformed other methods in terms of OA, AA, and K . [11] A. Wang, X. He, P. Ghamisi, and Y. Chen, “LiDAR data classification
using morphological profiles and convolutional neural networks,” IEEE
Tables V and VI show the results of a different affine Geosci. Remote Sens. Lett., vol. 15, no. 5, pp. 774–778, May 2018.
transformation (including rotation, translation, and scale trans- [12] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spa-
formations). The number of training samples per class tial transformer networks,” in Proc. Adv. Neural Inf. Process. Syst.,
was 50 for the Recology and the Houston data sets. It can be vol. 25, 2015, pp. 2017–2025.
[13] C. Debes et al., “Hyperspectral and LiDAR data fusion: Outcome of
seen that the performance of those transformation approaches the 2013 GRSS data fusion contest,” IEEE J. Sel. Topics Appl. Earth
depended on the data set. Fig. 5 displays the optimal inputs Observ. Remote Sens., vol. 7, no. 6, pp. 2405–2418, Jun. 2014.

You might also like