Tree Counting With High Spatial-Resolution Satellite Imagery Based On Deep Neural Networks 2021

Ecological Indicators 125 (2021) 107591
Contents lists available at ScienceDirect
Ecological Indicators
journal homepage: www.elsevier.com/locate/ecolind
Tree counting with high spatial-resolution satellite imagery based on deep

neural networks
Ling Yao a, b, d, *, Tang Liu a, c, Jun Qin a, b, e, Ning Lu a, b, d, Chenghu Zhou a, b, d
a
State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of
Sciences, Beijing, China
b
Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
c
China University of Geosciences, Beijing, China
d
Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing, China
e
University of Chinese Academy of Sciences, Beijing, China
A R T I C L E I N F O A B S T R A C T
Keywords: Forest inventory at single-tree level is of great importance to modern forest management. The inventory contains
Forestry two critical parameters about trees, including their numbers and spatial locations. Traditional methods to
Deep learning algorithms catalogue single trees are laborious, while deep neural networks enable to discover the multi-scale features
Remote sensing
hidden in images and thus make it possible to count trees with remote sensing imagery. In this study, four
Density estimation
different tree counting networks, which were constructed by remodeling four different classical deep convolu
Tree counting
tional neural networks, were evaluated to determine their abilities to grasp the relationship between remote
sensing images and tree locations for automatic tree counting end-to-end. To this end, a tree counting dataset was
constructed with remote sensing images of 0.8-m spatial resolution in distinct regions. This dataset consisted of
24 GF-II images and the corresponding manually annotated locations of trees based on these images. Thereafter,
a large number of experiments were conducted to examine the performance of these networks in regards to tree
counting. The results demonstrated that all networks could achieve the competitive performance (above 0.91) in
terms of the determination coefficient (R2) between the ground truth and the estimated values. The average
accuracy of the Encoder-Decoder Network (one of the four networks) was greater than 91.58% and its R2 was
equal to 0.97, achieving the best performance. It has been found that the deep learning is an efficient and
effective means for tree counting task.
1. Introduction reference for decision-making concerning forest management conducted

by relevant agencies and regulatory authorities (Anderson, 2018;
Forests play a key role in environmental processes. For example, they Caughlin et al., 2016; Pan et al., 2011). This is of great significance for
could help maintain water cycle, conserve soils, sequestrate carbon and optimizing forests management and utilization in the context of climatic
protect the habitats of humans and animals. By virtue of the aforesaid changes, and achieving sustainable development goals relating to
functions of forests, many projects were launched to improve environ forests.
mental governance and ensure sustainable ecological development, such The traditional field inventory of individual trees is a time-
as the Global Billion Tree Campaign, China’s Natural Forest Protection consuming, strenuous and expensive task. Although Lidar can be
Project and Three-North Shelterbelt Project (Cao et al., 2011; Hansen applied to automatically detect trees, it cannot be widely applied due to
et al., 2013). For a wide range of scientific and applied purposes, it is in its high cost. Therefore, it is more practical to use optical remote sensing
urgent need to build a reliable single-tree level forest inventory which data to estimate tree numbers on a large spatial scale. For example,
records information on critical parameters, such as the number of trees, Crowther et al. (2015) took more than two years to collect 429,775
canopy size and location (FAO, 2016). The number of trees can be datasets from more than 50 countries around the world to estimate the
exploited to monitor and assess forest health and provide a basic data global tree number by means of certain statistical methods. With the
* Corresponding author at: State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources
Research, Chinese Academy of Sciences, Beijing, China.
E-mail address: yaoling@lreis.ac.cn (L. Yao).
https://doi.org/10.1016/j.ecolind.2021.107591
Received 24 September 2020; Received in revised form 4 March 2021; Accepted 6 March 2021
Available online 21 March 2021
1470-160X/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
L. Yao et al. Ecological Indicators 125 (2021) 107591
coming of the era of highly spatial resolution for remote sensing, indi for a specific region, which were not applicable for multiple scenarios on
vidual trees can be discernible by the naked eyes. However, the tradi a large scale (Maillard and Gomes, 2016; Wang et al., 2018; Xie et al.,
tional methods widely used to handle remote sensing data with a coarser 2018). Moreover, many of these approaches attempted to delineate clear
spatial resolution are unsuitable for performing tree counting tasks boundaries among different canopies, thus complicating the problems
based on highly-spatial-resolution satellite images(Culvenor, 2002; meaninglessly (Pouliot et al., 2002; Wagner et al., 2018). In terms of tree
Zhang et al., 2019). counting, the foremost concern is the number of trees rather than their
In recent years, artificial intelligence (AI) has been widely applied exact locations. In addition, it is unnecessary to detect or segment the
and achieved great success in the domain of computer vision, which covered area of each tree.
inspired the researchers in the field of remote sensing. Li et al. (2016) As a matter of fact, an AI method termed as density regression has
adopted an oak tree segmentation method based on deep learning, using been developed to conduct object counting (Xie et al., 2016; Boomina
manually labeled samples for training a CNN and predicting sample than et al., 2016; Koon Cheang et al., 2017; Kong et al., 2019; Wu et al.,
labels by sliding window techniques. Khan and Gupta (2018) compared 2019), which managed to build a functional relationship between an
the performance of three methods in different vegetation regions, indi image and the objects of interest on such image end-to-end. However,
cating that deep learning detection could be used as a fast and effective this method was primarily developed for people counting (Sindagia and
tool to determine the number of trees with drone images. Weinstein Patelb, 2017). In respect to tree counting, there are many challenges as
et al. (2019) described a method to identify tree canopies in RGB images the spatial pattern of trees in an remote sensing image are more
by using a semi-supervised deep learning network. Although plenty of complicated than that of people in an photograph. For instance, in
achievements have been obtained, there still existed some problems in contrast with people, distinct tree canopies vary tremendously in color,
these approaches. For example, some of these approaches were designed shape, size and so on, making it difficult to exactly characterize trees.
Fig. 1. Locations of the Study Areas.
2
Furthermore, irregularly spatial distribution and partial overlap of tree 2.1. Tree counting dataset
canopies are also more severe. Therefore, it is impossible to directly
implement density regression, and much more explorations are required In this study, a total of 24 GF-II remote sensing images were selected
to be conducted in terms of tree counting. In this study, four classic to construct a dataset for training and validation. The study areas were
architectures of deep neural networks were compared to test their located in Jiangsu Province, China, for which the specific locations are
abilities to count trees, during which the impact of different depths and shown in Fig. 1. These images were at a size of 2400 × 2420 with a
structures on tree counting was fully examined, and the effect of tree spatial resolution of 0.8 m. The centers of tree canopies in these images
density levels was also investigated. were annotated by visual interpretation. These annotated points not
This article is organized as follows. In Section 2, the method of tree only reflected the number of trees, but also marked their spatial distri
counting based on density regression is briefly introduced and four bution. The tree number in these images varied greatly from about 800
different CNN structures are described in details. The results of the to 60,000. Fig. 2A1–C1 display some small patches of these images with
experiment are shown in Section 3, for which the detailed analysis and different landscapes such as cropland, urban residential area and hill. At
discussion are presented in Section 4. In the final part, the conclusion of the same time, their corresponding annotated points are also shown in
the paper is given. Fig. 2A2–C2.
2. Materials and methods 2.2. Tree density map
This section describes how to address tree counting problems based It is extremely difficult to construct a relationship between gridded
on a counting regression network. Firstly, we constructed a dataset for
images and abstract points annotated by visual interpretation. Inspired
training the network. Secondly, we proposed four network architec by Lempitsky and Zisserman (2010), we made the first attempt at ras
tures. Thirdly, we presented how to train and evaluate the network.
terizing these points into the same spatial resolution (0.8 m) as the
image and then blurring them with a discrete Gaussian kernel to
construct density map. Assuming that the annotated tree is located at
pixel si and can be expressed as an indicator function 1{s = si }, where s
Fig. 2. Examples of the tree counting dataset in distinct landscapes (A1–3 respectively show the image, tree annotation and density map in a crop field with sparse
trees; B1–3 show the scenario in a residential area with medium density of trees; C1–3 show a scenario on a hill with dense trees).
3
denotes the index of any pixel, the map of gridded tree distribution H(s) implemented in the up-sampling process. As the original AlexNet had a
with N trees can be represented as: tremendous number of parameters to be optimized, to improve the
computation efficiency, the original AlexNet network was simplified by
∑
N
H(s) = 1{s = si } (1) replacing the large convolutional kernel with a small convolutional
i=1 kernel, for which the detailed structure is illustrated in Fig. 3.
To convert it into a density function, a discrete Gaussian Blur kernel
2.3.2. VGG-like network
was used to convolve with the tree distribution map to acquire the tree
Compared to AlexNet, VGGNet (Simonyan and Zisserman, 2014)
density map as below:
uses smaller filter sizes and deeper network structures. In this study,
D(s) = H(s)*Gσ (s) (2) VGGNet was applied to explore whether deeper networks can outper
form the Alex-like Networks with a relatively shallow structure.
A Gaussian function contains two key parameters: the standard de
VGGNet has 5 max-pool layers, each having a stride of 2, and thus the
viation σ and the size of the Gaussian blur kernel, which determine the
size of output feature maps is only 1/32 of that of the input image.
size and integration value of individual marker points in the density
Inspired by crowdnet (Boominathan et al., 2016; Wu et al., 2019), we
map. According to some relevant investigations (Deng, 2009), the
removed the last pooling layer, set the stride of the fourth maximum-
average canopy width of trees in the study area is about 4–6 m.
pool layer to 1, and used technique holes (Chen et al., 2016) to handle
Considering the spatial resolution of remote sensing images, the size of
the mismatch of the receptive field. Similar to the Alex-Like Network,
Gaussian blur kernel was defined as five pixels. Meanwhile, according to
the up-sampling–convolution–ReLU method was also used to recover
the Gaussian distribution, the generated density map can contain more
the feature maps of dense representation to its original size. A bilinear
than 99% of the information when σ is 1.5, which ensured the integral of
interpolation method was used in the up-sampling process. The VGG-
the tree density map to amount to the sum of the annotated central
Like Network is illustrated in Fig. 4 in details.
points of tree canopies. The corresponding density maps for the images
are shown in Fig. 2A3–C3.
2.3.3. Combined network
When an image I and its corresponding density map D are available,
In remote sensing images, a tree may feature a small blob. A deep
the end-to-end regression problem is to search for a function F such that
network may lead to the loss of blob feature. Since blob detection does
D = F(I).
not require high-level characteristics, the combination of a shallow net
and a deep net (Boominathan et al., 2016) seems to be a possible solu
2.3. Network architecture
tion. The intention of designing crowdnet is to recognize the low-level
blob patterns of heads. Although such problems are less likely to
There have been some researches on objects counting based on deep
appear in the remote sensing image, the size of trees vary greatly,
neural networks. Through using a similar method, the problem of tree
especially for high-resolution images in which some small trees just
counting can be formulated as a density regression. However, both the
feature blobs. Therefore, a shallow network was designed to capture
partial overlap and multi-scale characteristics of trees make tree
these blob features and the VGG-Like Network structure was applied to
counting more complicated than other object counting problems.
handle advanced features. For blob features, a structure with 3 con
Therefore, it is necessary to investigate whether the existing networks
volutional layers was designed and named as ‘shallow net’. The outputs
are suitable for counting. In this study, four classic deep network
of the deep and shallow nets were concatenated, and then fed into a 1x1
structures were evaluated on tree counting problems, in order to identify
convolutional layer to get the final prediction. This structure was named
the most suitable one.
as Combined Network, and the output from this layer was sampled up to
the size of the input image and the final tree density prediction was
2.3.1. Alex-like network
obtained based on bilinear interpolation. To ensure that there would be
The AlexNet network was originally designed for image classifica
no loss of features caused by maximum-pooling, the average pooling
tion. By virtue of its ability to capture the high-level abstract features of
layers were used in the shallow network, for which the details are
images, it has been widely applied in a variety of vision tasks, such as
illustrated in Fig. 5.
object segmentation (Chen et al., 2016) and crowd counting (Boomi
nathan et al., 2016). In this study, an Alex-like Network was constructed
2.3.4. Encoder-decoder network
to perform the end-to-end density regression.
In order to obtain the density map at the same size of the input
The constructed network has 3 maximum pooling layers with a stride
image, an up-sampling operation was implemented through bilinear
of 2, so as to keep the size of the output feature maps as 1/8 of the input
interpolation for the three networks presented above. However, the
image. To produce a density map with the same size as the input image,
value of the bilinear interpolation was determined according to the pixel
the fully connected layers were replaced with convolutional layers, and
values of the four nearest points around the target point. Given the
the spatial size was restored by the upsampling-convolution-ReLU
uncertainty of tree distribution, this rough interpolation may cause an
structure, mapping the feature maps back to the original size. A
accumulation of errors in some regions. Inspired by image segmentation
bilinear interpolation, followed by a convolutional layer, was
Fig. 3. Diagram of the Alex-Like Network Structure (the boxes in blue represent convolution layers, in green represent dropout layers, and in orange represent max
pool layers). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
4
Fig. 4. Diagram of the VGG-Like Network Structure.
Fig. 5. Diagram of the Combined Network Structure.
tasks (Ronneberger et al., 2015), we found that a longer decoder path network paths. The structure of the Encoder-Decoder Network is
that is symmetrical with the encoder and skip-connection operation may detailed in Fig. 6.
be helpful in predicting density maps. The results of different pooling
layers were combined to optimize the output at up-sampling. It was
2.4. Learning and implementation
beneficial to capturing features at different scales, which was common
to trees in remote sensing images. U-net is a classic symmetrical
To evaluate the performance of the four networks, the detailed setup
Encoder-Decoder structure, which has been widely used since its pre
sentation benefiting from its satisfactory performance. To conduct tree of the experiment, the ways of network training and testing, loss func
tion and evaluation metrics were introduced in this subsection.
counting, the structure of the classic U-net network was modified to
adapt to the density map regression, for which the proposed network
2.4.1. Loss function
adopted the symmetrical structure. In the encoding phase, the network
structure settings were almost the same as that of the VGG-Like The learning of the regression network should be guided by a loss
function, which was defined as follows:
Network, and the output layer was removed. And then in the decoding
phase, four deconvolution layers were used for up-sampling. After ∑
W ∑
H
passing through each deconvolution layer, concatenation was used to L= (pij − tij )2 (3)
connect the output of the pooling layer in the encoding phase as well as i j
the output of the deconvolution layer, so as to obtain features at

where W and H denote the width and height of the input image
different scales. Through using this symmetrical structure, the size of the
respectively, while pij and tij represent the density value and ground
output density map was same as that of the input image, and the error
truth density respectively.
caused by direct interpolation can be reduced. In addition, the skip-
connection allowed shallow network features to be output into the
2.4.2. Experiment setup
decoding process, thus avoiding the loss of blob features due to long
In this study, K-fold cross-validation method was adopted to train,
Fig. 6. Diagram of the Encoder-Decoder Network Structure (the boxes in blue represent convolution layers, in green represent dropout layers, in orange represent
max pool layers, and in red represent up-sampling layers, and the dash arrow represents the skip-connection operation). (For interpretation of the references to color
in this figure legend, the reader is referred to the web version of this article.)
5
validate and test the four networks based upon the tree counting dataset 3. Results
introduced in Subsection 2.1. The dataset was randomly divided into six
groups, each containing four images. In each fold of the cross-validation, 3.1. Evaluating the performance of the proposed networks on training
five groups (20 images) were treated as the training dataset while the process
remaining four groups were treated as the test dataset. It is worth noting
that all the networks share the same settings. Since the image size was Firstly, the performance of the four networks (Alex-Like Network,
too large to be fed into GPU for training, it was clipped into sub-images VGG-Like Network, Combined Network and Encoder-Decoder Network)
at a size of 256 × 256, and the corresponding density maps of these in the training process was evaluated. In addition, to prevent over-
images were handled in the same manner. fitting, the network with the minimal training loss, the minimal vali
This procedure yielded 2000 training sub-images and 400 testing dation loss, the minimal training MAE, and the minimal validation MAE
sub-images in each fold of the cross-validation. In the training stage, were saved respectively. Taking Fold-3 as an example, the loss curves
1800 sub-images were used for training while the rest of which were and MAE curves of the four models are shown in Fig. 7A and B. It can be
used for validation. In the testing stage, 400 testing sub-images were observed that, the deeper the used model was, the lower the training loss
used to evaluate the network performance in each fold. Through cross- can be achieved. The Encoder-Decoder Network got the minimal loss,
validation, all images were trained and tested to assess the overall ac while the losses of other methods became quite stable after 300 epochs.
curacy of the models. All these Networks were implemented under the It is worth noting that the VGG-Like Network had a lower loss value than
deep learning framework of Pytorch, in which Adam optimization al the Combined Network, but the Combined Network performed better in
gorithm was used to train the networks. terms of MAE metric.
However, the lower training errors does not imply lower validation
2.4.3. Evaluation metrics errors, as it is shown in Fig. 8 that the validation loss values of both the
To examine the performance of the four networks for tree counting, a VGG-Like Network and the Encoder-Decoder Network had obvious
series of experiments were conducted on the tree counting dataset. In fluctuations, and were also higher than that of Combined Network. Even
each fold of the cross-validation, the experiment was repeated for three the Alex-Like Network got a lower validation loss than that of VGG-Like
times. Network, the validation MAE showed a different result. Although the
In this study, three standard metrics including Mean Absolute Error Encoder-Decoder Network had obvious fluctuations, its result was better
(MAE), R-Squared (R2) and Mean Squared Error (MSE) are taken for than other three methods.
evaluation, which were defined as below: The experimental results demonstrated that the Encoder-Decoder
Network achieved the best overall counting performance, but the per
1 ∑N
formance of validation loss showed that the network may be overfitted
MAE = ‖ti − pi ‖ (4)
M i=1 on the training dataset, which might also be caused by the fixed gaussian
∑N kernels and varying tree sizes. The inferiority of the Alex-Like Network
(ti − pi )2 may be attributed to the limited capacity due to its relatively shallow
R2 = 1 − ∑1N (5)
1 (ti − t i )
2
architecture, which cannot capture the deep structure hidden in the
data. Compared to the Combined Network, the VGG-Like Network fitted
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
√
√1 ∑ N the training data better while exhibiting a higher validation error, sug
RMSE = √ ‖ti − pi ‖2 (6) gesting that the VGG-Like Network may not generalize well.
N i=1
where N denotes the number of samples, t denotes ground truth, t de 3.2. Evaluating the performance of the proposed networks on the test
notes the sample mean, p denotes the estimate, and i denotes the dataset
subscript for a different sample.
In this Subsection, we evaluated the performance of the four net
works. Table 1 shows the corresponding metrics obtained from each fold
for every 400 sub-images. As can be seen that, the average R2 values over
Fig. 7. Learning Curves for the Proposed Networks of (A) Loss and (B) MAE on the Training Dataset.
6
Fig. 8. Learning Curves for the Proposed Networks of (A) Loss and (B) MAE on the Validation Dataset.
Table 1
Statistic metrics of all four methods on all sub-images for different folds. Bold indicates the best performance in that fold.
Network Name Fold-1 Fold-2 Fold-3 Fold-4 Fold-5 Fold-6 Average
MAE Alex-Like Network 81.32 47.60 44.52 40.82 50.72 111.24 62.70
VGG-Like Network 90.45 32.48 37.73 44.88 60.52 111.25 62.89
Combined Network 75.14 34.70 40.35 43.76 46.41 92.11 55.41
E-D Network 67.52 22.66 33.94 30.70 37.93 91.28 47.34
RMSE Alex-Like Network 135.45 72.06 84.09 85.76 119.69 192.94 114.99
VGG-Like Network 143.63 49.65 66.09 86.84 101.85 189.04 106.18
E-D Network 114.92 36.12 65.75 67.43 76.53 138.25 83.17
R2 Alex-Like Network 0.890 0.809 0.934 0.878 0.854 0.923 0.881

VGG-Like Network 0.880 0.911 0.943 0.871 0.898 0.910 0.902
E-D Network 0.914 0.949 0.945 0.924 0.934 0.932 0.933
six folds exceeded 0.88 for the four networks, the average MAE values 3.3. Evaluating the efficacy of the proposed networks on the entire image
were below 63, and the RMSE values were lower than 115. In particular,
the Encoder-Decoder Network achieved an average MAE of 47.34, an In practice, people are more likely to have an interest in the per
average RMSE of 83.17, and an average R2 of 0.933, surpassing the formance of these four networks on an entire image. In this Section, the
second-best approach by 14.6% for MAE and 13.9% for RMSE. The performance of the four networks on the entire image was evaluated.
testing results were distinct at different folds. For example, the Encoder- The trained networks were run for the four testing images in each fold,
Decoder Network failed to perform best on each fold except at fold-3, for which the sliding window with a moving step of 128 pixels was used
where R2 of the Combined Network exceeded 0.97. The performance to reduce uncertainties at the edges of the estimated density map and the
of the VGG-Like Network was better than the Combined Network at fold- simple average was calculated for the overlapped part of two contiguous
2. density maps. The average MAE value was approximately lower than
The comparisons were performed between the annotated tree 3733 for all four networks, among which the Encoder-Decoder Network
numbers and the estimated ones on the sub-images in Fig. 9 where achieved the lowest MAE and RMSE. In terms of MAE, the Encoder-
different folds were represented by their own colors. As shown in the Decoder Network surpassed the second-best by 28.5%, while its RMSE
figure, the four networks could achieve R2 values as high as 0.85, 0.87, is lower than the second best by 17.0%. The Combined Net was the
0.90 and 0.93 respectively. However, almost all points lay below the 1:1 second-best and outperformed the VGG-Like Network by 21.6% in
line, indicating that the estimated tree numbers by all the four networks regards to MAE and 13.2% in terms of RMSE. The VGG-Like Network
were less than the annotated tree numbers, which was because the obtained an improvement up to 18.3% over the Alex-Like Network in
“ground truth” was obtained by visual interpretation. As is well-known, terms of RMSE, as shown in Table 2.
a tree may have multiple canopies. In the locations where trees are As shown in Fig. 10, the Encoder-Decoder Network performed best in
sparse, it is relatively easy to differentiate them and get the accurate most cases. However, the estimated number of trees from Encoder-
number by visual interpretation. However, it is prone to annotate more Decoder Network was not as accurate as that from other models in
trees, when tree density is high, it could be seen in Fig. 9 that there was rare cases. For example, for the 6th image, the VGG Network performed
almost no bias for sub-images with sparse trees, while the negative better than Encoder-Decoder Network, and for the 23rd image, the
biases occurred for dense trees. Combined Network defeated the Encoder-Decoder Network. Moreover,
for almost all images, Encoder-Decoder Network tended to underesti
mate the tree numbers. Even so, the accuracy of the Encoder-Decoder
Network still exceeded 79% in the 6th image and 85% in the 23rd
7
Fig. 9. Scatter Plots of the Annotated Numbers of Trees Versus the Estimated Number of Trees Using All Four Models in All the Sub-Images (A: Alex-Like Network; B:
VGG-Like Network; C: Combined Network and D: Encoder-Decoder Network). The blue dotted line is 1:1 line. Different folds are represented by different colors and
the linear fitting line is obtained by fitting the data of all folds. (For interpretation of the references to color in this figure legend, the reader is referred to the web
version of this article.)
image. Therefore, it showed that the effectiveness of the Encoder-

Table 2 Decoder Network would be acceptable in most, if not all, cases.
Statistic metrics for of all the proposed methods.
Fig. 11 depicted the regression relationship between the ground
Alex-Like VGG-Like Combined Encoder-Decoder truth and the estimated one based on the four methods for 24 images in
Network Network Network Network
the dataset. All the four networks performed well in estimating the
MAE 3732.629 3386.536 2653.536 1896.098 number of trees. As far as the error metrics were concerned, the per
RMSE 5938.168 4850.052 4211.691 3497.131 formance of the four networks for the original images was better than
R2 0.9157 0.9313 0.9457 0.9758
that for the clipped sub-images. The Alex-Like Network achieved a R2 of
0.9157, though it performed worst for the clipped samples, and the best
method reported a R2 value of 0.9758, indicating that all four networks
can effectively estimate the number of trees among which the Encoder-
8
Fig. 10. Estimated Numbers Obtained from the Proposed Models Versus the Annotated Numbers for Each of the 24 Images (the green line represents the ground
truth). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Decoder Network performed best. Furthermore, most of the points were more advanced net structures (for example, self-attention) are required
located below the 1:1 line in Fig. 11, especially in the Alex-Like Network to be investigated in further researches.
and VGG-Like Networks, implying that the tree numbers were under As is shown in Fig. 12, although the total numbers of trees are
estimated on the most of the images. Compared to the estimation on sub- similar, the Alex-Like Network, VGG-Like Network and Combined
images, the boundary effect can be relieved to certain degree when the Network tended to underestimate the tree densities in areas with high
estimation was conducted on the entire images based on the sliding- ground truth values. The main reason for this is that a bilinear inter
window method, as the errors can partially cancel each other out. polation was used to match the output size to the ground truth map,
which made the feature value spread evenly, which is also the reason
4. Discussion why the training error cannot decline continuously. The Encoder-
Decoder Network achieved the results that look satisfactory, which
In the tree counting dataset, both the spatial pattern and the location output a density map with a higher center value and a relatively low
of tress vary greatly. Some of the trees queued along the roadside, some surrounding value for a single tree. But it also induced errors in some
were located in residential areas, while others randomly distributed in situations as shown in Fig. 12B. This is partially because the manual
dense forests. It is of great interest to investigate the network perfor annotations varied from person to person, and partially because both the
mance in these three distinct landscapes with different tree densities shape and size of trees also changed in different regions. To address this
(sparse, medium and dense) as shown in Table 3. problem, the quality of the density map is required to be enhanced.
Table 3 lists the MAE and RMSE metrics for the above three land In addition, we compared the processing time of different methods,
scapes. Encoder-Decoder Network had the lowest MAE in the medium and all the experiments were conducted on the same equipment (Nvidia
and dense scenarios, while its MSE was somewhat competitive. When Tesla V100) in the same mini-batch of size 64. The training time of
the tree density was sparse, the Combined Network performed better. At different approaches was presented in Table 4, and it was not surprising
the medium density level, Alex-Like Network had the second-lowest that the deeper the neural network layers were, the more parameters
MAE, and Encoder-Decoder Network defeated Alex-Like Network by needed to be trained, and the longer the training time was required.
making an improvement of 25.19% in MAE and an improvement of Fortunately, such differences were within a small range per epoch. The
25.86% in RMSE. number of the Encoder-Decoder Network’s parameters is about three
As mentioned above, the density regression method is suitable not times that of VGG-Like Network, but it took only 25% more time than
only for tree counting but also for tree locating. Fig. 12 displayed the that of the VGG-Like Network. In addition, the long training time is
estimated tree density maps and their corresponding ground truth becoming less limiting with the rapid development of hardware, espe
values. All the density maps used the same stretching scheme. It clearly cially that of GPU.
showed that the four methods can approximately obtain the spatial
distribution of trees to different degrees. The Alex-Like Network, VGG- 5. Conclusions
Like Network and Combined Network mistakenly gave certain density
value in some places where no trees existed. Especially the Alex-Like In this article, we explored the feasibility of deep learning for tree
Network, in which some misalignments on the right side of the figure counting and formulated it as a density regression. A tailored tree
led to a prediction error of 12 trees. In Fig. 12B, compared to the VGG- counting dataset was constructed with 24 GF-II images and the locations
Like Network, the Combined Network captured some shallow features as of the central points of tree canopies were annotated manually. The
shown in the lower-left corner of the picture. These features were suc performance of the four classic CNN-based network structures was
cessfully acquired by the Alex-Like Network but failed to be acquired by evaluated with regards to tree counting (Alex-Like, VGG-Like, Combined
the VGG-Like Network. As aforementioned, due to the multi-scale Network and Encoder-Decoder Network). Lots of experiments showed
spatial characteristics of trees, the ability of the network to capture that deep learning was suitable to be applied in tree counting and the
trees acted as a decisive factor for accurate tree counting. Therefore, Encoder-Decoder Network achieved the best performance in a variety of
9
Fig. 11. Scattered Plots of Estimated Numbers of Tress Versus Annoated Numbers of Trees for All the Four Models in the Whole Image (A: Alex-Like Network; B:
VGG-Like Network; C: Combined Network and D: Encoder-Decoder Network). The dotted yellow line is 1:1 line. (For interpretation of the references to color in this
figure legend, the reader is referred to the web version of this article.)
Table 3
Statistic metrics of all the proposed methods at different density levels.
MAE MSE
Alex-Like VGG-Like Combined Encoder-Decoder Alex-Like VGG-Like Combined Encoder-Decoder

Network Network Network Network Network Network Network Network
Sparse 32.51 29.51 25.67 26.19 42.76 36.72 33.69 35.39

Medium 95.06 98.03 98.44 71.11 128.71 130.244 128.24 95.43
Dense 384.40 314.42 281.33 230.50 447.41 391.51 343.21 271.08
scenarios with a high coefficient of determination (R2) (around 0.975). resources and assessment on green assets.
These models can be implemented to directly obtain a fine spatial dis Several interesting points deserve to be further explored in the
tribution of trees as well as their number from high-resolution satellite future. Firstly, since this study only used RGB data, the counting method
images, which would potentially facilitate both surveys on forestry can be easily extended to plantation forests or key protected forest areas,
10
Fig. 12. Density Maps for Ground Truth, Alex-Like Network, VGG-Like Network, Combined Network and Encoder-Decoder Network.
well as the National Data Sharing Infrastructure of Earth System Science

Table 4
(http://www.geodata.cn/).
Training time per epoch and number of parameters for the four proposed
networks.
References
Alex-Like VGG-Like Combined Encoder-
Network, Network, Network Decoder Anderson, C.B., 2018. Biodiversity monitoring, earth observations and the ecology of
Network scale. Ecol. Lett. 21 (10), 1572–1585.
Boominathan, L., Kruthiventi, S.S.S., Venkatesh Babu, R., 2016. CrowdNet: A Deep
Training 36.01 42.10 46.82 51.12
Convolutional Network for Dense Crowd Counting. arXiv e-prints:arXiv1608.06197.
time(s)
Cao, S., Sun, G., Zhang, Z., Chen, L., Feng, Q., Fu, B., McNulty, S., Shankman, D.,
Parameters 0.6 M 12.39 M 12.41 M 34.73 M
Tang, J., Wang, Y., Wei, X., 2011. Greening China naturally. Ambio 40 (7), 828–831.
Caughlin, T.T., Graves, S.J., Asner, G.P., van Breugel, M., Hall, J.S., Martin, R.E.,
Ashton, M.S., Bohlman, S.A., 2016. A hyperspectral image can predict tropical tree
which enables to quickly monitor the dynamic changes of forests and growth rates in single-species stands. Ecol. Appl. 26 (8), 2369–2375.
serve relevant policy making and enforcement issues. Secondly, the Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2016. DeepLab:
semantic image segmentation with deep convolutional nets, atrous convolution, and
significance of tree counting lies not in the number of trees per se, but in fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 834–848.
the possibility of studying the relationship among trees, ecology and Crowther, T.W., Glick, H.B., Covey, K.R., Bettigole, C., Maynard, D.S., Thomas, S.M.,
human activities from a new perspective, in order to deepen our un Smith, J.R., Hintler, G., Duguid, M.C., Amatulli, G., Tuanmu, M.N., Jetz, W.,
Salas, C., Stam, C., Piotto, D., Tavani, R., Green, S., Bruce, G., Williams, S.J.,
derstanding of terrestrial ecosystems. Finally, although we have ob Wiser, S.K., Huber, M.O., Hengeveld, G.M., Nabuurs, G.J., Tikhonova, E.,
tained encouraging results, more work still needs to be done in the Borchardt, P., Li, C.F., Powrie, L.W., Fischer, M., Hemp, A., Homeier, J., Cho, P.,
further researches. For example, the optimum or desirable spatial scale Vibrans, A.C., Umunay, P.M., Piao, S.L., Rowe, C.W., Ashton, M.S., Crane, P.R.,
Bradford, M.A., 2015. Mapping tree density at a global scale. Nature 525 (7568),
to detect individual trees from satellite data can be discussed in the
201–205.
future. Moreover, additional spectral bands or other information might Culvenor, D.S., 2002. TIDA: an algorithm for the delineation of tree crowns in high
be beneficial for the improvement of performance. spatial resolution remotely sensed imagery. Comput. Geosci. 28 (1), 33–44.
Deng, G., 2009. Research on Individual Tree Identification and Crown Segmentation
Algorithm in High Spatial Resolution Remote Sensing Imagery. Chinese Academy of
Forestry, Beijing, China.
Declaration of Competing Interest
FAO, 2016. State of the World’s Forests. Forests and Agriculture: Land-Use Challenges
and Opportunities. UNO, Rome.
The authors declare that they have no known competing financial Hansen, M.C., Potapov, P.V., Moore, R., Hancher, M., Turubanova, S.A., Tyukavina, A.,
Thau, D., Stehman, S.V., Goetz, S.J., Loveland, T.R., Kommareddy, A., Egorov, A.,
interests or personal relationships that could have appeared to influence
Chini, L., Justice, C.O., Townshend, J.R.G., 2013. High-resolution global maps of
the work reported in this paper. 21st-century forest cover change. Science 342 (6160), 850–853.
Khan, S., Gupta, P.K., 2018. Comparitive study of tree counting algorithms in dense and
sparse vegetative regions. In: ISPRS – International Archives of the Photogrammetry,
Acknowledgments Remote Sensing and Spatial Information Sciences, Beijing, China, pp. 801–808.
Koon Cheang, E., Koon Cheang, T., Haur Tay, Y., 2017. Using Convolutional Neural
This study was jointly supported by the National Natural Science Networks to Count Palm Trees in Satellite Images. arXiv e-prints: arXiv1701.06462.
Lempitsky, V., Zisserman, A., 2010. Learning to count objects in images. In: International
Foundation of China (Grant No. 41771380), Key Special Project for
Conference on Neural Information Processing Systems, pp. 1324–1332.
Introduced Talents Team of Southern Marine Science and Engineering Li, W., Fu, H., Yu, L., Cracknell, A., 2016. Deep learning based oil palm tree detection and
Guangdong Laboratory (Guangzhou) (GML2019ZD0301) the National counting for high-resolution remote sensing images. Remote Sensing 9 (1), 22.
Postdoctoral Program for Innovative Talents (BX20200100), China, as
11
Maillard, P., Gomes, M.F., 2016. Detection and counting of orchard trees from vhr Wang, Y., Zhu, X., Wu, B., 2018. Automatic detection of individual oil palm trees from
images using a geometrical-optical model and masked template matching. XXIII UAV images using HOG features and an SVM classifier. Int. J. Remote Sensing 40
ISPRS Congress, Prague, Czech Republic. (19), 7356–7370.
Pan, Y., Birdsey, R.A., Fang, J., Houghton, R., Kauppi, P.E., Kurz, W.A., Phillips, O.L., Kong, W., Li, H., Xing, G., Zhao, F., 2019. An automatic scale-adaptive approach with
Shvidenko, A., Lewis, S.L., Canadell, J.G., Ciais, P., Jackson, R.B., Pacala, S.W., attention mechanism-based crowd spatial information for crowd counting. IEEE
McGuire, A.D., Piao, S., Rautiainen, A., Sitch, S., Hayes, D., 2011. A large and Access pp(99):1-1.
persistent carbon sink in the World’s forests. Science 333 (6045), 988–993. Weinstein, B.G., Marconi, S., Bohlman, S., Zare, A., White, E., 2019. Individual tree-
Pouliot, D.A., King, D.J., Bell, F.W., Pitt, D.G., 2002. Automated tree crown detection and crown detection in RGB imagery using semi-supervised deep learning neural
delineation in high-resolution digital camera imagery of coniferous forest networks. Remote Sensing 11 (11).
regeneration. Remote Sensing Environ. 82, 322–334. Wu, J., et al., 2019. Automatic counting of in situ rice seedlings from UAV images based
Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolutional Networks for on a deep fully convolutional neural network. Remote Sensing 11 (6).
Biomedical Image Segmentation, International Conference on Medical Image Xie, W., Noble, J.A., Zisserman, A., 2016. Microscopy cell counting and detection with
Computing and Computer-Assisted Intervention. Springer, Cham. fully convolutional regression networks. Comput. Methods Biomech. Biomed. Eng.:
Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-Scale Imaging Visualization 6 (3), 283–292.
Image Recognition. arXiv e-prints: arXiv1409.1556S. Xie, Y., Bao, H., Shekhar, S., Knight, J., 2018. A TIMBER framework for mining urban
Sindagia, V.A., Patelb, V.M., 2017. A survey of recent advances in CNN-based single tree inventories using remote sensing datasets. In: 2018 IEEE International
image crowd counting and density. Pattern Recognit. Lett. 107, 3–16. Conference on Data Mining, pp. 1344–1349.
Wagner, F.H., Ferreira, M.P., Sanchez, A., Hirye, M.C.M., Zortea, M., Gloor, E., Zhang, X., Long, T., He, G., Yantao, G., 2019. Gobal forest cover mapping using landsat
Phillips, O.L., de Souza Filho, C.R., Shimabukuro, Y.E., Aragão, L.E.O.C., 2018. and Google earth engine cloud computing. In: 2019 8th International Conference on
Individual tree crown delineation in a highly diverse tropical forest using very high Agro-Geoinformatics, pp. 1–5.
resolution satellite images. ISPRS J. Photogramm. Remote Sensing 145, 362–377.
12

Tree Counting With High Spatial-Resolution Satellite Imagery Based On Deep Neural Networks 2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tree Counting With High Spatial-Resolution Satellite Imagery Based On Deep Neural Networks 2021

Uploaded by

Copyright:

Available Formats

Ecological Indicators 125 (2021) 107591

Contents lists available at ScienceDirect

Tree counting with high spatial-resolution satellite imagery based on deep

1. Introduction reference for decision-making concerning forest management conducted

Fig. 1. Locations of the Study Areas.

2. Materials and methods 2.2. Tree density map

Fig. 4. Diagram of the VGG-Like Network Structure.

Fig. 5. Diagram of the Combined Network Structure.

the output of the deconvolution layer, so as to obtain features at

R2 Alex-Like Network 0.890 0.809 0.934 0.878 0.854 0.923 0.881

image. Therefore, it showed that the effectiveness of the Encoder-

Alex-Like VGG-Like Combined Encoder-Decoder Alex-Like VGG-Like Combined Encoder-Decoder

Sparse 32.51 29.51 25.67 26.19 42.76 36.72 33.69 35.39

well as the National Data Sharing Infrastructure of Earth System Science

You might also like