Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

remote sensing

Article
Vegetation Detection Using Deep Learning and
Conventional Methods
Bulent Ayhan 1 , Chiman Kwan 1, * , Bence Budavari 1 , Liyun Kwan 1 , Yan Lu 2 , Daniel Perez 2 ,
Jiang Li 2 , Dimitrios Skarlatos 3 and Marinos Vlachos 3
1 Applied Research LLC, Rockville, MD 20850, USA; bulent.ayhan@signalpro.net (B.A.);
bencebudavari@gmail.com (B.B.); martin.kwan.97@gmail.com (L.K.)
2 Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA 23259, USA;
yxxlu003@odu.edu (Y.L.); dpere013@odu.edu (D.P.); jli@odu.edu (J.L.)
3 Department of Civil Engineering and Geomatics, Cyprus University of Technology, P.O. Box 50329,
Limassol 3603, Cyprus; dimitrios.skarlatos@cut.ac.cy (D.S.); marinos.vlachos@cut.ac.cy (M.V.)
* Correspondence: chiman.kwan@signalpro.net

Received: 27 June 2020; Accepted: 31 July 2020; Published: 4 August 2020 

Abstract: Land cover classification with the focus on chlorophyll-rich vegetation detection plays an
important role in urban growth monitoring and planning, autonomous navigation, drone mapping,
biodiversity conservation, etc. Conventional approaches usually apply the normalized difference
vegetation index (NDVI) for vegetation detection. In this paper, we investigate the performance of deep
learning and conventional methods for vegetation detection. Two deep learning methods, DeepLabV3+
and our customized convolutional neural network (CNN) were evaluated with respect to their
detection performance when training and testing datasets originated from different geographical sites
with different image resolutions. A novel object-based vegetation detection approach, which utilizes
NDVI, computer vision, and machine learning (ML) techniques, is also proposed. The vegetation
detection methods were applied to high-resolution airborne color images which consist of RGB and
near-infrared (NIR) bands. RGB color images alone were also used with the two deep learning
methods to examine their detection performances without the NIR band. The detection performances
of the deep learning methods with respect to the object-based detection approach are discussed and
sample images from the datasets are used for demonstrations.

Keywords: NDVI; deep learning; machine learning; vegetation; DeepLabV3+; CNN

1. Introduction
Land cover classification [1] has been widely used in change detection monitoring [2], construction
surveying [3], agricultural management [4], green vegetation classification [5], identifying emergency
landing sites for UAVs [6,7], biodiversity conservation [8], land-use [9], and urban planning [10].
One important application of land cover classification is vegetation detection. In Skarlatos et al. [3],
chlorophyll-rich vegetation detection was a crucial stepping stone to improve the accuracy of the
estimated digital terrain model (DTM). Upon detection of vegetation areas, they were automatically
removed from the digital surface model (DSM) to have better DTM estimates. Bradley et al. [11]
conducted chlorophyll-rich vegetation detection to improve autonomous navigation in natural
environments for autonomous mobile robots operating in off-road terrain. Zare et al. [12] used
vegetation detection for mine detection to minimize false alarms since some vegetation such as round
bushes were mistakenly identified as mines by mine detection algorithms and they demonstrated
that vegetation detection improves mine detection results. Miura et al. [13] used vegetation detection
techniques for monitoring vegetation areas in the Amazon to monitor the temporal and spatial changes

Remote Sens. 2020, 12, 2502; doi:10.3390/rs12152502 www.mdpi.com/journal/remotesensing


Remote Sens. 2020, 12, 2502 2 of 23

due to forest fires and growing population. Some conventional vegetation detection methods are
based on normalized difference vegetation index (NDVI) [14–16], which takes advantage of different
solar radiation absorption phenomena of green plants in the red spectral and near-infrared spectral
regions [17,18].
Since 2012, deep learning methods have found a wide range of applications for land cover
classification after several breakthroughs have been reported in a variety of computer vision
tasks, including image classification, object detection and tracking, and semantic segmentation.
Senecal et al. [19] used convolutional networks for multispectral image classification. Zhang et al. [20]
provided a comprehensive review of land cover classification and object detection approaches using
high-resolution imagery. The authors evaluated the performances of deep learning models against
traditional approaches and concluded that the deep-learning-based methods provide an end-to-end
solution and show better performance than the traditional pixel-based methods by utilizing both spatial
and spectral information whereas traditional pixel-based methods result in salt-and-pepper type noisy
land cover map estimates. A number of other works have also shown that semantic segmentation
classification with deep learning methods is quite promising in land cover classification [21–24].
DeepLabV3+ [25] is a semantic segmentation method that provided very promising results
in the PASCAL VOC-2012 data challenge [26]. For the PASCAL VOC-2012 dataset, DeepLabV3+
has currently the best ranking among several methods, including Pyramid Scene Parsing Network
(PSP) [27], SegNet [28], and Fully Convolutional Networks (FCN) [29]. Du et al. [30] used DeepLabV3+
for crop area (CA) mapping using RGB color images only and the authors compared its performance
with three other deep learning methods, UNet, PSP, and SegNet and three traditional machine learning
methods. The authors concluded that DeepLabV3+ was the best performing method and stated
its effectiveness in defining boundaries of the crop areas. Similarly, DeepLabV3+ was used for the
classification of remote sensing images and its performance ranking was reported to be better than
other investigated deep learning methods, including UNet, FCN, PSP, and SegNet [31–33]. These works
also proposed some original ideas to outperform DeepLabV3+ and extend on existing deep learning
architectures such as a novel FCN-based approach that aimed at the fusion of deep features [31],
a multi-scale context extraction module [32], and a novel patch attention module [33].
In Lobo Torres et al. [34], a contradictory result was presented in which the authors found
DeepLabV3+’s performance worse than FCN. The authors stated that much higher demand for training
samples was most likely necessary, which was not met by their dataset, and thus causing DeepLabV3+
to perform below its potential. One other factor contributing to this lower than expected performance
could be because they trained a DeepLabV3+ model from scratch without using any pre-trained
models as the initial model. There are also some recent works that modify DeepLabV3+’s architecture
so that more than three input channels can be used with it. Huang et al. [35] inserted 3D features in the
form of point clouds as a fourth input channel in addition to RGB image data. The authors replaced
the Xception backbone of DeepLabV3+ with ResNet101 with some other modifications in order to be
able to use more than three input channels with DeepLabV3+.
In Ayhan et al. [5], three separate DeepLabV3+ models were trained using RGB color images
originated from three different datasets which are collected at different resolutions and also have a
different image capturing hardware. The first model was trained using the Slovenia dataset [36] with
10 m per pixel resolution. The second model was trained using the DeepGlobe dataset [37] with 50 cm
per pixel resolution. The third model was trained using the airborne high-resolution image dataset
with two large size images with the names Vasiliko and Kimisala. The Vasiliko image with 20 cm per
pixel resolution was used for training and the two Kimisala images with 10 cm and 20 cm per pixel
resolutions were used for testing. It is reported in Ayhan et al. [5] that, among the three DeepLabV3+
models, the model trained with the Vasiliko dataset provided the best performance and the vegetation
detection results with DeepLabV3+ looked very promising.
This paper extends the work in Ayhan et al. [5] by applying DeepLabV3+, a custom CNN
method [38–40] and a novel object-based method, which utilizes NDVI, computer vision and machine
Remote Sens. 2020, 12, 2502 3 of 23

learning techniques on the Vasiliko and Kimisala dataset that was used in Ayhan et al. [5] for comparison
of vegetation detection performance. Different from Ayhan et al. [5], NIR band information in the
Vasiliko and Kimisala dataset is also utilized in this work. The detection performances of DeepLabV3+,
the custom CNN method, and the object-based method are compared. In contrast to DeepLabV3+,
which uses only color images (RGB), our customized CNN method is capable of using RGB and NIR
bands (four bands total). In principle, DeepLabV3+ can handle more than three bands after a number
of architecture modifications as was discussed in Huang et al. [35] and Hassan [41]. In this work, we
used the default DeepLabV3+ architecture with three input channels. However, to utilize all four input
channels with DeepLabV3+ to some degree, we replaced the red (R) band with the NDVI band in the
training and test datasets, which is estimated by red (R) and NIR image bands, and kept the blue and
green image bands to make it three input channels in total, NDVI-GB.
The contributions of this paper are as follows:

• We introduced a novel object-based vegetation detection method, NDVI-ML, which utilized NDVI,
computer vision, and machine learning techniques with no need for training. The method is
simple and outperformed the two investigated deep learning methods in detection performance.
• We demonstrated the potential use of the NDVI band image as a replacement to the red (R) band in
the color image for DeepLabV3+ model training to take advantage of the NIR band while fulfilling
the three-input channels restriction in DeepLabV3+ and transfer learning from a pre-trained
RGB model.
• We compared the detection performances of DeepLabV3+ (RGB and NDVI-GB bands),
our CNN-based deep learning method (RGB and RGB-NIR bands), and NDVI-ML (RGB-NIR).
This demonstrated that DeepLabV3+ detection results using RGB color bands only were better
than those obtained by conventional methods using the NDVI index only and were also quite
close to NDVI-ML’s results which used NIR band and several sophisticated machine learning and
computer vision techniques.
• We discussed the underlying reasons why NDVI-ML could be performing better than
the deep learning methods and potential strategies to further boost the deep learning
methods’ performances.

This paper is organized as follows. In Section 2, we describe the dataset and the vegetation
detection methods used for training and testing. Section 3 summarizes the results using various
methods. Section 4 contains some discussions about the results. A few concluding remarks are
provided in Section 5.

2. Materials and Methods


We first introduce the dataset in Section 2.1 followed by the two deep learning methods and our
object-based vegetation detection method, NDVI-ML, in Sections 2.2–2.4. A block diagram of the used
dataset and the applied methods in this paper can be seen in Figure 1.
Remote Sens. 2020, 12, 2502 4 of 23
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 23

Figure Block
Figure 1. Block diagram
diagram showing
showing the the
usedused dataset
dataset and applied
and applied methods.
methods. mIoU mIoU
is the is the
mean-
mean-intersection-of-union
intersection-of-union metric.metric.

2.1. Dataset Used for Training and Testing


2.1. Dataset Used for Training and Testing
The dataset used in this work was originally used in Skarlatos and Vlachos [3] and belongs to
The dataset used in this work was originally used in Skarlatos and Vlachos [3] and belongs to
two studied sites known as Vasiliko in Cyprus and Kimisala in Rhodes Island. UAV photography
two studied sites known as Vasiliko in Cyprus and Kimisala in Rhodes Island. UAV photography
and a modified non-calibrated near-infrared camera were used to acquire the data with two separate
and a modified non-calibrated near-infrared camera were used to acquire the data with two separate
UAV flights. Both flights were performed with SwingletCam UAV on different days. In the first flight,
UAV flights. Both flights were performed with SwingletCam UAV on different days. In the first flight,
a Canon IXUS 220HS camera was flown at an average flight height of 78 m. In the second flight,
a Canon IXUS 220HS camera was flown at an average flight height of 78 m. In the second flight, a
a modified near-infrared Canon PowerShot ELPH 300HS camera was used with a flight height of 100 m.
modified near-infrared Canon PowerShot ELPH 300HS camera was used with a flight height of 100
Both cameras were provided by SensFly, which is a UAV manufacturer. Agisoft’s Photoscan was used
m. Both cameras were provided by SensFly, which is a UAV manufacturer. Agisoft’s Photoscan was
to process the captured UAV photography and to create two orthophotos. Using the extracted Digital
used to process the captured UAV photography and to create two orthophotos. Using the extracted
Surface Model of the two sites, color RGB and NIR orthophotos were generated. For the overlapping
Digital Surface Model of the two sites, color RGB and NIR orthophotos were generated. For the
and co-registration of the orthophotos, a common bundle adjustment was performed with all the RGB
overlapping and co-registration of the orthophotos, a common bundle adjustment was performed
and NIR photos [3].
with all the RGB and NIR photos [3].
2.1.1. Vasiliko Site Data Used for Training
2.1.1. Vasiliko Site Data Used for Training
The height and width dimensions of the Vasiliko image (RGB and NIR) used in the investigations
The height and width dimensions of the Vasiliko image (RGB and NIR) used in the
as training data are 3450 × 3645. The image resolution of the Vasiliko image is 20 cm per pixel.
investigations as training data are 3450 × 3645. The image resolution of the Vasiliko image is 20 cm
The image area used in the investigations thus corresponds to an area of ~0.5 km2 . For investigations
per pixel. The image area used in the investigations thus corresponds to an area of ~0.5 km². For
with DeepLabV3+, this image is partitioned into 1764 overlapping image tiles of size 512 × 512.
investigations with DeepLabV3+, this image is partitioned into 1764 overlapping image tiles of size
The number of overlapping rows for two consecutive images along column direction is set to 440,
512 × 512. The number of overlapping rows for two consecutive images along column direction is set
and the number of overlapping columns for two consecutive images along row direction is set to 435.
to 440, and the number of overlapping columns for two consecutive images along row direction is set
This partitioning is conducted to increase the number of image patches in the Vasiliko training dataset
to 435. This partitioning is conducted to increase the number of image patches in the Vasiliko training
and can be considered as data augmentation by introducing shifted versions of the image patches.
dataset and can be considered as data augmentation by introducing shifted versions of the image
The number of overlapping pixels in row and column directions are set to high numbers to increase
patches. The number of overlapping pixels in row and column directions are set to high numbers to
the number of image batches in the training data set as much as possible. This image is annotated
increase the number of image batches in the training data set as much as possible. This image is
with respect to four land covers. Two color images from the Vasiliko dataset and their land cover
annotated with respect to four land covers. Two color images from the Vasiliko dataset and their land
annotations can be seen in Figure 2.
cover annotations can be seen in Figure 2.
Remote Sens. 2020, 12, 2502 5 of 23
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 23

(a) (b)

(c) (d)

Figure 2. Sample images from Vasiliko dataset and their annotations (silver color corresponds to barren
land, green color corresponds to tree/shrub/grass, red color corresponds to urban land, and blue color
Figure 2. Sample
corresponds images
to water fromcover
in land Vasiliko
map dataset and their
annotations). annotations
(a) Color (silver
image for color corresponds
mari_20_3_8; to
(b) ground
barren land, green color corresponds to tree/shrub/grass, red color corresponds to urban land,
truth land cover map for mari_20_3_8; (c) color image for mari_20_42_4; (d) ground truth land cover and
blue
map color corresponds to water in land cover map annotations). (a) Color image for mari_20_3_8; (b)
for mari_20_42_4.
ground truth land cover map for mari_20_3_8; (c) color image for mari_20_42_4; (d) ground truth land
2.1.2.cover
Kimisala
map forSite Data Used for Testing
mari_20_42_4.

This site is part of the Kimisala area in the southwestern part of the island of Rhodes and
2.1.2. Kimisala Site Data Used for Testing
contains many scattered archaeological sites. Regarding the Kimisala image used in the investigations,
This site is part
it corresponds to anofarea
the Kimisala km2 .inThere
of ~0.05 area the southwestern
are two images part of the
theisland
same of Rhodes
site with twoand contains
different
many scattered
resolutions (10 cmarchaeological
per pixel and 20 sites. Regarding
cm per pixel). Inthe
theKimisala
Kimisala image used land
test images, in the investigations,
covers in the form of it
corresponds
trees, shrubs,to an area
barren land,ofand
~0.05 km². There sites
archaeological are are
twopresent.
images These
of thelandsame site are
covers withcategorized
two different into
resolutions
two major land (10 cm per classes
cover pixel and 20 cm
which areper pixel). In(tree/shrub)
vegetation the Kimisalaand testnon-vegetation
images, land covers (barren in land
the form
and
of trees, shrubs,site).
archaeological barren
Theland,
10 cmand archaeological
resolution Kimisalasites
test are
imagepresent. These land
is annotated withcovers
respectare categorized
to the two land
into
coverstwo major land
(vegetation cover
and classes whichThe
non-vegetation). are land
vegetation
cover map(tree/shrub) and non-vegetation
for the Kimisala-20 test image(barren land
is generated
and archaeological
by resizing site). The
the annotated land10 cm resolution
cover map generatedKimisala
for thetest image is annotated
Kimisala-10 test image.with respect
The color to the
Kimisala
two
imageslandfor
covers
the 10(vegetation
and 20 cmand non-vegetation).
resolutions and their The land
land coverannotations
cover map for thecan Kimisala-20
be seen in test image3.
Figure
is
It generated by resizingthat
is worth mentioning the the
annotated land cover
Kimisala-10 map generated
test image is 1920 × 2680 for the Kimisala-10
in size test image. The
and the Kimisala-20 test
color
imageKimisala images
is 960 × 1340 for the
in size. 10 test
Both andimages
20 cm resolutions
are split intoand their land cover
non-overlapping 512annotations
× 512 imagecan bewhen
tiles seen
in Figure
testing 3. the
with It istrained
worth DeepLabV3+
mentioning that the Kimisala-10
models test image
with the exception of theistiles
1920 × 2680
that formin thesize
lastand the
portion
Kimisala-20 test image is 960 × 1340 in size. Both test images are split into non-overlapping 512 × 512
image tiles when testing with the trained DeepLabV3+ models with the exception of the tiles that
Remote Sens. 2020, 12, 2502 6 of 23
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 23

of the the
form rows and
last columns
portion of the
of the rowsimage in which of
and columns there
the is a slight
image overlapping.
in which there isIn the ground
a slight truth and
overlapping. In
estimated land cover maps for the Kimisala test images, a yellow color is used to annotate
the ground truth and estimated land cover maps for the Kimisala test images, a yellow color is used vegetation
and a blue color
to annotate is usedand
vegetation to annotate non-vegetation
a blue color land covers.
is used to annotate non-vegetation land covers.

(a) (b)

(c) (d)

Figure 3.3. The


Figure Thecolor
colorKimisala
Kimisalaimages
images forfor
thethe 10 and
10 and 20 resolutions
20 cm cm resolutions and their
and their groundground truthcover
truth land land
cover annotations
annotations (yellow: (yellow: Vegetation
Vegetation (tree/shrub), (tree/shrub), blue: Non-vegetation
blue: Non-vegetation (barren
(barren land and land and
archaeological
sites)). (a) Colorsites)).
archaeological image(a)
(Kimisala-10);
Color image (b) color image(b)
(Kimisala-10); (Kimisala-20);
color image (c) land cover map
(Kimisala-20); (Kimisala-10);
(c) land cover map
(d) land cover map
(Kimisala-10); (Kimisala-20).
(d) land cover map (Kimisala-20).

2.2.
2.2. DeepLabV3+
DeepLabV3+
DeepLabV3+ uses the
DeepLabV3+ uses the Atrous
Atrous Spatial
Spatial Pyramid
Pyramid Pooling
Pooling (ASPP)
(ASPP) mechanism
mechanism which
which exploits
exploits thethe
multi-scale contextual information to improve segmentation [42]. Atrous (which
multi-scale contextual information to improve segmentation [42]. Atrous (which means holes) means holes)
convolution
convolution has
has advantages
advantages over
over the
the standard
standard convolution
convolution by by providing
providing responses
responses atat all
all image
image
positions and while the number of filter parameters and the number of operations stays constant
positions and while the number of filter parameters and the number of operations stays constant [42]. [42].
DeepLabV3+
DeepLabV3+ has hasanan
encoder-decoder
encoder-decoder network structure.
network The encoder
structure. part consists
The encoder of a set of
part consists ofprocesses
a set of
that reduce the feature maps and capture semantic information and the decoder
processes that reduce the feature maps and capture semantic information and the decoder partpart recovers the
spatial
recovers the spatial information and result in sharper segmentations. The block diagram be
information and result in sharper segmentations. The block diagram of DeepLabV3+ can of
seen in Figurecan
DeepLabV3+ 4. be seen in Figure 4.
Remote Sens. 2020, 12, 2502 7 of 23
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 23

Block diagram
Figure 4. Block diagram of DeepLabV3+
DeepLabV3+[42].
[42].

AA PC
PCwith
withWindows
Windows 10 10 operating
operating system,
system, a GPU a GPU card (RTX2070),
card (RTX2070), and 16GB andmemory
16GB memory
is used foris
used for DeepLabv3+
DeepLabv3+ model
model training andtraining
testingand testing
which uses which uses the TensorFlow
the TensorFlow framework toframework to run.
run. For training
aFor training
land a landusing
cover model coverthe model using
training the training
datasets, datasets,
the weights the weightsmodel
of a pre-trained of a pre-trained model
with the exception
with
of thethe exception
logits are usedof the logits
as the are used
starting as the
point andstarting point and
these weights are these weights
fine-tuned arefurther
with fine-tuned with
training.
further training. These initial weights belong to a pre-trained model
These initial weights belong to a pre-trained model for the PASCAL VOC 2012 dataset for the PASCAL VOC 2012
dataset (“deeplabv3_pascal_train_aug_2018_01_04.tar.gz”). This model was
(“deeplabv3_pascal_train_aug_2018_01_04.tar.gz”). This model was based on the Xception-65 based on the Xception-65
backbone [42].
backbone Because the
[42]. Because thenumber
numberofofland
landcovers
coversin in
thethe
Vasiliko andand
Vasiliko Kimisala dataset
Kimisala is different
dataset from
is different
the number of classes in the PASCAL VOC-2012 dataset, the logit weights in the
from the number of classes in the PASCAL VOC-2012 dataset, the logit weights in the pre-trained pre-trained model are
excluded.
model The training
are excluded. Theparameters used for training
training parameters a model awith
used for training modelDeepLabv3+ in this work
with DeepLabv3+ in thiscan
workbe
seenbeinseen
can Tablein1.Table 1.
Table 1. Training parameters used in DeepLabV3+.
Table 1. Training parameters used in DeepLabV3+.
Training Parameter
Training Parameter Value
Value
Learning policy
Learning Poly
Poly
Base learning
Base learning rate 0.0001
0.0001
Learning
Learning raterate decay
decay factor
factor 0.1
0.1
Learning rate decay
Learning rate decay step step 2000
2000
Learning power 0.9
Learning power 0.9
Training number of steps ≥100,000
Training number of steps ≥100,000
Momentum 0.9
Momentum
Train batch size 0.9
2
Train
Weightbatch size
decay 2
0.00004
Weight
Train decay
crop size 0.00004
‘513,513’
Last layerTrain crop size
gradient multiplier ‘513,513′
1
Upsample
Last layer gradientlogits
multiplier True
1
Drop path keep
Upsample prob
logits 1
True
tf_initial_checkpoint
Drop path keep prob deeplabv3_pascal_train_aug
1
initialize_last_layer False
tf_initial_checkpoint deeplabv3_pascal_train_aug
last_layers_contain_logits_only True
initialize_last_layer False
slow_start_step 0
last_layers_contain_logits_only
slow_start_learning_rate 1 True
× 10−4
slow_start_step
fine_tune_batch_norm 0
False
slow_start_learning_rate
min_scale_factor 1e-4
0.5
fine_tune_batch_norm
max_scale_factor False
2
scale_factor_step_size
min_scale_factor 0.25
0.5
atrous_rates
max_scale_factor [6,12,18]
2
output_stride
scale_factor_step_size 16
0.25
atrous_rates [6,12,18]
output_stride 16
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 23

2.3. CNN
Remote Model
Sens. 2020, 12, 2502 8 of 23

Since the used DeepLabV3+ pre-trained model


(“deeplabv3_pascal_train_aug_2018_01_04.tar.gz”)
2.3. CNN Model utilized thousands of RGB images in PASCAL
VOC 2012 dataset for training [26], it is difficult to retrain it from scratch to accommodate four bands
Since the used DeepLabV3+ pre-trained model (“deeplabv3_pascal_train_aug_2018_01_04.tar.gz”)
(RGB+NIR) because we do not have many NIR images that are co-registered with those RGB bands.
utilized thousands of RGB images in PASCAL VOC 2012 dataset for training [26], it is difficult to
Our own customized CNN model, on the other hand, can handle up to four bands. In addition to
retrain it from scratch to accommodate four bands (RGB+NIR) because we do not have many NIR
using our CNN model with RGB-NIR bands, we also used it with three bands (RGB) to compare with
images that are co-registered with those RGB bands. Our own customized CNN model, on the other
the DeepLabV3+ results (RGB).
hand, can handle up to four bands. In addition to using our CNN model with RGB-NIR bands, we
We used the same structure for the CNN model as in our previous work [38] for soil detection,
also used it with three bands (RGB) to compare with the DeepLabV3+ results (RGB).
except the filter size in the first convolution layer changed accordingly to be consistent with the input
patch We used the same structure for the CNN model as in our previous work [38] for soil detection,
sizes. The input image with N bands is extracted into 7 × 7 image patches each with a size of 7
except the
× 7 × N. These filter
imagesizepatches
in the first convolution
are input to the CNN layermodel.
changed Foraccordingly
the case when to be
RGB consistent
image bandswithare the
inputNpatch
used, is 3 andsizes.
when Thetheinput imageiswith
NIR band N bands
included, is extracted
N becomes 4. Theinto
CNN 7 ×model
7 image haspatches each with a
four convolutional
7 × 7 × N.
layers and a fully connected layer with 100 hidden units. The CNN model structure isRGB
size of These image patches are input to the CNN model. For the case when shownimage in
bands are used, N is 3 and when the NIR band is included, N becomes
Figure 5. The 3D convolution filters in the convolutional layers are set to 3 × 3 × N in the first 4. The CNN model has four
convolutionallayer,
convolutional layers3 and
× 3 ×a20fully connected
in the second and layerthird
withlayers
100 hidden
and tounits.
1 × 1 ×The
100 CNN
in themodel
fourth structure
layer. Theis
shown in Figure 5. The 3D convolution filters in the convolutional layers
naming convention used for the convolutional layers in Figure 5 indicates the number of filters are set to 3 × 3 × N in theandfirst
convolutional
the filter size. As layer, 3 × 3 × 20
an example, 20 in
@ 3the× 3second
× N, in and thirdlayer
the first layers and to 20
indicates 1 ×convolutional
1 × 100 in thefilters
fourth layer.
with a
The naming convention used for the convolutional layers in Figure 5 indicates
size of 3 × 3 × N. The stride in all four convolution layers is set to 1 (shown as 1 × 1 × 1) meaning the the number of filters
and the filter
convolution size.isAs
filter an example,
moved one pixel 20at × 3 ×inN,each
@ a3 time in the first layerWhen
dimension. indicates 20 convolutional
we designed filters
the network,
with
we a size
tried of 3 × 3configurations
different × N. The stridefor in the
all four
numberconvolution
of layerslayers is set
and the to of
size 1 (shown
each layer ×1×
as 1 and 1) meaning
selected the
the convolution filter is moved one pixel at a time in each dimension. When
one that provided the best results. We did this for all the layers (convolutional and fully connected we designed the network,
we triedThe
layers). different
choiceconfigurations
of “100 hidden forunits”
the number of layers
in the fully and thelayer
connected size of each
was thelayer and selected
outcome the one
of our design
that provided
studies. the best results. We did this for all the layers (convolutional and fully connected layers).
The Each
choice of “100 hidden units” in the fully connected layer was the outcome
convolutional layer utilizes the Rectified Linear Unit (ReLu) as an activation function, the of our design studies.
Eachconnected
last fully convolutional layerlayer
usesutilizes the Rectified
the SoftMax function Linear Unit (ReLu) as
for classification. Weanadded
activation function,
a dropout the for
layer last
fully connected layer uses the SoftMax function for classification. We added
each convolutional layer with a dropout rate of ‘0.1’ to mitigate overfitting [43] after observing that a dropout layer for each
convolutional
‘0.1’ dropout value layer with a dropout
performed rate of
better than two‘0.1’
otherto mitigate
dropout overfitting
values, which [43]are
after observing
‘0.05’ that ‘0.1’
and ‘0.2’.
dropout value performed better than two other dropout values, which are ‘0.05’ and ‘0.2’.

Figure 5. Our customized convolutional neural network (CNN) model structure.


Figure 5. Our customized convolutional neural network (CNN) model structure.
2.4. NDVI-ML
2.4. NDVI-ML
We developed an object-based vegetation detection method, NDVI-ML, which utilizes NDVI [44],
machine learning (ML)
We developed techniques for
an object-based classification
vegetation and computer
detection method,vision techniques
NDVI-ML, whichforutilizes
segmentation.
NDVI
The block diagram of NDVI-ML can be found in Figure 6.
[44], machine learning (ML) techniques for classification and computer vision techniques for
The NDVI-ML
segmentation. method
The block diagramidentifies
of NDVI-ML the can
potential
be found vegetation candidates using an NDVI
in Figure 6.
threshold of zero. The candidate vegetation pixels are split into
The NDVI-ML method identifies the potential vegetation candidates using connected components using the
an NDVI threshold
Dulmage-Mendelsohn
of decomposition
zero. The candidate vegetation [45]are
pixels of split
the node
into pairs’ adjacency
connected matrix [46]
components by the
using assigning each
Dulmage-
candidate vegetation
Mendelsohn pixel as[45
decomposition a node and
] of the forming
node pairs’the neighboring
adjacency node
matrix [46pairs using the each
] by assigning 8-neighborhood
candidate
connectivity.
vegetation Each
pixel asconnected
a node and component
forming istheconsidered as a separate
neighboring node pairsvegetation object
using the entity with its
8-neighborhood
own vegetation map, sub-vegetation map.
Remote Sens. 2020, 12, 2502 9 of 23

For each vegetation object, a number of rules with respect to the size of the vegetation objects
and the amplitude of RGB content are applied. If the number of pixels of the connected component
vegetation object contains only a few pixels, these pixels are labeled as ‘non-vegetation’ since this
small size of a vegetation object candidate is not of interest to detect. For this, the number of pixels
of the object is compared with a threshold, minVegObj. If the number of pixels in the vegetation
object is bigger than minVegObj, but smaller than another set threshold, medVegObj, among these
pixels the ones that have red or blue content larger than green content are labeled as ‘non-vegetation’.
If, on the other hand, the number of pixels is larger than medVegObj, a more sophisticated process is
applied to classify these pixels. This process included a two-class Gaussian Mixture Model (GMM) [47],
which are fit to the RGB values of the connected component object pixels. The two GMMs split these
pixels into ‘vegetation’ and ‘non-vegetation’ classes. Among the identified GMMs, the one that has a
higher green content value is considered as the ‘vegetation’ class and the other as the ‘non-vegetation’.
If the difference between the mean green content values in the two GMMs exceeds a set threshold,
thrGMMGreen, the spatial information of the identified non-vegetation class pixels are then checked to
decide whether they are ‘non-vegetation’ pixels (such as shadow) or they are dark-toned vegetation
pixels which happen to be located inside the inner sections of the vegetation object. For extracting the
spatial information, an average filter is used with consideration of the image resolution (which is set to
5 × 5 in our investigations). When applying this average filter, if the pixel of interest is a dark-toned
vegetation object that happens to fall in inner parts of the vegetation object, the averaged filtered value
is expected to have higher green content since it is assumed that there will be several vegetation pixels
around it with green content being dominant and the average filtering would thus increase the green
content value of this pixel. Similarly, if it is a shadow pixel that happens to fall on the boundary sections
of the vegetation object, then because the neighborhood of the pixel would have more ‘non-vegetation’
pixels, the average filtering would result in a decrease in the green content value of this pixel. Thus,
the averaging filter helps to extract spatial information which is utilized to separate the shadow-like
non-vegetation pixels from the dark-toned vegetation pixels that happen to fall in inner parts of the
vegetation object.
The pseudocode of the NDVI-ML processing steps can be seen in Table 2. Other than
these processing steps, NDVI-ML has one other final estimated vegetation map cleaning process.
The pseudocode of the final decision map cleaning process can be seen in Table 3. In this cleaning
process, it is basically checked if there are any connected components remaining with a very small
number of pixels after the applied processing steps and if there are any connected components where
the green content is lower than red or blue content. If there are cases like this, the pixels of these
connected components are also labeled as ‘non-vegetation’ in the final estimated vegetation map.
Remote Sens. 2020, 12, 2502 10 of 23

Table 2. Pseudocode of the machine learning processing steps in the normalized difference vegetation
index-machine learning (NDVI-ML) method.

1: for Each connected component vegetation object, ccvo,


2: if number of pixels in ccvo < minVegObj,
3: Assign ccvo pixels as “Non-vegetation” in its sub-vegetation map
4: end
5: if minVegObj < number of pixels in ccvo < medVegObj,
6: Find the pixels in ccvo with red content (R) > green content (G), or blue content (B) > green content (G)
7: Remove these identified pixels from ccvo
8: Assign all the remaining pixels in ccvo as “Vegetation” in its sub-vegetation map
9: end
10: if number of pixels in ccvo > medVegObj,
11: Identify the pixels in ccvo with red content (R)>green content(G), or blue content(B)>green content (G)
12: Label the identified pixels as “Non-vegetation” in its sub-vegetation map
13: Exclude the identified pixels from ccvo
14: if the number of pixels in ccvo > minGMM
15: Fit a two-class GMM to split ccvo pixels into “Vegetation” and “Non-vegetation” classes
16: Assign the class with lower green content (G) as “Non-vegetation”, higher (G) content as “Vegetation”
17: if (averaged green content difference in Vegetation and Non-vegetation class) > thrGMMGreen
18: Extract spatial information of the identified Non-vegetation pixels to decide whether they are
shadow-related Non-vegetation pixels or dark-toned Vegetation pixels which are located
inside the vegetation object.
19: - Apply a 5x5 average filter to the Non-vegetation class pixels to form spatial statistical features
20: - Apply a two-class GMM to the spatial features to split them into two classes, Dark-toned
vegetation and Shadow.
21: - Among the two GMM classes, assign the one with the lower green content as Non-vegetation.
22: Exclude the identified Non-vegetation pixels from ccvo
23: Apply a closing morphology operation to ccvo
24: Assign the remaining pixels in ccvo after closing operation as “Vegetation” in its sub-veg. map
25: else
26: Apply a closing morphology operation to ccvo
27: Assign the remaining pixels in ccvo after closing operation as “Vegetation” in its sub-veg. map
28: end
29: else
30: Apply a closing morphology operation to ccvo pixels
31: Assign the remaining pixels in ccvo after closing operation as “Vegetation” in its sub-veg. map
32: end
33: end
34: Generate final vegetation map using all sub-veg. maps

Table 3. Final vegetation map cleaning process in the NDVI-ML method.

1: Find the connected component vegetation objects in the final vegetation map
2: for Each connected component,
3: if the number of pixels in the connected component < minVegObj,
4: Label all pixels of the connected component to Non-vegetation
5: else
6: Compute the mean RGB values for the pixels of the connected component
7: if (R) or (B) content in the mean RGB values of the connected component is higher than (G) content,
8: Label the pixels of the connected component as “Non-vegetation”
9: else
10: Label the pixels of the connected component as “Vegetation”
11: end
12: end
13: end
7: if (R) or (B) content in the mean RGB values of the connected component is higher than (G) content,
8: Label the pixels of the connected component as “Non-vegetation”
9: else
10: Label the pixels of the connected component as “Vegetation”
11: end
Remote Sens. 2020, 12, 2502 11 of 23
12: end
13: end

Calculate NDVI image using Red and NIR bands

Threshold NDVI image with an NDVI threshold of 0.0 to


get a vegetation map (the vegetation map is a binary map
where 1 in the binary map shows all candidate vegetation
pixels’ locations and 0 non-vegetation pixels)

Apply opening morphology operator to the vegetation map


(to reduce salt-and-pepper like noise)

Find the connected components in the vegetation map and


consider each connected component as a vegetation object

For each connected component vegetation object, determine


shadow and non-vegetation pixels using machine learning
techniques and form a sub-vegetation-map for the
connected component vegetation object

Merge all the sub-vegetation maps from the connected


components into a single vegetation map

Apply a final cleaning process to the vegetation map

Apply closing morphology operator to get rid of holes in


the vegetation map

Final vegetation map

Blockdiagram
Figure6.6.Block
Figure diagramfor
forthe
theNDVI-ML
NDVI-MLmethod.
method.

2.5. Performance Comparison Metrics


Accuracy and mean-intersection-over-union (mIoU) measures [48] are used to assess the
performance of DeepLabV3+, custom CNN, and NDVI-ML methods on the two Kimisala test
images. Suppose TP corresponds to the true positives, FP is the false positives, FN is the false negatives,
and TN is the true negatives. The accuracy measure is computed as:

TP + TN
Accuracy = (1)
TP + TN + FP + FN

The intersection-over-union (IoU) measure also known as the Jaccard similarity coefficient [48] for
a two-class problem can be mathematically expressed using the same notations as:

TP
IoU = (2)
TP + FP + FN
Remote Sens. 2020, 12, 2502 12 of 23

The mean-intersection-over-union (mIoU) measure simply takes the averages of IoU for all classes.

3. Results

3.1. DeepLabV3+

Results Using DeepLabV3+ Model Trained with Vasiliko Dataset


In our recent work [5], DeepLabV3+ models were trained using RGB color images of three datasets,
which are the Slovenia [36], DeepGlobe datasets [37], and Vasiliko dataset and the trained models were
all tested on the Kimisala test data. For completeness of this paper, the DeepLabV3+ results with these
three datasets are shown in Table 4. Due to significant resolution differences between the training
data (Slovenia data: 10 m resolution; DeepGlobe: 0.5 m resolution) and the testing data (10 cm and
20 cm), the results of these two DeepLabV3+ models on the Kimisala test dataset were found very
poor. Table 4 shows the performance metrics obtained with the three DeepLabV3+ models which are
trained using three different datasets of different image resolutions and image capturing hardware for
the Kimisala-10 test image. It can be noticed that the model trained with the Vasiliko dataset has the
highest detection scores. The DeepLabV3+ model trained with Vasiliko dataset also provided the best
performance on the Kimisala-20 test image. The resultant performance metrics for the Kimisala-20
test image with three DeepLabV3+ models can be seen in Table 5. The detection results for the two
Kimisala test images using the DeepLabV3+ model trained with the Vasiliko dataset and the estimated
vegetation maps can be seen in Figure 7. In the DeepLabV3+ results shown in Figure 7a,b, green color
pixels correspond to tree/shrub/grass, the silver color corresponds to the barren land and the red color
corresponds to urban land).

Table 4. Accuracy and mIoU measures for Kimisala-10 vegetation detection.

Method Accuracy mIoU (Vegetation & Non-Vegetation)


DeepLabV3+ (model trained with Slovenia) [5] 0.6171 0.4454
DeepLabV3+ (model trained with DeepGlobe [5] Very poor Very poor
DeepLabV3+ (model trained with Vasiliko) 0.8578 0.7435

Table 5. Accuracy and mIoU measures for Kimisala-20 vegetation detection.

Method Accuracy mIoU (Vegetation & Non-Vegetation)


DeepLabV3+ (model trained with Slovenia) [5] 0.6304 0.4355
DeepLabV3+ (model trained with DeepGlobe) [5] Very poor Very poor
DeepLabV3+ (model trained with Vasiliko) 0.8015 0.6541
Remote Sens. 2020, 12, 2502 13 of 23
Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 23

(a) (b)

(c) (d)

(e) (f)

Figure
Figure 7.7. DeepLabV3+
DeepLabV3+vegetation
vegetationdetection
detection results
results for
for the
the two
two Kimisala
Kimisala test
test images
images using
using the
the model
model
trained with the Vasiliko dataset. (a) DeepLabV3+ results for Kimisala-10 (green: Tree/shrub/grass,
trained with the Vasiliko dataset. (a) DeepLabV3+ results for Kimisala-10 (green: Tree/shrub/grass,
silver:
silver: Barren
Barren land,
land, red:
red: urban
urbanland);
land);(b)
(b)DeepLabV3+
DeepLabV3+results
resultsfor
forKimisala-20
Kimisala-20(green:
(green:Tree/shrub/grass,
Tree/shrub/grass,
silver:
silver: Barren land, red: urban land); (c) DeepLabV3+ estimated vegetation map
Barren land, red: urban land); (c) DeepLabV3+ estimated vegetation map for Kimisala-10
for Kimisala-10
(yellow:
(yellow: Vegetation, blue: Non-vegetation);
Vegetation, blue: Non-vegetation); (d) DeepLabV3+ estimated
(d) DeepLabV3+ estimated vegetation
vegetation map Kimisala-20
map Kimisala-20
(yellow: Vegetation, blue: Non-vegetation); (e) detected vegetation with DeepLabV3+ for
(yellow: Vegetation, blue: Non-vegetation); (e) detected vegetation with DeepLabV3+ for Kimisala-10; Kimisala-
10; (f) detected vegetation with DeepLabV3+ for Kimisala-20.
(f) detected vegetation with DeepLabV3+ for Kimisala-20.

As mentioned
mentioned earlier,
earlier, the
theDeepLabV3+
DeepLabV3+pre-trained
pre-trainedmodel
modelthat was
that used
was usedforfor
thethe
initialization of
initialization
our own
of our owntraining model’s
training model’s weights
weightswas
wastrained
trainedwith
withthousands
thousandsofofRGB
RGB images.
images. ItIt is,
is, however,
however,
challenging
challenging to extend DeepLabV3+
DeepLabV3+ modelmodel to
to incorporate
incorporate more
more than
than three
three input channels, such as
including a NIR band in addition to the RGB bands. This is because in order to build a satisfactory
model, not only a lot of training data are needed that include the NIR band in addition to RGB color
Remote Sens. 2020, 12, 2502 14 of 23

Remote Sens. 2020, 12, x FOR PEER REVIEW 14 of 23

including a NIR band in addition to the RGB bands. This is because in order to build a satisfactory
images
model, but also significant
not only GPU data
a lot of training power
arethat can that
needed conduct a proper
include the NIRtraining
band in with batch to
addition sizes
RGBlarger
color
than
images but also significant GPU power that can conduct a proper training with batch sizes larger than in
16. This is because, for efficient model training, larger batch sizes are recommended 16.
DeepLabV3+
This is because, [49for
]. Moreover, the DeepLabV3+
efficient model architecture,
training, larger batch sizeswhich is originallyindesigned
are recommended DeepLabV3+for three
[49].
input channels,
Moreover, RGB, needs to
the DeepLabV3+ be adjustedwhich
architecture, accordingly to accommodate
is originally designed forfour-channel input images.
three input channels, RGB,
With four-channel input images, the existing pre-trained models, which are for RGB,
needs to be adjusted accordingly to accommodate four-channel input images. With four-channel input cannot be used
directly
images, and there is pre-trained
the existing a need for training
models, awhich
modelarefrom scratch
for RGB, or atbe
cannot least
usedmodifying the there
directly and DeepLabV3+
is a need
architecture
for training a model from scratch or at least modifying the DeepLabV3+ architecture such that and
such that only the weights for the newly added image bands can be learned onlythe
the
weights
weights of forRGB input channels
the newly added imagecan be initialized
bands can be by pre-trained
learned and themodel
weightsweights
of RGB viainput
transfer learning
channels can
[be
41].
initialized by pre-trained model weights via transfer learning [41].
In
In this work, we
this work, weused
usedthethedefault
default DeepLabV3+
DeepLabV3+ architecture
architecture andand considered
considered using using
moremore
than than
three
three input channels with DeepLabV3+ as future work. However, we conducted an interesting
input channels with DeepLabV3+ as future work. However, we conducted an interesting investigation
investigation in which we replaced the red (R) band with the NDVI band and kept the green (G) and
in which we replaced the red (R) band with the NDVI band and kept the green (G) and blue (B) bands
blue (B) bands (NDVI-GB) in the training data when training a DeepLabV3+ model. Since the NDVI
(NDVI-GB) in the training data when training a DeepLabV3+ model. Since the NDVI band is computed
band is computed using red (R) and NIR bands, all four bands are involved in model training to some
using red (R) and NIR bands, all four bands are involved in model training to some extent, while also
extent, while also fulfilling DeepLabV3+’s three input channels restriction. The resultant NDVI values
fulfilling DeepLabV3+’s three input channels restriction. The resultant NDVI values which originally
which originally take values between −1 to 1 are scaled such that they take values between 0 and 255
take values between −1 to 1 are scaled such that they take values between 0 and 255 which is the case
which is the case for the color band image values.
for the color band image values.
When we started DeepLabV3+ model training for NDVI-GB bands from scratch with a higher
When we started DeepLabV3+ model training for NDVI-GB bands from scratch with a higher
learning rate, 0.1, even though the total loss values during training dropped nicely after 200K training
learning rate, 0.1, even though the total loss values during training dropped nicely after 200K training
steps, as can be seen in Figure 8, the final DeepLabV3+ predictions for the test dataset and even for
steps, as can be seen in Figure 8, the final DeepLabV3+ predictions for the test dataset and even for the
the training set were all black color indicating that the model trained from scratch was not reliable.
training set were all black color indicating that the model trained from scratch was not reliable.

Figure 8. The total loss plot for DeepLabV3+ model training from scratch for NDVI-GB bands.
Figure 8. The total loss plot for DeepLabV3+ model training from scratch for NDVI-GB bands.
Next, we used the same NDVI-GB training data set and used the pre-trained model’s weights
Next, we used the same NDVI-GB training data set and used the pre-trained model’s weights
(pre-trained model for PASCAL VOC 2012 dataset for RGB images) to initialize our training model’s
(pre-trained model for PASCAL VOC 2012 dataset for RGB images) to initialize our training model’s
weights. Even though the training dataset has the NDVI bands instead of red (R) band, the total loss
weights. Even though the training dataset has the NDVI bands instead of red (R) band, the total loss
converged nicely during training as can be seen in Figure 9 and the detection results also improved
converged nicely during training as can be seen in Figure 9 and the detection results also improved
slightly for both Kimisala test datasets in comparison to the results using the DeepLabV3+ model
slightly for both Kimisala test datasets in comparison to the results using the DeepLabV3+ model
trained using RGB bands. The results for the two Kimisala test datasets can be seen in Table 6.
trained using RGB bands. The results for the two Kimisala test datasets can be seen in Table 6.
Considering that an RGB-based pre-trained model is used as the initial model whereas the training
Considering that an RGB-based pre-trained model is used as the initial model whereas the training
dataset does not contain the R band but NDVI instead, it is found highly interesting that slightly better
dataset does not contain the R band but NDVI instead, it is found highly interesting that slightly
detection results can be still obtained.
better detection results can be still obtained.
Remote Sens. 2020, 12, 2502 15 of 23
Remote Sens. 2020, 12, x FOR PEER REVIEW 15 of 23

Figure 9. The total loss plot for DeepLabV3+ model training with the NDVI band replaced with the R
Figure
band 9. The
and withtotal loss plot forRGB
the pre-trained DeepLabV3+
model asmodel training
the initial with the NDVI band replaced with the R
model.
band and with the pre-trained RGB model as the initial model.
Table 6. Accuracy and mIoU measures for Kimisala-10 and Kimisala-20 vegetation detection using
DeepLabV3+ models
Table 6. Accuracy andtrained with RGB for
mIoU measures andKimisala-10
NDVI-G-B and bands (both models
Kimisala-20 were initialized
vegetation detection using
pre-trained models for RGB color images when starting training).
DeepLabV3+ models trained with RGB and NDVI-G-B bands (both models were initialized using pre-
trained models for RGB color images when starting training).
Test Dataset Method Accuracy mIoU
Test dataset DeepLabV3+ (modelMethodtrained with Vasiliko, RGB) Accuracy
0.8578 mIoU
0.7435
Kimisala-10 DeepLabV3+ (model trained with Vasiliko, RGB) 0.8578 0.7435
DeepLabV3+ (model trained with Vasiliko, NDVI-G-B 0.8601 0.7495
Kimisala-10
DeepLabV3+
DeepLabV3+ (model
(model trained
trained with
with Vasiliko,
Vasiliko, RGB)
NDVI-G-B 0.8015
0.8601 0.6541
0.7495
Kimisala-20
DeepLabV3+
DeepLabV3+ (model
(modeltrained with
trained Vasiliko,
with NDVI-G-B
Vasiliko, RGB) 0.8089
0.8015 0.6677
0.6541
Kimisala-20
DeepLabV3+ (model trained with Vasiliko, NDVI-G-B 0.8089 0.6677
3.2. CNN Results
3.2. CNN Results
The first investigation with the CNN model for the Vasiliko training and two Kimisala test datasets
was toThe first investigation
examine its detectionwith the CNNwhen
performance model for (color
RGB the Vasiliko
images)training and two
and RGB-NIR Kimisala
bands test
are used.
datasets
Using was toof
a subset examine its detection
30,000 training performance
samples when
per class out of RGB (color images)
the complete and
training RGB-NIR
data bands are
(Class 1—Barren:
used. Using a subset of 30,000 training samples per class out of the complete training
1469178 and Class 2—Trees: 935340), 7 × 7 RGB patches for each class, it was observed that for this data (Class 1—
Barren:
case, the1469178
overall and Class 2—Trees:
classification rate is 935340), 7 × 7 0,
76.2% (Class RGB patches for
vegetation, haseach class,accuracy
a 90.1% it was observed
and Classthat
1,
for this case, the overall classification rate is 76.2% (Class 0, vegetation, has a
no-vegetation, has a 65.8% accuracy). Table 7 shows the classification accuracy with CNN for RGB 90.1% accuracy and
Classand
only 1, no-vegetation,
RGB-NIR cases. hasItacan
65.8% accuracy).
be noticed thatTable 7 shows
with the the classification
addition accuracy
of the NIR band, with CNN
the overall for
accuracy
RGB only by
improves and ~5%RGB-NIR
with our cases.
customIt can
CNN be method.
noticed that with the addition of the NIR band, the overall
accuracy improves by ~5% with our custom CNN method.
Table 7. Classification accuracy with CNN (two-class training model, vegetation and non-vegetation)
Table30,000
using 7. Classification accuracy with CNN (two-class training model, vegetation and non-vegetation)
training samples.
using 30,000 training samples.
Bands Patch Size Training Data Per Class Overall Accuracy
Bands Patch Size Training Data Per Class Overall Accuracy
RGB and NIR (4 bands) 7×7 30,000 0.8091
RGBRGB
and (3NIR (4 bands)
bands) 77 ×
× 77 30,000
30,000 0.8091
0.7620
RGB (3 bands) 7×7 30,000 0.7620

Next, we
Next, we conducted
conductedaafull fullinvestigation
investigationwith
withthe
theCNNCNNmodel
modelononthe the Vasiliko/Kimisala
Vasiliko/Kimisala datasets
datasets to
to further improve the classification accuracy. Due to memory issues, we could
further improve the classification accuracy. Due to memory issues, we could not feed the whole Vasiliko not feed the whole
Vasiliko
image forimage
training.for Instead,
training.we Instead,
had to we hadthe
divide to training
divide the training
image image
into four into fourThe
quadrants. quadrants. The
performance
performance
metrics metrics were
were generated generated bytraining
by sequentially sequentially training
the model the model
on each on each
of the four of the four
quadrants quadrants
of the Vasiliko
of the Vasiliko dataset. In order to do this, we trained an initial model on
dataset. In order to do this, we trained an initial model on the first quadrant and then updated the first quadrant and then
the
updatedusing
weights the weights using quadrant.
the following the followingThisquadrant. This stepfor
step was repeated was therepeated
final three forquadrants
the final ofthree
the
quadrants
image. For of the image.
training For training
the model the model
sequentially, sequentially,
we used a patch sizewe used
of 7, the alearning
patch sizerateofof7, theand
0.01, learning
used
rate of 0.01, and used all the possible training samples in the quadrant. Table
all the possible training samples in the quadrant. Table 8 summarizes the results for our sequential8 summarizes the results
for our sequential
approach. approach.results
We only generated We only
usinggenerated results test
the Kimisala-20 using the because
image Kimisala-20 test image
Kimisala-10 because
is very large
Kimisala-10 is very large in size. The average overall accuracy for the CNN
in size. The average overall accuracy for the CNN model reached 0.8298, which was better than the model reached 0.8298,
which was better than the earlier results in Table 7 in which only 30,000 samples were used to train
Remote Sens. 2020, 12, x FOR PEER REVIEW 16 of 23
Remote Sens. 2020, 12, 2502 16 of 23

the CNN model. Here, all the training samples in the Vasiliko image were used. The resultant
vegetation mapinisTable
earlier results shown7 ininwhich
Figureonly
10. 30,000 samples were used to train the CNN model. Here, all the
training samples in the Vasiliko image were used. The resultant vegetation map is shown in Figure 10.
Table 8. Performance metrics using different quadrants of the Vasiliko dataset for sequential training.
Kimisala
Table 8. 20-cm image was
Performance used
metrics in testing.
using Four
different bands were
quadrants used.
of the Vasiliko dataset for sequential training.
Kimisala 20-cm image was used inQuadrant
testing. Four bands were
Accuracy mIoU used.
Q1 0.8126 0.6979
Quadrant Accuracy mIoU
Q2 0.8342 0.6775
Q1
Q3 0.8126 0.7041
0.8470 0.6979
Q2 0.8342 0.6775
Q4 0.8253 0.6861
Q3 0.8470 0.7041
Avg 0.8298 0.6914
Q4 0.8253 0.6861
Avg 0.8298 0.6914

(a) (b)

Figure 10. CNN vegetation detection results for the Kimisala-20 test image using the CNN model
trained with the Vasiliko dataset. (a) Estimated vegetation binary map Kimisala-20 using CNN (yellow:
Figure 10. CNN
Vegetation, blue:vegetation detection
Non-vegetation); (b)results forvegetation
detected the Kimisala-20 test image
using CNN using the CNN model
for Kimisala-20.
trained with the Vasiliko dataset. (a) Estimated vegetation binary map Kimisala-20 using CNN
3.3. (yellow:
NDVI-ML Results blue: Non-vegetation); (b) detected vegetation using CNN for Kimisala-20.
Vegetation,

Table 9 summarizes the results of the NDVI-ML method for the Kimisala-10 test image. As a
3.3. NDVI-ML Results
baseline comparison benchmark to NDVI-ML, the NDVI band image only is transformed into binary
Table maps
detection 9 summarizes
using two thedifferent
results of the NDVI-ML
NDVI thresholds,method
which for are the
0.0 Kimisala-10 test image. These
and 0.09, respectively. As a
baseline
NDVI-only comparison
detectionbenchmark
results withtotwo
NDVI-ML,
differentthe NDVI band
thresholds are image only is transformed
also included in Table 9. Itinto
can binary
be seen
detection maps using
that the proposed two different
NDVI-ML improved NDVI thethresholds,
performance which are 0.0 and
of NDVI-only 0.09,significantly.
results respectively.We These
also
NDVI-only detection results with two different thresholds are also included in Table
observed the same trend for the Kimisala-20 image as can be seen in Table 10. The vegetation detection 9. It can be seen
that the for
results proposed
the twoNDVI-ML
Kimisala improved
test images theusing
performance
NDVI-only of NDVI-only
and NDVI-ML results
cansignificantly. We also
be seen in Figures 11
observed the same trend
and 12, respectively. Whenfor the Kimisala-20
applying the NDVI-ML image as can the
approach, be parameter
seen in Table 10. Theis vegetation
minVegObj set to 70 for
detection results test
the Kimisala-10 for the twoand
image Kimisala test
to 35 for images
the using NDVI-only
Kimisala-20 and NDVI-ML
test image. Regarding can be
the other seen in
NDVI-ML
Figures 11 and
parameters, 12, respectively.
medVegObj is set toWhen
250 andapplying
minGMM the is
NDVI-ML
set to 200approach, the parameter
for both Kimisala-10 andminVegObj
Kimisala-20
istest
set images.
to 70 for In
thecomparison
Kimisala-10totest image and to 35 for the Kimisala-20 test image.
the ground truth land cover map, the NDVI-ML approach results Regarding the other
are
NDVI-ML
found to be parameters, medVegObj
highly accurate is set
and better to 250
than the and minGMM
detection resultsis of
setthe
to 200
twofor
deep both Kimisala-10
learning and
methods.
Kimisala-20 test images. In comparison to the ground truth land cover map, the NDVI-ML approach
results are foundTableto be9.highly accurate
Accuracy and measures
and mIoU better thanfor the detection
Kimisala-10 results of
vegetation the two deep learning
detection.
methods.
Method Accuracy mIoU (Vegetation & Non-Vegetation)
NDVITable
only 9.
(NDVI threshold
Accuracy = 0.0)measures
and mIoU 0.7565 0.6047
for Kimisala-10 vegetation detection.
NDVI only (NDVI threshold = 0.09) 0.7234 0.5662
Method
NDVI - ML Accuracy
0.8730 mIoU (Vegetation0.7737
& Non-Vegetation)
NDVI only (NDVI threshold = 0.0) 0.7565 0.6047
NDVI only (NDVI threshold = 0.09) 0.7234 0.5662
NDVI - ML 0.8730 0.7737
Remote Sens. 2020, 12, 2502 17 of 23
Remote Sens. 2020, 12, x FOR PEER REVIEW 17 of 23

Table 10. Accuracy


Table10. Accuracy and
and mIoU
mIoU measures
measuresfor
forKimisala-20
Kimisala-20vegetation
vegetationdetection.
detection.

Method
Method Accuracy mIoU
mIoU (Vegetation & Non-Vegetation)
Non-Vegetation)
NDVI only (NDVI threshold== 0.0)
NDVI only (NDVI threshold 0.0) 0.7563
0.7563 0.6045
0.6045
NDVI only (NDVI threshold = 0.09)
NDVI only (NDVI threshold = 0.09) 0.7232
0.7232 0.5659
0.5659
NDVI -- ML
NDVI ML 0.8578 0.7487

(a) (b)

(c) (d)

Figure 11.
Figure 11. NDVI
NDVI only
only (threshold
(threshold==0)0)vegetation
vegetationdetection
detectionresults forfor
results thethe
two Kimisala
two testtest
Kimisala images. (a)
images.
NDVI-only estimated vegetation map (Kimisala-10) (yellow: Vegetation, blue: Non-vegetation);
(a) NDVI-only estimated vegetation map (Kimisala-10) (yellow: Vegetation, blue: Non-vegetation); (b)
NDVI-only
(b) NDVI-only estimated vegetation
estimated map
vegetation map(Kimisala-20) (yellow:
(Kimisala-20) Vegetation,
(yellow: Vegetation,blue: Non-vegetation);
blue: Non-vegetation); (c)
Detected vegetation with NDVI only (Kimisala-10); (d) Detected vegetation with
(c) Detected vegetation with NDVI only (Kimisala-10); (d) Detected vegetation with NDVI only NDVI only
(Kimisala-20).
(Kimisala-20).
Remote Sens. 2020, 12, 2502 18 of 23
Remote Sens. 2020, 12, x FOR PEER REVIEW 18 of 23

(a) (b)

(c) (d)

Figure 12.
Figure 12. NDVI
NDVI++ML MLvegetation
vegetationdetection
detectionresults forfor
results thethe
two Kimisala
two testtest
Kimisala images. (a) NDVI
images. + ML
(a) NDVI +
ML estimated vegetation map (Kimisala-10) (yellow: Vegetation, blue: Non-vegetation); (b) NDVI +
estimated vegetation map (Kimisala-10) (yellow: Vegetation, blue: Non-vegetation); (b) NDVI + ML
estimated
ML vegetation
estimated map
vegetation (Kimisala-20)
map (Kimisala-20)(yellow:
(yellow:Vegetation,
Vegetation,blue:
blue:Non-vegetation);
Non-vegetation); (c) detected
(c) detected
vegetation with NDVI + + ML (Kimisala-10); (d) detected vegetation with NDVI + + ML
ML (Kimisala-20).
(Kimisala-20).

3.4. Performance Comparisons


3.4. Performance Comparisons
Tables
Tables 1111and
and1212 summarizes
summarizes the resultant
the resultant performance
performance metricsmetrics of NDVI-ML
of NDVI-ML and deepand deep
learning-
learning-based
based methods methods
for the twoforKimisala
the two Kimisala
test data. test data.
It can It can
be seen be in
that seen that
both in bothNDVI-ML
metrics, metrics, NDVI-ML
performs
performs
better than the deep learning methods. The two DeepLabV3+ models trained with VasilikoVasiliko
better than the deep learning methods. The two DeepLabV3+ models trained with dataset
dataset
(RGB and (RGB and NDVI-GB
NDVI-GB input input channels)
channels) follow follow closely
closely the the NDVI-ML
NDVI-ML approach
approach in in termsofofaccuracy.
terms accuracy.
DeepLabV3+
DeepLabV3+ results
resultswith NDVI-GB
with NDVI-GB input channels
input are found
channels are to be slightly
found to bebetter than the
slightly DeepLabV3+
better than the
results with RGB input channels only. Among the three DeepLabV3+ models,
DeepLabV3+ results with RGB input channels only. Among the three DeepLabV3+ models, Vasiliko Vasiliko model performs
the
modelbest.performs
Because the
the best.
imageBecause
resolutions in Slovenia
the image (10 m) and
resolutions DeepGlobe
in Slovenia (10 (50
m) cm)
and training
DeepGlobedatasets are
(50 cm)
quite different
training than
datasets arethe resolutions
quite differentofthan
the the
Kimisala test images
resolutions of the (10 cm and
Kimisala 20 images
test cm) and(10 also
cmbecause
and 20
the
cm) and also because the image characteristics of Kimisala test images are quite different fromand
image characteristics of Kimisala test images are quite different from the ones in Slovenia the
DeepGlobe datasets,
ones in Slovenia andthese two training
DeepGlobe modelsthese
datasets, were two
considered
trainingto be performing
models poorly. One to
were considered other
be
aspect that must be favoring the Vasiliko DeepLabV3+ model over the other
performing poorly. One other aspect that must be favoring the Vasiliko DeepLabV3+ model over the two models is that both
Kimisala
other twotest images
models and both
is that the Vasiliko
Kimisala training dataset
test images andimages were collected
the Vasiliko trainingwith the same
dataset images camera
were
systems.
collected Our
withcustomized CNN model
the same camera (RGB+NIR)
systems. performance
Our customized CNNwas evaluated
model (RGB+NIR)only on the Kimisala-20
performance was
test image. The customized CNN results were found to be better than DeepLabV3+,
evaluated only on the Kimisala-20 test image. The customized CNN results were found to be better but were still
worse than NDVI-ML.
than DeepLabV3+, but were still worse than NDVI-ML.
Remote Sens. 2020, 12, 2502 19 of 23

Table 11. Accuracy and mIoU measures for Kimisala-10 vegetation detection.

Method Accuracy mIoU (Vegetation & Non-Vegetation)


NDVI only (NDVI threshold = 0.0) 0.7565 0.6047
NDVI only (NDVI threshold = 0.09) 0.7234 0.5662
NDVI - ML (4 bands, RGB+NIR) 0.8730 0.7737
DeepLabV3+ (model trained with Slovenia, RGB) [5] 0.6171 0.4454
DeepLabV3+ (model trained with DeepGlobe, RGB [5]) Very poor Very poor
DeepLabV3+ (model trained with Vasiliko, RGB) [5] 0.8578 0.7435
DeepLabV3+ (model trained with Vasiliko, NDVI-G-B) 0.8601 0.7495

Table 12. Accuracy and mIoU measures for Kimisala-20 vegetation detection.

Method Accuracy mIoU (Vegetation & Non-Vegetation)


NDVI only (NDVI threshold = 0.0) 0.7563 0.6045
NDVI only (NDVI threshold = 0.09) 0.7232 0.5659
NDVI - ML (4 bands, RGB+NIR) 0.8578 0.7487
DeepLabV3+ (model trained with Slovenia, RGB) [5] 0.6304 0.4355
DeepLabV3+ (model trained with DeepGlobe, RGB) [5] Very poor Very poor
DeepLabV3+ (model trained with Vasiliko, RGB) [5] 0.8015 0.6541
DeepLabV3+ (model trained with Vasiliko, NDVI-G-B) 0.8089 0.6677
Our customized CNN (4 bands, RGB+NIR) 0.8298 0.6914

4. Discussion
In Kimisala-20 (20 cm resolution) test dataset, DeepLabV3+ (RGB only) had 0.8015 overall accuracy
whereas our customized CNN model achieved an accuracy of 0.7620 with RGB bands and reached an
accuracy of 0.8298 if the RGB and NIR bands were all used (four bands). The DeepLabV3+ model
with NDVI-GB input channels provided an overall accuracy of 0.8089, which was slightly better than
DeepLabV3+ model accuracy with RGB channels but lower than our CNN model’s accuracy. Similar
trends were also observed in the second Kimisala test dataset that has a 10 cm resolution (Kimisala-10).
The detection results for the NDVI-ML method are found to be better than DeepLabV3+ and our
customized CNN model for both Kimisala test datasets. For the Kimisala-20 cm test dataset, NDVI-ML
provided an accuracy of 0.8578 which was considerably higher than the two deep learning methods.
NDVI-ML method performed considerably better than the investigated deep learning methods
for vegetation detection and does not need any training data. However, NDVI-ML consists of several
rules and thresholds that need to be selected properly by the user and the parameters and thresholds
used in these rules might most likely need to be revisited for another test image other than Kimisala
test data. NDVI-ML also focuses on vegetation detection as a binary classification problem (vegetation
vs. non-vegetation) since it depends on NDVI for detecting candidate vegetation pixels in its first step
whereas in the deep learning-based methods there is the flexibility to classify different vegetation types
(such as a tree, shrub, and grass). Among the two deep learning methods, DeepLabV3+, provided
extremely good detection performance using only RGB images without the NIR band showing that for
low-budget land cover classification applications using drones with low-cost onboard RGB cameras,
DeepLabV3+ could certainly be a viable method.
Comparing the deep learning and NDVI-based approaches, we observe that the NDVI-ML
method provided significantly better results than the two deep learning methods. This may look
surprising because normally people would expect deep learning methods to perform better than
conventional techniques. However, a close look at the results and images reveal that these findings
are actually reasonable from two perspectives. First, for deep learning methods to work well, a large
amount of training data is necessary. Otherwise, the performance will not be good. Second, for deep
learning methods to work decently, it is better for the training and testing images to have a close
resemblance. However, in our case, the training and testing images are somewhat different even if
they were captured by the same camera system as can be seen from Figure 13, making the vegetation
classification challenging for deep learning methods. In our recent study [5], we observed more serious
Remote Sens. 2020, 12, x FOR PEER REVIEW 20 of 23

learning methods to work decently, it is better for the training and testing images to have a close
Remote Sens. 2020, 12, 2502 20 of 23
resemblance. However, in our case, the training and testing images are somewhat different even if
they were captured by the same camera system as can be seen from Figure 13, making the vegetation
classification
detection challenging
performance forwith
drops deep learning methods.
DeepLabV3+ In ourand
when training recent study
testing [5], we
datasets hadobserved more
different image
serious detection
resolutions, and theperformance drops
camera systems with
that DeepLabV3+
captured whenwere
these images training and testing datasets had
different.
different image resolutions, and the camera systems that captured these images were different.
Non-vegetation Vegetation Mixed

Training (Vasiliko)

25 × 25 25 × 25
200 × 200

Testing (Kimisala)

25 × 25 25 × 25
200 × 200

Figure 13. Zoomed in section of Vasiliko (training) and Kimisala (test) images.
Figure 13. Zoomed in section of Vasiliko (training) and Kimisala (test) images.

Another limitation
Another limitation of DeepLabV3+ is
of DeepLabV3+ is that
that itit accepts
accepts only
only three input channels
three input channels and and requires
requires
architecture modifications
architecture modifications when when moremore than than three
three channels
channels areare aimed
aimed to to bebe used.
used. Even
Even if these
if these
modifications are done properly, the training will have to start from scratch since there arepre-
modifications are done properly, the training will have to start from scratch since there are no no
trained DeepLabV3+
pre-trained DeepLabV3+ models
modelsother
otherthan thanRGB RGB input
inputchannels.
channels.Moreover,
Moreover,one oneneeds
needsto to find
find aa
significant number
significant numberof oftraining
trainingimages
imagesthat thatcontain
containallall these
these additional
additional input
input channels.
channels. ThisThis
may maynotnotbe
be practical since the existing RGB pre-trained model utilized thousands,
practical since the existing RGB pre-trained model utilized thousands, if not millions, of RGB images if not millions, of RGB
images
in in the training
the training process.process. Our customized
Our customized CNN method,
CNN method, on theon the other
other hand,hand, can handle
can handle more morethan
than three channels; however, the training needs to start from scratch since
three channels; however, the training needs to start from scratch since there are no pre-trained models there are no pre-trained
models available
available for the NIR for the
band.NIR band.
One other
One other challenge
challenge withwith deep
deep learning
learning methods
methods is is when
when the
the dataset
dataset isis imbalanced.
imbalanced. With With heavily
heavily
imbalanced datasets, the error from the overrepresented classes contributes much more to
imbalanced datasets, the error from the overrepresented classes contributes much more the loss
to the loss
value than
value than the
the error
error contribution
contribution fromfrom the the underrepresented
underrepresented classes. classes. This
This makes
makes the the deep
deep learning
learning
method’s loss function to be biased toward the overrepresented classes resulting
method’s loss function to be biased toward the overrepresented classes resulting in poor classification in poor classification
performance for
performance forthetheunderrepresented
underrepresented classes
classes [50One
[50]. ]. One should
should alsoalso
pay pay attention
attention whenwhen applying
applying deep
learning methods to new applications because one requirement for deep learning is the availabilitythe
deep learning methods to new applications because one requirement for deep learning is of
aavailability
vast amount of ofa training
vast amount data. of trainingthe
Moreover, data. Moreover,
training the training
data needs data needs
to have similar to have similar
characteristics as the
characteristics
testing as the testing
data. Otherwise, data. Otherwise,
deep learning methods may deepnot learning
yield goodmethods
performance. may Augmenting
not yield good the
performance. Augmenting the training dataset using different brightness
training dataset using different brightness levels, adding vertically and horizontally flipped versions, levels, adding vertically
and horizontally
shifting, rotating,flipped
or adding versions,
noisyshifting,
versionsrotating, or adding
of the training noisycould
images versions of the training
be potential images
strategies to
could bethe
mitigate potential
issues whenstrategies to mitigate
test data the issues
characteristics differwhen
from test data characteristics
the training data. differ from the
training data.
5. Conclusions
5. Conclusions
In this paper, we investigated the performance of three methods for vegetation detection. Two of
theseInmethods
this paper,are we
basedinvestigated the performance
on deep learning and another of three
one ismethods for vegetation
an object-based method detection. Two
that utilizes
of thesecomputer
NDVI, methods are based
vision, onmachine
and deep learning
learning andtechniques.
another one Experimental
is an object-based results method
showed thatthat
utilizes
the
NDVI, computer
DeepLabV3+ model vision, and the
that used machine
RGB bands learning techniques.
performed Experimental
reasonably resultsitshowed
well. However, that the
is challenging to
DeepLabV3+ model that used the RGB bands performed reasonably
extend that model to include the NIR band in addition to the three RGB bands. When the NDVI band well. However, it is challenging
to replaced
is extend that
withmodel
the red toband
include the NIR
to enable theband
use ofinalladdition
four inputto the three to
channels RGBsomebands.
extentWhen
whilethe NDVI
fulfilling
band is replaced
DeepLabV3+’s withinput
three the red band to
channels enable
only the usewe
restriction, of noticed
all four some
input slight
channels to some
detection extent while
improvements;
yet, this is not fully equivalent to using all four bands at once. In contrast to DeepLabV3+, our
Remote Sens. 2020, 12, 2502 21 of 23

customized CNN model can be easily adapted to use RBG+NIR bands. With our customized CNN
model, slightly better results than DeepLabV3+ were obtained for the Kimisala-20 dataset when RGB
and NIR bands were used. Overall, we found the NDVI-ML approach to perform better than both
two deep learning models. We anticipate that the reason is that the training data and testing data are
different in appearance making it challenging for deep learning methods. In contrast, NDVI-ML does
not require any training data and may be more practical in real-world applications. However, NDVI-ML
is not applicable to situations where the NIR band is not available and might need special care when
choosing optimal parameters that might vary for different test images with different resolutions. Even
though vegetation detection with a reasonable level of accuracy is possible with DeepLabV3+ using
RGB bands only, one of the future research directions would be the customization of the DeepLabV3+
framework to accept more than three channels so that the NIR band can be used together with the
three color channels, RGB. Another direction would be using augmentation techniques with deep
learning methods to diversify the training data so that more robust responses can be obtained when
the test data characteristics considerably differ from training data.

Author Contributions: Conceptualization, B.A.; methodology, B.A.; software, B.A., Y.L., D.P. and J.L.; validation,
B.A.; investigation, B.A. and B.B.; resources, D.S. and M.V.; data curation, B.A. and L.K.; writing—original draft
preparation, C.K. and B.A.; supervision, C.K.; project administration, C.K.; funding acquisition, C.K. All authors
have read and agreed to the published version of the manuscript.
Funding: This research was supported by US Department of Energy under grant # DE-SC0019936. The views,
opinions and/or findings expressed are those of the author and should not be interpreted as representing the
official views or policies of the Department of Defense or the U.S. Government.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Kwan, C.; Gribben, D.; Ayhan, B.; Bernabe, S.; Plaza, A.; Selva, M. Improving Land Cover Classification Using
Extended Multi-Attribute Profiles (EMAP) Enhanced Color, Near Infrared, and LiDAR Data. Remote Sens.
2020, 12, 1392. [CrossRef]
2. Tan, K.; Zhang, Y.; Wang, X.; Chen, Y. Object-Based Change Detection Using Multiple Classifiers and
Multi-Scale Uncertainty Analysis. Remote Sens. 2019, 11, 359. [CrossRef]
3. Skarlatos, D.; Vlachos, M. Vegetation removal from UAV derived DSMS, using combination of RGB and NIR
imagery. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, IV-2, 255–262.
4. Hellesen, T.; Matikainen, L. An Object-Based Approach for Mapping Shrub and Tree Cover on Grassland
Habitats by Use of LiDAR and CIR Orthoimages. Remote Sens. 2013, 5, 558–583. [CrossRef]
5. Ayhan, B.; Kwan, C.; Kwan, L.; Skarlatos, D.; Vlachos, M. Deep learning models for accurate vegetation
classification using RGB image only. In Proceedings of the Geospatial Informatics X (Conference SI113),
Anaheim, CA, USA, 21 April 2020.
6. Ayhan, B.; Kwan, C. A Comparative Study of Two Approaches for UAV Emergency Landing Site Surface
Type Estimation. In Proceedings of the 44th Annual Conference of the IEEE Industrial Electronics Society,
Washington, DC, USA, 21–23 October 2018.
7. Ayhan, B.; Kwan, C.; Budavari, B.; Larkin, J.; Gribben, D. Semi-automated emergency landing site selection
approach for UAVs. IEEE Trans. Aerosp. Electron. Syst. 2019, 55, 1892–1906. [CrossRef]
8. Guirado, E.; Tabik, S.; Alcaraz-Segura, D.; Cabello, J.; Herrera, F. Deep-learning versus OBIA for scattered
shrub detection with Google earth imagery: Ziziphus Lotus as case study. Remote Sens. 2017, 9, 1220.
[CrossRef]
9. Lindgren, D. Land Use Planning and Remote Sensing; Taylor & Francis: Milton, UK, 1984; Volume 2.
10. Yang, L.; Wu, X.; Praun, E.; Ma, X. Tree detection from aerial imagery. In Proceedings of the 17th ACM
SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, USA,
4–6 November 2009; pp. 131–137.
11. Bradley, D.M.; Unnikrishnan, R.; Bagnell, J. Vegetation detection for driving in complex environments.
In Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Roma, Italy,
10–14 April 2007; pp. 503–508.
Remote Sens. 2020, 12, 2502 22 of 23

12. Zare, A.; Bolton, J.; Gader, P.; Schatten, M. Vegetation mapping for landmine detection using long-wave
hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2007, 46, 172–178. [CrossRef]
13. Miura, T.; Huete, A.R.; Van Leeuwen, W.J.D.; Didan, K. Vegetation detection through smoke-filled AVIRIS
images: An assessment using MODIS band passes. J. Geophys. Res. Atmos. 1998, 103, 32001–32011. [CrossRef]
14. Gandhi, G.M.; Parthiban, S.; Thummalu, N.; Christy, A. Ndvi: Vegetation change detection using remote
sensing and gis–A case study of Vellore District. Procedia Comput. Sci. 2015, 57, 1199–1210. [CrossRef]
15. Bhandari, A.K.; Kumar, A.; Singh, G.K. Feature extraction using Normalized Difference Vegetation Index
(NDVI): A case study of Jabalpur city. Procedia Technol. 2012, 6, 612–621. [CrossRef]
16. Zhou, H.; Van Rompaey, A. Detecting the impact of the “Grain for Green” program on the mean annual
vegetation cover in the Shaanxi province, China using SPOT-VGT NDVI data. Land Use Policy 2009, 26,
954–960. [CrossRef]
17. Gates, D.M. Biophysical Ecology; Springer: New York, NY, USA, 1980; p. 611.
18. Salamati, N.; Larlus, D.; Csurka, G.; Süsstrunk, S. Incorporating near-infrared information into semantic
image segmentation. arXiv 2014, arXiv:1406.6147.
19. Senecal, J.J.; Sheppard, J.W.; Shaw, J.A. Efficient Convolutional Neural Networks for Multi-Spectral Image
Classification. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN),
Budapest, Hungary, 14–19 July 2019; pp. 1–8.
20. Zhang, X.; Han, L.; Han, L.; Zhu, L. How Well Do Deep Learning-Based Methods for Land Cover Classification
and Object Detection Perform on High Resolution Remote Sensing Imagery? Remote Sens. 2020, 12, 417.
[CrossRef]
21. Audebert, N.; Saux, B.L.; Lefevre, S. Semantic Segmentation of Earth Observation Data Using Multimodal
and Multi-scale Deep Networks. arXiv 2016, arXiv:1609.06846.
22. Huang, B.; Zhao, B.; Song, Y. Urban land-use mapping using a deep convolutional neural network with high
spatial resolution multispectral remote sensing imagery. Remote Sens. Environ. 2018, 214, 73–86. [CrossRef]
23. Kemker, R.; Salvaggio, C.; Kanan, C. Algorithms for Semantic Segmentation of Multispectral Remote Sensing
Imagery using Deep Learning. ISPRS J. Photogramm. Remote Sens. 2018, 145, 60–77. [CrossRef]
24. Zheng, C.; Wang, L. Semantic Segmentation of Remote Sensing Imagery Using Object-Based Markov Random
Field Model With Regional Penalties. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 1924–1935.
[CrossRef]
25. Chen, L.-C.; Papandreou, G.; Kokkinos, K.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation
with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal.
Mach. Intell. 2017, 40, 834–848. [CrossRef]
26. Available online: http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&
compid=6&submid=15347#KEY_DeepLabv3+_JFT (accessed on 10 April 2020).
27. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017.
28. Vijay, B.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image
segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495.
29. Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ade20k dataset.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA,
21–26 July 2017; pp. 633–641.
30. Du, Z.; Yang, J.; Ou, C.; Zhang, T. Smallholder Crop Area Mapped with a Semantic Segmentation Deep
Learning Method. Remote Sens. 2019, 11, 888. [CrossRef]
31. Wang, J.; Shen, L.; Qiao, W.; Dai, Y.; Li, Z. Deep feature fusion with integration of residual connection and
attention model for classification of VHR remote sensing images. Remote Sens. 2019, 11, 1617. [CrossRef]
32. Shang, R.; Zhang, J.; Jiao, L.; Li, Y.; Marturi, N.; Stolkin, R. Multi-scale Adaptive Feature Fusion Network for
Semantic Segmentation in Remote Sensing Images. Remote Sens. 2020, 12, 872. [CrossRef]
33. Ding, L.; Tang, H.; Bruzzone, L. Improving Semantic Segmentation of Aerial Images Using Patch-based
Attention. arXiv 2019, arXiv:1911.08877.
34. Torres, D.L.; Feitosa, R.Q.; Happ, P.N.; Rosa, L.E.C.L.; Junior, J.M.; Martins, J.; Bressan, P.O.; Gonçalves, W.N.;
Liesenberg, V. Applying Fully Convolutional Architectures for Semantic Segmentation of a Single Tree
Species in Urban Environment on High Resolution UAV Optical Imagery. Sensors 2020, 20, 563. [CrossRef]
[PubMed]
Remote Sens. 2020, 12, 2502 23 of 23

35. Huang, S.; Nex, F.; Lin, Y.; Yang, M.Y. Semantic segmentation of building in airborne images. Int. Arch.
Photogramm. Remote Sens. Spat. Inf. Sci. 2019, 42, 35–42. [CrossRef]
36. Example Dataset of EOPatches for Slovenia. 2017. Available online: http://eo-learn.sentinel-hub.com/
(accessed on 10 April 2020).
37. Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Raskar, R.D. A challenge to parse the
earth through satellite images. arXiv 2018, arXiv:1805.06561.
38. Perez, D.; Banerjee, D.; Kwan, C.; Dao, M.; Shen, Y.; Koperski, K.; Marchisio, G.; Li, J. Deep Learning for
Effective Detection of Excavated Soil Related to Illegal Tunnel Activities. In Proceedings of the IEEE Ubiquitous
Computing Electronics and Mobile Communication Conference, New York, NY, USA, 19–21 October 2017;
pp. 626–632.
39. Lu, Y.; Koperski, K.; Kwan, C.; Li, J. Deep Learning for Effective Refugee Tent Extraction Near Syria-Jordan
Border. IEEE Geosci. Remote Sens. Lett. Early Access 2020. [CrossRef]
40. Kwan, C.; Ayhan, B.; Budavari, B.; Lu, Y.; Perez, D.; Li, J.; Bernabe, S.; Plaza, A. Deep Learning for Land
Cover Classification Using Only a Few Bands. Remote Sens. 2020, 12, 2000. [CrossRef]
41. Transfer Learning from RGB to Multi-Band Imagery. Available online: https://www.azavea.com/blog/2019/
08/30/transfer-learning-from-rgb-to-multi-band-imagery/ (accessed on 10 April 2020).
42. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution
for semantic image segmentation. In Proceedings of the European Conference on Computer vision (ECCV),
Munich, Germany, 8–14 September 2018; pp. 801–818.
43. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent
neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
44. Rouse, J.W.; Haas, R.H.; Schell, J.A.; Deering, D.W. Monitoring vegetation systems in the Great Plains with
ERTS. NASA Spec. Publ. 1974, 1, 309–317.
45. Pothen, A.; Fan, C.J. Computing the block triangular form of a sparse matrix. ACM Trans. Math. Softw. (TOMS)
1990, 16, 303–324. [CrossRef]
46. MathWorks Image Processing Toolbox. Available online: https://www.mathworks.com/help/images/ref/
bwconncomp.html (accessed on 10 April 2020).
47. Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker verification using adapted Gaussian mixture models.
Digit. Signal. Process. 2000, 10, 19–41. [CrossRef]
48. Mathworks Computer Vision Toolbox, Evaluate Semantic Segmentation Data Set against Ground
Truth. Available online: https://www.mathworks.com/help/vision/ref/evaluatesemanticsegmentation.html
(accessed on 10 April 2020).
49. Tensorflow Models, DeepLabV3+ Training Details. Available online: https://github.com/tensorflow/models/
issues/4345 (accessed on 10 April 2020).
50. Ayhan, B.; Kwan, C. Tree, Shrub, and Grass Classification Using Only RGB Images. Remote Sens. 2020,
12, 1333. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like