An Accurate Car Counting in Aerial Images Based On Convolutional Neural Networks 2023

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Journal of Ambient Intelligence and Humanized Computing (2023) 14:1259–1268

https://doi.org/10.1007/s12652-021-03377-5

ORIGINAL RESEARCH

An accurate car counting in aerial images based on convolutional


neural networks
Ersin Kilic1 · Serkan Ozturk1

Received: 19 October 2020 / Accepted: 5 July 2021 / Published online: 13 July 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021

Abstract
This paper proposes a simple and effective single-shot detector model to detect and count cars in aerial images. The proposed
model, called heatmap learner convolutional neural network (HLCNN), is used to predict the heatmap of target car instances.
In order to learn the heatmap of the target cars, we have improved CNN architecture by adding three convolutional layers as
adaptation layers instead of fully connected layers. The VGG-16 has been used as a backbone convolutional neural network
in the proposed model. The proposed method successfully determines the number of cars and precisely detects the center
of target cars. Experiments on the two different car datasets (PUCPR+ and CARPK) show the state-of-the-art counting and
localizing performance of the proposed method in comparison with existing methods. Also, experiments have been conducted
to examine the effect of data augmentation and batch normalization on the success of the proposed method. The code and
data will be made available here [https://​www.​github.​com/​ekilic/​Heatm​ap-​Learn​er-​CNN-​for-​Object-​Count​ing].

Keywords Deep learning · Car counting · Object counting · Convolutional neural networks

1 Introduction without precise object positions separates object counting


methods from object detection and instance segmentation
With the development of unmanned flying vehicles, new (Kang et al. 2015). Moreover, the object counting problem
potential applications have emerged for aerial view cameras can be considered as a sub-problem of the object detection
such as urban management, traffic monitoring (Wang et al. problem. Thus, existing object detection and instance seg-
2019), farm management, and parking lot utilization (Chen mentation methods can be used to estimate the number of
et al. 2019). Generally, these applications need to find out objects in the image. Counting by detection and segmenta-
the number of objects such as cars on the road or on park- tion is a usual consequence of the object detection and seg-
ing lots (Mundhenk et al. 2016; Hsieh et al. 2017; Sun et al. mentation methods that localize individual object instances
2017; Di Mauro et al. 2019; Chen et al. 2020), people in a in the image.
crowd (Nogueira et al. 2019; Revathi and Rajalaxmi 2019; Predominant approaches for object counting learn a
Razakarivony and Jurie 2016), sheep or cattle in a graz- regression model that predicts a non-negative scalar count.
ing area (Sarwar et al. 2018; Shao et al. 2019), tobacco or A single-output convolutional neural network (CNN) model
banana plant in a plantation area (Fan et al. 2018; Neupane that predicts positive object count is frequently used for
et al. 2019). object counting tasks. This CNN-based approaches are
The object counting methods aim to estimate the number called a single-shot regression model, which predicts a
of objects in a still image or video frame. Object detection scalar count for an input image without explicit localiza-
identifies visual objects of a specific class and determines tion information of the objects. Unlike the regression-based
where the objects are in a digital image (Zou et al. 2019). approaches, detection-based approaches localize the objects
The determination of the number of objects in an image to determine the number of objects. Anchor-based object
detection models such as Faster R-CNN (Ren et al. 2015),
SSD Liu et al. 2015), Yolo (Redmon et al. 2016), RetinaNet
* Ersin Kilic (Lin et al. 2017) and anchor-free object detection models
ersinkilic@erciyes.edu.tr such as CornerNet (Law and Deng 2018), CenterNet (Zhou
1 et al. 2019a) have shown state-of-the-art performance in
Erciyes University, Kayseri, Turkey

13
Vol.:(0123456789)
1260 E. Kilic, S. Ozturk

many object detection tasks (Cazzato et al. 2020). Also, result, it reveals how the model works and where it makes a
these methods and variants have shown promising results in mistake. HLCNN is an anchor-free CNN model which does
car counting in aerial images (Hsieh et al. 2017; Goldman not represents objects with anchor boxes. Anchor boxes rep-
et al. 2019; Li et al. 2019; Cai et al. 2019). In fact, count- resentation has three main hyperparameters, namely: num-
ing without localizing is not difficult for CNN models since ber, size, and aspect ratios of boxes (Law and Deng 2018).
these models can classify and detect objects from dozens of The optimal adjustment of these parameters directly affects
different classes. Actually, visualizations of feature maps the success of the model. HLCNN has fewer hyperparam-
of regression-based models have demonstrated that feature eters, and thus, it is easy to use and robust. For evaluating
maps have information about car locations (Aich and Stav- the success of our model, we used two car counting datasets:
ness 2018b). CARPK and PUCPR+ (Aich and Stavness 2018b) .
Aich and Stavness (2018b) used class activation map The main contributions of this paper are summarized
(CAM) to regulate feature maps from the final convolu- as follows: (i) We propose a novel CNN model that pro-
tional layer of CNN to enhance a pure single-shot regression duces a heatmap for car location in a single shot. (ii) The
model for a car counting. CAM is a method that visualizes effect of data augmentation and batch normalization (BN)
the class-specific discriminative region parts detected by on the success of the HLCNN in predicting heatmap has
a CNN on an input image. CAM has made it available to been demonstrated by detailed experiments. (iii) We improve
see which regions in the image are relevant to a particular the accuracy of state-of-the-art car counting methods for
class. Zhou et al. (2016) describe the procedure for generat- CARPK and PUCPR+ datasets. We improved the mean
ing CAM using global average pooling (GAP) in CNNs. absolute error (MAE) for car counting using the CARPK
They proposed a simple modification of the global average and PUCPR+ datasets from 4.80 to 2.12 and 3.68 to 2.54,
pooling layer to allow CNN to both classify the image and respectively. Finally, our model has obtained impressive
localize class-specific image regions in a single forward- results on detecting the center points of the car vehicles.
pass. The motivation of this CAM lies in the idea that CNN These results demonstrate that the proposed method has a
can produce a heatmap that shows the correct position of significant potential for object detection problems.
an object. Oquab et al. (2015) propose a weakly supervised
object classification approach to demonstrate that CNN can
learn object localization without object-level annotation. 2 Literature review
They have treated the fully connected layers as convolutions.
Chen et al. (2019) proposed a novel anchor-free CNN model A number of published studies in the literature that focus on
called target heatmap network, which consists of a shallow tackling counting tasks can be roughly categorized into two
features-extracting network, decoder network, and the loca- categories: counting by detection and regression. Popular
tion method. In their approach, the target heatmap network object detection frameworks train CNN models to extract
generates a 2-D heatmap of the objects on the remote sens- features of candidate object proposals and then predict the
ing image. class probabilities and regress bounding boxes of those pro-
To learn generating the heatmap with a supervised posals. These approaches are called two-stage object detec-
approach, ground-truth heatmaps should be created during tion methods and the most well-known methods belong to
the training. Pfister et al. (2015) and Chu et al. (2017) used R-CNN family (Girshick 2015; Ren et al. 2015). Although
the Gaussian function with fixed-sized and fixed-variance to these two-stage methods are successful in detection, they
mark each joint of the human body in a heatmap. Chen et al. are slower than single-shot methods such as YOLO (Red-
(2019) used the normalized Gaussian function whose param- mon et al. 2016) and SSD (Liu et al. 2015; Zou et al. 2019).
eters vary with the object size to train a deep network that YOLO and SSD implement object classifiers and bounding
detects oil tanks, vehicles, and aircrafts in remote sensing box regressors in an end-to-end manner without extracting
images. The process of creating a heatmap using the Gauss- proposals. Saribas et al. (2018) and Ammar et al. (2019)
ian function is called a Gaussian activation map (GAM). have used Faster R-CNN and YOLO for detecting cars in
By considering the localization ability of CNNs, as men- aerial images. The direct application of most state-of-the-
tioned above, a novel single-shot detection model based art object detection frameworks for object counting requires
on CNN is proposed in this paper. The proposed model is a particular focus on densely small-scale objects such as
called heatmap learner CNN (HLCNN). HLCNN is trained cars. The layout proposal network presented in Hsieh et al.
using GAMs for heatmap prediction of the target cars. The (2017) proposes a novel bounding box generation method
proposed model is more successful than existing regression- with spatial kernels to simultaneously count and localize
based methods for car counting. Also, the proposed method target objects. Goldman et al. (2019) propose learning the
localizes the objects by detecting the center point of counted Jaccard index with a soft Intersection over Union (Soft-IoU)
vehicles. Besides facilitating the explanation of the counting network layer for car counting on densely packed scenes. A

13
An accurate car counting in aerial images based on convolutional neural networks 1261

novel scale-adaptive strategy for generating anchors and an CARPK, and PUCPR+ datasets, and state-of-the-art results
effective loss function to push the anchors towards match- have been obtained by the proposed method.
ing the ground-truth boxes are proposed by Li et al. (2019)
for detecting and counting dense vehicles in drone images.
The guided attention network presented in Cai et al. (2019) 3 Proposed approach
enhanced feature representation with background attention
and foreground attention modules to facilitate accurate car A car counting system based on HLCNN consists of two
localization. Objects were represented by bounding boxes stages: predicting heatmaps of an image by HLCNN and
(anchors) in these methods. They considered object detec- generating peak-map from heatmap to locate and count car
tion as an image classification of an extensive number of in an image. Figure 1 gives an overview of the proposed
potential object bounding boxes (Wu et al. 2019). Anchor framework. To train the HLCNN model, GAM is used to
based methods include the post-processing stage namely create heatmaps that show the size and location of cars.
non-maximum suppression (NMS), which serves to elimi- GAM, HLCNN, and peak-map generation methods are
nate duplicated bounding boxes (Kilic and Ozturk 2019). described in detail in the ensuing subsections.
The anchor has long been the predominant form of object
representation in object detection tasks. Recently, there have
3.1 Gaussian activation map (GAM)
been some attempts to develop different object represen-
tations for building more effective deep models. A rotated
The Gaussian function is used to generate heatmap to be
bounding box to better handle rotational variations is pro-
used during the training. The created heatmap is called
posed in Zhou et al. (2017). ExtremeNet (Zhou et al. 2019b)
GAM. The Gaussian function (kernel) is given with the fol-
was proposed to locate the extreme points of objects in the x-
lowing form:
and y-directions. To overcome limitations of anchor repre-
sentation RepPoints is proposed to toward finer localization 2 2)
G(x, y) = 𝛼 ∗ exp−(a(x−x0 ) +2b(x−x0 )(y−y0 )+c(y−y0 ) (1)
and better object feature extraction by Yang et al. (2019).
Zhou et al. (2019a) proposed the CenterNet which uses key-
point estimation to find center points of objects. Arruda et al.
G(x, y) = G(x, y)∕max(G(x, y)) (2)
(2021) proposed the anchor-free CNN-based approach for Where 𝛼 is a normalization factor, x0 , and y0 are the coordi-
counting and locating objects in high-density imagery using nates of the kernel center. a, b and c are control coefficients
Gaussian kernels and evaluated the proposed method in two that control size and orientation of the kernel. In this study,
counting datasets: car and tree. a, b and c are calculated using the equations as given below:
Counting by regression methods aim to estimate the
number of objects without dealing with object localization. cos2 𝜃 sin2 𝜃
Thus, instead of solving the challenging object detection
a =
2𝜎x2
+
2𝜎y2 (3)
task, regression-based methods learn a direct mapping from
some global image characteristics to the non-negative num-
sin 2𝜃 sin 2𝜃
ber of objects. A number of approaches (Chan et al. (2008; b =− + (4)
4𝜎x2 4𝜎y2
Chen et al. 2012, 2013) using global regressors have not
achieved the desired success because of trained with low-
level features. Several methods (Lempitsky and Zisserman sin2 𝜃 cos2 𝜃
2010; Fiaschi et al. 2012; Arteta et al. 2014) try to predict
c=
2𝜎x2
+
2𝜎y2 (5)
count by density map estimation. Lempitsky and Zisserman
(2010) proposed to generate a pixel-level ground-truth den- The 𝜃 value determines the orientation of the kernel,
sity map from the dot annotations using one gaussian kernel whereas 𝜎x and 𝜎y are determined by the width and length
per object. Based on this ground-truth density map, (Aich of the object, respectively. The maximum value calculated
and Stavness 2018b) enhance one-look regression counting by the Gaussian function depends on the size of the object.
model by regulating activation maps of the CNN model. Xie Therefore, the maximum value of the Gaussian function is
et al. (2016) propose fully convolutional regression networks set to 1 with the Eq. 1. GAM is a heatmap that consists
that predict the final density map to count microscopy cells. distribution of synthesized Gaussian function calculated for
In this paper, an efficient CNN-based single-shot car each object. Figure 2 shows a sample image and heatmap
detector model for aerial images is proposed. Unlike the of objects created by GAM. Oriented bounding box (OBB)
regression-based methods, the proposed method can detect annotation must be included in the data set to determine the
the center point of the target objects while counting cars object orientation. If the data set has not included OBB, 𝜃
accurately. The proposed method has been tested with is set to zero.

13
1262 E. Kilic, S. Ozturk

Fig. 1  Overview of the proposed car counting framework. Detected center points of the target cars have been indicated by red point in output
samples

3.2 Heatmap learner convolutional neural networks In addition, the batch normalization (BN) (Ioffe and Sze-
(HLCNNs) gedy 2015) layer has been added to the backbone network
output.
The input image to HLCNN can be of any size, whereas {
the output size of HLCNN varies depending on the down- 𝛼x x ≤ 0
𝜙(x) =
x x>0 (6)
sampling ratio of the model and the number of classes. A
model with a downsampling ratio r generates an output
The parameters of the pooling layers and convolution layers
with dimensions w/r × h/r × C for an input with dimen-
contained in the model determine the downsampling ratio.
sions w × h × c. The variables w, h, c denote weight, height
The downsampling ratio for both VGG-16 and ResNet-18
and number of a color channel of input image, respec-
models is 32. The downsampling factor is crucial for the
tively. C denotes the number of the target object catego-
success of the HLCNN model and is appropriately deter-
ries. The model generates a heatmap for each object class
mined according to the size of the object to be counted.
separately.
This parameter significantly affects the success of a model.
The proposed model consists of two parts: backbone
The architecture of the HLCNN model with VGG-16 with
network and adaptation layers. Any standard CNN model
3 max-pooling layers as a backbone network is illustrated
without fully connected layers can be used as a backbone
in Fig. 3. For example, a 720 × 1280 pixel input would lead
network. Adaptation layers consists of convolution layers
to size of a 90 × 160 heatmap for all classes. While train-
with 1x1 size filters with a different number of feature
ing, we back-propagate L1 loss error (Eq. 7) of the predicted
maps. All convolutional layers are followed by the acti-
heatmap and GAM. In Eq. 7, w and h donate the width and
vation function of leaky relus (Eq. 6) (Xu et al. 2015) in
height of GAM, respectively.The Hpred and GAM donate the
adaptation layers. The 𝛼 value determines the coefficient of
predicted heatmap and the calculated heatmap of target cars,
leakage. The adaptation layers generates a number of heat-
respectively.
maps depending on the number of the object categories.

13
An accurate car counting in aerial images based on convolutional neural networks 1263

∑w ∑h
∣ Hpred ij − GAMij ∣
L1 =
i j
(7)
w∗h

3.3 Peak‑map generation

We propose a simple algorithm to detect the center points of


objects from the heatmap. The proposed method takes a pre-
dicted heatmap and detects the peaks using a local maximum
filter. The maximum filter detects the highest value of pixels
within a local region of an image. Algorithm 1 describes the
process of the proposed peak-map generation method.

Fig. 2  a Sample image from CARPK dataset b Heatmap of cars cre-
ated by GAM

Fig. 3  Architecture of the proposed HLCNN model

13
1264 E. Kilic, S. Ozturk

4 Experimental study Table 1  Analysis of the effect Downsam- MAE RMSE


of downsampling ratio on pling ratio
performance for object counting
We report experimental results of object counting with in the CARPK dataset 4 5.29 7.15
MAE and root mean square error (RMSE) (Aich and Stav-
8 2.12 3.02
ness 2018b). The object localization results is reported using
16 6.00 9.80
peak-maps which show the center of object as a white point
on the black background. We implemented our model using The lowest counting errors are
PyTorch (Paszke et al. 2019). For analyzing the effective- written even in bold
ness of our model between different approaches in fairly, we Table 2  Performance analysis of data augmentation methods and BN
implemented our model with the VGG-16 (Simonyan and layer for object counting on CARPK data set
Zisserman 2014) as a backbone network. In order to train the
Model MAE RMSE
models with CARPK and PUCPR+, we employed transfer
learning by using the pre-trained weights obtained for train- HLCNN-WOBNAU 3.19 4.33
ing on the ImageNet dataset (Deng et al. 2009). We run our HLCNN-WOAU 2.73 3.93
experiments in a Linux environment with a 16 GB memory HLCNN-WOBN 2.61 3.52
and an NVidia GTX1080-Ti GPU. The following numerical HLCNN 2.12 3.02
metrics (Eq. 8) are used:

⎧ th
⎪ xi , yi = actual and target counts for i image been used as a backbone network. Three models with 16,
⎪ N = number ∑
of image
⎪ ∣xi −yi ∣ 8 and 4 downsampling ratios were trained to measure the
⎨ MAE = i (8) effect of the model downsampling ratio on counting suc-
⎪ �N ∑
⎪ (xi −yi )2 cess. The ADAM optimizer has been used to train all net-
i
⎪ RMSE = works. The initial learning rate has been set to 0.0001 and
⎩ N
batch size has been set to 1. Networks have been trained
for 30 epochs with all training images. The models have
been trained with the 540 × 960 size of input image. The
4.1 Datasets results of the experiments are shown in Table 1. The model
trained with a downsampling ratio of 8 achieved the best
Experimental studies have been conducted using CARPK MAE and RMSE values of 2.12 and 3.02, respectively.
and PUCPR+ datasets. PUCPR+ dataset is a revised ver- Thus, the model generates a 90 × 160 heatmap for 720 ×
sion of the original PUCPR dataset. The PUCPR dataset 1280 size input with a downsampling ratio of 8.
consists of images that were captured with a fixed camera. We employed data augmentation methods including:
This makes the images very similar to each other such that image rotation and color jitter techniques. A color jitter
PUCPR is not suitable for training deep learning models for data augmentation randomly shifts the HUE channel of
counting and localizing tasks. Hence, the PUCPR dataset an image with a probability of 0.5. An image rotation
was reorganized and annotated by Hsieh et al. (2017) to cre- augmentation randomly rotates images through random
ate the PUCPR+ dataset with a total of 17000 cars. degrees (between 0 to + 45) with a probability of 0.2. In
CARPK dataset contains 989 training and 459 testing order to demonstrate the effects of data augmentation and
images and information of 89777 cars in various scenes for BN layer, the model with downsampling ratio of 8 has
4 different parking lots. The images were captured using a been trained with or without data augmentation and BN
drone, unlike the PUCPR+ dataset. The images were col- layer. Firstly, the model which is called HLCNN-WOB-
lected with the drone-view at approximately 40 m height. NAU was trained without data augmentation and BN layer.
The CARPK dataset was introduced by Hsieh et al. in HLCNN-WOBNAU achieved MAE and RMSE values of
(2017). 3.19 and 4.33, respectively. Secondly, the model which
is called HLCNN-WOBN was trained without BN layer.
HLCNN-WOBN achieved MAE and RMSE values of 2.61
4.2 Results on CARPK and 3.52, respectively. Thirdly, the model which is called
HLCNN-WOAU was trained without data augmentation.
For evaluating the performance of the proposed method HLCNN-WOAU achieved MAE and RMSE values of 2.73
in the CARPK data set, we have used three models with and 3.93, respectively. The results show that the model
a different downsampling ratios. The VGG-16 architec- trained with data augmentation and BN layer achieved the
ture with different number of max-pooling layers has best scores in terms of MAE and RMSE (Table 2).

13
An accurate car counting in aerial images based on convolutional neural networks 1265

Fig. 4  Sample visual results of the HLCNN-WOAU. a and b Samples Fig. 6  Sample visual results of the HLCNN-WOBNAU. a and b
images. Red points have indicated detected center points of the target Samples images. Red points have indicated detected center points of
objects. c and d Heatmaps of samples images predicted by HLCNN- the target objects. c and d Heatmaps of samples images predicted by
WOAU (color figure online) HLCNN-WOBNAU (color figure online)

Fig. 5  Sample visual results of the HLCNN. a and b Samples images. Fig. 7  Sample visual results of the HLCNN. a and b Samples images.
Red points have indicated detected center points of the target objects. Red points have indicated detected center points of the target objects.
c and d Heatmaps of samples images predicted by HLCNN (color fig- c and d Heatmaps of samples images predicted by HLCNN (color fig-
ure online) ure online)

Color-based data augmentation enhances the detection results in undetectable cars. Mainly partly seen cars cause
performance of a method by reducing false detections on the the model to be unable to detect cars.
shadow of vehicles. The visual results of the model HLCNN The threshold (t) value of peak-map generation has been
and HLCNN-WOAU have been presented in Figs. 4 and 5. specially set for each model to maximize counting success. t
They have shown positive effect of color-based data aug- value was chosen as 0.05 for HLCNN with a downsampling
mentation. Figure 5 has shown that the HLCNN has reduced ratio of 8. We compare our method with state-of-the-art
false detection caused by vehicle shadows. Also, BN layer algorithms in Table 3. The results show that our approach
and data augmentation reduced false detection rate of the achieves the best MAE and RMSE. Figure 8 shows the
proposed method. Comparing the visual results seen in object localization performance of proposed method.
Figs. 6 and 7, this technique reduced the false detections
of other objects such as graffitis and park toys. Parking
some cars side by side, partly covered by trees or shadows,

13
1266 E. Kilic, S. Ozturk

Fig. 8  Sample images from CARPK dataset (top) and predicted Fig. 9  Sample images from CARPK dataset (top) and predicted
heatmap of cars (bottom). Red points have indicated detected center heatmap of cars (bottom). Red points have indicated detected center
points of the target cars (color figure online) points of the target cars (color figure online)

Table 3  CARPK dataset results Table 5  PUCPR+ dataset results

Method MAE RMSE Method MAE RMSE

Faster R-CNN (Ren et al. 2015; Hsieh et al. 2017) 24.32 37.62 Faster R-CNN (Ren et al. 2015; Hsieh et al. (017) 39.88 47.67
YOLO (Redmon et al. 2016; Hsieh et al. 2017) 48.89 57.55 YOLO (Redmon et al. 2016; Hsieh et al. 2017) 156.00 200.42
One-Look Regression (Mundhenk et al. 2016; 59.46 66.84 One-Look Regression (Mundhenk et al. 2016; 21.88 36.73
Hsieh et al. 2017) Hsieh et al. 2017)
LPN (Hsieh et al. 2017) 23.80 36.79 LPN (Hsieh et al. 2017) 22.76 34.46
RetinaNet (Lin et al. 2018) 16.62 22.30 RetinaNet (Lin et al. 2018) 24.58 33.12
IEP Counting (Stahl et al. 2019) 51.83 IEP Counting (Stahl et al. 2019) 15.17
IoUNet (Goldman et al. 2019) 6.77 8.52 IoUNet (Goldman et al. 2019) 7.16 12.00
YOLOv3 (Redmon and Farhadi 2018) 7.92 11.08 VGG-GAP-HR (Aich and Stavness 2018b) 5.24 6.67
VGG-GAP-HR (Aich and Stavness 2018b) 5.88 9.30 YOLOv3 (Redmon and Farhadi 2018) 5.24 7.14
GSP-224 (Aich and Stavness 2018a) 5.46 8.09 SA+CF+CRT (Li et al. 2019) 3.92 5.06
SA+CF+CRT (Li et al. 2019) 5.42 7.38 GANet (VGG-16) (Cai et al. 2019) 3.68 5.47
GANet (VGG-16) (Cai et al. 2019) 4.80 6.9 dos Santos de Arruda et al. (2021) 3.16 4.39
dos Santos de Arruda et al. (2021) 4.45 6.18 HLCNN (VGG-16) 2.52 3.40
HLCNN (VGG-16) 2.12 3.02
The lowest counting errors are written even in bold
The lowest counting errors are written even in bold

Table 4  Analysis of the effect downsam- MAE RMSE data augmentation methods were utilized in both models.
of downsampling ratio on pling ratio The model has been trained for 100 epochs with same opti-
performance for object counting
in the PUCPR+ dataset
mizer and learning rate as described for experiments with
4 2.52 3.40
the CARPK dataset. The threshold (t) value of peak-map
8 2.64 4.05
generating was set to 0.45. The results of the experiments for
16 29.92 41.62
measuring the effect of the downsampling ratio are shown in
The lowest counting errors are Table 4. The model trained with a downsampling factor of 4
written even in bold achieved the best RMSE value of 3.40.
The results of the experiments are compared with state-
of-the-art methods. It can be seen from Table 5 that the pro-
4.3 Results on PUCPR+ posed method achieves the best MAE and RMSE values of
2.52 and 3.40, respectively, among all the methods. Figure 9
In order to evaluate the performance of our method for the shows the object localization performance of the proposed
PUCPR+ data set, we used three models with a different method.
downsampling ratios. The VGG-16 with different number of
max-pooling layers served as the backbone network. Similar

13
An accurate car counting in aerial images based on convolutional neural networks 1267

5 Conclusion Chen W, Qiao Y, Li Y (2020) Inception-SSD: an improved single shot


detector for vehicle detection. J Ambient Intell Humaniz Comput.
https://​doi.​org/​10.​1007/​s12652-​020-​02085-w
In this paper, a novel CNN-based model to deal with object Chu X, Yang W, Ouyang W, Ma C, Yuille AL, Wang X (2017) Multi-
counting in remote sensing images has been proposed. The context attention for human pose estimation. IEEE Conf Comput
proposed method predicts a heatmap in a single-shot to Vis Pattern Recognit (CVPR). https://​doi.​org/​10.​1109/​cvpr.​2017.​
601
demonstrate the location of cars. Existing regression-based Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet:
object counting approaches predict the number of objects a large-scale hierarchical image database. In: 2009 2013 IEEE
but cannot localize the target objects. In the experiments, the conference on computer vision and pattern recognition CVPR09
proposed model has obtained impressive results on detect- Di Mauro D, Furnari A, Patanè G, Battiato S, Farinella GM (2019)
Estimating the occupancy status of parking areas by counting cars
ing center points of the target objects. Also, experiments and non-empty stalls. J Vis Commun Image Represent 62:234–
have been conducted to show the effect of BN layer and data 244. https://​doi.​org/​10.​1016/j.​jvcir.​2019.​05.​015
augmentation. The performance comparison of the proposed dos Santos de Arruda M, Lucas PO, Plabiany RA, Diogo NG, José
method on two challenging datasets has demonstrated the MJ, Ana P, Marques R, Matsubara ET, Zhipeng L, Jonathan L,
Jonathan de Andrade S, Wesley NG (2021) Counting and locat-
effectiveness of our method. The HLCNN has improved the ing high-density objects using convolutional neural network.
state-of-the-art counting performance of existing methods. arXiv preprint arXiv:​2102.​04366
The dependence of the HLCNN on the downsampling ratio Fan Z, Jiewei L, Gong M, Xie H, Goodman ED (2018) Auto-
is the disadvantage of the method. To solve this problem, we matic tobacco plant detection in UAV images via deep neu-
ral networks. IEEE J Sel Top Appl Earth Observ Remote Sens
intend to improve the proposed HLCNN model with feature 11(3):876–887. https://​doi.​org/​10.​1109/​jstars.​2018.​27938​49
pyramid networks as a future work. We also plan to test the Fiaschi L, Nair R, Köthe U, Hamprecht FA (2012) Learning to count
model with multi-class object detection and counting data with regression forest and structured labels. In: Proceedings
sets. of the 21st international conference on pattern recognition
(ICPR2012), pp 2685–2688. ISBN 978-1-4673-2216-4
Girshick RB (2015) Fast R-CNN. arXiv preprint arXiv:​1504.​08083
Acknowledgements This work is supported by Erciyes University,
Goldman E, Herzig R, Eisenschtat A, Ratzon O, Levi I, Goldberger
the Department of Research Projects under Contract FDK-2018-8624.
J, Hassner T (2019) Precise detection in densely packed scenes.
arXiv preprint arXiv:​1904.​00853
Hsieh M-R, Lin Y-L, Hsu WH (2017) Drone-based object counting
References by spatially regularized regional proposal network. arXiv pre-
print arXiv:​1707.​05972
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep
Aich S, Stavness I (2018a) Object counting with small datasets of large
network training by reducing internal covariate shift. arXiv pre-
images. arXiv preprint arXiv:​1805.​11123
print arXiv:​1502.​03167
Aich S, Stavness I (2018b) Improving object counting with heatmap
Kang D, Ma Z, Chen AB (2019) Beyond counting: Comparisons of
regulation. arXiv preprint arXiv:​1803.​05494
density maps for crowd analysis tasks-counting, detection, and
Ammar A, Koubaa A, Ahmed M, Saad A (2019) Aerial images pro-
tracking. IEEE Trans Circuits Syst Video Technol 29(5):1408–
cessing for car detection using convolutional neural networks:
1422 (ISSN 1558-2205)
comparison between faster R-CNN and yolov3. arXiv preprint
Kilic E, Ozturk S (2019) A subclass supported convolutional neural
arXiv:​1910.​07234
network for object detection and localization in remote-sensing
Arteta C, Lempitsky V, Alison NJ, Zisserman A (2014) Interactive
images. Int J Remote Sens 40(11):4193–4212. https://​doi.​org/​
object counting. In: David F, Tomas P, Bernt S, Tinne T (eds)
10.​1080/​01431​161.​2018.​15622​60
Computer vision-ECCV. Springer International Publishing, Berlin
Law H, Deng J (2018) Cornernet: detecting objects as paired key-
Cai Y, Du D, Zhang L, Wen L, Wang W, Wu Y, Lyu S (2019) Guided
points. In: Proceedings of the European conference on computer
attention network for object detection and counting on drones.
vision (ECCV), 2018, pp. 734–750
arXiv preprint arXiv:​1909.​11307
Lempitsky V, Zisserman A (2010) Learning to count objects in images.
Cazzato D, Claudio C, Jose Luis S-L, Holger V, Marco L (2020) A
In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta
survey of computer vision methods for 2d object detection from
A (eds) Advances in neural information processing systems, vol
unmanned aerial vehicles. J Imaging 6(8):78
28. Curran Associates Inc., London, pp 1324–1332
Chan AB, Liang Z-SJ, Vasconcelos N (2008) Privacy preserving crowd
Li W, Li H, Wu Q, Chen X, Ngan KN (2019) Simultaneously detect-
monitoring: counting people without people models or tracking.
ing and counting dense vehicles from drone images. IEEE Trans
IEEE Conf Comput Vis Pattern Recognit. https://d​ oi.o​ rg/1​ 0.1​ 109/​
Ind Electron 66(12):9651–9662. https://​doi.​org/​10.​1109/​tie.​2019.​
cvpr.​2008.​45875​69
28995​48
Chen K, Gong S, Xiang T, Loy CC (2013) Cumulative attribute space
Lin T-Y, Goyal P, Girshick RB, He K, Dollár P (2017) Focal loss for
for age and crowd density estimation. In: 2013 IEEE conference
dense object detection. arXiv preprint arXiv:​1708.​02002
on computer vision and pattern recognition CVPR
Lin T-Y, Goyal P, Girshick R, He K, Dollar P (2018) Focal loss for
Chen H, Libao Z, Jie M, Jue Z (2019) Target heat-map network: an
dense object detection. IEEE Trans Pattern Anal Mach Intell.
end-to-end deep network for target detection in remote sensing
https://​doi.​org/​10.​1109/​tpami.​2018.​28588​26
images. Neurocomputing 331:375–387. https://​doi.​org/​10.​1016/j.​
Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu C-Y, Berg
neucom.​2018.​11.​044
AC (2016) SSD: single shot multibox detector. In: Computer
Chen K, Loy CC, Gong S, Xiang T (2012) Feature mining for localised
vision – ECCV 2016. ECCV 2016. Lecture notes in computer
crowd counting. In: British machine vision conference BMVC12
science, vol 9905. Springer, Cham. https://​doi.​org/​10.​1007/​
978-3-​319-​46448-0_2

13
1268 E. Kilic, S. Ozturk

Mundhenk NT, Konjevod G, Sakla WA, Boakye K (2016) A large Shao W, Kawakami R, Yoshihashi R, You S, Kawase H, Naemura T
contextual dataset for classification, detection and counting of cars (2019) Cattle detection and counting in UAV images based on
with deep learning. In: Computer vision – ECCV 2016. ECCV convolutional neural networks. Int J Remote Sens 41(1):31–52.
2016. Lecture notes in computer science, vol 9907. Springer, https://​doi.​org/​10.​1080/​01431​161.​2019.​16248​58
Cham. https://​doi.​org/​10.​1007/​978-3-​319-​46487-9_​48 Simonyan K, Zisserman A (2014) Very deep convolutional networks
Neupane B, Horanont T, Hung ND (2019) Deep learning based banana for large-scale image recognition. arXiv preprint arXiv:1​ 409.1​ 556
plant detection and counting using high-resolution red-green-blue Stahl T, Pintea SL, van Gemert JC (2019) Divide and count: generic
(RGB) images collected from unmanned aerial vehicle (UAV). object counting by image divisions. IEEE Trans Image Process
PLOS One 14(10):e0223906. https://​doi.​org/​10.​1371/​journ​al.​ 28(2):1035–1044. https://​doi.​org/​10.​1109/​tip.​2018.​28753​53
pone.​02239​06 Sun M, Yan W, Teng L, Jing L, Jun W (2017) Vehicle counting in
Nogueira V, Oliveira H, Augusto Silva J, Vieira T, Oliveira K (2019) crowded scenes with multi-channel and multi-task convolutional
Retailnet: a deep learning approach for people counting and hot neural networks. J Vis Commun Image Represent 49:412–419.
spots detection in retail stores. In: 2019 32nd SIBGRAPI confer- https://​doi.​org/​10.​1016/j.​jvcir.​2017.​10.​002
ence on graphics, patterns and images (SIBGRAPI). https://​doi.​ Wang J, Liu C, Tian F, Zheng L (2019) Research on automatic target
org/​10.​1109/​sibgr​api.​2019.​00029 detection and recognition based on deep learning. J Vis Commun
Oquab M, Bottou L, Laptev I, Sivic J (2015) Is object localization for Image Represent 60:44–50. https://​doi.​org/​10.​1016/j.​jvcir.​2019.​
free? Weakly-supervised learning with convolutional neural net- 01.​017
works. In: 2015 IEEE conference on computer vision and pattern Wu Y, Yinpeng C, Lu Y, Zicheng L, Lijuan W, Hongzhi L, Yun F
recognition (CVPR), pp 685–694 (2019) Rethinking classification and localization in R-CNN. arXiv
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen preprint arXiv:​1409.​1556
T, Lin Z, Gimelshein N, Antiga L, Desmaison A (2019) Pytorch: Xie W, Alison JN, Andrew Z (2016) Microscopy cell counting and
an imperative style, high-performance deep learning library. Adv detection with fully convolutional regression networks. Comput
Neural Inf Process Syst 32:8026–37 Methods Biomech Biomed Eng 6(3):283–292. https://​doi.​org/​10.​
Pfister T, Charles J, Zisserman A (2015) Flowing ConvNets for human 1080/​21681​163.​2016.​11491​04
pose estimation in videos. IEEE Int Conf Comput Vis (ICCV). Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified
https://​doi.​org/​10.​1109/​iccv.​2015.​222 activations in convolutional network. arXiv preprint arXiv:​1505.​
Razakarivony S, Jurie F (2016) Vehicle detection in aerial imagery: a 00853
small target detection benchmark. J Vis Commun Image Represent Yang Z, Liu S, Hu H, Wang L, Lin S (2019) Reppoints: point set rep-
34:187–203. https://​doi.​org/​10.​1016/j.​jvcir.​2015.​11.​002 resentation for object detection. arXiv preprint arXiv:​1904.​11490
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning
once: unified, real-time object detection. IEEE Conf Comput Vis deep features for discriminative localization. IEEE Conf Comput
Pattern Recogni (CVPR). https://​doi.​org/​10.​1109/​cvpr.​2016.​91 Vis Pattern Recognit (CVPR). https://​doi.​org/​10.​1109/​cvpr.​2016.​
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. 319
arXiv preprint arXiv:​1804.​02767 Zhou Y, Qixiang Y, Qiang Q, Jianbin J (2017) Oriented response net-
Ren S, He K, Irshick R, Sun J (2015) Faster r-cnn: towards real-time works. arXiv preprint arXiv:​1701.​01833
object detection with region proposal networks. In: Cortes C, Zhou X, Wang D, Krähenbühl P (2019a) Objects as points. arXiv pre-
Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances print arXiv:​1904.​07850
in neural information processing systems, vol 28. Curran Associ- Zhou X, Wang D, Krähenbühl P (2019b) Bottom-up object detection
ates Inc., London, pp 91–99 by grouping extreme and center points. arXiv preprint arXiv:1​ 901.​
Revathi T, Rajalaxmi TM (2019) Deep learning for people count- 08043
ing model. Adv Intell Syst Comput. https://​doi.​org/​10.​1007/​ Zou Z, Zhenwei S, Yuhong G, Jieping Y (2019) Object detection in 20
978-​981-​15-​0035-​043 years: a survey. arXiv preprint arXiv:​1905.​05055
Saribas H, Hakan C, Sinem K (2018) Car detection in images taken
from unmanned aerial vehicles. Signal Process Commun Appl Publisher’s Note Springer Nature remains neutral with regard to
Conf (SIU). https://​doi.​org/​10.​1109/​siu.​2018.​84042​01 jurisdictional claims in published maps and institutional affiliations.
Sarwar F, Griffin A, Periasamy P, Portas K, Law J (2018) Detecting
and counting sheep with a convolutional neural network. IEEE Int
Conf Adv Video Signal Based Surveill (AVSS). https://​doi.​org/​
10.​1109/​avss.​2018.​86393​06

13

You might also like