Professional Documents
Culture Documents
A New High Performance Approach For Crowd Counting Using Human Filter
A New High Performance Approach For Crowd Counting Using Human Filter
Abstract— One of the tasks of the crowd monitoring system is The remainder of the paper is organized as follows. In
to estimate the number of people in the crowd and issue a warning Section II, we conduct a literature review of crowd counting and
when it exceeds the allowed threshold. Previous approaches often density estimation models. Section III presents our approach to
used multi-column CNN to estimate density maps and thereby estimating the density map and the human filter. In Section IV,
estimate the count. However, the amount of information learned we show the experimental results and evaluate our model. The
from crowd datasets is very small. On the other hand, the conclusion and future work are drawn in Section V.
confusion between people and other objects such as buildings,
trees, rocks, etc (background noise) will affect the density map
estimation. In this paper, we focus to solve these two problems and
propose a model called Counting using Human Filter (CHF) which
consists of two modules: The first one is a feature extractor from
a crowd image based on the VGG-16 model to estimate the density
map. This module will take advantage of the features learned from
the ImageNet dataset. The second one is the human filter used to
weight each pixel of the density map. Two modules are combined
by element-wise multiplication. We evaluate the estimated results
of the model with MAE, MSE metric and assess the quality of
density maps according to PSNR, SSIM. Experiments show that
our approach estimates the number of people better than the
previous methods when evaluating on the ShanghaiTech,
UCF_CC_50, UCF-QRNF datasets. Regarding the complexity of
the model, our method shares parameters between two modules so
it halved the number of parameters compared to previous
methods such as Switch-CNN, SSC, ADCrowdNet.
Fig. 1. Sample images in UCF-QRNF dataset and their density maps.
Keywords— crowd counting, convolutional neural networks,
density map, feature extractor. II. RELATED WORK
I. INTRODUCTION The existing crowd counting methods can be classified into
two categories: traditional methods (detection-based,
Crowd counting is one of the important tasks in applications regression-based) and modern methods (density-based and
such as video surveillance, traffic control and public safety. It density-based crowd counting using CNN – CNN-based).
aims to figure out the number of people in crowd images or Detection-based methods use a detector to traverse the image,
videos. Most crowd counting methods are based on detection which localizes and counts the targets [7], [24]. Regression-
[7], [24], regression [1], [17], [5], and especially density map, based methods focus on learning a mapping between features to
whose values are aggregated to give the entire crowd count [11], the count, which are more robust in high-density scenes, but
[27], [21], [3]. Recently, inspired by the success of hand-crafted representations make the results are not optimal
convolutional neural networks (CNN) in computer vision task [1], [17], [5]. Density maps-based method proposed by
such as detection, recognition, segmentation, etc., many CNN Lempitsky [11] that learns a linear mapping between the image
based methods have been proposed to increase the quality of feature and the density map whose values are aggregated to give
density maps. In fact, the accuracy of crowd counting depends the count. Recently, many CNN-based crowd counting methods
on the quality of density maps. One of the challenges in have been proposed and have shown their performance superior
estimating density maps is the background noise such as to the traditional methods. To overcome perspective issues,
buildings, trees, rocks, etc. (Fig. 1). To tackle this challenge, we some methods [16], [27] proposed a multi-column network
propose a model that can filter out background noise. On the (MCNN) using convolution filters with different sizes in each
other hand, we use VGG-16 model [10] as baseline network column to generate the density map. Sam et al. [21] improved
architecture because of the large amount of data used in the the MCNN architecture by adding a classifier that adaptively
ImageNet dataset instead of just using the small amount of data select the most optimal regressor for a particular image patch.
from the crowd dataset. Do et al. [3] introduced a human filter that filter out non-human
image patches. Li et al. [12] proposed the dilated convolution to density map is the total value of the ground truth density map.
aggregate multiscale contextual information. Currently, crowd Therefore, we do not need a full picture of the crowd for estimate
counting task has the main challenges such as background noise, the count. We choose depending on times the average
perspective distortion, various crowd density in one image. distance from the point being considered to adjacent points.
However, most of the above methods were easily affected by This makes the density map suitable for both dense and sparse
background noise. Liu et al. [13] add a binary classification density. In the experiment we choose = 4, = 0.1.
network which trained by images from crowd datasets and
background images from the Internet to be able to resolve C. Generating ground truth human filter
background noise. In this paper, we focus on addressing As mentioned earlier, the human filter weights each pixel
background noise. value of the density map. Creating the ground truth of human
filter based on creating the ground truth of density map. After
III. PROPOSED METHOD the ground truth of density map is done, the pixel values of the
Our proposed model consists of two main modules: The first density map are greater than the threshold α will be set to 1, the
one is a model that estimates the density map by extracting remaining pixels will be set to 0.
image features, inherited from the VGG-16 model. The second
1, >α
one is the human filter. The final density map is calculated by ∀ ∈ , F =
0, ≤α
element-wise multiplication between the density map and the
human filter.
where F is ground truth human filter with threshold α
A. Human filter
Following previous approach of Do et al. [3], they divided In our experimental process, we choose α = 0.001. Details of
the image into 9 patches. For each patch, they categorized it into generating ground truth of density map and the algorithm for
two types: human, non-human. However, this division is not generating ground truth of human filter are depicted in Fig. 3.
really effective. For example, see patch 1 and patch 2 of Fig. 2, Algorithm 1. Generating ground truth of density map and
they belong to non-human patch style but they still have a small
human filter
number of humans. Therefore, we changed the approach by
weighting each pixel of the density map, the classification term Input: Annotated image with n points
is changed to filter term. To do this, we use the Sigmoid Output: Ground truth density map D and human filter F
activation function when estimating density maps. The Sigmoid
function will give a value in the range between 0 and 1 to show Begin
that the probability of the pixel being considered is belongs to init D = zeros[], F = zeros[], A = zeros[]
human existence status or non-human existence status. foreach point in annotated points:
A[point] = 1
// distances from head position to its k nearest neighbors d%
d%&'
&' (
= ∑)* d
(
)
σ = βd%* (where k = 4, β = 0.1)
gauss = gaussian_filter(A, σ
D += gauss
foreach pixel in D:
if pixel > 0.001:
F[pixel] = 1
else:
F[pixel] = 0
return D, F
End
Fig. 2. Disadvanges in dividing the image into nine patches and classifying
them into two types: human patch and non-human patch. Fig. 3. Algorithm for generating ground truth of density map and human
filter.
B. Generating ground truth density map
D. Density map estimation model
Similar to the previous approach, we generate ground truth
density map by imposing normalized 2D Gaussian at head Previous methods [27], [26], [21], [3] often used model to
positions: learn from scratch with relatively small datasets. This makes
their model simple but the performance of crowd estimation is
=∑ − ,
not high. Some improvements based on the VGG-16 model are
proposed in [21], [3], but they use this powerful model for
classification work. Therefore, we have changed the purpose of
where is ground truth density map, denotes using the VGG-16 model from classification to estimation of
normalization by the Gaussian kernel with standard deviation , density map.
n is the number of annotated points. The sum of the Gaussian
kernel equals to one, so the number of people in the ground truth
37
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
Fig. 5. Human filter and density map estimation model in offline-stage. MP is max pooling. US is upsampling. Small numbers are the numbers of filters. The
ReLU layers following the convolution layers is not show in the figure.
IV. EXPERIMENTS
We performed the experiments on three datasets
ShanghaiTech, UCF_CC_50 and UCF-QRNF. We
implemented the model on python using pytorch. We used
Adam optimizer [9] with learning rate 1e-4, weight decay 5e-3
and the number of epoch is 400. Evaluation metric and results
of the experiment are shown below.
A. Evaluation Metric
Fig. 6. Proposed model in online-stage. The final density map will be
predicted by element-wise multiplication between human filter and estimated For comparison with the previous methods, we use two
density map evaluation metric, they are Mean Absolute Error (MAE) and
Mean Squared Error (MSE):
38
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
<=> = ∑0?@:ABC − @:ABC ? TABLE I. COMPARISON OF OUR METHOD AND OTHER METHODS ON
0 SHANGHAITECH DATASET. BOLD INDICATES THE FIRST BEST PERFORMANCE
AND UNDERLINE INDICATES THE SECOND BEST PERFORMANCE
estimated, @:ABC is the ground truth of counting. From the MCNN [27] 110.2 173.2 26.4 41.3
two equations above shows, the smaller the MAE and the MSE Switch-CNN [21] 90.4 135.0 21.6 33.4
are, the better the models are.
Do et at. [3] 81.9 122.1 20.9 33.1
B. ShanghaiTech dataset
SSC [4] 69.7 120.2 11.8 17.2
ShanghaiTech dataset composed of 1198 images with
330,165 annotations [27]. The dataset is divided into Part A CSRNet [12] 68.2 115.0 10.6 16.0
and Part B. Part A contains images randomly selected from the AMG-bAttn-DME [13] 63.2 98.9 8.2 15.7
Internet while Part B includes the images are taken from a busy
street of a metropolitan area in Shanghai. Both datasets are TEDnet [8] 64.2 109.1 8.2 12.8
divided into the training and test sets. Part A is constructed of CHF (Our) 65.1 99.7 7.9 11.2
482 images (300 images for training and 182 images for testing).
Part B is constructed of 716 images (400 images for training and
316 images for testing). The density in Part A is much larger
than that in Part B. The results are shown in TABLE I.
Fig. 9. Sample images and our corresponding predicted density maps from
UCF_CC_50 dataset.
Fig. 7. Sample images and our corresponding predicted density maps from
ShanghaiTech Part A dataset. TABLE II. COMPARISON OF OUR METHOD WITH OTHER METHODS ON
UCF_CC_50 DATASET. BOLD INDICATES THE FIRST BEST PERFORMANCE AND
UNDERLINE INDICATES THE SECOND BEST PERFORMANCE
Fig. 8. Sample images and our corresponding predicted density maps from
ShanghaiTech Part B dataset.
39
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
Our method is much improved when compared to the non- E. Evaluate the quality of density map
VGG-16 network as a base-line, reducing 16.8 points MAE, Density maps represent the distribution of objects in an
22.4 points MSE [3]. With a sparse dataset like Part B, the image. Therefore, a high-quality density map will partly show
background will be more, our method has MAE, MSE is quite
this distribution level when run in real time. By weighting each
low, showing the ability to tackle background noise problem.
pixel value of the density map, we also compared the density
C. UCF_CC_50 dataset map quality of the proposed method and previous methods. We
UCF_CC_50 dataset [5] contains extremely crowded scenes used PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural
with only 50 images of different resolutions. It includes a variety Similarity in Image) [25] to evaluate the quality of the predicted
of scenes such as concerts, stadiums, protests, etc. The number density maps on ShanghaiTech Part A dataset. The result is
of annotation ranges from 94 to 4543 with an average number of shown in TABLE IV.
1280. 5-fold cross-validation is performed following the From the below results, the addition of human filter and
standard setting in [5]. TABLE II. shows the experiment result taking advantage of the ability to extract features of the VGG-
of MAE and MSE. 16 model for crowd counting problem greatly improved the
quality of density maps (2.14 points on PSNR and 0.11 points
D. UCF-QRNF dataset
on SSIM) when compared to the previous method [3].
UCF-QRNF dataset [6] is a new dataset for crowd counting
and localization. The dataset contains 1535 high resolution TABLE IV. COMPARISON OF OUR METHOD WITH OTHER METHODS ON
images (1201 images for training and 334 images for testing) SHANGHAITECH PART A DATASET. BOLD INDICATES THE FIRST BEST
PERFORMANCE AND UNDERLINE INDICATES THE SECOND BEST PERFORMANCE
with 1.25 million annotations. This dataset come with a variety
of scenes containing the diverse set of viewpoints, densities and Method PSNR SSIM
background noise. It contains extremely congested scenes MCNN [27] 21.4 0.52
where the maximum count up to 12,865. 0shows results on the
UCF-QNRF dataset when compared with the recent methods. Switch-CNN [21] 21.91 0.67
F. Ablation Study
Difference between attention map [13] and human filter:
Our approach and ADCrowdNet are both focused on solving
the challenge of background noise. The most basic difference
between the two models is that Attention Map of ADCrowdNet
is a separate model that uses the softmax function to classify
Fig. 10. Sample images and our corresponding predicted density maps from the background while the proposed model uses human filters as
UCF-QRNF dataset. a component of the model and uses the Sigmoid function to
filter out non-human image patches. Because Attention Map is
TABLE III. COMPARISON OF OUR METHOD WITH OTHER METHODS ON
UCF-QRNF DATASET. BOLD INDICATES THE FIRST BEST PERFORMANCE AND a separate model, the number of parameters of ADCrowdNet is
UNDERLINE INDICATES THE SECOND BEST PERFORMANCE quite large while human filter shares the parameters of the
front-end VGG-16 architecture.
Method MAE MSE
Switch-CNN [21] 252.0 514.0 With the number of parameters shown in TABLE V. , our
method slightly increased the parameters, but it produced quite
Do et at. [3] 245.3 512.7
good results compared to the previous methods [21], [4], [12].
SSC [4] 125.7 213.1 Besides, our method reduces the computational complexity a
lot compared to these methods [3], [13].
Choose threshold F: We use the following loss function to
CSRNet [12] 120.3 208.5
40
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
where 2 is the number of images, 4IJ is the ground truth [6] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot,
41