A New High Performance Approach For Crowd Counting Using Human Filter

2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
A New High Performance Approach for

Crowd Counting Using Human Filter
Phuc Thinh Do Ngoc Quoc Ly
Dong Nai Technology University VNUHCM - University of Science
Dong Nai, Vietnam Ho Chi Minh City, Vietnam
dophucthinh@dntu.edu.vn lqngoc@fit.hcmus.edu.vn
Abstract— One of the tasks of the crowd monitoring system is The remainder of the paper is organized as follows. In
to estimate the number of people in the crowd and issue a warning Section II, we conduct a literature review of crowd counting and
when it exceeds the allowed threshold. Previous approaches often density estimation models. Section III presents our approach to
used multi-column CNN to estimate density maps and thereby estimating the density map and the human filter. In Section IV,
estimate the count. However, the amount of information learned we show the experimental results and evaluate our model. The
from crowd datasets is very small. On the other hand, the conclusion and future work are drawn in Section V.
confusion between people and other objects such as buildings,
trees, rocks, etc (background noise) will affect the density map
estimation. In this paper, we focus to solve these two problems and
propose a model called Counting using Human Filter (CHF) which
consists of two modules: The first one is a feature extractor from
a crowd image based on the VGG-16 model to estimate the density
map. This module will take advantage of the features learned from
the ImageNet dataset. The second one is the human filter used to
weight each pixel of the density map. Two modules are combined
by element-wise multiplication. We evaluate the estimated results
of the model with MAE, MSE metric and assess the quality of
density maps according to PSNR, SSIM. Experiments show that
our approach estimates the number of people better than the
previous methods when evaluating on the ShanghaiTech,
UCF_CC_50, UCF-QRNF datasets. Regarding the complexity of
the model, our method shares parameters between two modules so
it halved the number of parameters compared to previous
methods such as Switch-CNN, SSC, ADCrowdNet.
Fig. 1. Sample images in UCF-QRNF dataset and their density maps.
Keywords— crowd counting, convolutional neural networks,
density map, feature extractor. II. RELATED WORK
I. INTRODUCTION The existing crowd counting methods can be classified into
two categories: traditional methods (detection-based,
Crowd counting is one of the important tasks in applications regression-based) and modern methods (density-based and
such as video surveillance, traffic control and public safety. It density-based crowd counting using CNN – CNN-based).
aims to figure out the number of people in crowd images or Detection-based methods use a detector to traverse the image,
videos. Most crowd counting methods are based on detection which localizes and counts the targets [7], [24]. Regression-
[7], [24], regression [1], [17], [5], and especially density map, based methods focus on learning a mapping between features to
whose values are aggregated to give the entire crowd count [11], the count, which are more robust in high-density scenes, but
[27], [21], [3]. Recently, inspired by the success of hand-crafted representations make the results are not optimal
convolutional neural networks (CNN) in computer vision task [1], [17], [5]. Density maps-based method proposed by
such as detection, recognition, segmentation, etc., many CNN Lempitsky [11] that learns a linear mapping between the image
based methods have been proposed to increase the quality of feature and the density map whose values are aggregated to give
density maps. In fact, the accuracy of crowd counting depends the count. Recently, many CNN-based crowd counting methods
on the quality of density maps. One of the challenges in have been proposed and have shown their performance superior
estimating density maps is the background noise such as to the traditional methods. To overcome perspective issues,
buildings, trees, rocks, etc. (Fig. 1). To tackle this challenge, we some methods [16], [27] proposed a multi-column network
propose a model that can filter out background noise. On the (MCNN) using convolution filters with different sizes in each
other hand, we use VGG-16 model [10] as baseline network column to generate the density map. Sam et al. [21] improved
architecture because of the large amount of data used in the the MCNN architecture by adding a classifier that adaptively
ImageNet dataset instead of just using the small amount of data select the most optimal regressor for a particular image patch.
from the crowd dataset. Do et al. [3] introduced a human filter that filter out non-human
978-0-7381-0553-6/20/$31.00 ©2020 IEEE 36

image patches. Li et al. [12] proposed the dilated convolution to density map is the total value of the ground truth density map.
aggregate multiscale contextual information. Currently, crowd Therefore, we do not need a full picture of the crowd for estimate
counting task has the main challenges such as background noise, the count. We choose depending on times the average
perspective distortion, various crowd density in one image. distance from the point being considered to adjacent points.
However, most of the above methods were easily affected by This makes the density map suitable for both dense and sparse
background noise. Liu et al. [13] add a binary classification density. In the experiment we choose = 4, = 0.1.
network which trained by images from crowd datasets and
background images from the Internet to be able to resolve C. Generating ground truth human filter
background noise. In this paper, we focus on addressing As mentioned earlier, the human filter weights each pixel
background noise. value of the density map. Creating the ground truth of human
filter based on creating the ground truth of density map. After
III. PROPOSED METHOD the ground truth of density map is done, the pixel values of the
Our proposed model consists of two main modules: The first density map are greater than the threshold α will be set to 1, the
one is a model that estimates the density map by extracting remaining pixels will be set to 0.
image features, inherited from the VGG-16 model. The second
1, >α
one is the human filter. The final density map is calculated by ∀ ∈ , F =
0, ≤α
  
element-wise multiplication between the density map and the
human filter.
where F is ground truth human filter with threshold α
A. Human filter
Following previous approach of Do et al. [3], they divided In our experimental process, we choose α = 0.001. Details of
the image into 9 patches. For each patch, they categorized it into generating ground truth of density map and the algorithm for
two types: human, non-human. However, this division is not generating ground truth of human filter are depicted in Fig. 3.
really effective. For example, see patch 1 and patch 2 of Fig. 2, Algorithm 1. Generating ground truth of density map and
they belong to non-human patch style but they still have a small
human filter
number of humans. Therefore, we changed the approach by
weighting each pixel of the density map, the classification term Input: Annotated image with n points
is changed to filter term. To do this, we use the Sigmoid Output: Ground truth density map D and human filter F
activation function when estimating density maps. The Sigmoid
function will give a value in the range between 0 and 1 to show Begin
that the probability of the pixel being considered is belongs to init D = zeros[], F = zeros[], A = zeros[]
human existence status or non-human existence status. foreach point in annotated points:
A[point] = 1
// distances from head position to its k nearest neighbors d%
d%&'
&' (
= ∑)* d
(
)
σ = βd%* (where k = 4, β = 0.1)
gauss = gaussian_filter(A, σ
D += gauss
foreach pixel in D:
if pixel > 0.001:
F[pixel] = 1
else:
F[pixel] = 0
return D, F
End
Fig. 2. Disadvanges in dividing the image into nine patches and classifying
them into two types: human patch and non-human patch. Fig. 3. Algorithm for generating ground truth of density map and human
filter.
B. Generating ground truth density map
D. Density map estimation model
Similar to the previous approach, we generate ground truth
density map by imposing normalized 2D Gaussian at head Previous methods [27], [26], [21], [3] often used model to
positions: learn from scratch with relatively small datasets. This makes
their model simple but the performance of crowd estimation is
=∑ − ,
not high. Some improvements based on the VGG-16 model are
   proposed in [21], [3], but they use this powerful model for
classification work. Therefore, we have changed the purpose of
where is ground truth density map, denotes using the VGG-16 model from classification to estimation of
normalization by the Gaussian kernel with standard deviation , density map.
n is the number of annotated points. The sum of the Gaussian
kernel equals to one, so the number of people in the ground truth
37
The proposed model consists of VGG-16 model layers  -:;; = -. Θ 9 -3  

except the fully connected layers. To prevent the image from
being downsampled too much, we only use three Max-pooling The algorithm used to train the model is described in Fig. 4.
layers. After going through the first ten levels of the model, the
feature maps will be reduced in size eight times and will be Algorithm 2. Density map training model
upsampled three times by factor of two using bilinear
Input: Annotated image with ground truth density map dm
interpolation. Details of the model are described as Fig. 5. The
and ground truth human filter hf
output of 2 modules will be combined by element-wise
multiplication to create the final density map (Fig. 6). We use Output: Trained model
two loss functions, including: Mean Squared Error (MSE) loss Begin
for density map estimation model and Binary Cross Entropy optimizer = Adam(lr=1e-4)
(BCE) loss for Human filter: // Density map estimation model training
.
foreach image in dataset:
 -. Θ = ∑0 1 Θ − 1.   // Estimated density map and estimated human filter
.0
est, est_hf = model(image)
where 2 is number of image, is the ground truth of l2loss = MSEloss(est, dm)
density map of -th image, Θ is the estimated density map bceloss = BCEloss(est_hf, hf)
with parameters Θ for -th image. loss = l2loss + bceloss
// Optimized using the Pytorch library
 -3 = − ∑ 0 4 log 8 9 1 − 4 log 1 − 8   optimizer.zero_grad()
0 loss.backward()
where 2 is number of image, 4 is the ground truth of
optimizer.step()
human filter of -th image, 8 is the probability of each pixel
End
activated by sigmoid function. Fig. 4. Algorithm for density map training model.
So the loss function used during the training is defined as

follows:
Fig. 5. Human filter and density map estimation model in offline-stage. MP is max pooling. US is upsampling. Small numbers are the numbers of filters. The
ReLU layers following the convolution layers is not show in the figure.
IV. EXPERIMENTS
We performed the experiments on three datasets
ShanghaiTech, UCF_CC_50 and UCF-QRNF. We
implemented the model on python using pytorch. We used
Adam optimizer [9] with learning rate 1e-4, weight decay 5e-3
and the number of epoch is 400. Evaluation metric and results
of the experiment are shown below.
A. Evaluation Metric
Fig. 6. Proposed model in online-stage. The final density map will be
predicted by element-wise multiplication between human filter and estimated For comparison with the previous methods, we use two
density map evaluation metric, they are Mean Absolute Error (MAE) and
Mean Squared Error (MSE):
38
 <=> = ∑0?@:ABC − @:ABC ?  TABLE I. COMPARISON OF OUR METHOD AND OTHER METHODS ON
0 SHANGHAITECH DATASET. BOLD INDICATES THE FIRST BEST PERFORMANCE
AND UNDERLINE INDICATES THE SECOND BEST PERFORMANCE
 <D> = E ∑0 @:ABC − @:ABC .  Part A Part B

0 Method
MAE MSE MAE MSE
where N is the number of images, @:ABC is the crowd count

Zhang et al. [26] 181.8 277.7 32.0 49.8
estimated, @:ABC is the ground truth of counting. From the MCNN [27] 110.2 173.2 26.4 41.3
two equations above shows, the smaller the MAE and the MSE Switch-CNN [21] 90.4 135.0 21.6 33.4
are, the better the models are.
Do et at. [3] 81.9 122.1 20.9 33.1
B. ShanghaiTech dataset
SSC [4] 69.7 120.2 11.8 17.2
ShanghaiTech dataset composed of 1198 images with
330,165 annotations [27]. The dataset is divided into Part A CSRNet [12] 68.2 115.0 10.6 16.0
and Part B. Part A contains images randomly selected from the AMG-bAttn-DME [13] 63.2 98.9 8.2 15.7
Internet while Part B includes the images are taken from a busy
street of a metropolitan area in Shanghai. Both datasets are TEDnet [8] 64.2 109.1 8.2 12.8
divided into the training and test sets. Part A is constructed of CHF (Our) 65.1 99.7 7.9 11.2
482 images (300 images for training and 182 images for testing).
Part B is constructed of 716 images (400 images for training and
316 images for testing). The density in Part A is much larger
than that in Part B. The results are shown in TABLE I.
Fig. 9. Sample images and our corresponding predicted density maps from
UCF_CC_50 dataset.
ShanghaiTech Part A dataset. TABLE II. COMPARISON OF OUR METHOD WITH OTHER METHODS ON
UCF_CC_50 DATASET. BOLD INDICATES THE FIRST BEST PERFORMANCE AND
UNDERLINE INDICATES THE SECOND BEST PERFORMANCE
Method MAE MSE

Idrees et al. [5] 468.0 590.3
Zhang et al. [26] 467.0 498.5
MCNN [27] 377.6 509.1
Switch-CNN [21] 318.1 439.2
Do et at. [3] 250.5 383.7
SSC [4] 270.4 401.8
CSRNet [12] 266.1 397.5
AMG-attn-DME [13] 273.6 362.0
TEDnet [8] 249.4 354.5
CHF (Our) 231.6 323.4
ShanghaiTech Part B dataset.
39
Our method is much improved when compared to the non- E. Evaluate the quality of density map
VGG-16 network as a base-line, reducing 16.8 points MAE, Density maps represent the distribution of objects in an
22.4 points MSE [3]. With a sparse dataset like Part B, the image. Therefore, a high-quality density map will partly show
background will be more, our method has MAE, MSE is quite
this distribution level when run in real time. By weighting each
low, showing the ability to tackle background noise problem.
pixel value of the density map, we also compared the density
C. UCF_CC_50 dataset map quality of the proposed method and previous methods. We
UCF_CC_50 dataset [5] contains extremely crowded scenes used PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural
with only 50 images of different resolutions. It includes a variety Similarity in Image) [25] to evaluate the quality of the predicted
of scenes such as concerts, stadiums, protests, etc. The number density maps on ShanghaiTech Part A dataset. The result is
of annotation ranges from 94 to 4543 with an average number of shown in TABLE IV.
1280. 5-fold cross-validation is performed following the From the below results, the addition of human filter and
standard setting in [5]. TABLE II. shows the experiment result taking advantage of the ability to extract features of the VGG-
of MAE and MSE. 16 model for crowd counting problem greatly improved the
quality of density maps (2.14 points on PSNR and 0.11 points
D. UCF-QRNF dataset
on SSIM) when compared to the previous method [3].
UCF-QRNF dataset [6] is a new dataset for crowd counting
and localization. The dataset contains 1535 high resolution TABLE IV. COMPARISON OF OUR METHOD WITH OTHER METHODS ON
images (1201 images for training and 334 images for testing) SHANGHAITECH PART A DATASET. BOLD INDICATES THE FIRST BEST
PERFORMANCE AND UNDERLINE INDICATES THE SECOND BEST PERFORMANCE
with 1.25 million annotations. This dataset come with a variety
of scenes containing the diverse set of viewpoints, densities and Method PSNR SSIM
background noise. It contains extremely congested scenes MCNN [27] 21.4 0.52
where the maximum count up to 12,865. 0shows results on the
UCF-QNRF dataset when compared with the recent methods. Switch-CNN [21] 21.91 0.67
Do et at. [3] 21.98 0.68
SSC [4] 23.6 0.75
CSRNet [12] 23.79 0.76
ADCrowdNet (AMG-DME) [13] 24.48 0.88
TEDnet [8] 25.88 0.83
CHF (Our) 24.12 0.79
F. Ablation Study
Difference between attention map [13] and human filter:
Our approach and ADCrowdNet are both focused on solving
the challenge of background noise. The most basic difference
between the two models is that Attention Map of ADCrowdNet
is a separate model that uses the softmax function to classify
Fig. 10. Sample images and our corresponding predicted density maps from the background while the proposed model uses human filters as
UCF-QRNF dataset. a component of the model and uses the Sigmoid function to
filter out non-human image patches. Because Attention Map is
TABLE III. COMPARISON OF OUR METHOD WITH OTHER METHODS ON
UCF-QRNF DATASET. BOLD INDICATES THE FIRST BEST PERFORMANCE AND a separate model, the number of parameters of ADCrowdNet is
UNDERLINE INDICATES THE SECOND BEST PERFORMANCE quite large while human filter shares the parameters of the
front-end VGG-16 architecture.
Method MAE MSE
Switch-CNN [21] 252.0 514.0 With the number of parameters shown in TABLE V. , our
method slightly increased the parameters, but it produced quite
Do et at. [3] 245.3 512.7
good results compared to the previous methods [21], [4], [12].
SSC [4] 125.7 213.1 Besides, our method reduces the computational complexity a
lot compared to these methods [3], [13].
Choose threshold F: We use the following loss function to
CSRNet [12] 120.3 208.5
Idrees et al. [6] 132.0 191.0 find the value of F.

TEDnet [8] 113.0 188.0
 G:;;H = ∑0?4 F − 4 ? 
CHF (Our) 110.2 179.6 0
40
where 2 is the number of images, 4IJ is the ground truth [6] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot,
human filter of -th image, 4 F is the human filter with

and M. Shah, Composition loss for counting, density map estimation and
localization in dense crowds. In ECCV, 2018, pp. 532–546.
threshold F for -th image. [7] W. Ge and R. T. Collins. Marked point processes for crowd counting. In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE
TABLE V. NUMBER OF PARAMETERS (IN MILLIONS). BOLD AND Conference on, pages 2913–2920. IEEE, 2009.
UNDERLINE INDICATE MODEL HAS THE LEAST AND THE MOST NUMBER OF [8] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao,
PARAMETERS, RESPECTIVELY “Crowd counting and density estimation by trellis encoderdecoder
network,” CVPR, 2019.
Parameters
Method [9] DP Kingma, Diederik P, and Jimmy BA. Adam: A method for stochastic
(millions)
optimization. arXiv preprint arXiv:1412.6980, 2014.
MCNN [27] 0.13
[10] A. Krizhevsky, I. Sutskever, G. Hinton. Imagenet classification with deep
Switch-CNN [21] 15.11 convolutional neural networks. In Advances in neural information
processing systems, pp. 1097–1105, 2012.
Do et at. [3] 30.07 [11] V. Lempitsky and A. Zisserman. Learning to count objects in images. In
Advances in neural information processing systems, pages 1324–1332,
SSC [4] 14.98
2010.
CSRNet [12] 16.26 [12] Yuhong Li, Xiaofan Zhang, and Deming Chen. CSRNet: Dilated
convolutional neural networks for understanding the highly congested
ADCrowdNet [13] 30.2 scenes. CVPR, 2018.
TEDnet [8] 1.63 [13] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu, “Adcrowdnet: An
attention-injective deformable convolutional network for crowd
CHF (Our) 17.0 understanding,” CVPR, 2019
[14] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. SSD: single
V. CONCLUSION AND FUTURE WORK shot multibox detector. CoRR, abs/1512.02325, 2015.
[15] A. N. Marana, L. F. Costa, R. A. Lotufo, and S. A. Velastin. On the
We have proposed a model of two modules: a typical extract efficacy of texture analysis for crowd monitoring, in: Computer Graphics,
from the crowd image based on the VGG-16 model and an Image Processing, and Vision, 1998. Proceedings. SIBGRAPI’98.
environmental filter used to weight each pixel of the density International Symposium on, IEEE. pp. 354–361, 1998.
map. The final density map is obtained by multiplying element- [16] D. Onoro-Rubio and R.J. Lpez-Sastre. Towards perspective-free object
wise between these two modules. Experiments show that counting with deep learning. In Proceedings of the ECCV. Springer, pp.
615–629, 2016.
weighting density maps will improve the density map quality
[17] Paragios, N., Ramesh, V., 2001. A mrf-based approach for real-time
and the ability to estimate the number of people in the crowd. subway monitoring, in: Computer Vision and Pattern Recognition, 2001.
Based on analyzing the number of parameters, our method can [18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once:
run in real time. In our next work, we will tackle other challenges Unified, real-time object detection. arXiv preprint arXiv:1506.02640,
such as perspective distortion, various crowd density in one 2015.
image, counting the number of people in the crowd passing [19] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In
through a given area. Computer Vision and Pattern Recognition (CVPR), 2017 IEEE
Conference on, pages 6517–6525. IEEE, 2017.
ACKNOWLEDGMENT [20] J. Redmon and A. Farhadi. YOLOv3: An incremental improvement.
arXiv:1804.02767, 2018.
Thank to Viet Nam National University Ho Chi Minh City
[21] D. B. Sam, S. Surya, R. V. Babu. Switching convolutional neural network
(VNUHCM) for the valuable support on internship under project for crowd counting. In Proceedings of the IEEE Conference on Computer
grant no. B2018-18-01. Vision and Pattern Recognition, 2017.
[22] Karen Simonyan and Andrew Zisserman. Very deep convolutional
REFERENCES networks for large-scale image recognition. arXivpreprint
[1] K. Chen, C. C. Loy, S. Gong, and T. Xiang. Feature mining for localised arXiv:1409.1556, 2014.
crowd counting. In BMVC, 2012. [23] C. Wang, H. Zhang, L. Yang, S. Liu, X. Cao. Deep people counting in
[2] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for extremely dense crowds. In Proceedings of the 23rd ACM international
human detection. InComputer Vision and Pattern Recognition, 2005. conference on Multimedia, ACM. pages 1299–1302, 2015.
CVPR 2005. IEEE Computer Society Conference on, volume 1, pages [24] M. Wang and X. Wang. Automatic adaptation of a generic pedestrian
886–893. IEEE, 2005. detector to a specific traffic scene. In Computer Vision and Pattern
[3] Phuc Thinh Do and Ngoc Quoc Ly. A New Framework For Crowded Recognition (CVPR), 2011 IEEE Conference on, pages 3401–3408.
Scene Counting Based On Weighted Sum Of Regressors and Human IEEE, 2011.
Classifier. In SoICT ’18: Ninth International Symposium on Information [25] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al. Image quality
and Communication Technology, 2018. assessment: from error visibility to structural similarity. TIP, vol. 13, no.
[4] Phuc Thinh Do, Manh Thuong Phan, and Thien Tam Chan Le. A Single- 4, pp. 600–612, 2004.
Column Convolutional Neural Networks for Crowd Counting. In: 2019 [26] C. Zhang, H. Li, X. Wang, X. Yang. Cross-scene crowd counting via deep
6th NAFOSTED Conference on Information and Computer Science con volutional neural networks. In Proceedings of the IEEE Conference
(NICS’19). IEEE, 2019. p. 477-482. on Computer Vision and Pattern Recognition, pages 833–841, 2015.
[5] H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multi-source multi-scale [27] Y. Zhang, D. Zhou, S. Chen, S. Gao, Y. Ma. Single image crowd counting
counting in extremely densecrowd images. In Proceedings of the IEEE via multi-column convolutional neural network. In Proceedings of the
Conferenceon Computer Vision and Pattern Recognition, pages 2547– IEEE Conference on Computer Vision and Pattern Recognition, pages
2554, 2013. 589–597, 2016
41

A New High Performance Approach For Crowd Counting Using Human Filter

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A New High Performance Approach For Crowd Counting Using Human Filter

Uploaded by

Copyright:

Available Formats

2020 7th NAFOSTED Conference on Information and

Computer Science (NICS)

A New High Performance Approach for

978-0-7381-0553-6/20/$31.00 ©2020 IEEE 36

The proposed model consists of VGG-16 model layers  -:;; = -. Θ 9 -3  

So the loss function used during the training is defined as

 <D> = E ∑0 @:ABC − @:ABC .  Part A Part B

where N is the number of images, @:ABC is the crowd count

Method MAE MSE

Zhang et al. [26] 467.0 498.5

MCNN [27] 377.6 509.1

Switch-CNN [21] 318.1 439.2

Do et at. [3] 250.5 383.7

SSC [4] 270.4 401.8

CSRNet [12] 266.1 397.5

AMG-attn-DME [13] 273.6 362.0

TEDnet [8] 249.4 354.5

CHF (Our) 231.6 323.4

Do et at. [3] 21.98 0.68

SSC [4] 23.6 0.75

CSRNet [12] 23.79 0.76

ADCrowdNet (AMG-DME) [13] 24.48 0.88

TEDnet [8] 25.88 0.83

CHF (Our) 24.12 0.79

Idrees et al. [6] 132.0 191.0 find the value of F.

human filter of -th image, 4 F is the human filter with

You might also like