Thesis

A study on Deep Learning Approaches, Architectures and
Training Methods for Crowd Analysis
A Thesis
Submitted for the Degree of
Doctor of Philosophy
in the Faculty of Engineering
by
Deepak Babu Sam
Computational and Data Sciences

Indian Institute of Science
Bangalore – 560 012 (INDIA)
December 2020
c Deepak Babu Sam
December 2020
All rights reserved
Signature of the Author: .............................................
Deepak Babu Sam
Department of Computational and Data Sciences
Indian Institute of Science, Bangalore
Signature of the Thesis Supervisor: .............................................

R. Venkatesh Babu
Associate Professor
Department of Computational and Data Sciences
Indian Institute of Science, Bangalore
Acknowledgments
I would like to thank my research advisor, Prof. R. Venkatesh Babu, for his consistent support,
motivation and suggestions, that greatly helped to materialize this thesis.
I extend my gratitude to the Department of Computational and Data Sciences (CDS) and Indian
Institute of Science (IISc) for facilitating a conducive environment for my research.
Further, I must acknowledge the effort taken by people who have worked with me, in shaping all
my publications.
Finally, thanks to all the members of Video Analytics Lab (VAL), especially to K. Ram Prab-
hakar, Konda Reddy Mopuri and S. Santosh Ravi Kiran, who have provided relentless support during
difficult times.
i
Abstract
Analyzing large crowds quickly is one of the highly sought-after capabilities nowadays. Especially
in terms of public security and planning, this assumes prime importance. But automated reasoning of
crowd images or videos is a challenging Computer Vision task. The difficulty is so extreme in dense
crowds that the task is typically narrowed down to estimating the number of people. Since the count
or distribution of people in the scene itself can be very valuable information, this field of research
has gained traction. The difficulty mostly stems from the drastic variability in crowd density as any
prospective approach has to scale across crowds formed by few tens to thousands of people. This
results in large diversity in the way people appear in crowded scenes. Often people are only seen as a
bunch of blobs in highly dense crowds, whereas facial or body features might be visible in less dense
gatherings. Hence, the visibility and scale of features for crowd discrimination varies drastically with
the density of the crowd. Severe occlusion, pose changes and view-point variations further compound
the problem. Typical head or body detection-based methods fail to adapt with such a huge diversity,
paving way for the simpler crowd density regression models. Add to these, the practical difficulty
of annotating millions of head locations in dense crowds. This implies creating large-scale labeled
crowd data is expensive and directly takes a toll on the performance of existing CNN based counting
models.
Given these challenges, this thesis tackles the problem of crowd counting in multiple perspectives.
Detailed inquiry is done to address the three major issues: diversity, data scarcity and localization.
Addressing Diversity: First, the diversity issue is considered as it causes significant prediction
errors on account of failure to scale well across the density categories. In the diverse scenario, dis-
criminating persons requires larger spatial context and semantics of the scene, instead of local crowd
patterns. A set of brain-inspired top-down feedback connections from high-level layers is proposed.
This feedback is shown to deliver global context for initial layers of CNN and help correct prediction
errors in an iterative manner. Next an alternative mixture of experts approach is devised, where a
differential training regime jointly clusters and fine-tunes a set of experts to capture the huge diversity
seen in crowd images. This approach results in significant boost in counting performance as differ-
ent regions of the images are processed by the appropriate expert regressor based on the local density.
ii
Abstract
Further performance improvement is obtained through a growing CNN that can progressively increase
its capacity depending on the diversity exhibited in the given crowd dataset.
Addressing Data Scarcity: Dense crowd counting demands millions of head annotations for
training models. This annotation difficulty could be mitigated using a Grid Winner-Take-All au-
toencoder, which is designed to learn almost 99% of the parameters from unlabeled crowd images.
The model achieves superior results compared to other unsupervised methods and beats the fully su-
pervised baselines in limited data scenarios. In an alternate approach, a binary labeling scheme is
conceived. Every image is simply labeled to either dense or sparse crowd category, instead of anno-
tating every single person in the scene. This leads to dramatic reduction in the amount of annotations
required and delivers good performance at a very low labeling cost. The objective is pushed fur-
ther to fully eliminate the dependency on instance-level labeled data. The proposed completely self-
supervised architecture does not require any annotation for training, but uses a distribution matching
technique to learn the required features. The only input required to train, apart from a large set of
unlabeled crowd images, is the approximate upper limit of the crowd count for the given dataset.
Experiments show that the model results in effective learning of crowd features and delivers signifi-
cant counting performance. Furthermore, the superiority of the method is established in limited data
settings as well.
Addressing Localization: Typical counting models predict crowd density for an image as op-
posed to detecting every person. These regression methods, in general, fail to localize persons accu-
rate enough for most applications other than counting. Hence, two detection frameworks for dense
crowd counting are developed, such that they obviate the need for the prevalent density regression
paradigm. The first approach reformulates the task as localized dot prediction in dense crowds, where
the model is trained for pixel-wise binary classification to pinpoint people, instead of regressing local
crowd density. In the second dense detection architecture, apart from locating persons, the spotted
heads are sized with bounding boxes. This approach could detect individual persons consistently
across the diversity spectrum. Moreover, this improved localization is achieved without requiring any
additional bounding box annotations for training.
iii
Publications from the Thesis
Part of the work described in this thesis has previously been presented in the following publications.
1. Deepak Babu Sam, Shiv Surya and R. Venkatesh Babu, “Switching Convolutional Neural Network for
Crowd Counting”, in IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2017.
2. Deepak Babu Sam and R. Venkatesh Babu, “Top-down Feedback for Crowd Counting Convolutional
Neural Network”, in Thirty-Second AAAI conference on Artificial Intelligence (AAAI-18), 2018.
3. Deepak Babu Sam, Neeraj N. Sajjan, R. Venkatesh Babu and Mukundhan Srinivasan, “Divide and
Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN”, in IEEE
conference on Computer Vision and Pattern Recognition (CVPR), 2018.
4. Deepak Babu Sam, Neeraj N. Sajjan, Himanshu Maurya and R. Venkatesh Babu, “Almost Unsupervised
Learning for Dense Crowd Counting”, in Thirty-Third AAAI conference on Artificial Inteligence
(AAAI-19), 2019.
5. Deepak Babu Sam, Skand Vishwanath Peri, Mukuntha N. S. and R. Venkatesh Babu, “Going Beyond the
Regression Paradigm with Accurate Dot Prediction for Dense Crowds”, in IEEE Winter Conference
on Applications of Computer Vision (WACV), 2020.
6. Deepak Babu Sam, Skand Vishwanath Peri, Mukuntha N. S., Amogh Kamath and R. Venkatesh Babu,
“Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detection”, in IEEE
Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2020.
7. Deepak Babu Sam, Abhinav Agarwalla, Jimmy Joseph, Vishwanath A. Sindagi, R. Venkatesh Babu and
Vishal M. Patel, “Completely Self-Supervised Crowd Counting via Distribution Matching”, under
review, 2020.
8. Deepak Babu Sam, Jimmy Joseph, Abhinav Agarwalla and R. Venkatesh Babu, “Dense or Sparse:
Crowd Counting with Binary Supervision”, under review, 2020.
iv
Publications from the Thesis
Other publications during PhD which are not part of this thesis are as follows:
1. Vishwanath A. Sindagi, Rajeev Yasarla, Deepak Babu Sam, R. Venkatesh Babu and Vishal M. Patel,
“Learning to Count in the Crowd from Limited Labeled Data”, in European Conference on Com-
puter Vision (ECCV), 2020.
2. Deepak Babu Sam, Abhinaya Kandasamy, Sudharsan K. A. and R. Venkatesh Babu, “Generating Uni-
versal Adversarial Perturbations without using Data Samples”, under review, 2020.
3. Deepak Babu Sam, Abhinav Agarwalla and R. Venkatesh Babu, “Beyond Learning Features: Training
a Fully-functional Classifier with ZERO Instance-level Labels”, under review, 2020.
v
Contents
Acknowledgments i
Abstract ii
Publications from the Thesis iv
Contents vi
List of Figures xiii
List of Tables xviii
1 Introduction 1
1.1 Challenges in Dense Crowd Counting . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Density Regression Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Datasets and Evaluation for Dense Crowd Analysis . . . . . . . . . . . . . . . . . . 6
1.4 Related Works in Dense Crowd Counting . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
I Addressing Crowd Diversity 12
2 Top-Down Feedback to correct Errors 13

2.1 Related Works for Top-Down Architectures . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Feedback as a Correcting Signal . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Top-Down Feedback CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Training of Bottom-Up CNN . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Training of Top-Down CNN . . . . . . . . . . . . . . . . . . . . . . . . . . 18
vi
CONTENTS
2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Shanghaitech dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 UCF CC 50 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 WorldExpo’10 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Ablations and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Effectiveness of Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Switching CNN to capture Crowd Diversity 25

3.1 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Switch-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.3 Differential Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.4 Switch Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.5 Coupled Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 ShanghaiTech dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 UCF CC 50 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 The UCSD dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 The WorldExpo’10 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Effect of number of regressors on Switch-CNN . . . . . . . . . . . . . . . . 35
3.3.2 Specialty Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Attribute Clustering Vs Differential Training . . . . . . . . . . . . . . . . . 38
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Incrementally Growing CNN to adapt with larger Crowd Varieties 40

4.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Creating Experts with Hierarchical Differential Training . . . . . . . . . . . 42
4.2.2 Growing CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 Pretraining of Base CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.4 Training Algorithm for IG-CNN . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 Shanghaitech dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
vii
CONTENTS
4.3.2 UCF CC 50 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.3 WorldExpo’10 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Effect of Growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.2 Expert Specialty Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.3 Hierarchical Training Vs Baseline Methods . . . . . . . . . . . . . . . . . . 52
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
II Addressing Data Scarcity 54
5 Almost Unsupervised Learning 55

5.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 Grid Winner-Take-All Autoencoders for Unsupervised Learning . . . . . . . 57
5.2.2 Architecture of GWTA Counting CNN . . . . . . . . . . . . . . . . . . . . 59
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Shanghaitech Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2 UCF CC 50 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.1 Supervised Vs Unsupervised Features . . . . . . . . . . . . . . . . . . . . . 63
5.4.2 Comparison with Self-Supervised Methods . . . . . . . . . . . . . . . . . . 65
5.4.3 Effect of labeled data on performance . . . . . . . . . . . . . . . . . . . . . 66
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6 Binary Supervision for Density Regression 68

6.1 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 Binary Labeling Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.2 Noisy Ground Truths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.3 Modeling Noisy Density Maps . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.4 Noise Rectifier Network (NRN) . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.5 Regressor Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.1 Evaluation Metrics and Baselines . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.3 UCF-QNRF Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
viii
CONTENTS
6.2.4 UCF-CC-50 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.3.1 Cross Dataset Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.2 Noisy Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.3 Architectural Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7 Complete Self-Supervision via Distribution Matching 82

7.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2.1 Natural Crowds and Density Distribution . . . . . . . . . . . . . . . . . . . 85
7.2.2 Stage 1: Learning Features with Self-Supervision . . . . . . . . . . . . . . . 87
7.2.3 Stage 2: Sinkhorn Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.2.4 Improving Sinkhorn Matching . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.3.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.3.3 UCF-QNRF Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3.4 UCF-CC-50 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3.5 JHU-CROWD++ Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3.6 Cross Data Performance and Generalization . . . . . . . . . . . . . . . . . . 95
7.3.7 CSS-CCNN in True Practical Setting . . . . . . . . . . . . . . . . . . . . . 96
7.3.8 Performance with Limited Data . . . . . . . . . . . . . . . . . . . . . . . . 97
7.4.1 Ablations on Architectural Choices . . . . . . . . . . . . . . . . . . . . . . 97
7.4.2 Analysis of the Prior Distribution . . . . . . . . . . . . . . . . . . . . . . . 99
7.4.3 Sensitivity Analysis for the Crowd Parameter . . . . . . . . . . . . . . . . . 100
7.4.4 Analysis of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
III Addressing Person Localization 103
8 Spot-on Dot Prediction for Dense Crowds 104

8.1 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.1.1 Crowd Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
ix
CONTENTS
8.1.2 Multi-Scale Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.1.3 Multi-Scale Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.1.4 Adaptive Scale Fusion and Dot Detection . . . . . . . . . . . . . . . . . . . 109
8.1.5 Adaptive Scale Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2.1 UCF-QNRF Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2.2 UCF CC 50 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.3.1 Effect of Multi-Scale Architecture . . . . . . . . . . . . . . . . . . . . . . . 114
8.3.2 Dot vs Density Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.3.3 Localization of Detections . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9 Dense Detection to accurately resolve People in Crowds 118

9.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.2.1 Locate Heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.2.1.1 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.2.1.2 Top-down Feature Modulator . . . . . . . . . . . . . . . . . . . . 124
9.2.2 Size Heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.2.2.1 Box classification . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.2.2.2 GWTA Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.2.3 Count Heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.2.3.1 Prediction Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.3.1 Experimental Setup and Datasets . . . . . . . . . . . . . . . . . . . . . . . . 132
9.3.2 Evaluation of Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.3.3 Evaluation of Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.3.4 Evaluation of Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.4.1 Effect of Multi-Scale Box Classification . . . . . . . . . . . . . . . . . . . . 137
9.4.2 Architectural Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9.4.3 Comparison with Object/Face Detectors . . . . . . . . . . . . . . . . . . . . 139
9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
x
CONTENTS
10 Summary, Conclusion and Future Directions 141

10.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
10.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Bibliography 144
xi
CONTENTS
xii
List of Figures
1.1 Sample crowd scenes from the ShanghaiTech dataset [1] are shown. . . . . . . . . . 1
1.2 Diversity in appearance of people across different crowd density. One could only see
head blobs in extreme dense regions, but facial and body features might be visible in
other cases. Hence, the features available for crowd discrimination varies with density. 2
1.3 Sample crowd images from UCF CC 50 [2] dataset, along with corresponding head
annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Depiction of the density regression paradigm, where the model takes an input image
and predicts crowd density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Density maps predicted by a typical CNN regressor (third column) have a lot of false
detections. Many crowd like patterns are identified as humans. First column displays
the input scene and middle column holds the ground truth density map. . . . . . . . . 13
2.2 Architecture of Top-down feedback CNN. (a) displays bottom-up CNN, (b) depicts
the feedback generation by the top-down CNN and (c) shows how the bottom-up CNN
re-evaluates its prediction using the gate features. (Best viewed in colour.) . . . . . . 16
2.3 The unrolled computation graph used for training top-down network. The top-down
CNN uses features of the bottom-up CNN to apply feedback so that the bottom-up
network can be re-evaluated. Consequently, the bottom-up CNN appears twice in
the computation graph and the loss is calculated on the corrected prediction. Only
parameters of the top-down CNN are updated. (Best viewed in color.) . . . . . . . . 18
2.4 Sample predictions of TDF-CNN on images of Part A of Shanghaitech dataset [1]. . 20
2.5 Some of the feedback gate maps for the input image shown in first column. Red to
blue colour scale maps to 0-1 range. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
xiii
LIST OF FIGURES
3.1 Architecture of the proposed model, Switch-CNN is shown. A patch from the crowd
scene is highlighted in red. This patch is relayed to one of the three CNN regressor
networks based on the CNN label inferred from Switch. The highlighted patch is
relayed to regressor R3 which predicts the corresponding crowd density map. The
element-wise sum over the entire density map gives the crowd count of the crowd
scene patch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Sample predictions by Switch-CNN for crowd scenes from the ShanghaiTech dataset [1]
is shown. The top and bottom rows depict a crowd image, corresponding ground truth
and prediction from Part A and Part B of dataset respectively. . . . . . . . . . . . . 32
3.3 Histogram of average inter-head distance for crowd scene patches from Part A test
set of ShanghaiTech dataset [1] is shown. We see that the multichotomy of space of
crowd scene patches inferred from the switch separates patches based on latent factors
correlated with crowd density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Sample crowd scene patches from Part A test set of ShanghaiTech dataset [1] are
shown. We see that the density of crowd in the patches increases from CNN regressor
R1 –R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Predictions of a typical regressor fine-tuned for sparse or dense crowds. Models per-
form better on their own specialties. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Hierarchical Differential Training in IG-CNN. Regressors are recursively replicated
and specialized forming a CNN tree. . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Test time architecture of IG-CNN. The expert classifier routes crowd patches to the
appropriate specialized regressor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Predictions made by IG-CNN on images of Shanghaitech dataset [1]. . . . . . . . . . 48
4.5 Mean and standard deviation of crowd count distribution preferred by expert regres-
sors at different hierarchies of IG-CNN. Computed on patches from Shanghaitech [1]
Part A test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Grid Winner-Take-All architecture proposed in this work. Only the maximally acti-
vated neuron in a cell is allowed to pass its activation, creating sparse updates during
backpropagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 GWTA output of Conv1 layer for a sample image. Note that the reconstruction by
GWTA autoencoder is very sparse compared to normal autoencoder. . . . . . . . . . 59
5.3 Architecture of GWTA based Crowd Counting CNN (GWTA-CCNN). Unsupervised
training is done in stages, updating every layer by reconstructing its own input regu-
larized by the GWTA sparsity. Last two layers are trained with supervision. . . . . . 60
xiv
LIST OF FIGURES
5.4 Sample predictions given by GWTA-CCNN on images from Shanghaitech dataset.

The predicted density maps closely resemble that of the supervised CCNN model,
emphasizing the ability of our unsupervised approach to learn useful features. . . . . 62
5.5 Qualitative comparison of features learned by GWTA autoencoder with that of the
fully supervised CCNN. The images are sum maps of all the features in a layer. . . . 64
5.6 Some of the individual feature maps of Conv1 for GWTA and supervised CCNN. . . 65
5.7 Amount of labeled data vs MAE. CCNN is trained in fully and almost supervised
fashion with different amounts of labeled data of Part A Shanghaitech dataset. We
see that at less data scenarios our almost unsupervised approach performs better than
fully supervised. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1 Contrasting the proposed binary labeling paradigm with existing head annotation
framework. Our method requires only one binary label per crowd image to train a
density regressor as opposed to annotating all the heads. . . . . . . . . . . . . . . . 68
6.2 Samples of noisy density maps extracted from edge details of crowd images are
shown. They seem to roughly correlate with the crowd density. . . . . . . . . . . . . 70
6.3 The distribution counts computed from the noisy density maps, evaluated on crowd
images from Shanghaitech Part A [1]. . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 The training pipeline and architecture of the Noise Rectifier Network (NRN) is shown.
NRN takes noisy density maps and outputs scale factors to improve the maps. It is
trained on the synthetic data sampled from the parametric distribution. . . . . . . . . 73
6.5 Overall architecture of the Binary Supervised Density Regressor (BSDR) is depicted.
Binary labels of the crowd images are used to select appropriate NRN to generate
density maps. These noise rectified maps act as ground truths to train the density
regressor, completely avoiding the need to have head annotations. . . . . . . . . . . 74
6.6 Density maps regressed by the proposed BSDR model. Though our model is trained
on weak labeled and noisy data, the density predictions closely follow the ground truths. 75
7.1 Self-Supervision Vs Complete Self-Supervision: Normal self-supervision techniques

has a mandatory labeled training stage to map the learned features to the end task
of interest (in blue). But the proposed complete self-supervision is devoid of such
an instance-wise labeled supervision, instead relies on matching the statistics of the
predictions to a prior distribution (in green). . . . . . . . . . . . . . . . . . . . . . . 83
xv
LIST OF FIGURES
7.2 Computing the distribution of natural crowds: crops from dense crowd images are
framed to a spatial grid of cells and crowd counts of all the cells are aggregated to a
histogram (obtained on Shanghaitech Part A dataset [1]). The distribution is certainly
long tailed and could be approximated to a power law. . . . . . . . . . . . . . . . . . 85
7.3 The architecture of CSS-CCNN is shown. CSS-CCNN has two stages of training:
the first trains the base feature extraction network in a self-supervised manner with
rotation task and the second stage optimizes the model for matching the statistics of
the density predictions to that of the prior distribution using optimal transport. . . . . 87
7.4 Density maps estimated by CSS-CCNN along with that of baseline methods. Despite
being trained without a single annotated image, CSS-CCNN is seen to be quite good
at discriminating the crowd regions as well as regressing the density values. . . . . . 92
7.5 Comparing our completely self-supervised method to fully supervised and self-supervised
approaches under a limited amount of labeled training data. The x-axis denotes the
number of training images along with the count (in thousands) of head annotations
available for training, while the y-axis represents the MAE thus obtained. At low data
scenarios, CSS-CCNN has significantly superior performance than others. . . . . . . 98
7.6 Double logarithmic representation of maximum likelihood fit for the crowd counts
from Shanghaitech Part A [1] and UCF-QNRF [3]. . . . . . . . . . . . . . . . . . . 100
7.7 Visualization of mean features extracted from different convolutional blocks of CSS-
CCNN and the supervised baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.1 Dot Detection Vs Density Regression. The top row shows crowds with dot predictions
from the proposed DD-CNN, while bottom row has corresponding density maps. The
dot detection has better localization of individuals across density ranges. . . . . . . . 105
8.2 The architecture of the proposed dot detection network. DD-CNN has a multi-scale
architecture with dot predictions at different resolutions, which are combined through
Adaptive Scale Fusion. The networks are trained with pixel-wise binary cross-entropy
loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.3 Predictions made by DD-CNN on images of Shanghaitech dataset [1]. The results
emphasize the ability of our dot detection approach to localize people in crowds (zoom
in the dot maps to see the difference). . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.4 Dot predictions made by individual scale columns of DD-CNN on Shanghaitech
dataset [1]. The outputs clearly shows that the multi-scale training improves sig-
nificantly the dot prediction quality. (zoom in to see the difference) . . . . . . . . . . 113
xvi
LIST OF FIGURES
8.5 Detection by thresholding density maps of CSR-A-thr-dot net; results show almost no
detections in sparse regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.1 Face detection vs. Crowd counting. Tiny Face detector [4], trained on face dataset [5]
with box annotations, is able to capture 731 out of the 1151 people in the first image
[6], losing mainly in highly dense regions. In contrast, despite being trained on crowd
dataset [1] having only point head annotations, our LSC-CNN detects 999 persons
(second image) consistently across density ranges and provides fairly accurate boxes. 119
9.2 The architecture of the proposed LSC-CNN is shown. LSC-CNN jointly processes
multi-scale information from the feature extractor and provides predictions at mul-
tiple resolutions, which are combined to form the final detections. The model is
optimized for per-pixel classification of pseudo ground truth boxes generated in the
GWTA training phase (indicated with dotted lines). . . . . . . . . . . . . . . . . . . 122
9.3 The exact configuration of Feature Extractor, which is a modified version of VGG-16
[7] and outputs feature maps at multiple scales. . . . . . . . . . . . . . . . . . . . . 123
9.4 The implementation of the TFM module is depicted. TFM(s) processes the features
from scale s (terminal 1) along with s multi-scale inputs from higher branches (ter-
minal 3) to output head detections (terminal 4) and the features (terminal 2) for the
next scale branch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.5 Samples of generated pseudo box ground truth. Boxes with same color belong to one
scale branch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.6 Illustration of the operations in GWTA training. GWTA only selects the highest loss
making cell in every scale. The per-pixel cross-entropy loss is computed between the
prediction and pseudo ground truth maps. . . . . . . . . . . . . . . . . . . . . . . . 130
9.7 Predictions made by LSC-CNN on images from Shanghaitech, UCF-QNRF and UCF-
CC-50 datasets. The results emphasize the ability of our approach to pinpoint peo-
ple consistently across crowds of different types than the baseline density regression
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.8 Demonstrating the effectiveness of GWTA in proper training of high resolution scale
branches (notice the highlighted region). . . . . . . . . . . . . . . . . . . . . . . . . 138
9.9 Comparison of predictions made by face detectors SSH [8] and TinyFaces [4] against
LSC-CNN. Note that the Ground Truth shown for WIDERFACE dataset is the actual
and not the pseudo box ground truth. Normal face detectors are seen to fail on dense
crowds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
xvii
List of Tables
2.1 Comparison of TDF-CNN to other methods on Part A and Part B of Shanghaitech

dataset [1]. Our model performs better on all metrics. LBP+RR refers to a model that
uses Local Binary Pattern and Ridge Regression for estimating crowd count [1]. . . . 20
2.2 Benchmarking of TDF-CNN on UCF CC 50 dataset [2]. TDF-CNN performs com-
petitively with fewer model parameters. . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 MAEs computed for five test scenes of WorldExpo’10 dataset [9]. Our top-down
feedback model has better MAE for three scenes and delivers lower average MAE. . 22
2.4 Comparison of models with and without top-down feedback. The lower MAE deliv-
ered by models with feedback is indicative of its effectiveness. . . . . . . . . . . . . 23
3.1 Comparison of Switch-CNN with other state-of-the-art crowd counting methods on

ShanghaiTech dataset [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
UCF CC 50 dataset [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
UCSD crowd-counting dataset [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
WorldExpo’10 dataset [9]. Mean Absolute Error (MAE) for individual test scenes
and average performance across scenes is shown. . . . . . . . . . . . . . . . . . . . 35
3.5 Comparison of MAE for Switch-CNN variants and CNN regressors R1 through R3
on Part A of the ShanghaiTech dataset [1]. . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Comparison of MAE for Switch-CNN and manual clustering of patches based on
patch attributes on Part A of the ShanghaiTech dataset [1]. . . . . . . . . . . . . . . 38
4.1 Performance of IG-CNN on Part A and Part B of Shanghaitech dataset [1]. IG-CNN
outperforms other methods in MAE. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
xviii
LIST OF TABLES
4.2 Comparison of IG-CNN with other methods on UCF CC 50 dataset [2]. Our model
gives lower error than other methods. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 MAEs obtained by models for the 5 test scenes of WorldExpo’10 dataset [9]. . . . . 50
4.4 Effect of hierarchical growth of IG-CNN on Part A of Shanghaitech [1] dataset.
Though the oracle loss is steadily decreasing with depth, classifier error is increas-
ing leading to higher MAE at test time. . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Comparison of IG-CNN with other specialization based methods on Part A of Shang-
haitech [1] dataset. IG-CNN outperforms other architectures. . . . . . . . . . . . . . 52
5.1 Performance of GWTA-CCNN on Part A of Shanghaitech dataset. . . . . . . . . . . 62

5.2 Comparison of GWTA-CCNN with other methods on UCF CC 50 dataset [2]. Our
model delivers superior performance than other unsupervised methods. . . . . . . . . 63
5.3 Performance of GWTA-CCNN on Part A of Shanghaitech dataset [1] compared with
self-supervised methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1 Comparison of BSDR against other methods on Shanghaitech PartA [1]. Our model
delivers significant counting performance at the lowest annotation cost evident from
JLMC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Benchmarking BSDR on UCF-QNRF dataset [3]. Our approach beats other methods
in JLMC metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Evaluation BSDR on UCF-CC-50 dataset [2]. Despite a challenging dataset, BSDR
stands better in JLMC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.4 Cross dataset performance of our model; the reported entries are the MAEs obtained
for BSDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5 The performance of BSDR under changes in various hyper-parameters to validate
the architectural choices. BSDR seems to be robust under reasonable amount label
corruption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1 Performance comparison of CSS-CCNN with other methods on Shanghaitech PartA

[1]. Our model outperforms all the baselines. . . . . . . . . . . . . . . . . . . . . . 94
7.2 Benchmarking CSS-CCNN on UCF-QNRF dataset [3]. Our approach beats the base-
line methods in counting performance. . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3 Performance CSS-CCNN on UCF-CC-50 [2]. Despite being very challenging dataset,
CSS-CCNN achieves better MAE than baselines. . . . . . . . . . . . . . . . . . . . 95
7.4 Evaluation of CSS-CCNN on JHU-CROWD++ [11, 12] dataset. . . . . . . . . . . . 96
xix
LIST OF TABLES
7.5 Cross dataset performance of our model; the reported entries are the MAEs obtained
for CSS-CCNN and CSS-CCNN++ respectively. . . . . . . . . . . . . . . . . . . . 96
7.6 Evaluating CSS-CCNN in a true practical setting: the model is trained on images
crawled from the web, but evaluated on crowd datasets. The counting performance
appears similar to that of training on the dataset. . . . . . . . . . . . . . . . . . . . . 97
7.7 Validating different architectural design choices made for CSS-CCNN evaluated on
the Shanghaitech Part A [1] (computed on single run). . . . . . . . . . . . . . . . . 99
7.8 Sensitivity analysis for the hyper-parameters on CSS-CCNN. Our model is robust to
fairly large change in the max count parameter. . . . . . . . . . . . . . . . . . . . . 101
8.1 Performance of DD-CNN along with other methods on UCF-QNRF dataset [2]. Our
model has better count estimation than all other methods. . . . . . . . . . . . . . . . 111
8.2 Comparison of DD-CNN performance on UCF CC 50 [2]. DD-CNN beats other
models in terms of MAE and MSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.3 Comparison of DD-CNN performance on Shanghaitech Part A and Part B dataset [1].
DD-CNN delivers very competitive count accuracy relative to other regression models. 114
8.4 Results for DD-CNN model ablative experiments. The results evidence the effective-
ness of the design choices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.5 Evaluation of DD-CNN and baseline regression on the localization metrics to analyse
the dot prediction performance. Our model seems to achieve better localization of
predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.1 Comparison of LSC-CNN on localization metrics against the baseline regression method.
Our model seems to pinpoint persons more accurately. . . . . . . . . . . . . . . . . 133
9.2 Comparison of LSC-CNN on localization metrics against the baseline regression method.
Our model seems to pinpoint persons more accurately. . . . . . . . . . . . . . . . . 133
9.3 Evaluation of LSC-CNN box prediction on WIDERFACE [5]. Our model and PS-
DNN are trained on pseudo ground truths, while others use full supervision. LSC-
CNN has impressive mAP in Medium and Hard sets. . . . . . . . . . . . . . . . . . 134
9.4 Counting performance comparison of LSC-CNN on UCF-QNRF [2]. . . . . . . . . . 135
9.5 LSC-CNN on UCF CC 50 [2] dataset. LSC-CNN stands state-of-the-art in UCF CC 50,
except to DD-CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
9.6 Benchmarking LSC-CNN counting accuracy on Shanghaitech [1] datasets. LSC-
CNN stands state-of-the-art in ST PartB, with very competitive MAE on ST PartA. . 136
9.7 LSC-CNN on WorldExpo’10 [2] beats other methods in average MAE. . . . . . . . . 136
9.8 Evaluation of LSC-CNN on TRANCOS [13] vehicle counting dataset. . . . . . . . . 136
xx
LIST OF TABLES
9.9 MAE obtained by LSC-CNN with different hyper-parameter settings. . . . . . . . . 137

9.10 Validating various architectural design choices of LSC-CNN. . . . . . . . . . . . . . 138
9.11 LSC-CNN compared with existing detectors trained on crowd datasets. . . . . . . . . 139
9.12 Efficiency of detectors in terms of inference speed and model size. . . . . . . . . . . 140
xxi
LIST OF TABLES
xxii
Chapter 1
Introduction
ROWDS are common in public places; be it due to the daily city commuters or some special
C gatherings, analyzing crowds is gradually becoming important both in terms of security and
planning. This is primarily because of the frequent formation of massive crowds nowadays as a pos-
sible result of rapid growth in world population and urbanization. Huge gatherings usually arise due
to religious activities, democratic protests, political rallies, public demonstrations, sporting events, art
performances, etc. Civic agencies and planners rely on crowd estimates to regulate access points and
plan disaster contingency for such events. Critical to such analysis is crowd count and density.
Figure 1.1: Sample crowd scenes from the ShanghaiTech dataset [1] are shown.
1
In principle, the key idea behind crowd counting is self-evident: density times area. But crowds
are not uniform across the scene. They cluster in certain regions and are spread out in others. Typical
crowd scenes from the ShanghaiTech Dataset [1] are shown in Figure 1.1. We see extreme crowd-
ing, high visual resemblance between people and background elements (e.g. urban facade) in these
images. Different camera view-points create perspective effects resulting in large variability of scales
of people. Hence, counting people in crowds, especially dense gatherings, is a difficult task even
for humans. Coming to crowd analysis as a Computer Vision task, the major challenge is the drastic
diversity in the way people appear in crowded scenes. In highly dense crowds, people are only seen
as blobs, whereas facial features might be visible in sparse gatherings. The visibility of features for
crowd discrimination is directly related to the density of the crowd. Severe occlusion, pose changes
and illumination variations further compound the problem. Conventional head or body detection
based methods fail to adapt to such huge diversity. Since any candidate counting algorithm has to
address these issues that are specific to crowds, we enumerate and describe them in the next section.
1.1 Challenges in Dense Crowd Counting

Extracting any useful information from crowd images requires prospective models to first locate and
ascertain every person in the scene. This is true even to obtain seemingly simple statistics like the
crowd density distribution or overall count. Developing a framework that works consistently across
the entire variety of crowds is a challenging task and cannot be easily achieved with trivial changes to
existing person detection methods. This is because of the following major reasons:
• Diversity
Appearance Variety: Any counting model has to handle huge diversity in appearance of
Figure 1.2: Diversity in appearance of people across different crowd density. One could only see head
blobs in extreme dense regions, but facial and body features might be visible in other cases. Hence,
the features available for crowd discrimination varies with density.
2
individual persons and their assemblies. There exist an interplay of multiple variables, including
but not limited to pose, view-point and illumination variations within the same crowd as well
as across crowd images (see Figure 1.1).
Scale Diversity: The extreme scale and density variations in crowd scenes pose unique
challenges in formulating a person detection framework. In normal detection scenarios, this
could be mitigated using a multi-scale architecture, where images are fed to the model at dif-
ferent scales and trained. A large face in a sparse crowd is not simply a scaled up version of
that of persons in dense regions as evident from Figure 1.2. The pattern of appearance itself is
changing across scales or density.
Extreme Head Sizes: Since the densities vary drastically, so are the sizes of heads of people.
This stands as one of the bottleneck in designing a bounding box based detection approach. The
size of boxes must vary from as small as 1 pixel in highly dense crowds to more than 300 pixels
in sparser regions, which is several folds beyond the setting under which normal detectors
operate.
• Data Scarcity
Limited Annotations: Training models for crowd analysis require large datasets covering
the vast spectrum of diversity. However, this is difficult in practice as thousands of people
needs to be annotated per crowd image. It is tedious and time-consuming to locate every person
in highly dense crowd scenes as evident from Figure 1.3. Consequently, typical dense crowd
datasets are small and span only limited scenarios. This directly affects the performance of deep
learning models.
Figure 1.3: Sample crowd images from UCF CC 50 [2] dataset, along with corresponding head an-
notations.
3
Data Imbalance: Another problem due to drastic diversity is the imbalance in distribution
of people across density categories. The distribution is so skewed that the majority of samples
are crowded to certain set of densities while only a few are available for the remaining. There is
significantly more number of instances of people in dense crowds than sparse, requiring serious
data balancing schemes for proper training of models.
Point Annotation: Since it is quite cumbersome to get bounding boxes for people in dense
crowds, majority crowd datasets have only head locations (see Figure 1.3). These point annota-
tions are much easier to obtain. But this seriously limits the learning to head locations and does
not support further useful downstream applications like detection, face recognition etc.
• Localization
Person Localization: Though localizing every person accurately serves better for crowd
analysis, it proves to be a challenging task in the dense setting. But to just estimate the density
distribution or the overall count, accurate localization might not be required. One could model
the distribution of people in local regions instead of discrimination every person. This naturally
points to a trade-off where the performance on specific tasks could be improved at the cost of
localization.
Resolution: Usual detection models predict at a down-sampled spatial resolution, typically
one-sixteenth or one-eighth of the input dimensions. But this approach does not scale across
density ranges. Especially, highly dense regions require fine grained detection of people, with
the possibility of hundreds of instances being present in a small region, at a level difficult with
the conventional frameworks.
Local Minima: Training the model to predict at higher resolutions causes the gradient up-
dates to be averaged over a larger spatial area. This, especially with the diverse crowd data,
increases the chances of optimization being stuck in local minimas, leading to suboptimal per-
formance.
Given these issues, counting by detecting humans is perceived difficult [2], especially to scale
satisfactorily across the whole density spectrum seen in typical crowd scenes. In fact, due to extreme
pose, scale and view point variations, learning a consistent feature set to discriminate people seems
hard. Though faces might be largely visible in sparse assemblies, people become tiny blobs in highly
dense crowds. This makes it cumbersome to put bounding boxes in dense crowds, not to mention the
sheer number of people, in the order of thousands, that need to be annotated per image. Consequently,
the problem is more conveniently reduced to that of density regression.
4
Figure 1.4: Depiction of the density regression paradigm, where the model takes an input image and
predicts crowd density.
1.2 Density Regression Paradigm

In density estimation, a model is trained to map an input image to its crowd density, where the spatial
values indicate the number of people per unit pixel. Often Convolutional Neural Networks (CNN)
are used to predict crowd density map with advent of deep learning. This widely used regression
paradigm is graphically depicted in Figure 1.4. Density maps provide rich spatial information and
aids better feature learning than attempting to regress the crowd counts directly. The models trained
for density regression becomes discriminative to crowd features (head or shoulder patterns as they
appear in crowds) to its density as opposed to detecting individuals with facial or body features.
To facilitate training for density regression, the heads of people are annotated, which is much
easier than specifying bounding box for crowd images [2]. The annotations are provided as locations
of the center of the heads as shown in Figure 1.3. These point annotations are converted to density
map by convolving with a Gaussian kernel such that simple spatial summation gives out the crowd
count. Mathematically, let P be the set of all N annotated (x, y) locations of people in the given
image. Now given a Gaussian kernel Gσ of variance σ, the density map D is generated as,
N
X
D[x] = δ(x − xi ) ∗ Gσ (x), (1.1)
i=1
where δ(x) is the Dirac delta function and ∗ is the 2D convolution operation at location x. The kernel
Gσ is chosen to sum to unity so that spatially summing the resultant density map D gives the crowd
count.
Density maps ease the difficulty of regression for the CNN as the task of predicting the exact
point of head annotation is reduced to predicting a coarse location. The spread of the Gaussian in
the above density map is fixed. The variance σ is typically chosen such that the density maps looks
approximately smooth. Larger the σ, smoother the density maps at the cost of localization. However,
5
there are alternate ways to fix the σ; one way is to use the geometry-adaptive kernels [1]. In adaptive
kernels, the spread parameter of the Gaussian kernel is varied depending on the local crowd density.
It basically sets σ in proportion to the average distance of k-nearest neighboring head annotations.
This results in lower degree of Gaussian blur for dense areas, but a higher degree for regions of sparse
density in crowd scenes. We do not find any significant difference between the variants and the fixed
variance version is widely used as it is simple as well as effective.
Typical CNN regressors consist of a set of sequential convolution blocks with in-between max
pooling layers. ReLU activation function is employed except at the last layer that outputs the single
channel density map. These regressors are trained to minimize the Euclidean distance between the
estimated density map and ground truth. Let DXi (·; Θ) represent the output of a CNN regressor for
an input image Xi . The l2 loss function is given by
N
1 X GT
Ll2 (Θ) = kDXi (·; Θ) − DX (·)k22 , (1.2)
2N i=1 i
GT
where N is the number of training samples and DX i
(·) indicates ground truth density map for image
Xi . Note that Θ stands for the trainable parameters of the network. The loss Ll2 is optimized by
backpropagating the CNN via stochastic gradient descent (SGD). Here, l2 loss function acts as a
proxy for count error between the regressor estimated count and true count. It indirectly minimizes
the count error. The regressor is trained until the validation accuracy plateaus.
1.3 Datasets and Evaluation for Dense Crowd Analysis

Since quickly analyzing crowds via computing the distribution of people is important, this field of
research has gained traction in recent times. A couple of datasets have been created to propel studies
in dense crowd counting. Here we introduce a few popular ones.
The first is UCF CC 50 [2] dataset, which is a 50 image collection of annotated crowd scenes.
The annotations are in the form of locations of human heads in the images (as shown in Figure 1.3).
The dataset exhibits a large variance in the crowd density with counts varying between 94 and 4543.
The small size of the dataset and huge diversity in crowd counts makes it very challenging. There
is no separate test set for this dataset, but a 5-fold cross-validation protocol is used to evaluate the
performance of any given model.
Another important dataset is Shanghaitech (ST) [1]. It consists of total 1,198 crowd images with
more than 0.3 million head annotations. It is divided into two sets, namely, Part A and Part B. Part A
crowd scenes are parsed from the Internet and have a density variation ranging from 33 to 3139 people
6
per image with average count being 501.4. In contrast, images in Part B are scenes are captured from
surveillance cameras in city streets. They are relatively less diverse and sparse with an average density
of 123.6. Part A has 482 images, of which 300 images are used for training and the rest are used for
testing. Similarly, the 716 images of Part B are partitioned into chunks of 400 training and 316 testing
images.
The WorldExpo’10 dataset [9] contains 1132 video sequences captured with 108 surveillance
cameras in Shanghai 2010 WorldExpo. It has 3980 frames, out of which 3380 are used for training.
Test set includes five different video sequence with 120 frames each. The crowds are relatively sparse
compared with other dataset. There are only 50 people per image on an average. Region of interest
(ROI) is provided for both training and test scenes. In addition, the authors also provide perspective
maps for all scenes.
More recent datasets include significantly more annotations across diverse scenarios. But all these
datasets, in general, follow the point head annotation methodology employed in UCF CC 50. For
instance, UCF-QNRF, introduced by [3], is a large dense crowd counting dataset. The images are
collected from Internet as well as from Hajj footage. There are 1201 images for training and 334 for
the test set. The density of crowd varies between 49 to as high as 12,865.
JHU-CROWD++ [11, 12] is another new comprehensive dataset with 1.51 million head annota-
tions spanning 4372 images. The crowd scenes are obtained under various scenarios and weather con-
ditions, making it one of the challenging dataset in terms of diversity. Furthermore, JHU-CROWD++
has a richer set of annotations at head level as well as image level.
Typical training pipeline involves taking random crops from the crowd dataset images, computing
the l2 loss (equation 1.2) and updating the parameters of the density regressor. Simple augmentation
like flipping the training crops are done to improve generalization. The trained models are evaluated
on certain standard metrics. The widely used metric for crowd counting is the Mean Absolute Error or
MAE. MAE is simply the absolute difference between the predicted and actual crowd counts averaged
over all the images in the test set. Mathematically,
N
1 X
MAE = |Ci − CiGT |, (1.3)
N i=1
where Ci is the count predicted for input image i, for which the ground truth is CiGT . The counting
performance of a model is directly evident from the MAE value. Further to estimate the variance and
7
hence robustness of the count prediction, Mean Squared Error or MSE is used. It is given by
v
u
u1 X N
MSE = t (Ci − CiGT )2 . (1.4)
N i=1
For every proposed models, we benchmark it against the existing methods on MAE and MSE metrics.
Though these metrics measure the accuracy of overall count prediction, other aspects of the prediction
are not considered such as localization. Hence, apart from standard MAE, we introduce alternate
metrics as and when required in the rest of the thesis.
1.4 Related Works in Dense Crowd Counting

There are numerous works in the area of dense crowd counting. Here we provide a brief overview of
the major related works. Other relevant works will be discussed in the appropriate chapters.
Person Detection: The topic of crowd counting broadly might have started with the detection
of people in crowded scenes. These methods use appearance features from still images or motion
vectors in videos to detect individual persons [14, 15, 16]. Idrees et al. [17] leverage local scale
prior and global occlusion reasoning to detect humans in crowds. With features extracted from a
deep network, [18] run a recurrent framework to sequentially detect and count people. In general,
the person detection based methods are limited by their inability to operate faithfully in highly dense
crowds and require bounding box annotations for training. Consequently, density regression becomes
popular.
Density Regression: Idrees et al. [2] introduce an approach where features from head detections,
interest points and frequency analysis are used to regress the crowd density. A shallow CNN is
employed as density regressor in [9], where the training is done by alternatively backpropagating the
regression and direct count loss. There are works like [19], where the model directly regresses crowd
count instead of density map. But such methods are shown to perform inferior due to the lack of
spatial information in the loss.
Multiple and Multi-column CNNs: The next wave of methods focuses on addressing the huge
diversity of appearance of people in crowd images through multiple networks. Walach et al. [20] use
a cascade of CNN regressors to sequentially correct the errors of previous networks. The outputs of
multiple networks, each being trained with images of different scales, are fused in [21] to generate the
final density map. Extending the trend, architecture with multiple columns of CNN having different
receptive fields starts to emerge. The receptive field determines the affinity towards certain density
types. For example, the deep network in [22] is supposed to capture sparse crowds, while the shallow
8
network is for the blob like people in dense regions. The MCNN [1] model leverages three networks
with filter sizes tuned for different density types. Going further, the multiple columns are combined
into a single network, with parallel convolutional blocks of different filters by [23] and is trained along
with an additional consistency loss.
Leveraging context and other information: Improving density regression by incorporating addi-
tional information forms another line of thought. Works like [24, 25] supply local or global level
auxiliary information through a dedicated classifier. An alternate approach uses context through iter-
ative prediction of density maps. Ranjan et al. [26] show that this incremental refinement of density
leads to better regression performance.
Better and easier architectures: Since density regression suits better for denser crowds, Decide-
Net architecture from [27] adaptively leverages predictions from Faster RCNN [28] detector in sparse
regions to improve performance. Though the predictions seems to be better in sparse crowds, the
performance on dense datasets is not very evident. Also note that the focus of this work is to aid
better regression with a detector and is not a person detection model. In fact, Decide-Net requires
some bounding box annotation for training, which is infeasible for dense crowds. Striving for simpler
architectures have always been a theme. Li et al. [29] employ a VGG based model with additional
dilated convolutional layers and obtain better count estimation. Further, a DenseNet [30] model is
trained in [3] for density regression at different resolutions with composition loss.
Other counting works: An alternate set of works try to incorporate flavours of unsupervised
learning and mitigate the issue of annotation difficulty. Liu et al. [31] use count ranking as a self-
supervision task on unlabeled data in a multitask framework along with regression supervision on la-
beled data. Other counting works employ Negative Correlation Learning [32] and Adversarial training
to improve regression [33].
Object/face Detectors: Since later chapters explore crowd counting via head detection as well,
here we briefly compare with related detection works. Object detection has seen a shift from early
methods relying on interest points (like SIFT [34]) to CNNs. Early CNN based detectors operate on
the philosophy of first extracting features from a deep CNN and then have a classifier on the region
proposals [35, 36, 37] or a Region Proposal Network (RPN) [28] to jointly predict the presence and
boxes for objects. But the current dominant methods [38, 39] have simpler end-to-end architecture
without region proposals. They divide the input image in to a grid of cells and boxes are predicted
with confidences for each cell. But these works generally suit for relatively large objects, with less
number instances per image. Hence to capture multiple small objects (like faces), many models are
proposed. Zhu et al. [40] adapt Faster RCNN with multi-scale ROI features to better detect small
faces. On similar lines, a pyramid of images is leveraged in [4], with each scale being separately
processed to detect faces of varied sizes. The SSH model [8] detects faces from multiple scales in a
9
single stage using features from different layers. More recently, Sindagi et al. [41] improves small
face detections by enriching features with density information.
1.5 Organization of the Thesis

This thesis is broadly divided into three parts, each dedicated to address one of the major challenges
in dense crowd analysis. The exact organization is as follows:
• Addressing Diversity: The first part discusses approaches to address the Diversity issue in
crowd counting. It consists of Chapters 2, 3 and 4, where three different architectures and
training methods are proposed to handle the huge diversity visible in crowd scenes. Especially,
Chapter 2 builds up on the observation that discriminating persons in highly diverse settings
requires larger spatial context and semantics of the scene. A specialized architecture incor-
porating brain-inspired top-down feedback connections is formulated to extract global context
and correct prediction errors in an iterative manner. To further scale across the diversity and
improve performance, Chapter 3 and 4 employs a mixture of experts approach. The differential
training regime introduced in Chapter 3, jointly clusters and fine-tunes a set of experts to capture
huge diversity seen in crowd images. This boosts counting performance as different regions of
the images are processed by an expert regressor suited for the density type. Chapter 4 extends
the idea to a growing CNN that can progressively increase its capacity to account for the wide
variability seen in crowd scenes.
• Addressing Data Scarcity: Part 2 of the thesis deals with mitigating the Data scarcity is-
sue. Chapter 5 tackles this annotation difficulty through the Grid Winner-Take-All autoencoder,
which is designed to learn almost 99% of the parameters from unlabeled crowd images. Next a
binary labeling scheme is introduced in Chapter 6, where each image is labeled as either dense
or sparse, instead of annotating every person in the scene. This implies reduced amount of an-
notations, but yet delivers good performance at a very low labeling cost. The objective is pushed
further to fully eliminate the dependency on instance-level labeled data in Chapter 7. The pro-
posed completely self-supervised architecture does not require any annotation for training, but
uses a distribution matching technique to learn the required features.
• Addressing Localization: The third and final part of the thesis improves Localization of crowd
predictions. Two detection frameworks for dense crowd counting are developed, which elim-
inate the need for the prevalent density regression paradigm and improve person localization.
Chapter 8 reformulates the density regression task to localized dot prediction in dense crowds,
10
where the model is trained for pixel-wise binary classification to detect people. In the second
architecture described in Chapter 9, apart from locating persons, the spotted heads are sized
with bounding boxes. This approach could detect individual persons consistently across the di-
versity spectrum as opposed to regressing local crowd density values. Moreover, this improves
localization is achieved without requiring any additional box annotations for training.
• Finally, Chapter 10 concludes the results of the thesis and discusses possible directions for
future research.
11
Part I
Addressing Crowd Diversity
12
Chapter 2
Top-Down Feedback to correct Errors
S described in Section 1.2, counting CNNs are trained to regress spatial crowd density for an
A input image. They are forced to learn a hierarchy of filters specific to crowd features as op-
posed to individual human features. The discriminatory features that needs to be learned also varies
drastically with the crowd density. For example, the CNNs model edges to detect head-shoulder pair
as they appear in a highly crowded scene, rather than facial features like eyes or nose. However,
these dense crowd motifs closely resemble many patterns created by leaves of trees, buildings, clut-
tered backgrounds etc. and could be misclassified as people. This inability to learn a consistent set
Figure 2.1: Density maps predicted by a typical CNN regressor (third column) have a lot of false
detections. Many crowd like patterns are identified as humans. First column displays the input scene
and middle column holds the ground truth density map.
13
of features across diverse scenarios causes a lot of false predictions as shown in Figure 2.1. Some
high-level context information should have flagged the irrelevant patterns. In general, this is true for
any feedforward bottom-up systems, where the low-level feature detectors do not have enough con-
text information to decide on the input. Useful information might be lost in the initial layers (say, by
spatial max-pooling etc.) of a neural network, which are necessary for the correct prediction. Ideally,
once high-level context of the image is known, the system has to evaluate it and arrive at the final
decision.
One way to address this problem is to look at how humans do crowd counting. Whenever there
is difficulty in identifying people, say at extremely dense crowd, people use high-level scene under-
standing to judge whether the crowd like blobs are indeed humans. Also, there is ample evidence from
neuroscience researches that brain has complex feedback networks [42, 43, 44]. Information flows in
both directions; high-level cortical areas can influence low-level feature detectors. In this work, we
try to mimic some aspects of the top-down feedback mechanism to solve crowd counting task.
Our primary aim is to use high-level context information about the scene to correct false density
predictions of the counting CNN. To that end, we construct a regular CNN density regressor as the
bottom-up network and a separate top-down CNN to utilize scene context. Top-down information
comes in the form of feature maps output by higher layers of the bottom-up CNN. This information
is used by the top-down network to generate feedback. The feedback here acts as a correcting signal
to the lower layers of the bottom-up network. In our work, feedback is applied in the form of mul-
tiplicative gating to the low-level feature activations of the bottom-up network. The density map is
generated once again after applying the feedback to the bottom-up CNN. In this way, the lower layers
of the CNN regressor receive high-level context information in the form gating. The feedback gates
the lower layer feature activations of the bottom-up CNN to correct the density prediction.
In summary, the major contributions of this chapter are:
• A generic architecture to deliver top-down information in the form of feedback to the bottom-up
network.
• A crowd counting system that uses top-down feedback framework to correct its density predic-
tions.
2.1 Related Works for Top-Down Architectures

Most of the works in crowd analysis do not leverage any top-down feedback approach as evident from
Section 1.4. They use only a single-pass feedforward computation of images and incur significant
false positives. In contrast, top-down processing exist in the human brain [43, 44]. These top-down
14
modulations are evoked by a complex network of horizontal and feedback connections [42]. They help
the lower visual areas to attain attentional selectivity, so that only relevant information is combined
and spurious responses are removed [45, 46]. Some works in Computer Vision that try to incorporate
certain of these aspects. Many works [47, 48, 49, 50] use a series of networks to iteratively perform
their tasks of interests. Initial output by the first network is fed to the next network along with context
information. Feature maps of the previous network is concatenated with its output to supply top-down
information to the next network in series. Main drawback of this approach is that each stage requires
a separate network to be trained. Instead of iteratively improving predictions with multiple similar
network, [51, 52] have a separate top-down network which takes features from different layers of the
bottom-up network as context information. Note that the top-down network generates the final output
and no feedback is given back to the bottom-up network. Hence the top-down network also learns
to do the task of interest using high-level context, rather than providing feedback to the bottom-up
network.
On the other hand, in our architecture, the top-down network learns to drive feedback to the lower
layers of the bottom-up network. Here, the form of feedback is also important. Works like [53, 54, 55]
use multiplicative mask on feature maps to suppress unwanted activations. We also apply feedback in
the form of multiplicative gating. The final prediction is output by the bottom-up network using the
feedback gated activations.
2.2 Our Approach
2.2.1 Feedback as a Correcting Signal

As discussed previously, the majority of the crowd counting systems rely on a CNN to regress the
crowd density map. However, all these regressing models do not individually detect and count people.
They learn to look for crowd features (how the heads of a bunch of people appear) as opposed to
individual person features (eyes, nose, body parts etc.). Consequently, many crowd like patterns
(trees, cluttered backgrounds, etc.) are detected as people. Also, the density prediction at various
regions could be wrong due to occlusions or background interferences. Many of these problems
could be solved if high-level context information is available to the density regressor. Hence, our
approach is to use top-down feedback as a correcting signal to re-evaluate the density prediction of
the CNN. A separate top-down network learns to correlate the high-level context of the scene with
the low-level responses of the CNN regressor. It generates masks, which weigh spatially all feature
activations of the lower layers. This suppresses spurious detections and pass legitimate responses to
generate the corrected density map.
15
Initial Corrected
Density Map Density Map
Ce 1x1 | 1 Ce 1x1 | 1 Ce 1x1 | 1
C1d 7x7 | 8 C2d 3x3 | 12 C1d 7x7 | 8 C2d 3x3 | 12 C1d 7x7 | 8 C2d 3x3 | 12
MP3a 2x2 MP3a 2x2

C1c 7x7 | 16 C2c 3x3 | 24 C1c 7x7 | 16 C3a 3x3 | 16 C2c 3x3 | 24 C1c 7x7 | 16 C3a 3x3 | 16 C2c 3x3 | 24
MP1b 2x2 MP2b 2x2 MP1b 2x2 UnP3a 4x4 MP2b 2x2 MP1b 2x2 UnP3a 4x4 MP2b 2x2
C1b 7x7 | 32 C2b 3x3 | 48 C1b 7x7 | 32 C3b 3x3 | 32 C2b 3x3 | 48 C1b 7x7 | 32 C3b 3x3 | 32 C2b 3x3 | 48
C3c 3x3|16 C3d 3x3|24 C3c 3x3|16 C3d 3x3|24
Multiplicative
Feedback Gate Features
Gating
MP1a 2x2 MP2a 2x2 MP1a 2x2 MP2a 2x2 MP1a 2x2 MP2a 2x2
C1a 9x9 | 16 C2a 5x5 | 24 C1a 9x9 | 16 C2a 5x5 | 24 C1a 9x9 | 16 C2a 5x5 | 24
C : Convolution : Bottom-Up CNN

MP : Max-Pool
UnP : Un-Pool : Top-Down CNN
(a) Bottom-Up CNN (b) Feedback Generation (c) Applying Feedback
Figure 2.2: Architecture of Top-down feedback CNN. (a) displays bottom-up CNN, (b) depicts the
feedback generation by the top-down CNN and (c) shows how the bottom-up CNN re-evaluates its
prediction using the gate features. (Best viewed in colour.)
2.2.2 Top-Down Feedback CNN

Our model has a bottom-up CNN for density prediction and a separate top-down CNN for feedback
generation. A two column CNN forms the bottom-up network. Feature maps from final layers of the
CNN columns are fused with a 1 × 1 filter to obtain the density map. This CNN regressor is similar
to the architecture introduced in [1]. We use it because of the simplicity and performance it offers.
The CNN columns basically vary in filter sizes and hence the receptive fields. These are designed
to capture crowd at different scales and partly address the Scale Diversity (Section 1.1). The first
network has large initial filter size of 9 × 9 and can attend patterns that span large receptive fields
like big faces etc. The other column with initial filter size 5 × 5 is meant for dense crowds. Both are
shallow networks with only four convolutional layers and two pooling layers. Figure 2.1(a) shows the
bottom-up network.
The top-down CNN runs down parallel to the bottom-up network. It consumes high-level fea-
ture maps from the bottom-up network to generate feedback. Generally, the feature maps are taken
from the layers immediately before pooling layers of the bottom-up CNN. This is to ensure that the
top-down network has access to any relevant information lost in spatial pooling. As depicted in Fig-
ure 2.2(b), the feature maps from the convolutional layers C1c and C2c of the bottom-up CNN are
16
concatenated and pooled. Max-pooling MP3a is done so that convolutional layer C3a receives larger
spatial context about the crowd scene. Since the bottom-up network is pretrained, the features of the
last layers C1d and C2d resemble more like density maps. Those features are not given to the top-
down network as they do not add useful context information. The features maps of C3a are unpooled
(values are repeated) to make it same size as the corresponding features of the bottom-up network.
The resultant feature maps are again concatenated to that of the bottom-up network. These are again
operated by two convolutional layers to generate the feedback gate features. The activation function
for C3c and C3d are chosen to be sigmoid so that the values are in 0-1 range. All other convolutional
layers use ReLU non-linearity. Hence, the gate features can strongly damp spurious activations or
allow legitimate responses to pass. It has the same spatial and feature dimensions as that of the output
of the first pool layers MP1a and MP2a.
The feedback is applied by element-wise multiplication of the gate feature maps with that of the
bottom-up CNN. It provides different spatial gating for each of the feature maps of the first convo-
lutional layers C1a and C2a. So, the top-down network influences the bottom-up CNN at spatial as
well as individual feature level. Moreover, it could selectively control information flow through the
two columns of the bottom-up CNN. Depending on the scale of the crowd, one CNN column can be
given prominence over the other. The 5 × 5 CNN column which performs better for dense crowds,
needs to have more effect in the final prediction of such scenes. Note that it is not useful to apply
feedback gating directly on the input image. Doing so would mask many regions of the image and
destroy relevant context required for higher layers (see Section 2.4).
Density prediction happens in two steps as shown in Figure 2.2(b,c); first the image is passed
through the bottom-up network. Its features are used to compute the gate feedback maps. The gate
features are then element-wise multiplied with feature maps of the bottom-up network to generate
the corrected density map. The model is trained also in two stages. Initially, the bottom-up CNN is
trained and its parameters are fixed. This is followed by training of the feedback network.
2.2.3 Training of Bottom-Up CNN

The bottom-up CNN is initially trained alone to regress crowd density maps. Both columns of the
bottom-up network are individually pretrained. This ensures learning of better features and makes
later finetuning more effective. The l2 distance between the predicted density map and ground truth
is used as the loss to train the CNN regressor. If DXi (x; Θ) stands for the output of a CNN regressor
17
Count loss
computed here
Tied
Parameters
:Bottom-Up :Top-Down
CNN Layers CNN Layers
Figure 2.3: The unrolled computation graph used for training top-down network. The top-down CNN
uses features of the bottom-up CNN to apply feedback so that the bottom-up network can be re-
evaluated. Consequently, the bottom-up CNN appears twice in the computation graph and the loss
is calculated on the corrected prediction. Only parameters of the top-down CNN are updated. (Best
viewed in color.)
with parameters Θ, the l2 loss function is given by
N
1 X GT
Ll2 (Θ) = kDXi (x; Θ) − DX (x)k22 , (2.1)
2N i=1 i
GT
where N is the number of training samples. Here, ground truth density map is DX i
(x) for the input
image Xi . Standard Stochastic Gradient Descent (SGD) algorithm is applied on the parameters Θ
to optimize Ll2 . The l2 loss function implicitly captures the count error between the predicted and
ground truth count.
2.2.4 Training of Top-Down CNN

After the bottom-up network is trained, its parameters are fixed. The top-down network is trained
by backpropagating the loss incurred by the estimated density after applying feedback. The unrolled
computation graph is shown in Figure 2.3. While training, only the parameters of the top-down
18
network are updated. The parameters are updated so as to reduce the count error of the final prediction.
The top-down CNN, thus learns to use context information to gate the activations of the bottom-up
network and correct the density prediction.
The performance of any counting CNN is measured using count error. The crowd count of an
P
image Xi is computed from its density prediction DXi as CXi = x DXi (x; Θ). Let its actual count
GT
P GT
be CX i
= x DX i
(x). Then, count loss is the squared difference between predicted and true count,
N
λ X GT 2
LC (Θ) = (CXi − CX ), (2.2)
2N i=1 i
where λ is a constant multiplier to check the magnitude of the loss. We use l2 -loss between the density
maps to train the bottom-up network, since it accounts for spatial distribution of the density and helps
learn better features. But, reducing the count error between the estimated and ground truth map is our
primary objective. Hence, it is beneficial to use count loss rather than Ll2 (eq: 2.1) loss for training
top-down network. In this way, top-down CNN receives complementary information to improve the
overall performance. Experimentally, we find that using Ll2 (eq: 2.1) loss for top-down training does
not result in much improvement. Note that training top-down network with count loss, still learns
good features as the bottom-up network is already trained.
We also add l1 regularizer to the loss function to impose sparse activations for the feedback gate
features. This aids the top-down network to train effectively and allows only relevant features to be
active. So, the final loss function for the top-down CNN is,
N N
λ X GT 2 µ X
LT D (Θ) = (CXi − CX ) + GXi , (2.3)
2N i=1 i
N i=1
where the scalar GXi is the sum of all feedback gate features generated for image Xi and µ is a
regularization constant. The values of the regularizer constants are chosen empirically. In all our
experiments, these regularizers are fixed as λ = 10−2 and µ = 10−3 . The loss LT D is used to
backpropagate the top-down CNN till the validation accuracy plateaus.
2.3 Experiments
TDF-CNN is evaluated on four major crowd counting datasets. During testing, the input image passes
through the TDF-CNN to generate feedback gate features. The feedback features are then applied on
the bottom-up CNN to predict the final density map. The predicted density maps are 1/4th size of the
image because of the two pooling layers.
19
Input Image Ground truth Initial Prediction Corrected Prediction
Figure 2.4: Sample predictions of TDF-CNN on images of Part A of Shanghaitech dataset [1].
2.3.1 Shanghaitech dataset

Table 2.1 reports performance of TDF-CNN along with other models. It is seen that TDF-CNN
outperforms all other models by a significant margin both in terms of MAE and MSE. This emphasizes
effectiveness of top-down feedback in correcting density predictions. Figure 2.4 shows some density
predictions of TDF-CNN along with the initial predictions. Comparing predictions without feedback,
it is observed that many false detections are removed in the corrected maps.
Part A Part B
Method MAE MSE MAE MSE
LBP+RR 303.2 371.0 59.1 81.7
Zhang et al. [9] 181.8 277.7 32.0 49.8
MCNN [1] 110.2 173.2 26.4 41.3
TDF-CNN 97.5 145.1 20.7 32.8
Table 2.1: Comparison of TDF-CNN to other methods on Part A and Part B of Shanghaitech dataset
[1]. Our model performs better on all metrics. LBP+RR refers to a model that uses Local Binary
Pattern and Ridge Regression for estimating crowd count [1].
20
Method MAE MSE
Lempitsky et al. [56] 493.4 487.1
Idrees et al. [2] 419.5 487.1
Zhang et al. [9] 467.0 498.5
CrowdNet [22] 452.5 -
MCNN [1] 377.6 509.1
Hydra2s [21] 333.7 425.3
TDF-CNN 354.7 491.4
Table 2.2: Benchmarking of TDF-CNN on UCF CC 50 dataset [2]. TDF-CNN performs competitively
with fewer model parameters.
2.3.2 UCF CC 50 dataset

For this dataset, we use 3 × 3 filters for the last convolutional layer (Ce in Figure 2.2) of TDF-CNN.
It is evident from Table 2.2 that TDF-CNN performs better than all other models except Hydra2s
[21]. The difference between the two models in terms of MAE is 21.1. But note that, TDF-CNN
has roughly 93% ( 0.13 million parameters) less number of parameters than Hydra2s network ( 1.82
million parameters). This indicates that our model with top-down feedback performs competitively
with quite few parameters.
2.3.3 WorldExpo’10 dataset

The WorldExpo’10 dataset [9] contains crowds that are relatively sparse compared with other dataset.
There are only 50 people per image on an average. Region of interest (ROI) is provided for both
training and test scenes. In addition, the authors also provide perspective maps for all scenes. TDF-
CNN is trained and tested with ROI as done in [1, 9]. While training, error is backpropagated only for
areas in the ROI. Similarly, only ROI regions are evaluated for testing. MAE is computed for each of
the 5 test scenes and averaged.
Table 2.3 lists the performance of all methods. Despite the dataset being relatively sparse in crowd
density, TDF-CNN is able to offer better MAE in three scenes as well as in average terms. This further
underlines the need for top-down feedback in crowd counting.
21
Method Scene-1 Scene-2 Scene-3 Scene-4 Scene-5 Average MAE
LBP+RR [1] 13.6 59.8 37.1 21.8 23.4 31.0
Zhang et al. [9] 9.8 14.1 14.3 22.2 3.7 12.9
MCNN [1] 3.4 20.6 12.9 13.0 8.1 11.6
TDF-CNN 2.7 23.4 10.7 17.6 3.3 11.5
Table 2.3: MAEs computed for five test scenes of WorldExpo’10 dataset [9]. Our top-down feedback
model has better MAE for three scenes and delivers lower average MAE.
2.4 Ablations and Analysis
2.4.1 Effectiveness of Feedback

In this section, the effectiveness of top-down feedback in crowd counting is demonstrated with ab-
lations. First, we study the performance improvement offered by top-down feedback. To that end,
prediction accuracy of the bottom-up CNN is compared to that with top-down feedback in Table 2.4.
The bottom-up CNN trained on Part A of the Shanghaitech dataset, gives an MAE of 147.4. Interest-
ingly, adding top-down feedback to the bottom-up network decreases the MAE to 97.5. Thus, in this
case, top-down feedback is able to correct the counting error by a significant factor.
As described earlier (Section 2.2.4), the top-down CNN is trained with count loss LC (eq: 2.2).
This ensures that the top-down network learns complementary information to correct the prediction
of the bottom-up CNN. For fair comparison, we train the bottom-up CNN with count loss. The
pretrained CNN columns in Figure 2.2(a) are fine-tuned with the final 1 × 1 filter using count loss.
Again, from Table 2.4, this network performs inferior to the TDF-CNN. This clearly indicates that the
performance gain observed for TDF-CNN is due to the feedback mechanism and not because of using
count loss alone. Note that, fine-tuning with count loss distorts to some extent the spatial quality of the
predicted density maps, which does not happen with TDF-CNN. Further, to make parameters roughly
the same as TDF-CNN, we train MCNN [1] with count loss. MCNN has three shallow CNN columns
which are pretrained. Their outputs are fused with a 1 × 1 filter and fine-tuned by back-propagating
the count loss. Still, TDF-CNN has lower MAE.
The bottom-up CNN that we use, has two CNN columns with different field-of-view. Such a
design allows the top-down network to not only gate features of the individual column, but also be
selective about the CNN columns themselves. We show in Table 2.4 that the top-down feedback
results in considerable performance gain even without a multi-column architecture for the bottom-
up network. For this experiment, the CNN column with initial filter size 9 × 9 alone is used as the
bottom-up network. The top-down network remains same as in Figure 2.2(b), except for layer C3d,
which is removed. The MAE for this network with top-down feedback is 21.4% less than that without
22
Method MAE
Bottom-up CNN 147.4
Bottom-up CNN fine-tuned with count loss LC 116.2
MCNN [1] fine-tuned with count loss LC 116.7
Only 9 × 9 CNN column 158.5
TDF-CNN with only 9 × 9 column 124.6
TDF-CNN with 9 × 9 & 5 × 5 columns 97.5
Table 2.4: Comparison of models with and without top-down feedback. The lower MAE delivered by
models with feedback is indicative of its effectiveness.
feedback. This means that the top-down framework is generic and can be applied on a variety of
bottom-up networks.
In order to shed light on how the gate feedback features work, we display some of the gate feed-
back features in Figure 2.5. The images show that the gate feature maps indeed act as masks. Majority
of the values of the gates are near zero. This is like selective gating, where only relevant information
is allowed to pass. Note that there are many gate feature maps, only some are shown. They could be
complementary also, i.e. a region blocked in one feature map may be open in another. Activations
of spatial regions are not fully suppressed in all feature maps. Instead, they weigh differently across
feature maps to supply relevant detections to higher layers. This is precisely the reason why we do not
apply the feedback directly on the input. Gating on the input blanks out many regions of the image
that might provide crucial context cues for higher layers. Also, it is interesting that some gate maps
offer rough localisations of the crowd. This demonstrates that the top-down network actually learns
to disambiguate people in the scene to correct the bottom-up prediction.
Figure 2.5: Some of the feedback gate maps for the input image shown in first column. Red to blue
colour scale maps to 0-1 range.
23
2.5 Conclusion
Typical crowd counting CNNs are trained to look for crowd patterns, instead of accounting every
person and count. It is difficult to learn patterns that consistently works across diverse crowd sce-
narios. Consequently, many crowd-like patterns are detected as people in dense crowds leading to
massive false predictions. In this chapter, we propose top-down feedback, which carries high-level
scene context to correct spurious detections. Our architecture consists of a bottom-up CNN, which has
connections to another top-down CNN. The top-down CNN generates feedback in the form of gating
to the lower level activations of the bottom-up CNN. This selectively passes legitimate activations and
damps spurious responses. We show that such a feedback model delivers better or competitive results
on all major crowd counting datasets. Furthermore, we demonstrate the effectiveness of top-down
feedback with ablations.
However, the proposed framework is not enough to scale across the full diversity spectrum. The
multi-column networks along with the feedback mechanism could jointly handle the Appearance
Variety and Scale Diversity (Section 1.1) to certain extent. But the scale columns are trained together
and specialization is not explicitly provided. More performance improvement can be obtained by
specializing the columns in the network for appropriate crowd categories as shown in the next chapter.
24
Chapter 3
Switching CNN to capture Crowd Diversity
ROWD analysis via density regression has been tackled in computer vision by a myriad of tech-
C niques. From early HOG based head detections [2] to CNN regressors [1, 9, 21], density estima-
tion has witnessed drastic changes in approaches. CNN based regressors have largely outperformed
traditional crowd counting approaches on the back of improved feature representations. Especially,
multi-column architectures [1, 21] are shown to mitigate the Appearance Variety and Scale Diversity
issues (Section 1.1) to some degree. We build on the multi-column architectures for crowd count-
ing and propose Switching Convolutional Neural Network (Switch-CNN) to aggressively solve the
Diversity challenge in density regression and improve performance.
Switch-CNN leverages the variation of crowd density within an image to improve the quality and
localization of the predicted crowd count. Independent CNN crowd density regressors are trained on
patches sampled from a grid in a given crowd scene. The independent regressors are chosen such that
they have different receptive fields and field of view. This ensures that the features learned by each
CNN regressor are adapted to a particular scale. This renders Switch-CNN robust to large scale and
perspective variations of people observed in a typical crowd scene. A particular CNN regressor is
trained on a crowd scene patch if the performance of the regressor on the patch is the best. A switch
classifier is trained alternately with the training of multiple CNN regressors to correctly relay a patch
to a particular regressor. The joint training of the switch and regressors helps augment the ability of
the switch to learn the complex multichotomy of space of crowd scenes acquired in the differential
training stage.
While switching architectures have not been used for counting, expert classifiers is used in [57] to
improve single object image classification across depiction styles using a deep switching mechanism.
However, unlike [57], we do not have labels (for eg: depiction styles like ”art” and ”photo”) to train
the switch classifier. To overcome this challenge, we propose a training regime that exploits regressor
performance statistics to generate the label (see Section 3.1.1).
25
To summarize, this chapter presents:
• A novel CNN architecture, Switch-CNN, trained to predict crowd density for a given crowd
scene.
• Switch-CNN maps crowd patches from a crowd scene to independent CNN regressors to min-
imize count error and improve density localization exploiting the density variation within a
scene.
• We evidence state-of-the-art performance (at the time of publication) on all major crowd count-
ing datasets including ShanghaiTech dataset [1], UCF CC 50 dataset [2] and WorldExpo’10
dataset [9].
3.1 Our Approach

In this work, we consider switching CNN architecture (Switch-CNN) that relays patches from a grid
within a crowd scene to independent CNN regressors based on a switch classifier. The independent
CNN regressors are chosen with different receptive fields and field-of-view as in multi-column CNN
networks to augment the ability to model large scale variations. A particular CNN regressor is trained
on a crowd scene patch if the performance of the regressor on the patch is the best. A switch classifier
is trained alternately with the training of multiple CNN regressors to correctly relay a patch to a
particular regressor. The salient properties that make this model excellent for crowd analysis are (1)
the ability to model large scale variations (2) the facility to leverage local variations in density within a
crowd scene. The ability to leverage local variations in density is important as the weighted averaging
technique used in multi-column networks to fuse the features is global in nature.
3.1.1 Switch-CNN
Our proposed architecture, Switch-CNN consists of three CNN regressors with different architectures
and a classifier (switch) to select the optimal regressor for an input crowd scene patch. Figure 3.1
shows the overall architecture of Switch-CNN. The input image is divided into nine non-overlapping
patches such that each patch is 13 rd of the image. Crowd characteristics like density, appearance etc.
can be assumed to be consistent in a given patch for a crowd scene. Feeding patches as input to the
network helps in regressing different regions of the image independently by a CNN regressor most
suited to patch attributes like density, background, scale and perspective variations of crowd in the
patch.
26
Figure 3.1: Architecture of the proposed model, Switch-CNN is shown. A patch from the crowd
scene is highlighted in red. This patch is relayed to one of the three CNN regressor networks based on
the CNN label inferred from Switch. The highlighted patch is relayed to regressor R3 which predicts
the corresponding crowd density map. The element-wise sum over the entire density map gives the
crowd count of the crowd scene patch.
We use three CNN regressors introduced in [1], R1 through R3 , in Switch-CNN to predict the
density of crowd. These CNN regressors have varying receptive fields that can capture people at
different scales. The architecture of each of the shallow CNN regressor is similar: four convolutional
layers with two pooling layers. R1 has a large initial filter size of 9×9 which can capture high level
abstractions within the scene like faces, urban facade etc. R2 and R3 with initial filter sizes 7×7 and
5×5 capture crowds at lower scales detecting blob like abstractions.
Patches are relayed to a regressor using a switch. The switch consists of a switch classifier and
a switch layer. The switch classifier infers the label of the regressor to which the patch is to be
relayed to. A switch layer takes the label inferred from the switch classifier and relays it to the correct
27
regressor. For example, in Figure 3.1, the switch classifier relays the patch highlighted in red to
regressor R3 . This patch has a very high crowd density. Switch relays it to regressor R3 which has
smaller receptive field: ideal for detecting blob like patterns that are characteristic of patches with
high crowd density. We use an adaptation of VGG16 [7] network as the switch classifier to perform
3-way classification. The fully-connected layers in VGG16 are removed. We use global average pool
(GAP) on Conv5 features to remove the spatial information and aggregate discriminative features.
GAP is followed by a smaller fully connected layer and 3-class softmax classifier corresponding to
the three regressor networks in Switch-CNN.
Training of Switch-CNN is done in three stages, namely pretraining, differential training and
coupled training as described in the following sections.
3.1.2 Pretraining
The three CNN regressors R1 through R3 are pretrained separately to regress density maps. Pretrain-
ing helps in learning good initial features which improves later fine-tuning stages. Individual CNN
regressors are trained to minimize the Euclidean distance between the estimated density map and
ground truth. Let DXi (·; Θ) represent the output of a CNN regressor with parameters Θ for an input
image Xi . The l2 loss function is given by
N
1 X GT
Ll2 (Θ) = kDXi (·; Θ) − DX (·)k22 , (3.1)
2N i=1 i
GT
where N is the number of training samples and DX i
(·) indicates ground truth density map for image
Xi . The loss Ll2 is optimized by backpropagating the CNN via stochastic gradient descent (SGD).
Here, l2 loss function acts as a proxy for count error between the regressor estimated count and true
count. It indirectly minimizes count error. The regressors Rk are pretrained until the validation
accuracy plateaus.
3.1.3 Differential Training

CNN regressors R1−3 are pretrained with the entire training data. The count prediction performance
varies due to the inherent difference in initialization and network structure of R1−3 like receptive
field and effective field-of-view. Though we optimize the l2 -loss between the estimated and ground
truth density maps for training CNN regressor, factoring in count error during training leads to better
crowd counting performance. Hence, we measure CNN performance using count error. Let the count
estimated by kth regressor for ith image be Cik = x DXi (x; Θk ). Let the reference count inferred
P
28
Algorithm 1 Switch-CNN training algorithm is shown. The training algorithm is divided into stages
coded by color. Color code index: Differential Training, Coupled Training, Switch Training
input : N training image patches {Xi }N GT N
i=1 with ground truth density maps {DXi }i=1
output: Trained parameters {Θk }3k=1 for Rk and Θsw for the switch
Initialize Θk ∀ k with random Gaussian weights
Pretrain {Rk }3k=1 for Tp epochs : Rk ← fk (·; Θk ) ;
/*Differential Training for Td epochs*/

/*Cik is count predicted by Rk for input Xi */
/*CiGT is ground truth count for input Xi */
for t = 1 to Td do
for i = 1 to N do
libest = argmin|Cik − CiGT | Backpropagate Rlibest and update Θlibest
k
end
end
/*Coupled Training for Tc epochs*/
Initialize Θsw with VGG-16 weights for t = 1 to Tc do
/*generate labels for training switch*/
for i = 1 to N do
libest = argmin|Cik − CiGT |
k
end
Strain = {(Xi , libest ) | i ∈ [1, N ]}
/*Training switch for 1 epoch*/
Train switch with Strain and update Θsw
/*Switched Differential Training*/
for i = 1 to N do
/*Infer choice of Rk from switch*/
lisw = argmax fswitch (Xi ; Θsw ) Backpropagate Rliswitch and update Θlisw
end
end
from ground truth be CiGT = GT

P
x DXi
(x). Then count error for ith sample evaluated by Rk is
ECi (k) = |Cik − CiGT |, (3.2)
the absolute count difference between prediction and true count. Patches with particular crowd at-
tributes give lower count error on a regressor having complementary network structure. For example,
29
a CNN regressor with large receptive field capture high-level abstractions like background elements
and faces. To amplify the network differences, differential training is proposed (shown in blue in Al-
gorithm 1). The key idea in differential training is to backpropagate the regressor Rk with minimum
count error for a given training crowd scene patch. For every training patch i, we choose the regressor
libest such that ECi (libest ) is lowest across all regressors R1−3 . This amounts to greedily choosing the
regressor that predicts the most accurate count amongst k regressors. Formally, we define the label of
chosen regressor libest as:
libest = argmin|Cik − CiGT | (3.3)
k
The count error for ith sample is

ECi = min|Cik − CiGT |. (3.4)
k
This training regime encourages a regressor Rk to prefer a particular set of the training data patches
having a certain attribute so as to minimize the loss. While the backpropagation of independent
regressor Rk is still done with l2 -loss, the choice of CNN regressor for backpropagation is based on
the count error. Differential training indirectly minimizes the mean absolute count error (MAE) over
the training images. For N images, MAE in this case is given by
N
1 X
EC = min|Cik − CiGT |, (3.5)
N i=1 k
which can be thought as the minimum count error achievable if each sample is relayed correctly to
the right CNN. However during testing, achieving this full accuracy may not be possible as the switch
classifier is not ideal. To summarize, differential training generates three disjoint groups of training
patches and each network is finetuned on its own group. The regressors Rk are differentially trained
until the validation accuracy plateaus.
3.1.4 Switch Training

Once the multichotomy of space of patches is inferred via differential training, a patch classifier
(switch) is trained to relay a patch to the correct regressor Rk . The manifold that separates the space
of crowd scene patches is complex and hence a deep classifier is required to infer the group of patches
in the multichotomy. We use VGG16 [7] network as the switch classifier to perform 3-way classifica-
tion. The classifier is trained on the labels of multichotomy generated from differential training. The
number of training patches in each group can be highly skewed, with the majority of patches being
relayed to a single regressor depending on the attributes of crowd scene. To alleviate class imbalance
during switch classifier training, the labels collected from the differential training are equalized so
30
that the number of samples in each group is the same. This is done by randomly sampling from the
smaller group to balance the training set of switch classifier.
3.1.5 Coupled Training

Differential training on the CNN regressors R1 through R3 generates a multichotomy that minimizes
the predicted count by choosing the best regressor for a given crowd scene patch. However, the trained
switch is not ideal and the manifold separating the space of patches is complex to learn. To mitigate
the effect of switch inaccuracy and inherent complexity of task, we co-adapt the patch classifier and
the CNN regressors by training the switch and regressors in an alternating fashion. We refer to this
stage of training as Coupled training (shown in green in Algorithm 1).
The switch classifier is first trained with labels from the multichotomy inferred in differential
training for one epoch (shown in red in Algorithm 1). In the next stage, the three CNN regressors
are made to co-adapt with switch classifier (shown in blue in Algorithm 1). We refer to this stage of
training that enforces co-adaption of switch and regressor R1−3 as Switched differential training.
In switched differential training, the individual CNN regressors are trained using crowd scene
patches relayed by switch for one epoch. For a given training crowd scene patch Xi , switch is forward
propagated on Xi to infer the choice of regressor Rk . The switch layer then relays Xi to the particular
regressor and backpropagates Rk using the loss defined in Equation 3.1 and θk is updated. This
training regime is executed for an epoch.
In the next epoch, the labels for training the switch classifier are recomputed using criterion in
Equation 3.3 and the switch is again trained as described above. This process of alternating switch
training and switched training of CNN regressors is repeated every epoch until the validation accuracy
plateaus.
3.2 Experiments
We evaluate the performance of our proposed architecture, Switch-CNN, on four major crowd count-
ing datasets. At test time, the image patches are fed to the switch classifier which relays the patch to
the best CNN regressor Rk . The selected CNN regressor predicts a crowd density map for the relayed
crowd scene patch. The generated density maps are assembled into an image to get the final density
map for the entire scene. Because of the two pooling layers in the CNN regressors, the predicted
density maps are 14 th size of the input.
31
Figure 3.2: Sample predictions by Switch-CNN for crowd scenes from the ShanghaiTech dataset [1]
is shown. The top and bottom rows depict a crowd image, corresponding ground truth and prediction
from Part A and Part B of dataset respectively.
3.2.1 ShanghaiTech dataset

We perform extensive experiments on the ShanghaiTech crowd counting dataset [1]. We train Switch-
CNN as elucidated by Algorithm 1 on both parts of the dataset. Ground truth is generated using
geometry-adaptive kernels method. With an ideal switch (100% switching accuracy), Switch-CNN
performs with an MAE of 51.4. However, the accuracy of the switch is 73.2% in Part A and 76.3%
in Part B of the dataset resulting in a lower MAE.
Table 3.1 shows that Switch-CNN outperforms all other state-of-the art methods by a significant
margin on both the MAE and MSE metric. Switch-CNN shows a 19.8 point improvement in MAE
on Part A and 4.8 point improvement in Part B of the dataset over MCNN [1]. Switch-CNN also
outperforms all other models on MSE metric indicating that the predictions have a lower variance
than MCNN across the dataset. This is an indicator of the robustness of crowd count predictions by
Switch-CNN.
We show sample predictions of Switch-CNN for sample test scenes from the ShanghaiTech dataset
along with the ground truth in Figure 3.2. The predicted density maps closely follow the crowd
distribution visually. This indicates that Switch-CNN is able to localize the spatial distribution of
crowd within a scene accurately.
32
Part A Part B
LBP+RR [1] 303.2 371.0 59.1 81.7
Zhang et al. [9] 181.8 277.7 32.0 49.8
MCNN [1] 110.2 173.2 26.4 41.3
TDF-CNN (Chapter 2) 97.5 145.1 20.7 32.8
Switch-CNN 90.4 135.0 21.6 33.4
Table 3.1: Comparison of Switch-CNN with other state-of-the-art crowd counting methods on Shang-
haiTech dataset [1].

UCF CC 50 [2] is a 50 image collection of annotated crowd scenes. In Table 3.2, we compare the
performance of Switch-CNN with other methods using MAE and MSE as metrics. Switch-CNN
outperforms all other methods and evidences a 15.7 point improvement in MAE over Hydra2s [21].
Switch-CNN also gets a competitive MSE score compared to Hydra2s indicating the robustness of
the predicted count. The accuracy of the switch is 54.3%. The switch accuracy is relatively low as the
dataset has very few training examples and a large variation in crowd density. This limits the ability
of the switch to learn the multichotomy of space of crowd scene patches.
Method MAE MSE

Lempitsky et al.[56] 493.4 487.1
Idrees et al.[2] 419.5 487.1
Zhang et al. [9] 467.0 498.5
CrowdNet [22] 452.5 –
MCNN [1] 377.6 509.1
Hydra2s [21] 333.73 425.26
TDF-CNN (Chapter 2) 354.7 491.4
Switch-CNN 318.1 439.2
Table 3.2: Comparison of Switch-CNN with other state-of-the-art crowd counting methods on
UCF CC 50 dataset [2].
3.2.3 The UCSD dataset

The UCSD dataset crowd counting dataset consists of 2000 frames from a single scene. The scenes
are characterized by sparse crowd with the number of people ranging from 11 to 46 per frame. A
region of interest (ROI) is provided for the scene in the dataset. We use the train-test splits used by
33
Method MAE MSE
Kernel Ridge Regression [58] 2.16 7.45
Cumulative Attribute Regression [59] 2.07 6.86
Zhang et al. [9] 1.60 3.31
MCNN [1] 1.07 1.35
CCNN [21] 1.51 –
Switch-CNN 1.62 2.10
Table 3.3: Comparison of Switch-CNN with other state-of-the-art crowd counting methods on UCSD
crowd-counting dataset [10].
[10]. Of the 2000 frames, frames 601 through 1400 are used for training while the remaining frames
are held out for testing. Following the setting used in [1], we prune the feature maps of the last layer
with the ROI provided. Hence, error is backpropagated during training for areas inside the ROI. We
use a fixed spread Gaussian to generate ground truth density maps for training Switch-CNN as the
crowd is relatively sparse. At test time, MAE is computed only for the specified ROI in test images
for benchmarking Switch-CNN against other approaches.
Table 3.3 reports the MAE and MSE results for Switch-CNN and other state-of-the-art approaches.
Switch-CNN performs competitively compared to other approaches with an MAE of 1.62. The switch
accuracy in relaying the patches to regressors R1 through R3 is 60.9%. However, the dataset is
characterized by low variability of crowd density set in a single scene. This limits the performance
gain achieved by Switch-CNN from leveraging intra-scene crowd density variation.
3.2.4 The WorldExpo’10 dataset

In addition to head annotations, perspective maps are provided for all scenes in the WorldExpo’10
dataset [9]. The maps specify the number of pixels in the image that cover one square meter at every
location in the frame. These maps are used by [1, 9] to adaptively choose the spread of the Gaussian
while generating ground truth density maps. We evaluate performance of the Switch-CNN using
ground truth generated with and without perspective maps.
We prune the feature maps of the last layer with the ROI provided. Hence, error is backpropagated
during training for areas inside the ROI. Similarly at test time, MAE is computed only for the specified
ROI in test images for benchmarking Switch-CNN against other approaches.
MAE is computed separately for each test scene and averaged to determine the overall perfor-
mance of Switch-CNN across test scenes. Table 3.4 shows that the average MAE of Switch-CNN
across scenes is better by a margin of 2.2 point over the performance obtained by the state-of-the-art
approach MCNN [1]. The switch accuracy is 52.72%.
34
Method Scene1 Scene2 Scene3 Scene4 Scene5 Avg.
MAE
LBP+RR [1] 13.6 59.8 37.1 21.8 23.4 31.0
Zhang et al. [9] 9.8 14.1 14.3 22.2 3.7 12.9
MCNN [1] 3.4 20.6 12.9 13.0 8.1 11.6
TDF-CNN (Chapter 2) 2.7 23.4 10.7 17.6 3.3 11.5
Switch-CNN 4.2 14.9 14.2 18.7 4.3 11.2
(GT with perspective map)
Switch-CNN 4.4 15.7 10.0 11.0 5.9 9.4
(GT without perspective)
Table 3.4: Comparison of Switch-CNN with other state-of-the-art crowd counting methods on World-
Expo’10 dataset [9]. Mean Absolute Error (MAE) for individual test scenes and average performance
across scenes is shown.
3.3.1 Effect of number of regressors on Switch-CNN

Differential training makes use of the structural variations across the individual regressors to learn a
multichotomy of the training data. To investigate the effect of structural variations of the regressors
R1 through R3 , we train Switch-CNN with combinations of regressors (R1 ,R2 ), (R2 ,R3 ), (R1 ,R3 ) and
(R1 ,R2 ,R3 ) on Part A of ShanghaiTech dataset. Table 3.5 shows the MAE performance of Switch-
CNN for different combinations of regressors Rk . Switch-CNN with CNN regressors R1 and R3
has lower MAE than Switch-CNN with regressors R1 –R2 and R2 –R3 . This can be attributed to
the former model having a higher switching accuracy than the latter. Switch-CNN with all three
regressors outperforms both the models as it is able to model the scale and perspective variations
Method MAE
R1 157.61
R2 178.82
R3 178.10
Switch-CNN with (R1 ,R3 ) 98.87
Switch-CNN with (R1 ,R2 ,R3 ) 90.41
Table 3.5: Comparison of MAE for Switch-CNN variants and CNN regressors R1 through R3 on
Part A of the ShanghaiTech dataset [1].
35
better with three independent CNN regressors R1 , R2 and R3 that are structurally distinct. Switch-
CNN leverages multiple independent CNN regressors with different receptive fields. In Table 3.5,
we also compare the performance of individual CNN regressors with Switch-CNN. Here each of the
individual regressors are trained on the full training data from Part A of Shanghaitech dataset. The
higher MAE of the individual CNN regressor is attributed to the inability of a single regressor to
model the scale and perspective variations in the crowd scene.
3.3.2 Specialty Characteristics

The principal idea of Switch-CNN is to divide the training patches into disjoint groups to train in-
dividual CNN regressors so that overall count accuracy is maximized. This multichotomy in space
of crowd scene patches is created automatically through differential training. We examine the un-
derlying structure of the patches to understand the correlation between the learnt multichotomy and
attributes of the patch like crowd count, density etc. However, the unavailability of perspective maps
renders computation of actual density intractable. We believe inter-head distance between people is
a candidate measure of crowd density. In a highly dense crowd, the separation between people is
low and hence density is high. On the other hand, for low density scenes, people are far away and
R1:9x9
100 R2:7x7
R3:5x5
80
No of Patches
60
40
20
0
0 50 100 150 200 250 No People
Mean inter-head distance per patch
Figure 3.3: Histogram of average inter-head distance for crowd scene patches from Part A test set of
ShanghaiTech dataset [1] is shown. We see that the multichotomy of space of crowd scene patches
inferred from the switch separates patches based on latent factors correlated with crowd density.
36
Figure 3.4: Sample crowd scene patches from Part A test set of ShanghaiTech dataset [1] are shown.
We see that the density of crowd in the patches increases from CNN regressor R1 –R3 .
mean inter-head distance is large. Thus mean inter-head distance is a proxy for crowd density. This
measure of density is robust to scale variations as the inter-head distance naturally subsumes the scale
variations.
To analyze the multichotomy in space of patches, we compute the average inter-head distance of
each patch in Part A of ShanghaiTech test set. For each head annotation, the average distance to its
10 nearest neighbors is calculated. These distances are averaged over the entire patch representing
the density of the patch. We plot a histogram of these distances in Figure 3.3 and group the patches
by color on the basis of the regressor Rk used to infer the count of the patch. A separation of patch
space based on crowd density is observed in Figure 3.3. R1 , which has the largest receptive field of
9×9, evaluates patches of low crowd density (corresponding to large mean inter-head distance). An
interesting observation is that patches from the crowd scene that have no people in them (patches in
Figure 3.3 with zero average inter-head distance) are relayed to R1 by the switch. We believe that the
patches with no people are relayed to R1 as it has a large receptive field that helps capture background
attributes in such patches like urban facade and foliage. Figure 3.4 displays some sample patches
that are relayed to each of the CNN regressors R1 through R3 . The density of crowd in the patches
increases from CNN regressor R1 through R3 .
37
Method MAE
Cluster by count 99.56
Cluster by mean inter-head distance 94.93
Mixture of Experts 111.6
Switch-CNN 90.41
Table 3.6: Comparison of MAE for Switch-CNN and manual clustering of patches based on patch
attributes on Part A of the ShanghaiTech dataset [1].
3.3.3 Attribute Clustering Vs Differential Training

We saw in Section 3.3.2 that differential training approximately divides training set patches into a
multichotomy based on density. We investigate the effect of manually clustering the patches based on
patch attribute like crowd count or density. We use patch count as metric to cluster patches. Training
patches are divided into three groups based on the patch count such that the total number of training
patches are equally distributed amongst the three CNN regressors R1−3 . R1 , having a large receptive
field, is trained on patches with low crowd count. R2 is trained on medium count patches while high
count patches are relayed to R3 . The training procedure for this experiment is identical to Switch-
CNN, except for the differential training stage. We repeat this experiment with average inter-head
distance of the patches as a metric for grouping the patches. Patches with high mean inter-head
distance are relayed to R1 . R2 is relayed patches with low inter-head distance by the switch while the
remaining patches are relayed to R3 .
Table 3.6 reports MAE performance for the two clustering methods. Both crowd count and av-
erage inter-head distance based clustering give a higher MAE than Switch-CNN. Average inter-head
distance based clustering performs comparable to Switch-CNN. This evidence reinforces the fact that
Switch-CNN learns a multichotomy in the space of patches that is highly correlated with mean inter-
head distance of the crowd scene. The differential training regime employed by Switch-CNN is able
to infer this grouping automatically, independent of the dataset.
Alternatively, we also compare our framework with the standard mixture of experts (MoE) ap-
proach [60]. MoE leverages a gating network to generate the weights which are used to fuse the
outputs of the three regressors. We use the same setting as that of SCNN and VGG -16 classifier is
employed as gating CNN to output softmax confidences. The regressors are initialized with pretrain-
ing and their outputs are multiplied by the classifier confidences before summing together to get the
final prediction. Clearly, SCNN is able to bring significantly more specialization among the regressors
and achieves better MAE than MoE approach.
38
3.4 Conclusion
In this chapter, we propose switching convolutional neural network that leverages intra-image crowd
density variation to improve the accuracy and localization of the predicted crowd count. We utilize
the inherent structural and functional differences in multiple CNN regressors capable of tackling large
scale and perspective variations by enforcing a differential training regime. Experiments on multiple
datasets show that our model exhibits superior performance on major datasets. Further, we show that
our model learns to group crowd patches based on latent factors correlated with crowd density.
Though SCNN framework addresses the Appearance Variety and Scale Diversity (Section 1.1)
significantly, still there exists performance gap due to the inability to scale across the full diversity
spectrum. Moreover, the proposed method is limited to three regressors with different architecture,
which needs to be extended. In short, a more generic scheme is required and one solution is being
described in the following chapter.
39
Chapter 4
Incrementally Growing CNN to adapt with

larger Crowd Varieties
YPICAL CNN density regressors [1, 21, 22] are optimized over an entire dataset containing im-
T ages of all densities. And in many datasets, the density of crowd is not uniform. For example,
in Part A of Shanghaitech dataset [1], the distribution is skewed with less number of dense crowd im-
ages than sparse ones. Consequently, performance of models varies widely across different categories
of crowds. This usually results in over-estimating count for sparse images and under predicting for
dense images as shown in [25].
One obvious solution to address this wide variability is having multiple regressors, each special-
ized for a particular type of crowd. This is illustrated in Figure 4.1 with predictions made by regressors
INPUT IMAGE GROUND TRUTH SPARSE EXPERT DENSE EXPERT
Figure 4.1: Predictions of a typical regressor fine-tuned for sparse or dense crowds. Models perform
better on their own specialties.
40
fine-tuned for sparse and dense crowds. The experts perform well in their own specialties but worse
in others. The major difficulty in such approaches is determining a criteria for creating experts. For
Figure 4.1, we choose division based on density, but it can be based on other characteristics too. Even
if the criteria is chosen, what would be the basis of division (how many people make a crowd dense
or sparse)? These metrics are dataset as well as model dependent and need to be manually specified.
In crowd counting, till now no principled method has been proposed to do this. Moreover, improper
divisions can lead to suboptimal solutions. Learning experts automatically with classical mixture of
expert [60] models do not work well in this scenario as shown by some works like [61].
The aim of this work is to introduce a principled methodology for creating experts, without any
handcrafted dataset dependent criteria for specialization. Hence, we propose an Incrementally Grow-
ing CNN (IG-CNN) for crowd counting. IG-CNN starts from a base CNN density regressor which
is trained on the entire dataset. Then we replicate the base CNN into two child networks by copy-
ing the weights of the parent. To make these child regressors specialized, we do differential training
(Chapter 3), where clustering of the dataset is done jointly with fine-tuning of the child networks
(Section 4.2.4). In the next growing step, we replicate each of the child regressors again into two
networks and perform differential training. This procedure is done recursively forming a hierarchical
CNN tree where each node has two child nodes which are more specialized than their parent. At the
end of the growing, a set of experts are created at the leaf nodes of the CNN tree. At test time, a
classifier routes the input image patches to the appropriate expert regressors.
In a nutshell, this chapter introduces the following:
• A hierarchical clustering method that jointly creates image clusters and a set of expert neural
networks specialized on their respective clusters.
• A crowd counting system that can adapt and grow based on the complexity of the dataset.
4.1 Related Works

Since existing multi-column approaches [1, 21, 22] train the entire model together, specialization
gained among the columns or component networks need not be drastic. Chapter 3 partly address this
issue by performing a differential training procedure to accentuate the specialization between CNN
columns of varied architecture. But the model is limited by the availability of regressors with differ-
ent architectures. In contrast, our method requires only one base regressor. More specific standard
mixture of experts [60] based model is used in [61] for direct count regression, but performs worse
than the hard switching mechanism of Chapter 3.
41
Growing Networks: The concept of a neural network that incrementally enlarges its capacity
while learning has been there for a while. Several such Growing Neural Network models have been
proposed in the literature [62, 63] for supervised as well as unsupervised learning. Few works like
[64] grow a CNN progressively by adding new neurons in a data-dependent fashion. In the domain of
transfer learning, [65] analyse different approaches for developmental networks which can increase
its model capacity as and when new tasks are given. They explore adding new layers along the depth
or width of the neural network.
Specialization based Methods: Expert specialization approaches like [66], increase classifica-
tion performance by imposing coarse and fine class hierarchy onto a deep CNN. But this method
requires coarse labels which is not required by method introduced in [67]. In a generalist-specialist
setting, Ahmed et al. [67] jointly train specialty branches along with a generalist CNN which can
classify the specialties. Specialty groups are formed such that they can be easily discriminated by
the classifier. The algorithm proposed in [68] can learn a CNN tree where the nodes down the tree
capture progressively fine-grained features. Our model also hierarchically grows a CNN tree, but it
is employed only as a method to create a set of experts without any manually specified specialization
criteria. In contrast to works like [68], the hierarchy is discarded in IG-CNN after training and only
the networks at the leaf nodes of the CNN tree are kept. These leaf networks are finer experts and are
selectively used at test time to evaluate inputs corresponding to their specialties.
4.2 Our Approach
4.2.1 Creating Experts with Hierarchical Differential Training

As motivated in Section 1.1, counting models have to handle severe diversity in the way people appear
in crowds. We try to mitigate this issue with a set of expert regressors, each of which are specialized
on one particular subset of the training data. Many previous works leveraging such specialization
methods require the specialty information to be given either in the form of priors [24, 25] or indirectly
as regressors with different architectures (Chapter 3). In this scenario, naive mixture of experts based
methods are shown to perform subpar [61]. Hence, we design a model which does not require any
specialty criteria to be manually specified for training experts and yet achieves better count estimates.
Our incrementally growing CNN or IG-CNN architecture is shown in Figure 4.2. IG-CNN training
begins with a base CNN regressor, which is trained on the full dataset to estimate crowd density. To
create specialties, a hierarchical training procedure is employed. Let R0 represent the base CNN and
N
D0 = {Xi=0 } denotes the dataset of N images on which the base regressor R0 is trained. Initially,
the base CNN is replicated into two networks R00 and R01 . Now with differential training, R00 and
42
Base
CNN R0
D0
Level 1
Differential
Training
R00 R01
D00 D01
Level 2
R000 R001 R010 R011

D000 D001 D010 D011
Figure 4.2: Hierarchical Differential Training in IG-CNN. Regressors are recursively replicated and
specialized forming a CNN tree.
R01 need to be made experts in separate specialties. The differential training procedure divides the
dataset into two and fine-tunes the two regressors on their respective clusters. For a given image patch,
only the regressor with the best count accuracy is trained. This way the oracle loss [69] of the set of
regressors is minimized. Oracle loss is the minimum loss achievable if the correct regressor is chosen
for all samples (see Section 4.2.4). We use a modified version of the differential training introduced
in [70]. Our algorithm does not require regressors with different architectures as in [70] and also uses
count loss function for fine-tuning.
After the first level of training, we have two expert regressors R00 and R01 along with their cor-
responding specialty subsets D00 and D01 . Each of the networks R00 and R01 , is replicated again
to form respective child networks in the second iteration of growing. Regressors R000 and R001 will
have same weights as of R00 while R01 is copied to R010 and R011 . Differential training is performed
on R000 and R001 with dataset D00 only. Similarly, D01 is used for fine-tuning R010 and R011 . This
43
2-CONV
3x3 | 64
MAX-POOL
2x2
2-CONV
3x3 | 128
MAX-POOL
EXPERT 2x2
3-CONV
CLASSIFIER 3x3 | 256
MAX-POOL
2x2
“R1”
3-CONV
3x3 | 512
MAX-POOL
2x2
ROUTE 3-CONV
LAYER ... 3x3 | 512
GAP
FC
512
FC
n
SOFTMAX
R1 R2 R3
... Rn
1-CONV
9x9 | 16
MAX-POOL
2x2
1-CONV
7x7 | 32
MAX-POOL
2x2
DENSITY 1-CONV
7x7 | 16
MAP
1-CONV
7x7 | 8
Σ 1-CONV
1x1 | 1
ESTIMATED COUNT
Figure 4.3: Test time architecture of IG-CNN. The expert classifier routes crowd patches to the ap-
propriate specialized regressor.
makes sure that specialization acquired by the parent is propagated to its child networks. The two
way replicating and specialization process is recursively continued forming a CNN tree. A child node
in the tree is more specialized than any of their parent network. More deeper the tree, more finer
the specialties with leaf nodes being the finest experts. Section 4.2.4 elucidates training algorithm in
detail.
4.2.2 Growing CNN Architecture

The hierarchical differential training procedure results in the creation of a set of regressors at the leaf
nodes of the CNN tree. All the regressors have the same architecture but give better performance on
44
their specialties. Additionally, a classifier is trained for selecting the right expert for a given scene
patch. Figure 4.3 shows the test time architecture of IG-CNN.
A neural network with five convolutional layers is used as the base CNN. Because of two pooling
layers, the density prediction is at 41 th scale of the input image. All convolutional layers use ReLU
activation function. This simple regressor is introduced in [1] and delivers reasonable performance.
But it is to be noted that our training methodology is generic and is not limited by any particular base
CNN. For expert classifier, we use a modified VGG-16 [7] network. Features of the last convolution
layer of VGG-16 are reduced via global average pooling followed by two fully connected layers and
softmax prediction at the end. The number of units in the last fully connected layer depends on how
many expert regressors are generated with the hierarchical training.
During testing, we take overlapping patches from the image. Each patch extracted is of the size
PW × PH . A region of interest (RoI) of size RW × RH is defined for the patch on which the CNN
regressor predicts the crowd density. Area outside the RoI acts as context and aids in better regression.
The RoI is slided over the entire image with an overlap. The predictions of the overlapping areas are
averaged to get the final density. Characteristics of crowd in the RoI is assumed to be constant. For
IG-CNN hierarchical differential training, patches are extracted at random locations from images and
the loss is computed only on the RoI. The expert classifier also uses the RoI part of a patch to select
the suitable regressor. Typical patch size is 224 × 224 with an RoI size of 80 × 80.
4.2.3 Pretraining of Base CNN

The base CNN is trained on the entire dataset to regress crowd density map. The network is trained
by backpropagating l2 loss between the predicted and the ground truth density maps. Let MXi (x; Θ)
denote the density map predicted by the CNN regressor and MXGTi (x) be the ground truth for image
Xi . Then, the l2 loss function is defined as
N
1 X
Ll2 (Θ) = kMXi (x; Θ) − MXGTi (x)k22 , (4.1)
2N i=1
where Θ refers to the trainable parameters of the CNN and N is the total number of training samples.
The parameters Θ are found by optimizing Ll2 using standard stochastic gradient descent (SGD) with
momentum. Even though our objective is to minimize the count error, l2 loss acts as proxy for the
count loss. Reduction in l2 distance implicitly lowers the count error between the prediction and
ground truth. For pretraining, we crop patches at different locations from every image and apply flip
augmentation.
45
4.2.4 Training Algorithm for IG-CNN
The overall training procedure of IG-CNN is depicted in Algorithm 2. The first step is the pretraining
of the base CNN R0 . For any regressor, the final metric of performance is the count error which
R
needs to be minimized. Count predicted by a regressor R for an image Xi is computed as CX i
=
GT GT
P P
x MXi (x; ΘR ), where its ground truth count is CXi = x MXi (x). The count error for regressor
R on image Xi is the absolute difference of predicted and actual count or mathematically, EXi (R) =
R GT
|CX i
− CX i
|.
After pretraining of the base CNN, a CNN tree is progressively built where each node represents
a regressor fine-tuned on a subset of the dataset. This is done by replicating each regressor at the tree
leaves into two and specializing the child networks with differential training. At any node m, let Rm0
and Rm1 be the child regressors and Dm be the subset of dataset available for the node. Now Rm0 and
Rm1 need to be made experts in the specialty sets Dm0 and Dm1 respectively. But we have neither the
specialty sets nor the expert regressors. Differential training allows to jointly obtain the specialties
and the expert regressors by minimizing the oracle count error. The oracle count error for patch Xi is
oracle R GT
EX i
= min |CX i
− CX i
|, the minimum of the count errors obtained by the two regressors.
R∈[Rm0 ,Rm1 ]
The basic idea is to evaluate both the regressors on a particular image patch and fine-tune only the
best R GT
one giving lesser count error. Choose the best regressor by rX i
= argmin |CX i
− CX i
|. Note that
R∈[Rm0 ,Rm1 ]
when count predictions by both networks are same, which mostly happens at the start of the training,
the first regressor is chosen to break the tie. This makes sure that the differentiation between the
networks builds up progressively. By selectively fine-tuning Rm0 and Rm1 based on its performance
on the training patches, the regressors become more and more specialized in two groups Dm0 and
Dm1 . These specialty subsets might be skewed and completely depends on the dataset as well as
the regressors. The fine-tuning is done with lower learning rate (10−6 ) and continue till validation
accuracy stops improving.
Unlike differential training in Chapter 3, count loss is used instead of l2 loss for fine-tuning re-
gressors. We define the count loss as,
N
λ X GT 2
LC (Θ) = (CXi − CX ). (4.2)
2N i=1 i
Here constant λ is used to check the magnitude of the loss. For all experiments, λ is set as 10−2 .
Since the CNN is pretrained with l2 loss, it has good initial features and fine-tuning with count loss
provides complementary information. This is found to give better clustering and more accurate count
estimation.
46
Algorithm 2 IG-CNN training algorithm.
input : Dataset D0 = {Xi , MXGTi }N i=1 (image & map)
output: Parameters {Θr } of experts and classifier Θc
Random initialize Θ0 for base CNN R0 Pretrain R0 Rleaf = {R0 } Dleaf [R0 ] = D0
/* Hierarchical Differential Training */
for l = 0 to max tree depth do
/* Replicate R twice */
for R in Rleaf do
Rchild [R] = {R, R}
end
/* Differential Training */
/* R predicts count CiR while CiGT is the actual */
for i = 1 to max iterations do
for Rl in Rleaf do
for X, M in Dleaf [Rl ] do
R GT
r = argmin |CX − CX | Fine-tune Rr with X to update Θr
R∈Rchild [Rl ]
end
end
Break if validation Oracle MAE stagnates;
end
/* Dataset division for leaf regressors */
Dleaf = [] for X, M in D0 do
for Rl in Rleaf do
R GT
r = argmin |CX − CX | Add (X, M ) to Dleaf [r]
R∈Rchild [Rl ]
end
end
/* Training of Expert Classifier */
Initialize Θc with VGG-16 weights for X, M in D0 do
R GT
r = argmin|CX − CX | Add (X, r) to Dc
R∈Rleaf
end
Train classifier with Dc and update Θc Break if validation Actual MAE stagnates;
end
Differential training minimizes oracle error over the training set. This count error is achievable
only if there is an oracle to classify a test patch to the correct regressor. But the ability of a classifier
to achieve high results in determining the specialty depends on the quality of the specialization. If the
expert specialties do not have any generalizable features, the performance might decay on the test set.
The leaf regressors (Rleaf ) at a particular level of growth are experts on specific specialties. As
shown in Figure 4.3, at test time, a classifier selects the right expert regressor for the image patch. The
47
classifier is trained on the labels obtained from the expert regressors. For a given image patch Xi , the
best R
corresponding label is attributed to the regressor with minimum count error, RX i
= argmin|CX i
−
R∈Rleaf
GT
CX i
|. As the samples per expert specialty need not be uniform, class balancing is done before training
the classifier.
At every increment of the growing process, regressors at the leaf nodes of the CNN tree are split
and new expert regressors are created. We monitor the Oracle MAE and Actual MAE for the leaf
regressors over a validation set. While Oracle MAE indicates the count error incurred when right
expert is always chosen for regression, Actual MAE is obtained with the expert classifier. Note that
the validation set is randomly sampled from the training images and is fixed across entire training
procedure (irrespective of tree level). The hierarchical tree splitting is stopped when the Actual MAE
on validation set is not improving (see Table 4.4).
4.3 Experiments
We benchmark our IG-CNN model on three crowd counting datasets. For a given test image, patches
are extracted and evaluated by the expert classifier to route them to the regressors specialized for the
specific crowd types.
760.0 664.6 764.5
216.0 186.4 224.2
Input Image Ground Truth Base CNN Prediction IG-CNN Prediction
Figure 4.4: Predictions made by IG-CNN on images of Shanghaitech dataset [1].
48
Part A Part B
Zhang [9] 181.8 277.7 32.0 49.8
MCNN [1] 110.2 173.2 26.4 41.3
TDF-CNN (Chapter 2) 97.5 145.1 20.7 32.8
SCNN (Chapter 3) 90.4 135.0 21.6 33.4
Cascaded-MTL [24] 101.3 152.4 20.0 31.1
CP-CNN [25] 73.6 106.4 20.1 30.1
IG-CNN 72.5 118.2 13.6 21.1
Table 4.1: Performance of IG-CNN on Part A and Part B of Shanghaitech dataset [1]. IG-CNN
outperforms other methods in MAE.
Method MAE MSE

Lempitsky et al. [56] 493.4 487.1
Idrees et al. [2] 419.5 541.6
Zhang et al. [9] 467.0 498.5
CrowdNet [22] 452.5 -
MCNN [1] 377.6 509.1
Hydra2s [21] 333.7 425.3
SCNN (Chapter 3) 318.1 439.2
Cascaded-MTL [24] 322.8 397.9
CP-CNN [25] 295.8 320.9
IG-CNN 291.4 349.4
Table 4.2: Comparison of IG-CNN with other methods on UCF CC 50 dataset [2]. Our model gives
lower error than other methods.
4.3.1 Shanghaitech dataset

For both Part A and Part B of the dataset, we grow IG-CNN to 3 levels resulting in 8 expert regres-
sors. Table 4.1 tabulates the performance metrics for IG-CNN on the dataset along with that of other
models. It can be observed that IG-CNN outperforms all other methods in Part B by a significant
margin both in terms of MAE and MSE. IG-CNN achieves better count accuracy in Part A as well.
Though our model narrowly outperforms CP-CNN [25], it is to be noted that the authors of CP-CNN
use adversarial training to boost their base performance from 76.1 to 73.6. Figure 4.4 shows density
maps predicted by IG-CNN and the base CNN along with the corresponding ground truths. The pre-
dicted density maps closely resemble the ground truth as well as have accurate count estimates. This
demonstrates the ability of IG-CNN to better capture the crowd density.
49
Method Scene-1 Scene-2 Scene-3 Scene-4 Scene-5 Average
Zhang et al. [9] 9.8 14.1 14.3 22.2 3.7 12.9
MCNN [1] 3.4 20.6 12.9 13.0 8.1 11.6
TDF-CNN (Chapter 2) 2.7 23.4 10.7 17.6 3.3 11.5
SCNN (Chapter 3) 4.4 15.7 10.0 11.0 5.9 9.4
CP-CNN [25] 2.9 14.7 10.5 10.4 5.8 8.9
IG-CNN 2.6 16.1 10.15 20.2 7.6 11.3
Table 4.3: MAEs obtained by models for the 5 test scenes of WorldExpo’10 dataset [9].

IG-CNN hierarchical growing is done for two levels, creating 4 expert regressor on UCF CC 50
dataset. It can be seen from Table 4.2 that IG-CNN has the lowest count error. Despite being a
challenging dataset, our model delivers an improvement of 4.4 points in MAE and has comparable
performance in MSE metric also.
4.3.3 WorldExpo’10 dataset

Table 4.3 lists the performance of all major methods. IG-CNN is grown only just one level with two
experts. World Expo’10 dataset proves to be extremely challenging for our model due to the sparse
nature of the crowd with the lack of significant variability in crowd density. This affects the ability of
our model to generate experts catering to different crowd types. Despite these limitations, our model
shows comparable performance with respect to other models.
4.4.1 Effect of Growing

In this section, we study the effect of the hierarchical CNN tree growing on the oracle accuracy and
the final accuracy at test time. All ablations are performed on Part A of the Shanghaitech dataset [1]
as it is sufficiently large and has high variation in crowd density. Table 4.4 lists count errors for the
base CNN along with that of the IG-CNN at different levels of growth. It also shows for each level,
the classifier accuracy and Oracle MAE. This oracle error is the MAE that the model would achieve
if the expert classifier is 100% accurate. There is significant improvement of MAE for IG-CNN at
higher levels than the base CNN but saturates after level 3. Although the oracle error decreases down
drastically with each increment of the growth, the expert classifier is unable to keep up and causes
50
Method Oracle Actual Classifier
MAE MAE Accuracy
Base CNN - 120.9 -
2 Experts (Level 1) 38.1 115.3 77.2
4 Experts (Level 2) 17.8 80.3 62.3
7 Experts (Level 3) 11.4 78.1 45.7
8 Experts (Level 3) 8.5 72.5 45.5
16 Experts (Level 4) 4.4 74.6 21.8
Table 4.4: Effect of hierarchical growth of IG-CNN on Part A of Shanghaitech [1] dataset. Though
the oracle loss is steadily decreasing with depth, classifier error is increasing leading to higher MAE
at test time.
more switching error as evident from the Table 4.4. This is primarily due to the reduction in number
of training samples per expert regressor at higher tree levels. For example at level 2, the distribution
of samples for the four regressors is so skewed that one of the expert gets only 2.9% of the total test
patches. This is more severe for level 3 with only 0.5% for the expert with the least number of samples
and the corresponding class wise classifier accuracy is just around 2%. The number of samples for
some of the regressors are so small that the classifier is unable to generalize significant discriminative
features for the specialties. We also show in Table 4.4, the performance when the regressor with the
least number of samples is not split, leading to an unbalanced tree. In this way, there are only 7 expert
regressors at level 3 instead of 8 experts. The MAE in this case is comparable to IG-CNN at level 3,
but higher.
4.4.2 Expert Specialty Characteristics

It is important to shed more light on the specialization process involved in the IG-CNN training.
Hence, we analyse the features of specialty groups automatically inferred in the hierarchical differ-
ential training. The specialty groupings might be based on some latent features such that the oracle
error is minimized. But are there observable characteristics based on crowd types? Crowd density
could be one of the factors for specialization. Number of people in a image patch is a proxy for crowd
density. We compute the distribution of crowd counts on the specialty subsets of the expert regressors
(see Section 4.2.4 for classifier label creation). For the experiment, the test set of Part A Shanghaitech
dataset [1] is used. Figure 4.5 indicates a possible clustering of crowd patches based on count. Note
that patches with few people go to one regressor while more denser ones get distributed across the
other experts. This multichotomy observed in the specialties reinforces the fact that IG-CNN training
creates experts based on certain latent factors. Some of the factors could be correlated with crowd
51
R
(5.2, 8.3) Mean
Std Deviation
R0 R1
(3.5, 5.5) (10.0, 11.9)
R00 R01 R10 R11

(1.9, 4.2) (6.0, 7.9) (11.0, 11.1) (23.1, 24.3)
R000 R001 R010 R011 R100 R101 R110 R111

(0.6, 2.1) (4.0, 5.9) (5.4, 7.1) (7.1, 8.7) (10.1, 10.6) (13.8, 12.5) (22.4, 23.7) (33.7, 38.1)
Figure 4.5: Mean and standard deviation of crowd count distribution preferred by expert regressors at
different hierarchies of IG-CNN. Computed on patches from Shanghaitech [1] Part A test set.
density as density variation accounts for much of the variability seen in crowd images.
4.4.3 Hierarchical Training Vs Baseline Methods

In short, IG-CNN training mines latent specialties hierarchically and creates a set of expert regressors.
Here we compare this methodology with other similar methods. The standard mixture of experts
(MoE) approach uses a gating network to weigh the output of the set of regressors. In the same setting
as that of IG-CNN, we use VGG-16 classifier as gating CNN to output softmax confidences. The 8
regressors are initialized with base CNN weights and their outputs are multiplied by the classifier
confidences. Table 4.5 shows MAE numbers for MoE and is clearly inferior to IG-CNN. MoE is
unable to bring significant specialization among the regressors.
We also compare with differential training introduced in Chapter 3. Instead of performing hierar-
Method Oracle Actual

MAE MAE
Mixture of Experts - 281.8
4-way Differential Training 20.6 99.0
8-way Differential Training 9.9 75.1
IG-CNN (Level 3) 8.5 72.5
Table 4.5: Comparison of IG-CNN with other specialization based methods on Part A of Shang-
haitech [1] dataset. IG-CNN outperforms other architectures.
52
chical training, N-way differential training is done on the set of regressors as in Chapter 3. For this
ablation, we use four and eight regressors which are exact copies of the base CNN, is comparable to
IG-CNN with the same number of experts. The oracle loss of the expert set is minimized by selec-
tively fine-tuning the best regressor for the given training sample. It can be observed from Table 4.5,
that the oracle MAE is lower for IG-CNN than that of N-way differentially trained model. In fact, the
final performance with the expert classifier is also inferior in the case of N-way differential training.
This emphasizes that the hierarchical training creates specialties with better discriminative features.
4.5 Conclusion
We address the problem of better capturing large diversity seen in crowd scenes for accurate regres-
sion of crowd density. The proposed model, IG-CNN iteratively expands its model capacity based
on the complexity of the training data. IG-CNN starts growing from a base CNN, which is trained to
regress crowd density. The base CNN is replicated into two child regressors, each of which are im-
posed specialization with differential training and recursively divided again forming a CNN tree. The
regressors at the leaf nodes of the tree are finer experts on certain specialties mined without any manu-
ally specified criteria. An expert classifier predicts the right expert for a given test patch. We evaluate
on standard benchmarks and show significantly better performance for the model. Additionally, anal-
ysis of the specialties created by IG-CNN reveals correlation with observable crowd characteristics
such as crowd density.
53
Part II
Addressing Data Scarcity
54
Chapter 5
Almost Unsupervised Learning
YPICAL crowd counting systems have to model the huge diversity of appearance of people, de-
T manding large annotated datasets for training (see Figure 1.3). The performance of models
based on Convolutional Neural Networks (CNN), in general, is directly related to the availability of
large datasets encompassing the entire diversity. However, due to annotation difficulty, the datasets
available for dense crowd counting are small, with the current largest one having only 482 images
with 0.2 million person annotations. This seriously limits the advances in annotation intensive prob-
lems like dense crowd counting. Hence we formulate the objective of this work as to train crowd
counting models to the maximum extent with unlabeled data. To the best of our knowledge, there are
no other works in this direction for dense crowd counting and is expected to fuel more research in the
area.
Existing unsupervised methodologies are mostly based on autoencoders. They learn features by
training to predict its own input [71, 72] or some function of the input [73, 74, 75, 76, 77, 78]. It
has been shown that many autoencoder based approaches fail to learn useful features [79]. When
applied on highly diverse dense crowd images, we show that current unsupervised methods do not
learn enough useful features for density regression as evidenced from their performance scores. In
order to improve feature learning from unlabeled crowd images, we consider winner-take-all (WTA)
regularization for autoencoders. WTA autoencoder proposed by [79], is inspired from the behavior of
actual neuron adaptation in human brain. The basic idea of WTA approach is to selectively perform
learning for neurons in the autoencoder. This means not all neurons are allowed to update their
weights at a particular iteration, creating a race among neurons to learn a feature and get specialized.
The “winner” neuron is the one which has the highest activation value. This loosely tries to model
the inhibition mechanism seen in brain neurons. It has been shown that WTA auto-encoders acquire
better features than normal autoencoders [79]. Till now WTA models have only been evaluated on
datasets like MNIST, CIFAR etc. and are not scalable to highly diverse scenarios like dense crowds.
55
Hence we significantly modify the WTA training methodology and develop Grid Winner-Take-All
(GWTA) convolutional autoencoders to handle huge diversity in crowd scenes.
In a nutshell, GWTA spatially divides each convolutional feature map into a grid of cells, where
WTA is applied in each cell. This allows local winners in a fixed neighborhood rather than global ones
as in WTA autoencoder. Hence, GWTA autoencoder is able to leverage diversity of features across
space, allowing scalable and efficient training with diverse crowd data. Our crowd counting system is
composed of a CNN regressor, for which we train several layers in an unsupervised manner, i.e. using
only crowd images and no annotation. Each layer of the model is trained separately as an GWTA
autoencoder to reconstruct its own input. This stacked autoencoder training progressively learns a
hierarchy of discriminative features frequently appearing in crowd images. Majority of the parameters
of the network, almost 99.9%, are trained in this manner without any labeled supervision. This is
followed by supervised training of the remaining parameters to get the final crowd density regressor.
But note that the layers trained in unsupervised manner are frozen and only the last two layers which
take unsupervised representations as input are tuned with labeled data. This way our model leverages
unlabeled data for training majority of its parameters and only require labeled examples to adjust very
few parameters (less than 0.1%).
As a summary, this chapter contributes the following:
• A stacked convolutional autoencoder model based on grid winner-take-all (GWTA) paradigm

for large-scale unsupervised feature learning.
• The first crowd counting system that can train almost 99.9% of its parameters without any
annotated data.
5.1 Related Works

All learning based previous works in dense crowd counting require labeled data and hence we briefly
discusses related unsupervised methods. The importance of unsupervised learning has been realized
long back, resulting in numerous works. While the traditional clustering based methods try to infer
groups in the data, modern approaches effort to learn good features by training with reconstruction
objective. An autoencoder [71] consists of an encoder and a decoder. The encoder generates a la-
tent representation for the input, which is constrained by the decoder to have enough information to
reconstruct the input back. In order to avoid overfitting, several variations are proposed. Vincent et
al. [72] employ denoising autoencoders that force the network to learn random noise removal. Vari-
ational autoencoders of [80], model input distribution in a variational Bayesian approach. Restricted
56
Boltzmann machines (RBMs) [81] and deep Boltzmann machines (DBMs) [82] are other genera-
tive models for the same. Convolutional neural network based approaches like Pixel-RNN [83] and
Pixel-CNN [84] learn image density models and can generate diverse scenes. Furthermore, generative
adversarial training techniques are used for density modeling in [85] and [86]. More recent paradigm
is that of self-supervision, where instead of reconstructing the input image, some label that can be
computed from the input is used for supervision. In colorization works like [73, 87, 88], the network
is trained to output colored image from its gray-scale version, thereby hopefully learning representa-
tions useful for other tasks. Self-supervisory labels are computed from motion cues in [74, 89, 90].
Other works obtain self-supervision labels from videos [75, 91], inpainting [76], co-occurrence [92],
context [77, 78], etc. Zhang et al. [93] argue that cross-channel prediction of raw data itself outper-
forms other task based self-supervision. Recent work of [94] formulate the task of spotting artifacts in
images for learning useful features. One limitation of these self-supervised approaches is the need for
defining certain pseudo label objectives compatible with the end task. If the objectives does not align,
the final performance might suffer as we find in the case of density estimation. Hence in this work we
prefer an unsupervised method for crowd counting. Especially, we leverage on winner-take-all [79]
paradigm, which we develop further to suite large-scale training with diverse crowd scenes.
5.2 Our Approach
5.2.1 Grid Winner-Take-All Autoencoders for Unsupervised Learning

Most of the unsupervised learning models are based on a reconstruction loss. Any normal autoen-
coder [71, 72] learns features from unlabeled data in an attempt to reconstruct the input through a
representational bottleneck. However, the representation acquired by the encoder is constrained to
only have enough information for the decoder to reconstruct the input. This results in many cases,
especially with convolutional neural networks, the encoder to learn delta or identity filters. These
pass-through filters are degenerate and simply passes the input as such without applying any signifi-
cant transformation [79]. Though these near-identity filters causes trivial reduction in reconstruction
objective, they are almost useless for any other tasks. It is hence apparent that normal reconstruction
objective might not result in useful feature learning. One way to mitigate this effect is by increasing
the task difficulty from input reconstruction to predict pseudo labels that can be easily obtained from
the input [73, 74, 75, 76, 77, 78]. Another possible way is to constrain the encoder filters directly with
some regularizers. In this work, we follow the approach pioneered by [79], where the encoder filters
are constrained to fire only at the maximally activated locations. We make following crucial changes
to WTA to create GWTA autoencoder:
57
.. . . . .. . .
Max value in
. . . . ..
h
. .... . .. ... ...

h x w cell w
. . ..
GWTA . ....... .. ...... ....
. . .. .
. ... .. .. .... .. . . .. .. . ..
hxw
. .. .. .. ... ..
Feature Maps
. . . .. .
Figure 5.1: Grid Winner-Take-All architecture proposed in this work. Only the maximally activated
neuron in a cell is allowed to pass its activation, creating sparse updates during backpropagation.
• The WTA method is adapted for large scale training with highly diverse data. Instead of ap-
plying WTA sparsity over the entire spatial map, we apply only over fixed neighborhood. This
helps in more efficient training and avoid extreme sparsity which is better for highly diverse
crowd data.
• More model constraining. While Makhzani et al. [79] use separate decoders with large filters,
we show that for our task of interest, a tied decoder gives improved results.
Figure 5.1 illustrates our proposed GWTA architecture. GWTA is applied during the unsuper-
vised training phase on the activation maps of the convolutional encoder. GWTA sparsity is applied
independently over each channel. Any given feature map is divided into a grid of rectangular cells
of pre-defined size h × w. During forward propagation of the input, only the “winner” neuron in the
h × w cell is allowed to pass the activation. The “winner” neuron is the one having the maximum
value of activation in the cell and activations of all other neurons in the h × w cell are set to zero. Now
the task of the decoder is to reconstruct the encoder input from such a sparse activation map, which is
extremely hard. Hence, the encoder cannot simply learn near identity filters and get minimum recon-
struction cost, but are forced to acquire useful features recurring frequently in the input data. Figure
5.2 shows an exemplar GWTA output and corresponding reconstruction. In GWTA, the weight update
comes from few “winner” neurons in the entire feature map rather than receiving contribution from
all the neurons in a normal autoencoder. This prevents filters from trying to reconstruct all parts of
the input equally as in a normal autoencoder, but are forced to get specialized for certain patterns,
resulting in more useful feature learning. Note that GWTA sparsity is applied only while training
and is removed during testing. Since features learned are mostly non-trivial and not near identity, the
encoder outputs carries significant abstractions.
The architecture for GWTA is motivated from unique characteristics of highly diverse crowd im-
ages. There exists severe variation in appearance of people even within a crowd image due to per-
58
Input Image GWTA Output Reconstruction by GWTA Reconstruction by
Autoencoder Normal Autoencoder
Figure 5.2: GWTA output of Conv1 layer for a sample image. Note that the reconstruction by GWTA
autoencoder is very sparse compared to normal autoencoder.
spective changes, density gradients or occlusions. Hence the feature sets needed for faithful crowd
density estimation mostly rely on local crowd patterns. Since GWTA is done in a grid fashion, we
are allowing local winners to update themselves and better learn specific crowd patterns. Normal
autoencoders or approaches like [79] do not explicitly take into account this spatial locality, but learn
features globally to reconstruct the entire input, which might not be very useful for density regression.
At present, we do not have any theoretical measure of feature usefulness for density estimation, other
than computing final regression performance.
5.2.2 Architecture of GWTA Counting CNN

To demonstrate the merit of the proposed architecture, we use a simple crowd counting CNN and
train almost all parameters with unlabeled data followed by supervised training of the remaining
parameters. We use a modified version of the CNN regressor introduced by [1]. The network consists
of six convolutional layers with three pooling layers in-between. The first four layers accounting
around 99.9% of the total parameters, are trained in an unsupervised manner and are then frozen. The
remaining layers are trained with labeled data to regress crowd density map.
The unsupervised training is performed in stages, stacking a hierarchy of GWTA autoencoders
as elucidated in Figure 5.3. For the first stage, random patches of size 224 × 224 are extracted from
crowd images and are fed to the first GWTA autoencoder. This autoencoder has the convolutional
layer Conv1 as enocoder followed by the GWTA regularizer layer. The GWTA cell size is chosen
to be 32 × 32 and is subsequently halved after every pooling layer so that grid dimensions remain
same across layers. The decoder DeConv1 is a transposed convolution with its weight tied with that
of Conv1. Note that we do not have bias for the encoder and decoder, which we find to be empirically
59
L2 RECONSTRUCTION LOSS
Conv1 GWTA DeConv1

STAGE-1
9x9 | 32 32x32 9x9 | 3
Pool1
2x2
Conv2 GWTA DeConv2
STAGE-2
7x7 | 64 16x16 7x7 | 32
Pool2
2x2
Conv3 GWTA DeConv3
STAGE-3
7x7 | 128 8x8 7x7 | 64
Pool2
2x2
Conv4 GWTA DeConv4
7x7 | 256 4x4 7x7 | 128
STAGE-4
SUPERVISED LAYERS
Conv6 Conv5
Crowd count Σ 3x3 | 1 3x3 | 64
Density map
Figure 5.3: Architecture of GWTA based Crowd Counting CNN (GWTA-CCNN). Unsupervised
training is done in stages, updating every layer by reconstructing its own input regularized by the
GWTA sparsity. Last two layers are trained with supervision.
better. The parameters of Conv1 are updated by backpropagating the l2 loss between the input and the
DeConv1 output. In general, if FXl i (x; Θ) denotes the output of layer l for input Xi and F̃Xl i (x; Θ) be
the corresponding GWTA decoder reconstruction, then the loss function is given by,
N
1 X l−1
Ll2 (Θ) = kF (x; Θ) − F̃Xl i (x; Θ)k22 , (5.1)
2N i=1 Xi
where N is the number of training samples and Θ refers the learnable parameters. Parameters Θ
are obtained by optimizing Ll2 with stochastic gradient descent (SGD). The reconstruction loss tries
to maximize the similarity between the reconstruction and the input, but is severely limited by the
GWTA sparsity. This prevents the filters being learned from reaching near pass-through. The training
is continued till loss Ll2 on the validation set stops improving.
After the first stage encoder-decoder is trained, the Conv1 weights are frozen and the Conv1 output
(without GWTA) after pooling is fed to the next stage encoder. The Conv1 activations are scaled for
training stability to be in 0-1 range by dividing by the maximum response in every feature map.
The maximum values are computed from the train set and are fixed for subsequent stages of training
as well as for testing. Conv2 along with the corresponding deconvolution DeConv2 forms another
GWTA autoencoder and is trained with the objective to reconstruct Conv1 output. This stage-wise
training of GWTA autoencoders is continued till Conv4, each one learning useful representation for
the output of previous layer. In this way, 99.9% of the parameters are trained without supervision and
60
the feature representation of Conv4 is mapped to density map with supervision.
The supervised stage is required since the unsupervised training can result in some features not so
useful for the end task of crowd counting. So, some level of supervision is needed to select appropriate
features for density map estimation. There are many methods in the literature on how to generate
density maps from head annotation available with the datasets. Most common method is to blur the
head annotation with a Gaussian of fixed variance summing to one. In this work, we use a sigma of
8.0 for generating ground truth density maps. The supervised training is performed on the last two
layers with simple 3 × 3 filters accounting for less than 0.1% of the total parameters (see Figure 5.3).
These layers are trained to regress the density map by backpropagating l2 loss between between the
predicted and ground truth map. Here the l2 loss function is defined as
N
1 X
LD
l2 (ΘS ) = GT
kDXi (x; ΘS ) − DX (x)k22 , (5.2)
2N i=1 i
GT
where DXi (x; ΘS ) stands for the output of the supervised layers with parameters ΘS and DX i
(x) is
corresponding ground truth density map for the input image Xi . SGD is continued till the validation
accuracy plateaus or does not improve. Note that none of the parameters in Conv1 to Conv4 are
updated in the supervised stage.
For a given test image, overlapping patches (10% overlap) are obtained and evaluated on the
trained model. The density map predictions of the overlapping areas are averaged to obtain the final
density map. The crowd count is calculated by summing the density map.
5.3 Experiments
5.3.1 Shanghaitech Dataset

We compare performance of GWTA-CCNN with that of other methods in Table 5.1. First impor-
tant experiment is the random baseline where the unsupervised layers are not trained but randomly
initialized. Subsequent supervised training is done on the feature representation obtained from this
randomly initialized network. As expected, our GWTA based network achieves significantly higher
count accuracy than the randomly initialized network. This suggests that the unsupervised training
has resulted in learning of features useful for density estimation. Then we try end-to-end convolu-
tional autoencoders [71], where the CCNN is trained to predict the input image. This is followed by
supervised training of last two layers to map features learned by Conv4 (in Figure 5.1) to crowd den-
sity. Denoising autoencoder [72] is also evaluated where the objective is to reconstruct clean image
61
1603 1605.8 1521.1
212 201.3 175.2
Input Image Ground Truth CCNN Prediction GWTA-CCNN Prediction

(fully supervised) (almost unsupervised)
Figure 5.4: Sample predictions given by GWTA-CCNN on images from Shanghaitech dataset. The
predicted density maps closely resemble that of the supervised CCNN model, emphasizing the ability
of our unsupervised approach to learn useful features.
from noisy input. Clearly, the proposed GWTA-CCNN achieves better MAE and MSE than these
end-to-end autoencoders.
Another important baseline is with fully supervised training of the CCNN. The network is same
as that in Figure 5.3 (Conv1 to Conv6). Obviously, the MAE for fully supervised CCNN is lower
than that of GWTA training, but is reasonably close, the difference in MAE being just 30.1. Further,
we ablate our model by training without GWTA. The results evidence the significant improvement
in performance contributed by the GWTA regularizer. Similarly, GWTA autoencoder with an untied
decoder having larger filters as in [79] performs worse, justifying our design choice.
Figure 5.4 presents density maps regressed by GWTA-CCNN and supervised network along with
the corresponding ground truths. It is interesting to note that the density maps by GWTA-CCNN
closely resemble the predictions by the supervised model. This emphasizes the ability of our approach
Method MAE MSE

CCNN Supervised 124.6 186.9
CCNN Random 367.6 510.1
Autoencoder 162.1 233.3
Denoising Autoencoder 181.9 254.1
CCNN without WTA 193.0 280.9
GWTA-CCNN without tied decoder 195.6 277.0
GWTA-CCNN 154.7 229.4
Table 5.1: Performance of GWTA-CCNN on Part A of Shanghaitech dataset.
62
Method MAE MSE
CCNN Supervised 367.2 551.3
Autoencoder 1272.8 1562.3
Denoising Autoencoder 1080.9 1391.1
CCNN without WTA 448.3 633.7
GWTA-CCNN without tied decoder 500.3 697.8
GWTA-CCNN 433.7 583.3
Table 5.2: Comparison of GWTA-CCNN with other methods on UCF CC 50 dataset [2]. Our model
delivers superior performance than other unsupervised methods.
to learn better features for crowd density estimation.
5.3.2 UCF CC 50 Dataset

Again we see similar trend on UCF CC 50 as with Shanghaitech dataset. In Table 5.2, GWTA-CCNN
has better accuracy than other unsupervised baselines and is also close to the supervised baseline. We
see that the end-to-end autoencoder methods have completely failed to learn useful feature for density
regression. This may be possibly due to less training data (just 40 images) available from the dataset.
But note that, despite having very less training images, our GWTA based model has significantly
better results.
5.4.1 Supervised Vs Unsupervised Features

It is important to compare features obtained through unsupervised learning to that of its supervised
counterpart. This would give valuable insights on how GWTA model works as well as help future
researches to bridge the performance gap between the two training paradigms. Figures 5.5 and 5.6
display features maps from supervised CCNN model, autoencoder with and without GWTA. Only
some of the feature maps are shown due to space constraints, but the sum maps in Figure 5.5, which
are sum of all the feature maps gives a general idea about all the feature maps in a particular layer.
It is clear from Conv1 maps (Figure 5.6) that supervised and GWTA unsupervised are close in terms
of the features learned, subject to different value ranges. Moreover, the sum maps of the features are
also close indicating that most of the filters are similar as that of supervised. Note that the feature
maps of autoencoder without GWTA are significantly different from the maps of supervised and are
63
INPUT IMAGE CONV1 CONV2 CONV3 CONV4
FULLY
SUPERVISED
CCNN
GWTA-CCNN
CCNN
WITHOUT
GWTA
Figure 5.5: Qualitative comparison of features learned by GWTA autoencoder with that of the fully
supervised CCNN. The images are sum maps of all the features in a layer.
mostly passing the input with minimal transformation or are dead filters (blank output). Similarly for
Conv2, the GWTA features look close to supervised than normal autoencoder and are more related
to abstracting various types of edges to form compound patterns like shoulders, head etc. It starts
to diverge at Conv3, where the supervised features show more aggregation to become like density
maps. But the GWTA unsupervised feature maps, though not visually very different, still combines
previous layer features to form more abstractions. This is due to the absence of any task oriented
supervisory signal. Coming to Conv4, the supervised layer activations almost look like density maps.
In contrast, the Conv4 unsupervised features look very different and still creates many abstractions
which may or may not be useful for the task of crowd counting. This observation that initial layers
of neural network have general features, with deeper layers tuned for task specific features is in line
with existing findings in the literature. Also note that, many feature maps of the autoencoder without
GWTA still have dense information about the input in order to reduce the reconstruction loss and
hence differ significantly from that of the supervised. This proves that the GWTA stacked autoencoder
learns features close to that of the supervised model compared to other competing models.
64
Figure 5.6: Some of the individual feature maps of Conv1 for GWTA and supervised CCNN.
5.4.2 Comparison with Self-Supervised Methods

In this section, we compare the performance of our model with some self-supervised methods, where
the features are learned by training the model to predict pseudo labels that are computed from the
input image. For example, in self-supervision with colorization, the CCNN model is trained as an
autoencoder to regress colored image from its gray scale version. We see from Table 5.3 that the
proposed GWTA-CCNN works better than self-supervision with colorization task. Inpainting [76] is
another task for self-supervision. A rectangular portion of the input image is removed and filled with
the mean value. The surrounding context image is used to train CCNN with the task of predicting the
missing region of the input. This task also does not surpass the performance of GWTA unsupervised
learning. Further, to suite the end task of density regression, we employ the count ranking loss formu-
lation of [27]. We train CCNN to be consistent by enforcing the count estimation of the interior region
of a crowd image to be less than the overall count of the crowd. Though the ranking loss provide count
consistency, it seems to be incapable of providing enough good features. This might be because of the
fact that the ranking loss could be satisfied without learning any crowd discriminative features. This
indicates one drawback of self-supervised methods, the need for certain objectives (like colorization
etc.) suitable for the end task to be defined. If the self-supervisory objective is not compatible with
65
Method MAE MSE
Colorization 168.4 244.5
Inpainting 166.3 252.8
Count Consistency 188.8 282.3
GWTA-CCNN 154.7 229.4
Table 5.3: Performance of GWTA-CCNN on Part A of Shanghaitech dataset [1] compared with self-
supervised methods.
the end task, performance might suffer.
5.4.3 Effect of labeled data on performance

It is important to examine the dependence of count estimation quality on the amount of labeled data
used for final supervision. Figure 5.7 shows performance of our GWTA counting model with different
280 CCCN fully supervised

GWTA-CCNN almost supervised
260
240
220
MAE
200
180
160
0 20 40 60 80 100 120
Number of labeled images for training
Figure 5.7: Amount of labeled data vs MAE. CCNN is trained in fully and almost supervised fash-
ion with different amounts of labeled data of Part A Shanghaitech dataset. We see that at less data
scenarios our almost unsupervised approach performs better than fully supervised.
66
levels of supervision compared against fully supervised CCNN. On Part A Shanghaitech dataset, we
vary the number of labeled training images from 50% of the entire dataset to the extremity of just one
image. We repeat the experiments eight times with different randomly drawn subset of the labeled data
and report the average MAE. Interestingly, we see that the performance of GWTA-CNN at extreme
less data case is clearly superior to fully supervised model. Some amount of training data is required
for satisfactory accuracy for fully supervised case and outperforms GWTA-CCNN at around 40%
data. With more data, though the accuracy of both approaches increases, the MAE for unsupervised
method drastically decreases than fully supervised and saturates near the 100% data performance with
less data (50%). This is primarily because the few parameters being updated with supervision require
only limited data for training. Hence the suitability of our approach at extremely less labeled data
scenario is well emphasized.
5.5 Conclusion
Our proposed architecture attempts to train a crowd counting CNN in an almost unsupervised manner.
Since it is difficult to obtain large-scale annotated data for dense crowds, this problem deserves prime
attention. We develop Grid Winner-Take-All (GWTA) autoencoder to learn useful features from un-
labeled images. The basic idea is to restrict weight update of neurons in convolutional output maps to
the maximally activated neuron in a fixed spatial cell. Almost 99.9% of the parameters of the network
are trained as stacked WTA autoencoders using unlabeled crowd images, while remaining parameters
are updated with supervision. We evaluate our model on standard benchmark datasets and demon-
strate better performance compared to other unsupervised methods. In fact, the count performance is
reasonably close to the supervised baseline, with a performance gap of 25%. Future works should ad-
dress this performance gap. Additional analysis reveals that our unsupervised approach outperforms
fully supervised training when available labeled data is less.
Though GWTA-CNN mitigates the Limited Annotations issue (Section 1.1), some labeled images
are necessary for training, which itself is cumbersome to obtain for dense crowds. We further solve
this challenge first by significantly reducing the annotation difficulty through a binary labeling scheme
(Chapter 6) and then with the complete self-supervision paradigm (Chapter 7) that do not require any
instance-level labels.
67
Chapter 6
Binary Supervision for Density Regression
ENSE crowd counting is one of the challenging problems where creating large labeled datasets
D turns out to be difficult. Typical crowd images have thousands of people positioned close to
each other and annotating the locations of every person is tedious. Add to these the growing need to
include crowds from as many diverse scenarios as possible for better generalization. In this context,
labeling every head for various settings under consideration is not scalable and directly affects the
performance of deep models on account of limited data.
We notice that the major difficulty occurs as the amount of annotations directly depends on the
crowd count. Since typical counts are of the order of hundreds or thousands, so is the labeling require-
Head Annotations
Typical Density
Regressor
Our Density
Binary Labels Regressor
SD
D
Figure 6.1: Contrasting the proposed binary labeling paradigm with existing head annotation frame-
work. Our method requires only one binary label per crowd image to train a density regressor as
opposed to annotating all the heads.
68
ment. Even annotating a few hundred images would require millions of human head annotations. This
can be mitigated if the labeling is done at an image-level instead of for individual persons. For ex-
ample, it is easy to identify whether any given crowd image is highly dense or relatively sparse.
One could easily perform this binary labeling for several thousands of crowd images as opposed to
few hundred images in the person-level annotation scheme. This paradigm is depicted in Figure 6.1.
Large datasets covering sufficient diversity could be created with a fraction of the cost of the current
prevalent approach.
However, obtaining density maps for training crowd counting models from any kind of image-
level labels is inherently cumbersome. Here we propose to use noisy signals to generate approximate
ground truth density maps. Edge density is used as proxy for crowd concentration, which might not
always follow the actual crowd density and cannot be used directly (Section 6.1.2). A set of rectifier
networks are trained to enhance the noisy ground truth in an unsupervised fashion (Section 6.1.4). The
characteristics of the crowd features are drastically different for highly dense and sparse categories.
Hence, depending on the binary label, the rectifier network specific to the density class is employed
to get the enhanced ground truth. The final density regressor is trained using the enhanced noisy
density maps (Section 6.1.5). Extensive experiments are done to evaluate the model performance,
which clearly shows the superiority of our paradigm (Section 6.2). It delivers significant counting
performance at the lowest annotation cost.
In summary, our work contributes the following:
• A binary labeling scheme for annotating crowd images easily as opposed to the existing person-
level approach.
• A novel architecture to generate ground truth density maps from noisy signal using the binary
crowd-level label.
• A crowd counting model that can be trained using the binary labels, but delivers competitive
performance at very low annotation cost.
6.1 Our Approach
6.1.1 Binary Labeling Scheme

As motivated already, we only use a binary annotation for any given crowd image. The image is either
categorized to highly dense or to sparse. It is easy to perform this labeling with a quick glance on
the crowd scene. Highly dense crowds have people appearing as mostly blobs, with almost no facial
69
Figure 6.2: Samples of noisy density maps extracted from edge details of crowd images are shown.
They seem to roughly correlate with the crowd density.
features available to discriminate. In contrast, relatively sparse scenes have people with more visible
human features. The category split is not strict and can have confusing samples at the threshold bound-
ary. Interestingly, our approach can tolerate a fair amount of noise in the labels (see Section 6.3.2 for
experimental validations).
Note that the current standard crowd datasets do not have the binary labels available with them.
In order to demonstrate the power of our approach, we obtain the binary labels from the ground truth
count. We choose a threshold CT based on the visual appearance criteria that defines highly dense to
have majority of the people looks like blobs. Any image with crowd count greater than or equal to
CT is taken to be dense, while lower one falls in the sparse category. Next we explain the pipeline to
generate useful density maps for training from the binary annotations.
6.1.2 Noisy Ground Truths

Since we adopt a binary labeling scheme for annotating crowd images, inclusion of additional infor-
mation is necessary to train density regressors. Especially, some supervisory signals should exist to
a get a coarse ground truth density maps. Once at least an approximate version of density map is
available, it greatly helps in training with the weak binary supervision. Hence, we search for a signal
that roughly corresponds to spatial crowd density.
70
Sparse Dense
Noisy Count Distribution Noisy Count Distribution by Category
Figure 6.3: The distribution counts computed from the noisy density maps, evaluated on crowd images
from Shanghaitech Part A [1].
Edge information from images is natural candidate that roughly correlates with the concentration
of crowds. The regions which are crowded tend to have relatively more edges than less densely
populated areas. But this signal is definitely noisy and might not work in certain cases. A higher
density of edges could simply arise due to non-crowd objects like background clutter or patterns that
inherently have more edges. Even then, at least for dense crowd scenario, we observe that this weak
signal could serve as proxy for density. Note that the edge density values do not correspond to crowd
density values and cannot be directly used for regressor training. But it approximately follows the
ordinal relationship that a patch with more people tend to have higher edge concentration compared
to a sparse patch.
Coming to the actual implementation, we apply the standard Canny edge detector [95] on any
given crowd image to obtain the edge map. Simple Gaussian smoothing is done to remove noise
content from the map, followed by a down-sample operation to resize to the required dimensions.
Figure 6.2 displays some of the noisy density maps created for crowd images. They coarsely resemble
actual density maps, with comparatively higher activation in dense areas than at sparse ones. The
absolute values in the noisy maps do not match with the ground truth, but the ordinal consistency
seems to be maintained in many cases. Now we need a mechanism to correct these noisy density
maps in order to make it usable for regressor training. For that, one requires the noisy and actual
ground truth pair to learn the rectification function. We solve this issue by modeling the noisy density
maps and then easily creating synthetic training pairs.
71
6.1.3 Modeling Noisy Density Maps
In Figure 6.3, the distribution of the counts from noisy density maps computed over a dataset of
crowd images is shown. Patches of fixed size (256 × 256) are randomly extracted from images and
their noisy density maps are obtained as described in Section 6.1.2. Summing these noisy maps gives
the counts, which are collected to form the distribution. We clearly note the normal nature of the noisy
counts. Interestingly, if we separate out the counts based on the ground truth sparse-dense categories,
then a multimodal normal distribution becomes evident in Figure 6.3. This means that under a broad
binary classification based on count, the noisy counts tend to follow an ordinal consistency. The noisy
counts of highly dense patches have significantly more likelihood of having larger values (green points
in Figure 6.3). There are dense samples with low noisy counts, but very less in number. Similarly,
very sparse patches have usually lower noisy counts and rarely have large values. However, this
ordinal relationship might not hold for nearby samples within the same density category or around
the threshold point that demarcates the categories. But it works fairly well samples across the highly
dense or sparse group.
We approximate and parameterize the statistics of the noisy density map counts using a bimodal
normal distribution. If Psp = N(µsp , σsp ) describes the distribution for sparse and Pdn = N(µdn , σdn )
stands for the dense, then the noisy count values could be thought as sampled from either c ∼ Psp or
c ∼ Pdn . On analyzing the dispersion of noisy counts over several crowd datasets, we find consistency
in the distribution parameters. The means are fixed as µsp = 4 and µdn = 4∗µsp , where as the standard
deviation is set same for both categories as σsp = σdn = σ = 4. This setting works for all dense crowd
datasets and has limited sensitivity to small variations on these values in terms of performance (see
ablations in Section 6.3.3). Now noisy counts could be generated from sampling from Psp and Pdn .
6.1.4 Noise Rectifier Network (NRN)

We have binary density labels for every crowd image and their corresponding noisy density maps. A
rectifier network is needed to correct the values in the noisy map and make it closer to the ground
truth. This is trivial if noisy and actual density map pairs are available. But the head annotations
are not available in our scenario to generate the real density maps for crowds. Here we rely on the
synthetic samples generated from the distributions Psp and Pdn . The main idea is to generate noisy
and actual ground truth density map pairs and use them to train rectifier networks.
Figure 6.4 depicts the training pipeline of the Noise Rectifier Network or NRN. NRN consists of
a set of five convolution layers with ReLU nonlinearities. It takes a noisy density map as input, but
outputs a scale factor matrix. Let Dins stand for the noisy map corresponding to the ith image and
Mi0 be the output scale factors. Due to the max-pooling operations in NRN, M 0 has dimensions
72
Noisy Count Noisy Density Map Noise Rectifier Network Correction Factor Rectified Density Map
Distribution
Figure 6.4: The training pipeline and architecture of the Noise Rectifier Network (NRN) is shown.
NRN takes noisy density maps and outputs scale factors to improve the maps. It is trained on the
synthetic data sampled from the parametric distribution.
one-fourth that of D ns . So M 0 is up-scaled to required size and is denoted as M . The scale factor is
obtained from the last linear layer of NRN and each factor correspond to a spatial region in the input.
NRN learns the larger spatial distribution of the noisy maps and generates the correction factors.
The rectification is multiplicative in nature as it helps to quickly scale over large count values. The
improved density map D is computed as,
D i = Mi Dins , (6.1)
where signifies the operation of element-wise multiplication.

For training NRN, the network is first randomly initialized. In order to create the ith noisy density
map input for training, a crowd count value ci is first chosen at random. Then ci locations within the
map are randomly picked as head locations for generating the the ground truth density map DiGT .
Density maps are formed by placing a Gaussian kernel with a fixed variance at the head locations
(see [1, 2, 70] for details on density map generation). Now the noisy density map D ns is obtained by
sampling a count value from either Psp or Pdn and then scaling the values of DiGT so that it sums to
the sampled count. Hence, we get pairs of DiGT and D ns for training NRN. The training loss is the
Euclidean distance between the rectified map and the ground truth, specified as,
N
1 X
LN RN = ||Di − DiGT ||2 , (6.2)
N i=1
where N is number of images in the mini batch. LN RN is backpropagated to update the parameters
of NRN.
We train two NRN models, one for the sparse category NRN-Sparse and the other NRN-Dense
73
NRN-Sparse
D
S
D
NRN-Dense
Labels Noisy Density
Switch
Map Rectified Density
Map
L2 Loss
Images
Images Predicted Density

Backbone Network Density Regressor
Map
Figure 6.5: Overall architecture of the Binary Supervised Density Regressor (BSDR) is depicted.
Binary labels of the crowd images are used to select appropriate NRN to generate density maps.
These noise rectified maps act as ground truths to train the density regressor, completely avoiding the
need to have head annotations.
for dense. This is because of the difference in the noise characteristics across the density groups
(Section 6.1.3). NRN-Sparse is updated only using samples from Psp , while NRN-Dense relies on
Pdn . Though trained on sampled data, these rectifiers can correct noisy density maps from real crowd
images.
6.1.5 Regressor Training

Now we train our counting model, named the Binary Supervised Density Regressor (BSDR), using
the noisy ground truths and the rectifier networks. Since the density maps for training are obtained
from noisy signals, the regressor is first pretrained with self-supervision to have a good initialization
to prevent optimization reaching local minimas. The pretraining learns filters to extract prominent
features in crowd images, which could aid the final task of density regression. For self-supervision,
we use the pretext task of predicting the angle of rotation of input image [96]. The backbone network
employed has three sets of VGG style convolutional blocks, followed by classifier head for rotation
classification. Patches taken from crowd images are randomly rotated to one of the four predefined
angles. Now the backbone network parameters are updated to predict the correct rotation label. Con-
sequently, the model learns to detect specific edges or even high-level patterns pertaining to crowds
74
ST PartA JHU UCF-QNRF UCF-CC-50
Input
340.0 140.0 564.0 2104.0
Ground
Truth
280.4 116.2 495.1 2433
BSDR
Figure 6.6: Density maps regressed by the proposed BSDR model. Though our model is trained on
weak labeled and noisy data, the density predictions closely follow the ground truths.
that are sensitive to the orientation. These features are shown to be generic [96] and also empirically
seem to work for density regression as well.
Once the backbone is trained with self-supervision, the network parameters are frozen and the
classifier head is replaced with additional convolutional layers to output density map. Figure 6.5
shows the regressor training pipeline in details. For each training image, its noisy density map is
computed (Section 6.1.2) and the associated binary annotation selects the right rectifier network.
NRN-Sparse is applied to correct the noisy map for sparse images, while NRN-Dense rectifies for
the dense ones. The rectified maps are then used to estimate the l2 loss with the predicted density
map from the regressor. This loss provides supervisory signal to update the weights of the density
regressor. Although our method relies on binary annotation and noisy signals, we evidence in the
following sections that it delivers significant counting performance at a low annotation footprint.
6.2 Experiments
6.2.1 Evaluation Metrics and Baselines

Density regression models are evaluated mainly using two counting metrics: the Mean Absolute Error
(MAE) and the Mean Squared Error (MSE). These performance metrics, however, do not reflect the
annotation difficulty. One could get better MAE or MSE by using more and more labeled images.
75
Consequently, we need a metric that directly exposes the efficiency of the model to deliver minimum
MAE at the lowest annotations cost. For this, the Joint Labeling and MAE Cost or JLMC is formu-
lated. It is the logarithm of the product of the model MAE with the number of annotations used for
training, written as JLMC = log10 (MAE ∗ Nann ). Here Nann is the number of human given annota-
tions, either as head locations or binary labels. Both the MAE and annotation cost has to be lower for
getting better JLMC.
Since our approach uses only binary supervision in training, it is not directly comparable to other
works that use dense annotations. So we adopt a set of additional baselines to compare the perfor-
mance of our BSDR model. To show that the regressor training using NRNs works, DR Random
baseline indicate the random MAE just with the self-supervision. DR Head-Annot experiments trains
the density regressor using different amounts of the standard head annotations. This is to analyse
how the counting performance and JLMC metrics vary with the number of labeled images. If the
binary labels are randomly assigned, the resultant performance is evaluated with the BSDR Random
experiment. Now we assess the ability of our method to give better MAE at slightly larger annotation
cost. The BSDR LPI series of evaluations annotate more number of patches from a single image.
Labels per image or LPI equals 1 for the default BSDR setting, but more crops from images are given
binary labels for higher LPI values. In contrast, BSDR Labels baselines evaluate the performance
with reduction of the number of training samples and directly indicate the effect of size of the binary
labeled dataset on the metrics. In all these experiments, the annotation cost is specifically monitored
through JLMC metric. The following sections evaluate our model on different datasets. The perfor-
mance metrics are computed on test set with the ground truth data after the training of BSDR. We
use the count threshold CT for the binary labeling as 1250. Unless otherwise stated, we use the same
hyper-parameters as specified in Section 6.1.

We evaluate BSDR on the Shanghaitech Part A [1] (STPA) dense crowd counting dataset. We com-
pare the performance of BSDR model with the baselines described in Section 6.2.1 in Table 6.1. The
performance of BSDR is obviously better than the random baselines. Note that the state-of-the-art
approaches uses significantly more number of annotations and hence cannot be directly examined in
terms of MAE alone. Clearly, JLMC scores for these methods are higher than BSDR, showing poor
efficiency in leveraging available annotations. To reduce the labeling cost, one could perform less
number of head annotations and train. However, it is clear from these DR Head-Annot experiments
that BSDR has better MAE (and JLMC) compared to normal density regressors when the amount of
labeled data is quite less. This indicate how BSDR can effectively translate the available annotations
76
Method MAE MSE JLMC
TDF-CNN (Chapter 2) 97.5 145.1 7.12
SCNN (Chapter 3) 90.4 135.0 7.09
IG-CNN (Chapter 4) 72.5 118.2 6.99
DR Random 431.1 559.0 -
DR Head-Annot 5% 248.6 372.4 6.23
DR Head-Annot 10% 194.5 295.1 6.42
DR Head-Annot 100% 118.4 195.3 7.21
BSDR Random 764.5 945.7 -
BSDR LPI=1 (ours) 170.7 257.4 4.61
BSDR LPI=5 165.3 264.1 5.30
BSDR LPI=10 165.1 264.0 5.60
BSDR Labels 75% 183.2 265.6 4.52
BSDR Labels 50% 195.3 288.1 4.37
BSDR Labels 25% 210.6 291.1 4.10
Table 6.1: Comparison of BSDR against other methods on Shanghaitech PartA [1]. Our model deliv-
ers significant counting performance at the lowest annotation cost evident from JLMC.
to performance. Also, the MAE improves with the rich labeled samples in the BSDR LPI experiments,
but incurs a 21% increase in the JLMC metric. Our base approach (LPI = 1) strikes a balance between
the MAE and JLMC, with the MAE remaining within 4% that for higher LPI. On varying the number
of labeled images for BSDR, we see that the MAE remains within 12% even with large reduction as
50%, and has better JLMC thereby increasing the practical utility of the approach. Figure 6.6 displays
some of the predictions made by the BSDR model. Visually, the quality density estimation closely
follows the ground truth, even though only binary labeling is used for training.
6.2.3 UCF-QNRF Dataset

We observe similar performance trends on UCF-QNRF as seen on Part A dataset. BSDR outperforms
all the baselines in terms of JLMC. Though the MAE of state-of-the-art methods are better, they incur
larger annotation cost and reducing labeling cost by using smaller training dataset results in sharp
increase in MAE significantly more than BSDR. Also, MAE of bare-bone BSDR (LPI=1) is within
5% of that with LPI=3, but has better JLMC. Interestingly, note the relative stability in MAE even with
steady decline in the amount of training samples, indicating usefulness of our method in extremely
low label scenarios.
77
Method MAE MSE JLMC
SCNN (Chapter 3) 228 445 8.27
IG-CNN (Chapter 4) 128.6 227.8 8.03
DR Random 718.7 1036.3 -
DR Head-Annot 5% 508.1 758.1 7.32
DR Head-Annot 10% 302.5 485.4 7.40
DR Head-Annot 100% 159.0 248.0 8.11
BSDR Random 720.7 874.5 -
BSDR LPI=1 (ours) 285.8 448.4 5.44
BSDR LPI=2 277.0 430.5 5.73
BSDR LPI=3 274.7 420.1 5.90
BSDR Labels 75% 293.6 461.6 5.32
BSDR Labels 50% 298.0 455.40 5.16
BSDR Labels 25% 310.6 490.1 4.87
Table 6.2: Benchmarking BSDR on UCF-QNRF dataset [3]. Our approach beats other methods in
JLMC metric.
Method MAE MSE JLMC

TDF-CNN (Chapter 2) 354.7 491.4 7.96
SCNN (Chapter 3) 318.1 439.2 7.91
IG-CNN (Chapter 4) 291.4 349.4 7.87
DR Random 1279.3 1567.9 -
DR Head-Annot 100% 320.6 455.1 7.91
BSDR Random 1889.3 2174.8 -
BSDR LPI=1 (ours) 459.2 622.4 4.23
BSDR LPI=2 414.2 539.1 4.52
BSDR LPI=5 411.6 553.1 4.91
BSDR Labels 50% 448.2 588.6 3.95
Table 6.3: Evaluation BSDR on UCF-CC-50 dataset [2]. Despite a challenging dataset, BSDR stands
better in JLMC.
6.2.4 UCF-CC-50 Dataset

In contrast to other datasets, UCF CC 50 [2] has just 50 images with extreme variation in crowd
counts ranging from 94 to 4543. The small size and drastic density variation makes this dataset quite
cumbersome to train models that generalize well. Considering the skewed count distribution range, we
set the binary labeling threshold CT = 2000. The evaluation results are available in Table 6.3. Despite
being very challenging collection images, the BSDR still delivers significant counting performance
78
and stands better than other methods in JLMC.
6.3.1 Cross Dataset Generalization

We evaluate the model in a cross dataset setting, where the BSDR trained on images from one dataset
is evaluated on another dataset. Table 6.4 reports the resultant MAE scores. The counting perfor-
mances are evidence for generalization to unseen data. Though BSDR is trained with weak binary
supervision, it does not seem to affect learning quality features relevant to density estimation. This
raises the utility of the proposed framework in practical scenarios where test scenario is slightly dif-
ferent from the training.
6.3.2 Noisy Labels

In this section, we examine the effect of perturbing the binary labels for the crowd images. This is im-
portant as errors happen during annotation process either due to the inherent confusion in the density
category for borderline images or due to human negligence. Since we use a weak form of labeling
to enable training, it is important to show the sensitivity of model performance on the correctness of
the annotations. A certain random fraction of the labels in the dataset is flipped to corrupt the binary
label. Table 6.5 lists the counting performance of BSDR under different amount of corruption. It is
clear that the MAE changed only slightly with up to fairly reasonable 5% corruption, emphasizing the
robustness of the framework. As expected, higher corruption leads to worse performance.
6.3.3 Architectural Ablations

Here we perform a series of ablative experiments on the architectural parameter settings of our frame-
work. First, we check the sensitivity of BSDR towards the count threshold CT used for binary label-
ing. The results for varying CT is available in Table 6.5. Clearly, changing CT by large values does
Train ↓ / Test → STPA UCF-QNRF

STPA 170.7 372.4
UCF-QNRF 165.0 285.8
Table 6.4: Cross dataset performance of our model; the reported entries are the MAEs obtained for
BSDR
79
Method MAE MSE
Label Corruption 1% 174.4 263.7
CT = 1000 190.6 233.4
CT = 1500 202.3 238.1
NRN µsp = 9, µdn = 15, σ = 4 199.9 280.0
NRN µsp = 4, µdn = 16, σ = 2 172.4 263.2
With 3 categories 212.9 294.3
BSDR (ours) 170.7 257.4
Table 6.5: The performance of BSDR under changes in various hyper-parameters to validate the
architectural choices. BSDR seems to be robust under reasonable amount label corruption.
not seem to affect the MAE drastically, though the default setting of CT = 1250 has better perfor-
mance. This shows that the BSDR is robust to changes in the count threshold, making it suitable for
settings where this value is not well defined. Next the parameters of the noisy count distribution (Sec-
tion 6.1.3) are varied. Especially, the Normal distributions for both density categories are made further
away by changing their means. They are made narrower with reduced variance as well. Among all the
parameter settings, we find the default values deliver the best performance as it models the underlying
distribution better. Furthermore, the binary labeling scheme is modified to include more categories.
For 3-way labeling, crowd images are classified in to three density categories (sparse, medium and
dense). Not only that the labeling process becomes difficult with more than two categories, the perfor-
mance also drops. This is because of the increased confusion for samples across categories, causing
larger overlap of modes in the noisy count distribution (Section 6.1.3). The overlap leads to poor
training of the rectifier networks and finally resulting in higher MAE. The binary labeling scheme is
easy to annotate and results better performance.
6.4 Conclusions
This chapter introduces an alternate way of annotating crowd images to train density regressors. In
the proposed framework, images are only given binary labels as opposed to annotating every person.
A crowd is either categorized as sparse or dense, which along with noisy density map extracted from
the image, is used to train a density regressor. Various experimental validations evidence the ability
of our model to deliver significant counting performance at the lowest annotation cost. Since this
80
labeling scheme is easy to perform, we expect creating large datasets for crowd counting is now a
practical possibility.
The binary labeling scheme drastically reduces annotation difficulty and supports addressing the
Limited Annotations issue (Section 1.1). However, it still heavily depends on the labels given by
humans, which might be infeasible in many cases. This naturally motivates a framework where abso-
lutely no instance-level labels are required. Next chapter proposes such a complete self-supervision
approach.
81
Chapter 7
Complete Self-Supervision via Distribution

Matching
HE ability to estimate head counts of dense crowds effectively and efficiently serves several prac-
T tical applications. This has motivated deeper research in the field and resulted in a plethora of
crowd density regressors. These CNN based models deliver excellent counting performance almost
entirely on the support of fully supervised training. Such a data hungry paradigm is limiting the further
development of the field as it is practically infeasible to annotate thousands of people in dense crowds
for every kind of setting under consideration. The fact that current datasets are relatively small and
cover only limited scenarios, accentuates the necessity of a better training regime. Hence, developing
methods to leverage the easily available unlabeled data, has gained attention in recent times.
The classic way of performing unsupervised learning revolves around autoencoders [71, 72, 79,
80]. Autoencoders or its variants are optimized to predict back their inputs, usually through a rep-
resentational bottleneck. By doing so, the acquired features are generic enough that they could be
employed for solving other tasks of interest. These methods have graduated to the more recent frame-
work of self-supervision, where useful representations are learned by performing some alternate task
for which pseudo labels can be easily obtained. For example, in self-supervision with colorization ap-
proach [73, 87, 88], a model is trained to predict the color image given its grayscale version. One can
easily generate grayscale inputs from RGB images. Similarly, there are lots of tasks for which labels
are freely available like predicting angle of rotation from an image [97, 98], solving jumbled scenes
[78], inpainting [76] etc. Though self-supervision is effective in learning useful representations, they
require a final mapping from the features to the end task of interest. This is thought to be essentially
unavoidable as some supervisory signal is necessary to aid the final task. For this, typically a linear
layer or a classifier is trained on top of the learned features using supervision from labeled data, de-
82
Self-Supervision Complete Self-Supervision
Input Labels Input Labels
Prior Distribution
CNN CNN
Self-supervised Supervised Self-supervised Distribution Supervision
Figure 7.1: Self-Supervision Vs Complete Self-Supervision: Normal self-supervision techniques has

a mandatory labeled training stage to map the learned features to the end task of interest (in blue).
But the proposed complete self-supervision is devoid of such an instance-wise labeled supervision,
instead relies on matching the statistics of the predictions to a prior distribution (in green).
feating the true purpose of self-supervision. In the case of crowd counting, one requires training with
annotated data for converting the features to a density map. To reiterate, the current unsupervised
approaches might capture the majority of its features from unlabeled data, but demand supervision at
the end should they be made useful for any practical applications.
Our work emerges precisely from the above limitation of the standard self-supervision methods,
but narrowed down to the case of crowd density estimation. The objective is to eliminate the manda-
tory final labeled supervision needed for mapping the learned self-supervised features to a density
map output. In other words, we mandate developing a model that can be trained without using any
labeled data. Such a problem statement is not only challenging, but also ill-posed. Without providing
a supervisory signal, the model cannot recognize the task of interest and how to properly guide the
training stands as the prime issue. We solve this in a novel manner by carefully aiding the model to
regress crowd density on the back of making some crucial assumptions. The idea relies on the obser-
vation that natural crowds tend to follow certain long tailed statistics and could be approximated to an
appropriate parametric prior distribution (Section 7.2.1). If a network trained with a self-supervised
task is available (Section 7.2.2), its features can be faithfully mapped to crowd density by enforcing
the predictions to match the prior distribution (Section 7.2.3). The matching is measured in terms of
Sinkhorn distance [99], which is differentiated to derive error signals for supervision. This proposed
framework is contrasted against the normal self-supervision regime in Figure 7.1, with the central
difference being the replacement of the essential labeled training at the end by supervision through
distribution matching. We show that the proposed approach results in effective learning of crowd
features and delivers good performance in terms of counting metrics (Section 7.3).
83
In summary, this chapter contributes the following:
• The first completely self-supervised training paradigm which does not require instance-wise
annotations, but works by matching statistics of the distribution of labels.
• The first crowd counting model that can be trained without using a single annotated image, but
delivers significant regression performance.
• A detailed analysis on the distribution of persons in dense crowds to reveal the power law nature
and enable the use of optimal transport framework.
• A novel extension of the proposed approach to semi-supervised setting that can effectively
exploit unlabeled data and achieve significant gains.
• An efficient way to improve the Sinkhorn loss by leveraging edge information from crowd
images.
7.1 Related Works

Interestingly, almost all crowd counting works are fully supervised and leverage annotated data to
achieve good performance. The issue of annotation has drawn attention of a few works in the field and
is mitigated via multiple means. A count ranking loss on unlabeled images is employed in a multi-task
formulation along with labeled data by [31]. Wang et al. [100] train using labeled synthetic data and
adapt to real crowd scenario. The autoencoder method proposed in Chapter 5 optimizes almost 99%
of the model parameters with unlabeled data. However, all of these models require some annotated
data (either given by humans or obtained through synthetic means) for training, which we aim to
eliminate.
Our approach is not only new to crowd counting, but also kindles alternate avenues in the area of
unsupervised learning as well. Though initial works on the subject employ autoencoders or its variants
[71, 72, 79, 80] for learning useful features, the paradigm of self-supervision with pseudo labels stands
out to be superior in many aspects. Works like [73, 87, 88], learn representations through colourising
a grayscale image. Apart from these, pseudo labels for supervision are computed from motion cues
[74, 89, 90], temporal information in videos [75, 91], learning to inpaint [76], co-occurrence [92],
spatial context [77, 78, 101], cross-channel prediction [93], spotting artifacts [102], predicting object
rotation [97, 98] etc. The recent work of Zhang et al. [103] introduce the idea of auto-encoding
transformations rather than data. An extensive and rigorous comparison of all major self-supervised
methods is available in [96]. All these approaches focus on learning generic features and not the final
task. But we extend the self-supervision paradigm further directly to the downstream task of interest.
84
Power Law with
Exponential Cutoff Pprior
Actual Distribution HGT
Figure 7.2: Computing the distribution of natural crowds: crops from dense crowd images are framed
to a spatial grid of cells and crowd counts of all the cells are aggregated to a histogram (obtained on
Shanghaitech Part A dataset [1]). The distribution is certainly long tailed and could be approximated
to a power law.
7.2 Our Approach
7.2.1 Natural Crowds and Density Distribution

As mentioned in earlier, our objective of training a density regressor without using any annotated data
is somewhat ill-posed. The main reason being the absence of any supervisory signal to guide the
model towards the task of interest, which is the density estimation of crowd images. But this issue
could be circumvented by effectively exploiting certain structure or pattern specific to the problem.
In the case of crowd images, restricting to only dense ones, we deduce an interesting pattern on the
density distribution. They seem to spread out following a power law. To see this, we sample fixed
size crops from lots of dense crowd images and divide each crop into a grid of cells as shown in
Figure 7.2. Then the number people in every cell is computed and accumulated to a histogram. The
distribution of these cell counts is quite clearly seen to be long tailed, with regions having low counts
forming the head and high counts joining the tail. The number of cell regions with no people has
the highest frequency, which then rapidly decays as the crowd density increases. This resembles
the way natural crowds are arranged with sparse regions occurring more often than rarely forming
highly dense neighborhoods. Coincidentally, it has been shown that many natural phenomena obey a
similar power law and is being studied heavily [104]. The dense crowds also, interestingly, appears
conforming to this pattern as evident from multiple works [105, 106, 107, 108] on the dynamics of
pedestrian gatherings.
Moving to a more formal description, if D represents the density map for the input image I, then
85
P
the crowd count is given by C = xy Dxy regarding creation of density maps). D is framed into a
grid of M × N (typically set as M = N = 3) cells, with Cmn denoting the crowd count in the cell
indexed by (m, n). Now let H GT be the histogram computed by collecting the cell counts (Cmn s) from
all the images. We try to find a parametric distribution that approximately follows H GT with special
focus to the long tailed region. The power law with exponential cut-off seems to be better suited (see
Figure 7.2). Consequently, the crowd counts in cells Cmn could be thought as being generated by the
following relation,
Cmn ∼ Pprior (c) ∝ cα exp(−λc), (7.1)
where Pprior is the substitute power law distribution. There are two parameters to Pprior with α
controlling the shape and λ setting the tail length.
Our approach is to fix a prior distribution so that it can be enforced on the model predictions.
Studies like [105, 106] simulate crowd behaviour dynamics and estimate the exponent of the power
law to be around 2. Empirically, we also find that α = 2 works in most cases of dense crowds,
with the only remaining parameter to fix is the λ. Observe that λ affects the length of the tail and
directly determines the maximum number of people in any given cell. If the maximum count C max is
specified for the given set of crowd images, then λ could be fixed such that the cumulative probability
density (the value of CDF) of Pprior at C max is very close to 1. We assume 1/S as the probability of
finding a cell with count C max out of S images in the given set. Now the CDF value at C max could
be set to 1 − 1/S, simply the probability for getting values less than the maximum. Note that C max
need not be exact as small variations do not change Pprior significantly. This makes it practical as
the accurate maximum count might not be available in real-world scenarios. Since C max is for the
cells, the maximum crowd count of the full image C f max is related as C max = C f max /(M N Scrop ),
where Scrop denotes the average number of crops that make up a full image (and is typically set as 4).
Thus, for a given a set of highly dense images, only one parameter, the C f max is required to fix an
appropriate prior distribution.
We make a small modification to the prior distribution Pprior as its value range starts from 1. H GT
has values from zero with large probability mass concentrated near the low count region. Roughly
30% of the mass is seen to be distributed for counts less than or around 1. So, that much probability
mass near the head region of Pprior is redistributed to [0, 1] range in a uniform manner. This is found
to be better for both training stability and performance.
In short, now we have a prior distribution representing how the crowd density is being allocated
among the given set of images. Suppose there exists a CNN model that can output density maps,
then one could try to generate error signals for updating the parameters of the model by matching
the statistics of the predictions with that of the prior. But that could be a very weak signal for proper
86
STAGE 1: SELF-SUPERVISION
FEN C1
165
Rotation Classes
STAGE 2: SINKHORN TRAINING
Sinkhorn
FEN C2 HCS Matching
Pprior ~ HGT
Figure 7.3: The architecture of CSS-CCNN is shown. CSS-CCNN has two stages of training: the
first trains the base feature extraction network in a self-supervised manner with rotation task and the
second stage optimizes the model for matching the statistics of the density predictions to that of the
prior distribution using optimal transport.
training of the model. It would be helpful if the model has a good initialization to start the supervision
by distribution matching, which is precisely what we do by self-supervision in the next section.
7.2.2 Stage 1: Learning Features with Self-Supervision

We rely on training the model with self-supervision to learn effective and generic features that could
be useful for the end task of density estimation. That means the model has to be trained in stages, with
87
the first stage acquiring patterns frequently occurring in the input images. Since only dense crowd
images are fed, we hope to learn mostly features relevant to crowds. These could be peculiar edges
discriminating head-shoulder patterns formed by people to fairly high-level semantics pertaining to
crowds. Note that the model is not signaled to pick up representations explicitly pertinent to density
estimation, but implicitly culminate in learning crowd patterns as those are the most prominent part of
the input data distribution. Hence, the features acquired by self-supervision could serve as a faithful
initialization for the second stage of distribution matching.
Regarding self-supervision, there are numerous ways to generate pseudo labels for training mod-
els. The task of predicting image rotations is a simple, but highly effective for learning good represen-
tations [96]. The basic idea is to randomly rotate an image and train the model to predict the angle of
rotation. By doing so, the network learns to detect characteristic edges or even fairly high-level pat-
terns of the objects relevant for determining the orientation. These features are observed to be generic
enough for diverse downstream tasks [96] and hence we choose self-supervision through rotation as
our method.
Figure 7.3 shows the architecture of our density regressor, named the CSS-CCNN (for Completely
Self-Supervised Counting CNN). It has a base Feature Extraction Network (FEN), which is composed
of three VGG [7] style convolutional blocks with max poolings in-between. This is followed by two
task heads: C1 for the first training stage of self-supervision, and C2 for regressing crowd density at
second stage. The first stage branch has two more convolutions and a fully connected layer to finally
classify the input image to one of the rotation classes. We take 112 × 112 crops from crowd images
and randomly rotate the crop by one of the four predefined angles (0, 90, 180, 270 degrees). The
model is trained with cross-entropy loss between the predicted and the actual rotation labels. The
optimization runs till saturation as evaluated on a validation set of images.
Once the training is complete, the FEN has learned useful features for density estimation and the
rotation classification head is removed. Now the parameters of FEN are frozen and is ready to be used
in the second stage of training through matching distributions.
7.2.3 Stage 2: Sinkhorn Training

After the self-supervised training stage, FEN is extended to a density regressor by adding two con-
volutional layers as shown in Figure 7.3. We take features from both second and third convolution
blocks for effectively mapping to crowd density. This aggregates features from slightly different re-
ceptive fields and is seen to deliver better performance. The layers of FEN are frozen and only a few
parameters in the freshly added layers are open for training in the second stage of distribution match-
ing. This particularly helps to prevent over-fitting as the training signal generated could be weak for
88
updating large number of parameters. Now we describe the details of the exact matching process.
The core idea is to compute the distribution of crowd density predicted by CSS-CCNN and op-
timize the network to match that closely with the prior Pprior . For this, a suitable distance metric
between the two distributions should be defined with differentiability as a key necessity. Note that
the predicted distribution is in the form of an empirical measure (an array of cell count values) and
hence it is difficult to formulate an easy analytical expression for the computing similarity. The classic
Earth Mover’s Distance (EMD) measures the amount of probability mass that needs to be moved if
one tries to transform between the distributions (also described as the optimal transport cost). But this
is not a differentiable operation per se and cannot be used directly in our case. Hence, we choose the
Sinkhorn distance formulation proposed in [99]. Sinkhorn distance between two empirical measures
is proven to be an upper bound for EMD and has a differentiable implementation. Moreover, this
method performs favorable in terms of efficiency and speed as well.
Let D CS represent the density map output by CSS-CCNN and C CS hold the cells extracted from
the predictions. To make the distribution matching statistically significant, a batch of images are
CS
evaluated to get the cell counts (Cmn s), which are then formed into an array H CS . We also sam-
ple the prior Pprior and create another empirical measure H GT to act as the ground truth. Now the
Sinkhorn loss Lsink is computed between H GT and H CS . It is basically a regularized version of
optimal transport (OT) distance for the two sample sets. Designate hGT and hCS as the probability
vectors (summing to 1) associated with the empirical measures H GT and H CS respectively. Now a
transport plan P could be conceived as the joint likelihood of shifting the probability mass from hGT
to hCS . Define U to be the set of all such valid candidate plans as,
d×d
U = {P ∈ R+ | P 1 = hGT , P T 1 = hCS }. (7.2)
There is a cost M associated with any given transport plan, where Mij is the squared difference
between the counts of ith sample of H GT and jth of H CS . Closer the two distribution, lower would
be the cost for transport. Hence, the Sinkhorn loss Lsink is defined as the cost pertinent to the optimal
transportation plan with an additional regularization term. Mathematically,
1
Lsink (H GT , H CS ) = arg min hP , M iF − E(P ), (7.3)
P ∈U β
where <>F stands for the Frobenius inner product, E(P ) is the entropy of the joint distribution P
and β is a regularization constant (see [99] for more details). It is evident that minimizing Lsink brings
the two distributions closer in terms of how counts are allotted.
The network parameters are updated to optimize Lsink , thereby bringing the distribution of pre-
89
dictions close to that of the prior. At every iteration of the training, a batch of crowd images are
sampled from the dataset and empirical measures for the predictions as well as prior are constructed
to backpropagate the Sinkhorn loss. The value of Lsink on a validation set of images is monitored for
convergence and the training is stopped if the average loss does not improve over a certain number of
epochs. Note that we do not use any annotated data even for validation. The counting performance is
evaluated at the end with the model chosen based on the best mean validation Sinkhorn loss.
Thus, our Sinkhorn training procedure does not rely on instance-level supervision, but exploits
matching the statistics computed from a set of inputs to that of the prior. One criticism regarding this
method could be that the model need not learn the task of crowd density estimation by optimizing
the Sinkhorn loss. It could learn any other arbitrary task that follows a similar distribution. The
counter-argument stems from the semantics of the features learned by the base network. Since the
initial training mostly captures features related to dense crowds (see Section 7.2.2), the Sinkhorn
optimization has only limited flexibility in what it can do other than map them through a fairly simple
function to crowd density. This is especially true as there is only a small set of parameters being
trained with Sinkhorn. It is highly likely and straightforward to map the frequent crowd features
to its density values, whose distribution is signaled through the prior. Moreover, we show through
extensive experiments in Section 7.3 and 7.4 that CSS-CCNN actually ends up learning crowd density
estimation.
7.2.4 Improving Sinkhorn Matching

As described already, the Sinkhorn training updates the network parameters by backpropagating
Sinkhorn loss Lsink , which brings the distribution of the density predictions closer to that of the
prior. But computing Lsink relies on estimating the optimal transport plan P ∗ (the solution to opti-
mization in equation 7.3) through the Sinkhorn iterations (see [99] for more details). The quality of
estimation of P ∗ directly affects the performance of the model. Hence, it is quite beneficial to aid
the computation of P ∗ by providing additional information. Any signal that can potentially boost the
transport assignments is helpful. For example, even simply grouping the prediction measures H CS to
a coarse sparse-dense categories and then restricting assignments within the groups from that of the
prior, leads to improved performance. This is because the restricted assignments make sure that the
dense samples from prediction are always mapped to dense points in the prior (similarly for sparse
ones), reducing costly errors of connecting dense ones to sparse and vice versa. However, one needs
to have the density category information to supplement the Sinkhorn assignments and that should be
obtained in an unsupervised fashion as well.
We observe that the edge details of crowd images could serve as an indicator of density. Highly
90
crowded regions seem to have more density of edges, while it is low for relatively sparse or non-
crowds. But this is a weak signal and can have lots of false positives. The higher density of edges
could arise from non-crowds such as background clutter or other patterns with more edges. Interest-
ingly for dense crowds, we find that this weak supervisory signal is good enough for grouping regions
into potential dense or very sparse categories. For any given crowd image, the standard Canny edge
detector [95] is applied to extract the edge map. The map is then blurred and down-sampled to look
like density maps. These pseudo maps resemble the actual ground truth crowd density in many cases,
having a relatively higher response in dense region than at sparse ones. Note that the absolute values
from the pseudo maps do not follow the actual crowd density and hence cannot be directly used for
supervision. However, given a set of crowd patches, the relative density values are sufficient to faith-
fully categorize regions into two broad density groups. This is done by first sorting pseudo counts of
the patches and then dividing the samples at a predetermined percentile. Crowd regions with pseudo
count values above this threshold are considered to be dense while those below goes to the highly
sparse or non-crowd. By employing a percentile threshold, accurate count values are not required and
the pseudo counts should only need to be relatively correct across the given set of images. Since any
random set of crowd patches should follow the prior distribution (as per the assumptions and approx-
imations in Section 7.2.1), the percentile threshold is fixed on the prior. We fix the threshold to be
30th percentile as there are roughly 30% samples that are non-crowds or with very low counts in the
range of 1.
We modify the Sinkhorn training to incorporate the pseudo density information in the following
manner: first, we compute the pseudo counts H CSP corresponding to the prediction samples H CS .
Using pseudo counts H CSP , H CS is split to the sparse part H0CS and the dense H1CS . The prior
samples are also grouped with the same threshold to get H0GT and H1GT . Now the Sinkhorn loss is
separately found for both the categories and added. The exact loss being backpropagated is,
L++
sink (H
GT
, H CS ) = Lsink (H0GT , H0CS ) + Lsink (H1GT , H1CS ) (7.4)
By separating out the assignment of sparse and dense samples, the counting performance of the model
increases as evident from the experiments in Section 7.3. Note that the the Sinkhorn training is
complete on its own without the auxiliary density information. It is a simple addendum to the method
that can improve the performance.
91
ST PartA JHU UCF-QNRF UCF-CC-50 U
Input
602.0 70.0 499.0 700.0
Ground
Truth
578.4 126.8 374.4 922.3
Fully
Supervised
CCNN
550.6 124.6 297.9 632.4
Self-
Supervised
CCNN
514..9 92.6 563..6 965..9
Completely
Self-
Supervised
CCNN
518.4 79.2 526.1 863.7
Completely
Self-
Supervised
CCNN++
Figure 7.4: Density maps estimated by CSS-CCNN along with that of baseline methods. Despite
being trained without a single annotated image, CSS-CCNN is seen to be quite good at discriminating
the crowd regions as well as regressing the density values.
7.3 Experiments
7.3.1 Baselines
Our completely self-supervised framework is unique in many ways that the baseline comparisons
should be different from the typical supervised methods. It is not fair to compare CSS-CCNN with
other approaches as they use the full annotated data for training. Hence, we take a set of solid baselines
92
for our model to demonstrate its performance. The CCNN Random experiment refers to the results
one would get if only Stage 1 self-supervision is done without the subsequent Sinkhorn training.
This is the random accuracy for our setting and helpful in showing whether the proposed complete
self-supervision works. Since our approach takes one parameter, the maximum count value of the
dataset (C f max ) as input, CCNN Mean baseline indicates the counting performance if the regressor
blindly predicts the given value for all the images. We choose mean value as it makes for sense in
this setting than the maximum (which anyway has worse performance than mean). Another important
validation for our proposed paradigm is the CCNN Pprior experiment, where the model gives out a
value randomly drawn from the prior distribution as its prediction for a given image. The counting
performance of this baseline tells us with certainty whether the Stage 2 training does anything more
than that by chance. Apart from these, the CCNN Fully Supervised trains the entire regressor with
the ground truth annotations from scratch. Note that we do not initialize CCNN with any pretrained
weights as is typically done for supervised counting models. CCNN Self-Supervised with Labels,
on the other hand, runs the Stage 1 training to learn the FEN parameters and is followed by labeled
optimization for updating the regressor layers. These are not directly comparable to our approach as
we do not use any annotated data for training, but are shown for completeness.
We evaluate our model on different datasets in the following sections. The results for the naive
version of the Sinkhorn loss Lsink (Section 7.2.3) is labeled as CSS-CCNN, whereas CSS-CCNN++
represents the one with the improved L++ sink (Section 7.2.4). Note that only the train/validation set
images are used for optimizing CSS-CCNN and the ground truth annotations are never used. The
counting metrics are computed with the labeled data from the test after the full training. Unless
otherwise stated, we use the same hyper-parameters as specified in Section 7.2.

We evaluate on the Shanghaitech Part A [1] dataset. The hyper-parameter used for is C f max = 3000.
We compare the performance of CSS-CCNN with the baselines listed earlier and other competing
methods in Table 7.1. The metrics for our method is evaluated for three independent runs with dif-
ferent initialization and the mean along with variance is reported. It is clear that CSS-CCNN outper-
forms all the baselines by a significant margin. This shows that the proposed method works better than
any naive strategies that do not consider the input images. With the improved loss, CSS-CCNN++
achieves around 5% less counting error than the naive version due to the more faithful Sinkhorn
matching process. Moreover, the CCNN network with rotation self-supervision also beats the model
developed in [109]. It is worthwhile to note that the performance of CSS-CCNN is reminiscent of
the results of early fully supervised methods with the MAE being better than a few of them as well
93
Method MAE MSE
CCNN Fully Supervised 118.9 196.6
Sam et al. [109] 154.7 229.4
CCNN Self-Supervised with Labels 121.2 197.5
C-CNN Random 431.1 559.0
C-CNN Mean 282.8 359.9
C-CNN Pprior 272.2 372.5
CSS-CCNN (ours) 207.3 ± 5.9 310.1 ± 7.7
CSS-CCNN++ (ours) 195.6 ± 5.8 293.2 ± 9.3
Table 7.1: Performance comparison of CSS-CCNN with other methods on Shanghaitech PartA [1].
Our model outperforms all the baselines.
(see Table 2 of [1]). Figure 7.4 visually compares density predictions made by CSS-CCNN and other
models. The predictions of our approach are mostly on crowd regions and closely follows the ground
truth, emphasizing its ability to discriminate crowds well.

The count hyper-parameter provided to the model for UCF-QNRF [3] is C f max = 12000. We achieve
similar performance trends on UCF-QNRF dataset as well. CSS-CCNN outperforms all the unsuper-
vised baselines in terms of MAE and MSE as evident from Table 7.2. Since the dataset has extreme
diversity in terms of crowd density, it is important to improve the Sinkhorn matching process and
faithfully assign appropriate counts across density categories. Owing to the better distribution match-
ing, CSS-CCNN++ achieves around 9% less counting error than CSS-CCNN, despite the dataset
being quite challenging.
Method MAE MSE

CCNN Mean 567.1 752.8
CCNN Pprior 535.6 765.9
CSS-CCNN (ours) 442.4 ± 4.2 721.6 ± 13.9
CSS-CCNN++ (ours) 414.0 ± 16.3 652.1 ± 15.6
Table 7.2: Benchmarking CSS-CCNN on UCF-QNRF dataset [3]. Our approach beats the baseline
methods in counting performance.
94
Method MAE MSE
Sam et al. [109] 433.7 583.3
CCNN Mean 771.2 898.4
CSS-CCNN (ours) 564.9 959.4
CSS-CCNN++ (ours) 557.0 737.9
Table 7.3: Performance CSS-CCNN on UCF-CC-50 [2]. Despite being very challenging dataset,
CSS-CCNN achieves better MAE than baselines.
7.3.4 UCF-CC-50 Dataset

UCF CC 50 dataset [2] has just 50 images with extreme variation in crowd density ranging from 94
to 4543. The small size and diversity together makes this dataset the most challenging. Since the
number of images is quite small, the assumption taken for setting the prior distribution gets invalid
to certain extent. But a slightly different parameter to the prior distribution works. We set α = 1
and C f max = 4000. Despite being a small and highly diverse dataset, CSS-CCNN is able to beat
all the baselines as clear from Table 7.3. The self-supervised MAE is also better than the method
in [109]. These results evidence the effectiveness of our method. CSS-CCNN++ improves upon the
result significantly in terms of MSE, indicating improved performance on highly dense crowds.
7.3.5 JHU-CROWD++ Dataset

JHU-CROWD++ [11, 12] is a new comprehensive dataset with 1.51 million head annotations span-
ning 4372 images. The crowd scenes are obtained under various scenarios and weather conditions,
making it one of the challenging dataset in terms of diversity. Furthermore, JHU-CROWD++ has
a richer set of annotations at head level as well as image level. The maximum count is fixed to
C f max = 8000. The performance trends are quite similar to other datasets, with our approach deliver-
ing better MAE than the baselines as evident from Table 7.4. This indicates the generalization ability
of CSS-CCNN across different types of crowd datasets.
7.3.6 Cross Data Performance and Generalization

In this section, we evaluate our proposed model in a cross dataset setting. CSS-CCNN is trained
in a completely self-supervised manner on one of the dataset, but tested on other datasets. Table
95
Method MAE MSE
CCNN Mean 316.3 732.3
CSS-CCNN (ours) 243.6 ± 9.1 672.4 ± 17.1
CSS-CCNN++ (ours) 197.9 ± 2.2 611.9 ± 12.0
Table 7.4: Evaluation of CSS-CCNN on JHU-CROWD++ [11, 12] dataset.
Train ↓ / Test → ST PartA UCF-QNRF JHU-CROWD++

ST PartA 207.3, 195.6 468.1, 472.4 254.0, 251.3
UCF-QNRF 251.2, 235.7 442.4, 414.0 236.5, 220.6
JHU-CROWD++ 290.2, 266.3 446.2, 417.4 243.6, 197.9
Table 7.5: Cross dataset performance of our model; the reported entries are the MAEs obtained for
CSS-CCNN and CSS-CCNN++ respectively.
7.5 reports the MAEs for the cross dataset evaluation. It is evident that the features learned from
one dataset are generic enough to achieve reasonable scores on the other datasets, increasing the
practical utility of CSS-CCNN. The difference in performance mainly stems from the changes in the
distribution of crowd density across the datasets. This domain shift is drastic in the case of UCF-CC-
50 [2], especially since the dataset has only a few images.
7.3.7 CSS-CCNN in True Practical Setting

The complete self-supervised setting is motivated for scenarios where no labeled images are available
for training. But till now we have been using images from crowd datasets with the annotations being
intentionally ignored. Now consider crawling lots of crowd images from the Internet and employing
these unlabeled data for training CSS-CCNN. For this, we use textual tags related to dense crowds
and similarity matching with dataset images to collect approximately 5000 dense crowd images. No
manual pruning of undesirable images with motion blur, perspective distortion or other artifacts is
done. CSS-CCNN is trained on these images with the same hyper-parameters as that of Shanghaitech
Part A and the performance metrics are computed on the datasets with annotations. From Table 7.6,
it is evident that our model is able to achieve almost similar or better MAE on the standard crowd
datasets, despite not using images from those datasets for training. This further demonstrates the
generalization ability of CSS-CCNN to learn from less curated data, emphasizing the practical utility
96
Train on CSS-CCNN CSS-CCNN++
web images MAE MSE MAE MSE
Test on ST PartA 208.8 309.5 184.2 268.8
Test on UCF-QNRF 450.7 755.9 422.1 699.9
Test on JHU-CROWD++ 241.2 706.8 231.0 660.1
Table 7.6: Evaluating CSS-CCNN in a true practical setting: the model is trained on images crawled
from the web, but evaluated on crowd datasets. The counting performance appears similar to that of
training on the dataset.
it could facilitate.
7.3.8 Performance with Limited Data

Here we explore the proposed algorithm along with fully supervised and self-supervised approaches
when few annotated images are available for training. The analysis is performed by varying the num-
ber of labeled samples and the resultant counting metrics are presented in Figure 7.5. For training
CSS-CCNN with data, we utilise the available annotated data to compute the optimal Sinkhorn as-
signments P ∗ and then optimize the Lsink loss. This way both the labeled as well as unlabeled data
can be leveraged for training by alternating respective batches (in a 5:1 ratio). It is clear that, at very
low data, scenarios CSS-CCNN beats the supervised as well as self-supervised baselines by a signif-
icant margin. The Sinkhorn training shows 13% boost in MAE (for Shanghaitech Part A) by using
just one labeled sample as opposed to no samples. This indicates that CSS-CCNN can perform well
in extremely low data regimes. It takes about 20K head annotations for the supervised model to per-
form as well as CSS-CCNN. Also, CSS-CCNN has significantly less number of parameters to learn
using the labeled samples as compared to a fully supervised network. These results suggests that our
complete self-supervision is the right paradigm to employ for crowd counting when the amount of
available annotated data is less.
7.4.1 Ablations on Architectural Choices

In Table 7.7, we validate our architectural choices taken in designing CSS-CCNN. The first set of
experiments ablates the Stage 1 self-supervised training. We perform Sinkhorn training on a ran-
domly initialized FEN (labeled as Without Stage 1) and receive a worse MAE. In the chosen setting
of self-supervision with rotation, the input image is randomly rotated by one of the four predefined
97
ST_PartA UCF-QNRF
Figure 7.5: Comparing our completely self-supervised method to fully supervised and self-supervised
approaches under a limited amount of labeled training data. The x-axis denotes the number of training
images along with the count (in thousands) of head annotations available for training, while the y-axis
represents the MAE thus obtained. At low data scenarios, CSS-CCNN has significantly superior
performance than others.
angles for creating pseudo labels. Now we analyse the effect of the number of rotation classes on the
final counting metrics. As evident from the table, four angles stands to be the best in agreement with
previous research on the same [96]. Self-supervision via colorization is another popular strategy for
learning useful representations. The model is trained to predict the a-b color space values for a given
gray-scale image (the L channel). The end performance is observed to be inferior in comparison with
that of the rotation task. Another option is to load FEN with ImageNet trained weights (as this is
a typical way of transfer learning) and then employ Stage 2. The result (With ImageNet weights) is
worse than that of CSS-CCNN, suggesting that the self-supervised training is crucial to learn crowd
features necessary for density estimation. Furthermore, the base feature extraction network (FEN)
(see Figure 7.3) is changed to ResNet blocks and CSS-CCNN is trained as well as evaluated (with
ResNet based FEN). Simple VGG style architecture appears to be better for density regression. We
also run experiments with different types of prior distributions and see that the power law with ex-
ponential cutoff works better, justifying our design choice. The without skip connection experiment
trains CSS-CCNN devoid of the features from the second convolutional block in FEN being directly
fed to C2 (see Figure 7.3 and Section 7.2.3). As expected, the feature aggregation from multiple
layers improves the counting performance. The cell sizes used for computing count histograms (see
Section 7.2.1) are varied (labeled Cell Size) to understand the effect on MAE. The metrics seem to
be better with our default setting of 8 × 8. CSS-CCNN employs a prior parametric distribution to
98
Method MAE MSE
Without Stage 1 257.5 397.7
Rotation with 2 class 233.5 344.1
Rotation with 8 class 232.2 341.5
Colorization 242.5 363.0
With ImageNet weights 257.8 370.8
ResNet based FEN 244.8 332.4
Uniform Prior 261.8 406.0
Pareto Prior 248.3 386.2
Lognormal Prior 239.5 345.8
Without skip connection 226.8 329.1
Cell Size 2 × 2 243.9 374.6
Cell Size 4 × 4 251.6 389.4
With GT distribution 202.7 300.3
Percentile threshold 10 191.5 288.7
Percentile threshold 50 189.1 286.8
CSS-CCNN 197.3 295.9
CSS-CCNN++ 187.7 280.21
Table 7.7: Validating different architectural design choices made for CSS-CCNN evaluated on the
Shanghaitech Part A [1] (computed on single run).
facilitate the unlabeled training. We investigate the case where the prior is directly given in the form
of an empirical measure derived from the ground truth annotations. For the Sinkhorn training, this GT
distribution is sampled to get H GT (see Section 7.2.3) instead of Pprior . The resultant MAE is very
similar to the standard CSS-CCNN setting, indicating that our chosen prior approximates the ground
truth distribution well. Lastly, we ablate the percentile threshold used to extract of pseudo density
category for the CSS-CCNN++ model (Section 7.2.4) and find that the default setting helps in better
density differentiation.
7.4.2 Analysis of the Prior Distribution

The proposed Sinkhorn training requires a prior distribution of crowd counts to be defined and the
choice of an appropriate prior is essential for the best model performance as seen from Table 7.7.
Here we analyze the crowd data more carefully to see why the truncated power law is the right choice
of prior. For this, the counts from crowd images are extracted as described in Section 7.2.1 and a
maximum likelihood fit over various parametric distributions is performed. The double logarithmic
visualization of the probability distribution of both the data and the priors are available in Figure 7.6.
99
ST_PartA UCF-QNRF
Figure 7.6: Double logarithmic representation of maximum likelihood fit for the crowd counts from
Shanghaitech Part A [1] and UCF-QNRF [3].
Note that the data curve is almost a straight line in the logarithmic plot, a clear marker for power law
characteristic. Both truncated power law and lognormal tightly follow the distribution. But on close
inspection of the tail regions, we find truncated power law to best represent the prior. This further
validates our choice of the prior distribution.
7.4.3 Sensitivity Analysis for the Crowd Parameter

As described in Section 7.2.1, CSS-CCNN requires the maximum crowd count (C f max ) for the given
set of images as an input. This is necessary to fix the prior distribution parameter λ. One might
not have the exact max value for the crowds in a true practical setting; an approximate estimate is a
more reasonable assumption. Hence, we vary C f max around the actual value and train CSS-CCNN on
Shanghaitech Part A [1] and UCF-QNRF [3]. The performance metrics in Table 7.8 show that chang-
ing C f max to certain extent does not alter the performance significantly. The MAE remained roughly
within the same range, even though the max parameter is being changed in the order of 500. Note that
the results are computed with single runs. These findings indicate that the our approach is insensitive
to the exact crowd hyper-parameter value, increasing its practical utility. We also check the sensitivity
of our approach on the power law exponent α. Varying α around 2 results in similar performances, in
agreement with the findings of existing works and our design choice (see Section 7.2.1).
100
ST PartA UCF-QNRF
Param MAE MSE Param MAE MSE
f max f max
C = 2000 204.2 316.4 C = 10000 443.9 749.7
f max f max
C = 2500 197.9 304.6 C = 11000 446.9 757.5
f max f max
C = 3000 197.3 295.9 C = 12000 437.0 722.3
C f max = 3500 191.9 288.5 C f max = 14000 446.1 697.5
α = 1.9 202.9 303.3 α = 1.9 438.3 700.6
α = 2.0 197.3 295.9 α = 2.0 437.0 722.3
α = 2.1 200.7 305.6 α = 2.1 446.4 756.3
Table 7.8: Sensitivity analysis for the hyper-parameters on CSS-CCNN. Our model is robust to fairly
large change in the max count parameter.
7.4.4 Analysis of Features

To further understand the exact learning process of CSS-CCNN, the acquired features can be com-
pared against that of a supervised model. Figure 7.7 displays the mean feature map for the outputs
at various convolutional blocks of CSS-CCNN along with that of the supervised baseline (see Sec-
tion 7.3) evaluated on a given crowd image. Note that Conv4 stands for the regressor block that is
trained with the Sinkhorn loss in the case of CSS-CCNN. It is clear that the self-supervised features
closely follow the supervised representations, especially at the initial blocks in extracting low-level
crowd details. Towards the end blocks, features are seen to diverge, with fully supervised Conv4
outputs appearing like density maps. But notice that the corresponding completely self-supervised
Input Image Conv1 Block Conv2 Block Conv3 Block Conv4 Block
Fully
Supervised
CSS-CCNN
Figure 7.7: Visualization of mean features extracted from different convolutional blocks of CSS-
CCNN and the supervised baseline.
101
outputs have higher activations on heads of people, which is relevant for the end task of density esti-
mation. This clearly shows that CSS-CCNN indeed learns to extract crowd features and detect heads,
rather than falling in a degenerate case of matching the density distribution without actually counting
persons.
7.5 Conclusions
We show for the first time that a density regressor can be fully trained from scratch without using
a single annotated image. This new paradigm of complete self-supervision relies on optimizing the
model by matching the statistics of the distribution of predictions to that of a predefined prior. Though
the counting performance of the model stands better than other baselines, there is a performance gap
compared to fully supervised methods. Addressing this issue could be the prime focus of future works.
For now, our work can be considered as a proof of concept that models could be trained directly for
solving the downstream task of interest, without providing any instance-level annotated data.
102
Part III
Addressing Person Localization
103
Chapter 8
Spot-on Dot Prediction for Dense Crowds
ECENT works have made huge progress in devising new architectures and algorithms to improve
R performance of density regression based approaches. The major metric for performance evalu-
ation of counting models only considers overall count estimation and does not account localization of
prediction on to individual humans. Though these methods deliver good count accuracy for a given
crowd scene, the localization seems poor for further downstream applications. This is because the
density map describes the people count in local regions and hence the focus is not to accurately locate
each person. Moreover, the notion of density makes more sense when people are relatively closer as
in highly dense crowds. The density surface in sparser crowds have frequent discontinuities and the
values over human heads are mostly near zero. This is evident in Figure 8.1, where the density peaks
on large faces are spread out, indicating practically almost no detection in sparse region. Also note
that one could not consistently find local peaks in these regions to localize persons. This is largely true
irrespective of Gaussian kernel parameters used for ground truth density map creation. Furthermore,
the local peaks might not accurately correspond to the location of persons (except in certain density
ranges) as it is trained for regressing density in a local region rather than to pinpoint people. Hence,
any simple method to post-process density maps for better localization might not scale equally across
the entire density range (see Section 8.3.2). In contrast, ideally one would expect spot-on predictions
on people at all scales. Such a system facilitates applications other than computing mere counts.
From accurate dot detections, faces and features can be extracted for other purposes, which is cum-
bersome with density maps. Above all, individual detection of people facilitates a more explainable
and practical AI system.
Hence, in this work, we try to break the ‘traditional’ paradigm of training for density regression
and replace it with accurate dot detection framework. We define the problem statement as to predict
localized dots over the head of any person irrespective of the scale, pose or other variations. Addi-
tionally, this has to be done without any bounding box annotations, but only with point annotations
104
Figure 8.1: Dot Detection Vs Density Regression. The top row shows crowds with dot predictions
from the proposed DD-CNN, while bottom row has corresponding density maps. The dot detection
has better localization of individuals across density ranges.
available with crowd datasets. There are many challenges in achieving such a goal; the major one
being the extreme scale and density variation in crowd scenes. In normal detection scenarios, this is
trivially done using a multi-scale architecture, where images are fed to the model at different scales
and trained. However, such a naive approach is not possible in our case since there is no ground
truth scale information (through bounding boxes) available with the crowd dataset, instead only point
annotation are present. Furthermore, the multi-scale architecture has to deal with large variation in
appearance of people across scale. A lower scale person simply is not a rescaled version of a large
face, but looks drastically different. In sparse crowd, facial features may be visible, but in highly
dense crowd people are only seen as blobs. These pose certain unique issues in formulating a dense
dot prediction system.
We devise a Dot Detection CNN model, named DD-CNN for the proposed challenging problem.
The basic idea is to train the CNN model for pixel-wise binary classification task of detecting people.
Cross entropy loss is used instead of l2 regression employed in density estimation. DD-CNN is
optimized in a multi-scale architecture which does not require ground truth scale information, but
uses only point supervision.
105
In summary, this chapter contributes:
• A new training paradigm of dot detection for crowd counting, dropping the prevalent density
regression.
• A unique multi-scale fusion architecture that facilitates highly localized detection of people in
dense crowds.
• A novel training regime that only requires point supervision, but delivers significant perfor-
mance.
8.1 Our Approach

In the previous section, we have motivated the paradigm shift from density regression to dot detec-
tion. The basic objective is to predict highly localized points on heads of people as close to the ground
truth annotation. At a high-level view, this is a dense classification task, where at each pixel the model
has to predict the presence of a person irrespective of the scale, pose or other variations. Figure 8.2
illustrates our proposed solution, the dot detection framework DD-CNN. DD-CNN is composed of
four functional modules; the first Crowd Feature Extraction network converts the input crowd scene
to rich features at multiple resolutions. Then this feature set is processed by Multi-Scale Feedback
CROWD FEATURE EXTRACTION MULTI-SCALE FEEDBACK ADAPTIVE SCALE FUSION DOT PREDICTION
⅛ SCALE
C | 256
C | 128
P BRANCH
C | 64
C | 32
C|1
3C | 256 3C | 512 D⅛
P
Threshold
P
C | 256
T
T | 256
TOP-DOWN
2C | 128 FEEDBACK
P
C | 256
C | 512
C | 256
C | 128
DOT MAP
C | 64
C | 32
C|1
2C | 64 3C | 256 D¼
¼ SCALE
BRANCH
Input VGG-16 Block P 2x2 Pooling 3x3 Conv with stride 1 3x3 DeConv with stride 2
Figure 8.2: The architecture of the proposed dot detection network. DD-CNN has a multi-scale
architecture with dot predictions at different resolutions, which are combined through Adaptive Scale
Fusion. The networks are trained with pixel-wise binary cross-entropy loss.
106
module, which correlates multi-scale information to generate predictions at multiple resolutions. Sub-
sequently, the novel Adaptive Scale Fusion module combines the multi-scale predictions into single
map, where each value indicates the confidence of person detection. A threshold is applied on this
map to generate the final accurate dot predictions. The following sections describes in detail each
functional modules as well as the training regime for DD-CNN.
8.1.1 Crowd Feature Extraction

Good features form backbone of any vision systems. It has been recently shown that VGG-16 [7]
based networks work well for crowd feature extraction and achieve state-of-the-art performance [29].
Following the trend, we employ the first four 3 × 3 convolutional blocks from VGG-16, which are
initialized with ImageNet trained weights. The input to the network is a three channel image of fixed
size 224 × 224. Due to max-pooling, the resolution of feature maps halves every block. After the
second max-pooling, the network branches into two, with the third block being replicated in both.
The third block is copied so that the two branches specialize by sharing low-level features without
any conflict. The two branches give out feature map sets at different resolutions. One set has size
one-fourth that of the input image and is meant to resolve relatively dense crowd features. The other
one-eighth resolution feature maps are for discriminating sparse crowd and large faces as they have
higher receptive field. These multi-scale feature sets are used by the subsequent modules to make dot
prediction.
8.1.2 Multi-Scale Module

The feature extraction blocks are followed by two columns of CNN for processing the multi-scale
feature maps. As shown in Figure 8.2, each feature set is passed through a block of 3 × 3 convolution
layers to finally make per-pixel binary classification for presence of a person. These layers have
ReLU non-linearity, except for the last, which has Sigmoid to predict pixel-wise confidence. Since
the one-eighth scale feature set is computed with a larger receptive field, it could have global context
information regarding crowd regions in the image. The one-fourth counterpart, though has a higher
resolution, its predictions are based on limited global context and could result in false detection on
crowd like patterns. Hence, we leverage the context information from one-eighth scale set through
a top-down feedback connection. Basically, a transpose convolution layer is used to upsample the
one-eighth feature maps followed by a normal 3 × 3 convolution to extract feedback feature maps.
The feedback maps are then concatenated with the one-fourth scale column features. This helps the
scale column block to receive high-level context information and achieve better prediction at higher
resolution.
107
Apart from handling drastic variation in scale of appearance of people, such a multi-scale architec-
ture is also motivated from the need to predict at the exact location in the output map. Note that there
is inherent inconsistency in ground truth annotation of heads. The location of annotations vary widely
in sparse crowds, where the point could be any where on the face or head. This issue is relatively less
for dense regions owing to small heads, but requires prediction at smaller resolution for sparse crowds
like the one-eighth. At this size, there is a high chance that the predicted and ground truth location
closely match. But predicting at 1/8th resolution causes one pixel in the output to represent multiple
people in a dense region. This calls for progressive prediction at increasing resolutions for better
performance at all densities. However, we empirically find that two scales are sufficient to capture
this variability for existing benchmark datasets. Now the challenge is to combine the multi-resolution
predictions, which can have overlapping predictions with no scale information being available.
8.1.3 Multi-Scale Pretraining

The training of DD-CNN is done in two stages; the first is the Multi-Scale Pretraining and the second
is Adaptive Scale Training (Section 8.1.5). Here we discuss the pretraining of the multi-scale network.
The multi-scale module outputs per-pixel confidences at two different resolutions and we train each
scale with per-pixel binary cross entropy loss. The loss is defined as,
1 X λY 0 [x,y]logX[x,y]+
L(X, Y, λ) = (1−Y 0 [x,y])log(1−X[x,y]) (8.1)
N
x,y
where X is network prediction for a given input image and Y is the point ground truth map. Y 0 [x, y] =
min(Y [x, y], 1) simply represents the binarized version of Y , where value 1 at pixel (x, y) indicates the
presence of a person and 0 for background. Note that the summation runs over the spatial dimensions
of the output, making the objective per-pixel. Since there are significantly less points with persons
than without in training images, class imbalance might arise. So while training, we weigh the person
class more by a factor λ (typically 2 or 4) and is observed to improve the performance.
Let D 1 and D 1 be respectively the one-fourth and one-eighth scale prediction maps. We train the
4 8
individual scale columns with ground truth binary maps of same resolution. These maps are created
from the head annotations available with crowd datasets. If DGT 1 and DGT
1 represents the ground truth
4 8
maps, they are generated as,
X
DGT
1 [x, y] = 1(x,y)=(b x0 c,b y0 c) (8.2)
s s s
x0 ,y 0
where (x0 , y 0 ) are the annotated locations of people and s is either 4 or 8 for the two scales. Note that 1
is indicator function and b.c denotes floor operation. The expression evaluates to the number of people
108
being annotated at any location (x, y) in the downsampled resolution. For pretraining, we optimize
parameters of one-eighth scale branch by minimizing the loss L(D 1 , DGT 1 , λ 1 ). Standard mini-batch
8 8 8
gradient descent with momentum is employed (learning rate is fixed to 1e-3). Once the training
is saturated, the weights updated for one-eighth branch are frozen and then remaining one-fourth
network blocks are optimized. This is done by backpropagating one-fourth loss L(D 1 , DGT 1 , λ 1 ).
4 4 4
Note that this scale is trained with the top-down feedback features and outputs dot map with a higher
resolution. Thus, we have dot predictions at two different resolutions for the same crowd scene and
can have inconsistent or inconclusive detections which needs to be faithfully combined.
8.1.4 Adaptive Scale Fusion and Dot Detection

A multi-scale architecture in the dot detection framework offers some unique challenges. The impor-
tant one is the absence of scale information of the crowd. For a given person in a crowd image, there
is no information regarding the size of the person in order to train with the correct scale. Hence we
propose a novel Adaptive Scale Fusion (ASF) strategy, which does not require bounding box annota-
tion, but delivers accurate dot prediction across drastic scale and density variations. ASF essentially
combines the predictions from multi-scale module and forms one output at the higher one-fourth res-
olution. For any given point in one-eighth prediction map (D 1 ), corresponding region in the next
8
higher resolution scale is taken (2 × 2 region in one-fourth scale D 1 ) and the scale in which the
4
maximum response occurs is the winning candidate. This is conceptually similar to scale pyramids,
but adapted for resolving dot detections from multi-resolution predictions. To be more precise, let
p(x) = b x2 c evaluates to the coordinate in D 1 for a pixel at location x of D 1 . Now for every pixel in
8 4
ASF output T, we compute an indicator variable I[x, y] to identify the scale and correct detections are
filtered out. The ASF operation is expressed mathematically as,

1 if D 1 [p(x), p(y)] ≥ max D 1 [x0 , y 0 ]


8 (p(x0 ),p(y 0 )) 4
I[x, y] = =(p(x),p(y)) (8.3)

0

otherwise,


 D 1 [x, y] if I[x, y] = 0
 4

T[x, y] = D 1 [p(x), p(y)] if ( x2 , y2 ) = (p(x), p(y)) (8.4)
 8


0 otherwise,
where T has one-fourth resolution. Note that the max operation is applied over all (x0 , y 0 ) pairs that
maps to the same coordinates in D 1 as that of point (x, y).
8
109
In a nutshell, ASF merges the dot maps from multiple scales; a point in one scale is selected if
it is maximum in its scale neighbourhood. This framework helps to select the scale which is giving
higher prediction confidence. A threshold is applied on the output of ASF to generate the final highly
localized binary dot map.
8.1.5 Adaptive Scale Training

After the Multi-Scale Pretraining of individual scale branches, we perform joint training to fine-tune
the columns on two specialties. Ideally, we would like the D 1 network to specialize on sparse crowds
8
(or people appearing large) and D 1 in dense crowds corresponding to their receptive fields. Such
4
a division is enforced with the ASF architecture through a special training procedure. Note that
straightforward training of ASF is not trivial due to absence of any scale information. For example,
a person may have detections in all the scales. One cannot simply take the scale with maximum
confidence, because D 1 scale predictions are seen to dominate in confidence value as it aggregates
8
more information regarding a point than scale D 1 . Hence, we device Scale Adaptive Training which
4
fine-tunes the two scale columns such that each responds more to its own specialties and the ASF can
then be done faithfully at test time.
To aid better training with ASF architecture, we leverage on the observation that some scale in-
formation can be obtained from ground truth point annotation. For example, at one-eighth resolution
prediction, people in dense crowds would merge as one point (happens if there are multiple people in
a region of 8 × 8). This provides a clear signal that these people could not be resolved at one-eighth
scale and has to be in the other scale. So for the Adaptive Scale training, we incorporate this Overlap
Criteria (OLC) on top of ASF to selectively fine-tune scale columns and achieve better specialization.
For every point in D 1 map, a check for overlap of ground truth points is performed. If there is an
8
overlap in D 1 , it means that the point under consideration has to be trained in D 1 . This is done by
8 4
setting the loss for the location to be zero in one-eighth D 1 and allowing D 1 network branch to be
8 4
updated. Such an adaptive training causes the two scale networks to specialize on crowds of different
types. However, OLC does not indicate anything about the scale of the majority non-overlapping
points. For these points, the ASF module selects a scale, which is the scale corresponding to the point
having the highest confidence. This acts like promoting the “winner” and updating the selected scale
network. The exact loss formulation is:

1 if DGT
1 [p(x), p(y)] > 1
M 1 [x, y] = 8 (8.5)
4 1 − I[x, y] otherwise
110
M 1 [x, y] = 1 − M 1 [p(x), p(y)] (8.6)
8 4
LM = L(D 1 M 1 , DGT
1 M1 , λ1 )
4 4 4 4
4
(8.7)
+L(D 1 M 1 , DGT
1 M 1 , λ 1 ),
8 8 8 8 8
where LM is joint loss for Adaptive Scale training and M 1 represent the mask variables to indi-
s
cate selected points for backpropagation. We train DD-CNN by minimizing LM in same way as in
pretraining. The branches of DD-CNN progressively get specialized possibly for different crowd den-
sities. This results in the columns to respond more for its own specialties and facilitate ASF at test
time. Note that OLC is crucial for the optimization as it acts as tie breaker.
At test time, only ASF is performed (as in Figure 8.2) and the points are selected based on the
confidence. This adaptive architecture helps in predicting highly localized dots on people ranging in
sparse to dense crowds. Finally, the threshold value (typically ∼0.5) for dot detection is selected so
as to minimize the MAE over a validation set.
8.2 Experiments
Primarily two metrics (MAE and MSE) are employed to evaluate any crowd counting system. Though
we use the same metrics in the following sections, there are some severe drawbacks with them. The
major limitation is that the metrics do not consider localization of the predictions. The MAE only
measure the accuracy of overall count prediction and hence we evaluate our model on some addi-
tional localization metrics in Section 8.3.3. Note that except for UCF-QNRF dataset, for all other
experiments the weighting hyper-parameters are set as λ 1 = 2 and λ 1 = 1.
4 8
Method MAE MSE

Idrees et al. [2] 315 508
MCNN [1] 277 426
CMTL [24] 252 514
SCNN (Chapter 3) 228 445
IG-CNN (Chapter 4) 125.9 217.2
DD-CNN (Ours) 120.6 161.5
Table 8.1: Performance of DD-CNN along with other methods on UCF-QNRF dataset [2]. Our model
has better count estimation than all other methods.
111
112.0 1068.0 885.0 307.0
INPUT
IMAGE
112.0 1068.0 885.0 307.0
GROUND
TRUTH
111.0 1173.0 911.0 350
DOT MAP
(DD-CNN)
111.5 1027.6 967.7 373.4
DENSITY
MAP
(CSRNetA)
Figure 8.3: Predictions made by DD-CNN on images of Shanghaitech dataset [1]. The results em-
phasize the ability of our dot detection approach to localize people in crowds (zoom in the dot maps
to see the difference).

UCF-QNRF is introduced by [3] and is one of the largest dense crowd counting dataset. For this
dataset, the class weighting factors are set as λ 1 = 4 and λ 1 = 2. Table 8.1 benchmarks DD-CNN
4 8
with other regression models. DD-CNN obtains an MAE of 120.6, which is 12.6 lower than that of
[3]. This shows that our approach is quite adaptable to highly diverse crowd scenario with relatively
low MAE.
8.2.2 UCF CC 50 Dataset

UCF CC 50 dataset [2] is a dataset of 50 images of highly diverse and dense crowds. The dataset
poses a severe challenge to crowd counting models due to the small size and the drastic density varia-
tion, which ranges from 94 to 4543 people per image. A five fold cross-validation testing is performed
112
Input Image D1/8 Output D1/4 Output DD-CNN Prediction
Figure 8.4: Dot predictions made by individual scale columns of DD-CNN on Shanghaitech dataset
[1]. The outputs clearly shows that the multi-scale training improves significantly the dot prediction
quality. (zoom in to see the difference)
on the dataset for evaluation. From Table 8.2, it is seen that DD-CNN delivers an impressive MAE
of 215.4 and even beats the SA-Net [23] regression model by a margin of 43. Despite being a small
dataset with drastic diversity, the state-of-the-art counting performance of our model, well evidence
the effectiveness of dot detection and could be attributed to the cross-entropy loss used for training.
Method MAE MSE

Zhang et al. [9] 467.0 498.5
MCNN [1] 377.6 509.1
SCNN (Chapter 3) 318.1 439.2
CP-CNN [25] 295.8 320.9
Liu et al. [110] 279.6 388.9
IC-CNN [26] 260.9 365.5
CSR-Net [29] 266.1 397.5
SA-Net [23] 258.4 334.9
DD-CNN 215.4 295.6
Table 8.2: Comparison of DD-CNN performance on UCF CC 50 [2]. DD-CNN beats other models
in terms of MAE and MSE.
113
We train our DD-CNN on the dataset and Part A results are reported in Table 8.3. Note that all other
models in the table are based on density regression and is not exactly fair to compare DD-CNN with
just MAE. DD-CNN achieves a detection MAE of 71.9 in Part A, which is very close to the count
error of best regression methods, with the difference being just 4.9. For Part B, we use only one-
eighth branch of DD-CNN as the dataset is less dense and does not require a second scale branch.
Again, the MAE for Part B indicates that our approach has competitive performance along with all
the merits of being a detection model. Figure 8.3 displays some dot predictions results of DD-CNN.
ST Part A ST Part B
Model MAE MSE MAE MSE
Zhang et al. [9] 181.8 277.7 32.0 49.8
MCNN [1] 110.2 173.2 26.4 41.3
SCNN (Chapter 3) 90.4 135.0 21.6 33.4
CP-CNN [25] 73.6 106.4 20.1 30.1
IG-CNN (Chapter 4) 72.5 118.2 13.6 21.1
Liu et al. [110] 72.0 106.6 14.4 23.8
IC-CNN [26] 68.5 116.2 10.7 16.0
CSR-Net [29] 68.2 115.0 10.6 16.0
SA-Net [23] 67.0 104.5 8.4 13.6
DD-CNN 71.9 111.2 12.9 20.3
Table 8.3: Comparison of DD-CNN performance on Shanghaitech Part A and Part B dataset [1].
DD-CNN delivers very competitive count accuracy relative to other regression models.
8.3.1 Effect of Multi-Scale Architecture

As described in Section 8.1.2, the proposed DD-CNN employs a multi-scale architecture with dot
predictions at two different resolutions. This is motivated so as to address the drastic scale variation
across sparse to dense crowds. We require localized dot prediction for both large faces/heads as well
as for people in dense regions. A single network prediction would be biased to frequently appearing
crowd type and would give lower confidence for large faces and fail to cross detection threshold. This
problem of almost no response for people appearing large is severe with density regression (see Fig-
ure 8.3). To empirically establish the usefulness of the proposed DD-CNN architecture, we ablate our
114
model in Table 8.4. We train a regression model, CSRNet-A [29] (CSR-A-reg) which is similar to the
network used for one-eighth branch of DD-CNN. Though the performance of the two models in terms
of MAE is close, DD-CNN has better localized prediction across crowd densities (Section 8.3.3). The
count errors for individual scale columns are also listed in Table 8.4 and outputs are shown in Figure
8.4. As expected, the individual scale MAEs are higher than the combined multi-scale count error.
We also see that MAE drops significantly without the Overlap Criteria (OLC) for training. Further,
we run DD-CNN without the top-down feedback (TDF) connection. The performance with feedback
is higher than without, indicating a possible propagation of high-level context to the next scale. These
show that DD-CNN processes at different scales to improve prediction.
ST Part A UCF-QNRF
CSR-A-reg 73.65 120.06 173.45 203.27
CSR-A-reg-dot 84.61 142.03 198.34 248.43
CSR-A-thr 309.8 513.4 384.42 566.98
CSR-A-thr-dot 167.09 218.22 164.38 204.97
DD-CNN D 1 only 75.67 109.36 165.78 254.71
8
DD-CNN D 1 only 125.19 190.77 234.7 511.9
4
DD-CNN (no TDF) 81.34 136.21 341.3 422.4
DD-CNN (no OLC) 104.95 151.26 346.81 406.59
DD-CNN 71.9 111.2 120.6 161.5
Table 8.4: Results for DD-CNN model ablative experiments. The results evidence the effectiveness
of the design choices.
8.3.2 Dot vs Density Maps

We emphasize that the dot map framework is fundamentally different from density map in terms of
the approach, philosophy and benefits. Here we show that the dot maps cannot be easily obtained
by post-processing density maps. The CSR-A-reg model trained for regression in Section 8.3.1, is
evaluated and density predictions are converted to dot maps by thresholding. The threshold value is
selected over a validation set to minimize the detection MAE. However, we find it difficult to threshold
density maps without loss of counting performance. Lower the sigma (σ) of Gaussian used for density
map generation, lower is the MAE drop. Though CSR-A-reg is trained with sigma as small as 1.0
(at prediction resolution), the MAE after thresholding is above 300, labeled as CSR-A-thr entry in
Table 8.4. We even go to the extreme of dot map regression (σ = 0), for which the normal MAE is
reasonable (CSR-A-reg-dot). But again, the thresholded MAE is very high (CSR-A-thr-dot). Some
115
Figure 8.5: Detection by thresholding density maps of CSR-A-thr-dot net; results show almost no
detections in sparse regions.
outputs of this model are shown in Figure 8.5, which clearly indicates hardly any detection in sparse
regions and spurious or multiple predictions in remaining areas. Since no scale information regarding
the detected person (like bounding boxes) is available, simple non-maximal suppression techniques
do not work well across density ranges. Hence it is clear that these thresholding methods suffer from
poor detections and results in much higher MAE than DD-CNN.
8.3.3 Localization of Detections

In this section, we analyse the localization of dot detection framework through some additional met-
rics. The MAE metric popularly being used in crowd counting, does not take into account prediction
localization. It simply checks whether the overall crowd count of the scene matches with the ground
truth. In other words, it is not necessary to detect people to get good MAE scores, but spurious re-
sponses could be counted as well. Hence, we propose a new metric named Mean Offset Error (MOE).
MOE is defined as the distance in pixels between the predicted and ground truth dot averaged over
test set. This is evaluated at the model prediction resolution and directly accounts for dot localization.
A fixed penalty of 12 pixels is added for absent or spurious dot detections. Next, we follow [3] and
116
consider a detection correct if the prediction is within a threshold distance. The threshold is varied
to evaluate localization with average precision (L-AP), recall (L-AR) and Area under ROC (L-AuC).
Furthermore, the Grid Average Mean absolute Error or GAME [13] metric, which is indicative of
local count prediction accuracy, is also considered. GAME divides the prediction map into a grid of
cells and the cell crowd counts are averaged.
Table 8.5 lists the performance of our model relative to the regression baseline on the metrics
specified above. We use CSR-A-thr-dot model defined in Section 8.3.2 as baseline and compute
localization metrics on detections from thresholded density maps. Clearly, DD-CNN outperforms the
regression model in localization as evident from MOE and L-AUC scores. The same trend is observed
in different levels of GAME metric as well. These experiments demonstrate that the proposed dot
detection framework delivers superior localization, while still maintaining high count accuracy.
ST Part A UCF-QNRF
Metric CSR-A DD-CNN CSR-A DD-CNN
MOE ↓ 5.13 4.93 4.16 2.91
L-AP ↑ 0.61 0.65 0.72 0.82
L-AR ↑ 0.76 0.81 0.77 0.83
L-AuC ↑ 0.45 0.69 0.68 0.78
GAME(0) ↓ 167.09 71.9 176.43 120.6
GAME(1) ↓ 214.87 86.08 185.76 123.54
GAME(2) ↓ 241.44 91.05 194.36 134.79
GAME(3) ↓ 263.0 105.12 216.26 141.68
Table 8.5: Evaluation of DD-CNN and baseline regression on the localization metrics to analyse the
dot prediction performance. Our model seems to achieve better localization of predictions.
8.4 Conclusion
We propose a novel change to the framework of density regression employed for dense crowd count-
ing. The density maps typically generated by existing regression models suffer from poor localization
among other limitations. We address these issues by reformulating the counting task as a localized
dot prediction problem. The proposed model, DD-CNN, is trained for per-pixel binary classification
task of predicting a person. DD-CNN employs a multi-column multi-scale architecture to handle the
drastic scale variations. Extensive evaluations indicate that the model achieves better or competitive
performance compared to the state-of-the-art methods, despite providing the merits of a dot detection
system. In the next chapter, we extend the approach to a full-fledged detection framework that puts
bounding boxes on all the persons in dense crowds.
117
Chapter 9
Dense Detection to accurately resolve People

in Crowds
HERE exists a huge body of works on crowd counting. They range from initial detection based
T methods [14, 15, 16, 17] to later models regressing crowd density [1, 2, 9, 21, 25, 29, 70]. The
detection approaches, in general, seem to scale poorly across the entire spectrum of diversity evident
in typical crowd scenes. Note the crucial difference between the normal face detection problem with
crowd counting; faces may not be visible for people in all cases (see Figure 9.1). In fact, due to
extreme pose, scale and view point variations, learning a consistent feature set to discriminate people
seems difficult. Though faces might be largely visible in sparse assemblies, people become tiny blobs
in highly dense crowds. This makes it cumbersome to put bounding boxes in dense crowds, not to
mention the sheer number of people, in the order of thousands, that need to be annotated per image.
Consequently, the problem is more conveniently reduced to that of density regression (Section 1.2).
In density estimation, a model is trained to map an input image to its crowd density, where the
spatial values indicate the number of people per unit pixel. To facilitate this, the heads of people are
annotated, which is much easier than specifying bounding box for crowd images [2]. These point
annotations are converted to density map by convolving with a Gaussian kernel such that simple
spatial summation gives out the crowd count. Though regression is the dominant paradigm in crowd
analysis and delivers excellent count estimation, there are some serious limitations. The first being
the inability to pinpoint persons as these models predict crowd density, which is a regional feature
(see the density maps in Figure 9.7). Any simple post-processing of density maps to extract positions
of people, does not seem to scale across the density ranges and results in poor counting performance
(Section 9.3.2). Ideally, we expect the model to deliver accurate localization on every person in the
scene possibly with bounding box. Such a system paves way for downstream applications other than
118
Figure 9.1: Face detection vs. Crowd counting. Tiny Face detector [4], trained on face dataset [5]
with box annotations, is able to capture 731 out of the 1151 people in the first image [6], losing mainly
in highly dense regions. In contrast, despite being trained on crowd dataset [1] having only point head
annotations, our LSC-CNN detects 999 persons (second image) consistently across density ranges
and provides fairly accurate boxes.
predicting just the crowd distribution. With accurate bounding box for heads of people in dense
crowds, one could do person recognition, tracking etc., which are practically more valuable. Hence,
we try to go beyond the popular density regression framework and create a dense detection system for
119
crowd counting.
Basically, our objective is to locate and predict bounding boxes on heads of people, irrespective
of any kind of variations. Developing such a detection framework is a challenging task and cannot be
easily achieved with trivial changes to existing detection frameworks [4, 8, 28, 38, 39].
Hence, we try to tackle these challenges and develop a tailor-made detection framework for dense
crowd counting. Our objective is to Locate every person in the scene, Size each detection with bound-
ing box on the head and finally give the crowd Count. This LSC-CNN, at a functional view, is trained
for pixel-wise classification task and detects the presence of persons along with the size of the heads.
Cross entropy loss is used for training instead of the widely employed l2 regression loss in density
estimation. We devise novel solutions to each of the problems listed before, including a method to
dynamically estimate bounding box sizes from point annotations.
In summary, this chapter contributes:
• Dense detection as an alternative to the prevalent density regression paradigm for crowd count-
ing.
• A novel CNN framework, different from conventional object detectors, that provides fine-
grained localization of persons at very high resolution.
• A unique fusion configuration with top-down feature modulation that facilitates joint processing
of multi-scale information to better resolve people.
• A practical training regime that only requires point annotations, but can estimate boxes for
heads.
• A new winner-take-all based loss formulation for better training at higher resolutions.
• A benchmarked model that delivers impressive performance in localization, sizing and count-
ing.
9.1 Related Works

Object/face Detectors: Since our model is a detector tailor-made for dense crowds, we compare with
other detection works as well in Section 1.4. Our proposed architecture differs from these models in
many aspects as described in the previous section. Though it has some similarity with the SSH model
in terms of the single stage architecture, we output predictions at resolutions higher than any face
detector. This is to handle extremely small heads (of few pixels size) occurring very close to each
120
other, a typical characteristic of dense crowds. Moreover, bounding box annotation is not available
per se from crowd datasets and has to rely on pseudo data. Due to this approximated box data, we
prefer not to regress or adjust the template box sizes as the normal detectors do, instead just classifies
every person to one of the predefined boxes. Above all, dense crowd analysis is generally considered
a harder problem due to the large diversity.
A concurrent work: We note a recent paper [111] which proposes a detection framework, PSDNN,
for crowd counting. But this is a concurrent work which has appeared while this manuscript is under
preparation. PSDNN uses a Faster RCNN model trained on crowd dataset with pseudo ground truth
generated from point annotation. A locally constrained regression loss and an iterative ground truth
box updation scheme is employed to improve performance. Though the idea of generating pseudo
ground truth boxes is similar, we do not actually create (or store) the annotations, instead a box is
chosen from head locations dynamically (Section 9.2.2). We do not regress box location or size
as normal detectors and avoid any complicated ground truth updation schemes. Also, PSDNN em-
ploys Faster RCNN with minimal changes, but we use a custom completely end-to-end and single
stage architecture tailor-made for the nuances of dense crowd detection and outperforms in almost all
benchmarks.
WTA Architectures: Since LSC-CNN employs a winner-take-all (WTA) paradigm for training,
here we briefly compare with similar WTA works. WTA is a biologically inspired widely used case
of competitive learning in artificial neural networks. In the deep learning scenario, Makhzani et
al. [79] propose a WTA regularization for autoencoders, where the basic idea is to selectively update
only the maximally activated ‘winner’ neurons. This introduces sparsity in weight updates and is
shown to improve feature learning. The Grid WTA version from [109] extends the methodology for
large scale training and applies WTA sparsity on spatial cells in a convolutional output. We follow
[109] and repurpose the GWTA for supervised training, where the objective is to learn better features
by restricting gradient updates to the highest loss making spatial region (see Section 9.2.2.2).
9.2 Our Approach

As motivated in the beginning of the Chapter, we drop the prevalent density regression paradigm and
develop a dense detection model for dense crowd counting. Our model named, LSC-CNN, predicts
accurately localized boxes on heads of people in crowd images. Though it seems like a multi-stage
task of first locating and sizing the each person, we formulate it as an end-to-end single stage process.
Figure 9.2 depicts a high-level view of our architecture. LSC-CNN has three functional parts; the first
is to extract features at multiple resolution by the Feature Extractor. These feature maps are fed to a
set of Top-down Feature Modulator (TFM) networks, where information across the scales are fused
121
3
TFM(s=0) 4
1/16 SCALE
1 BRANCH
2
BOX
Prediction PREDICTION
1 2
3
Fusion
1/8 SCALE (testing phase)
2 TFM(s=1) 4
BRANCH
1
2
3
1
Feature
Extractor 4 3 1/4 SCALE LOSS
BRANCH
5 TFM(s=2) 4
1
2
1/2 SCALE GWTA
BRANCH 1 3
Loss
3
(training phase)
TFM(s=3) 4 2
1
2 POINT ANNOTATION
VGG-16 Block
Figure 9.2: The architecture of Dense Detector
the proposed LSC-CNN is shown. LSC-CNN jointly processes multi-
Multi-scale
scale Feedbackfrom
information Reasoning Prediction
the feature Fusion and provides predictions at multiple resolutions, which
extractor
are combined to form the final detections. The model is optimized for per-pixel classification of
pseudo ground truth boxes generated in the GWTA training phase (indicated with dotted lines).
and box predictions are made. Then Non-Maximum Suppression (NMS) selects valid detections from
multiple resolutions and is combined to generate the final output. For training of the model, the last
stage is replaced with the GWTA Loss module, where the winners-take-all (WTA) loss backpropaga-
tion and adaptive ground truth box selection are implemented. In the following sections, we elaborate
on each part of LSC-CNN.
9.2.1 Locate Heads

9.2.1.1 Feature Extractor
Almost all existing CNN object detectors operate on a backbone deep feature extractor network. The
quality of features seems to directly affect the detection performance. For crowd counting, VGG-16
[7] based networks are widely used in a variety of ways [24, 25, 70, 112] and delivers near state-
of-the-art performance [29]. In line with the trend, we also employ several of VGG-16 convolution
layers for better crowd feature extraction. But, as shown in Figure 9.3, some blocks are replicated and
manipulated to facilitate feature extraction at multiple resolutions. The first five 3 × 3 convolution
blocks of VGG-16, initialized with ImageNet [113] trained weights, form the backbone network. The
input to the network is RGB crowd image of fixed size (224 × 224) with the output at each block
being downsampled due to max-pooling. At every block, except for the last, the network branches
122
Feature
Extractor
3C | 512
P
2
3C | 512 3C | 512
P 3
P
1 4
3C | 256 3C | 256
P 5
P
2C | 128 2C | 128
P
P n convolutions with
nC | k
k filters of size 3x3
2C | 64 P 2x2 max pooling
Figure 9.3: The exact configuration of Feature Extractor, which is a modified version of VGG-16 [7]
and outputs feature maps at multiple scales.
with the next block being duplicated (weights are copied at initialization, not tied). We tap from
these copied blocks to create feature maps at one-half, one-fourth, one-eighth and one-sixteenth of
the input resolution. This is in slight contrast to typical hypercolumn features and helps to specialize
each scale branches by sharing low-level features without any conflict. The low-level features with
half the spatial size that of the input, could potentially capture and resolve highly dense crowds. The
other lower resolution scale branches have progressively higher receptive field and are suitable for
relatively less packed ones. In fact, people appearing large in very sparse crowds could be faithfully
discriminated by the one-sixteenth features.
The multi-scale architecture of feature extractor is motivated to solve many roadblocks in dense
detection. It could simultaneously address the appearance variety, scale diversity and resolution
issues mentioned in Section 1.1. The appearance variety aspect is taken care by having multiple
scale columns, so that each one can specialize to a different crowd type. Since the typical multi-scale
input paradigm is replaced with extraction of multi-resolution features, the scale issue is mitigated to
certain extent. Further, the increased resolution for branches helps to better resolve people appearing
very close to each other.
123
s-0
TFM(s) T | t 0, 2 C|m
s-1
T | t 1, 2 C|m
t0
t1 3
...
...
s branches
ts-1
channels 1
T | ts-1,2 C|m
f channels
1 C|m 2
C | 64 C | 128 C | 256
C | 32 C|4 4
3x3 DeConv with 3x3 Conv with k

T | k,r C|k Concat
k filters & stride r filters & stride 1
Figure 9.4: The implementation of the TFM module is depicted. TFM(s) processes the features from
scale s (terminal 1) along with s multi-scale inputs from higher branches (terminal 3) to output head
detections (terminal 4) and the features (terminal 2) for the next scale branch.
9.2.1.2 Top-down Feature Modulator
One major issue with the multi-scale representations from the feature extractor is that the higher
resolution feature maps have limited context to discriminate persons. More clearly, many patterns
in the image formed by leaves of trees, structures of buildings, cluttered backgrounds etc. resemble
formation of people in highly dense crowds [114]. As a result, these crowd like patterns could be
misclassified as people, especially at the higher resolution scales that have low receptive field for
making predictions. We cannot avoid these low-level representations as it is crucial for resolving
people in highly dense crowds. The problem mainly arises due to the absence of high-level context
information about the crowd regions in the scene. Hence, we evaluate global context from scales with
higher receptive fields and jointly process these top-down features to detect persons.
As shown in Figure 9.2, a set of Top-down Feature Modulator (TFM) modules feed on the output
by crowd feature extractor. There is one TFM network for each scale branch and acts as a person
detector at that scale. The TFM also have connections from all previous low resolution scale branches.
For example, in the case of one-fourth branch TFM, it receives connections from one-eighth as well
as one-sixteenth branches and generates features for one-half scale branch. If there are s feature
124
connections from high-level branches, then it uniquely identifies an TFM network as TFM(s). s is
also indicative of the scale and takes values from zero to nS − 1, where nS is the number of scale
branches. For instance, TFM with s = 0 is for the lowest resolution scale (one-sixteenth) and takes
no top-down features. Any TFM(s) with s > 0 receives connections from all TFM(i) modules where
0 ≤ i < s. At a functional view, the TFM predicts the presence of a person at every pixel for the
given scale branch by coalescing all the scale features. This multi-scale feature processing helps to
drive global context information to all the scale branches and suppress spurious detections, apart from
aiding scale specialization.
Figure 9.4 illustrates the internal implementation of the TFM module. Terminal 1 of any TFM
module takes one of the scale feature set from the feature extractor, which is then passed through a
3 × 3 convolution layer. We set the number of filters for this convolution, m, as one-half that of the
incoming scale branch (f channels from terminal 1). To be specific, m = b f2 c, where b.c denotes
floor operation. This reduction in feature maps is to accommodate the top-down aggregrated multi-
scale representations and decrease computational overhead for the final layers. Note that the output
Terminal 2 is also drawn from this convolution layer and acts as the top-down features for next TFM
module. Terminal 3 of TFM(s) takes s set of these top-down multi-scale feature maps. For the top-
down processing, each of the s feature inputs is operated by a set of two convolution layers. The first
layer is a transposed convolution (also know as deconvolution) to upsample top-down feature maps
to the same size as the scale branch. The upsampling is followed by a convolution with m filters.
Each processed feature set has the same number of channels (m) as that of the scale input, which
forces them to be weighed equally by the subsequent layers. All these feature maps are concatenated
along the channel dimension and fed to a series of 3 × 3 convolutions with progressive reduction in
number of filters to give the final prediction. These set of layers fuse the crowd features with top-
down features from other scales to improve discrimination of people. Terminal 4 delivers the output,
which basically classifies every pixel into either background or to one of the predefined bounding
boxes for the detected head. Softmax nonlinearity is applied on these output maps to generate per-
pixel confidences over the 1 + nB classes, where nB is the number of predefined boxes. nB is a hyper-
parameter to control the fineness of the sizes and is typically set as 3, making a total of nS × nB = 12
boxes for all the branches. The first channel of the prediction for scale s, Ds0 , is for background and
remaining {Ds1 , Ds2 , . . . , DsnB } maps are for the boxes (see Section 9.2.2.1).
The top-down feature processing architecture helps in fine-grained localization of persons spa-
tially as well as in the scale pyramid. The appearance variety, scale diversity and resolution bottle-
necks (Section 1.1) are further mitigated by the top-down mechanism, which could selectively identify
the appropriate scale branch for a person to resolve it more faithfully. This is further ensured through
the training regime we employ (Section 9.2.2.2). Scaling across the extreme head/box sizes is also
125
made possible to certain extent as each branch could focus on an appropriate subset of the box sizes.
9.2.2 Size Heads

9.2.2.1 Box classification
As described previously, LSC-CNN with the help of TFM modules locates people and has to put
appropriate bounding boxes on the detected heads. For this sizing, we choose a per-pixel classifi-
cation paradigm. Basically, a set of bounding boxes are fixed with predefined sizes and the model
simply classifies each head to one of the boxes or as background. This is in contrast to the anchor box
paradigm typically being employed in detectors [28, 38], where the parameters of the boxes are re-
gressed. Every scale branch (s) of the model outputs a set of maps, {Dsb }nb=0B
, indicating the per-pixel
confidence for the box classes (see Figure 9.4). Now we require the ground truth sizes of heads to
train the model, which is not available and not convenient to annotate for typical dense crowd datasets.
Hence, we devise a method to approximate the sizes of the heads.
For ground truth generation, we rely on the point annotations available with crowd datasets. These
point annotations specify the locations of heads of people. The location is approximately at the center
of head, but can vary significantly for sparse crowds (where the point could be any where on the
large face or head). Apart from locating every person in the crowd, the point annotations also give
some scale information. For instance, Zhang et al. [1] use the mean of k nearest neighbours of any
head annotation to estimate the adaptive Gaussian kernel for creating the ground truth density maps.
Similarly, the distance between two adjacent persons could indicate the bounding box size for the
heads, under the assumption of a smoothly varying crowd density. Note that we consider only square
boxes. In short, the size for any head can simply be taken as the distance to the nearest neighbour.
While this approach makes sense in medium to dense crowds, it might result in incorrect box sizes
for people in sparse crowds, where the nearest neighbour is typically far. Nevertheless, empirically it
is found to work fairly well over a wide range of densities.
Here we mathematically explain the pseudo ground truth creation. Let P be the set of all annotated
(x, y) locations of people in the given image patch. Then for every point (x, y) in P, the box size is
defined as,
B[x, y] = 0 0 min
p
0 0
(x − x0 )2 + (y − y 0 )2 , (9.1)
(x ,y )P,(x ,y )6=(x,y)
the distance to the nearest neighbour. If there is only one person in the image patch, the box size is
taken as ∞. Now we discretize the B[x, y] values to predefined bins, which specifies the box sizes.
s
Let {β1s , β2s , . . . , βns B } be the predefined box sizes for scale s and Bb [x, y] denote a boolean value
indicating whether the location (x, y) belongs to box b. Then a person annotation is assigned a box
126
Figure 9.5: Samples of generated pseudo box ground truth. Boxes with same color belong to one
scale branch.
s
b (Bb [x, y] = 1) if its pseudo size B[x, y] is between βbs and βb+1 s
. Box sizes less than β1s are given
to b = 1 and those greater than βns B falls to b = nB . Note that non-people locations are assigned
b = 0 background class. A general philosophy is followed in choosing the box sizes βbs s for all the
scales. The size of the first box (b = 1) at the highest resolution scale (s = nS − 1) is always fixed to
one, which improves the resolving capacity for highly dense crowds (Resolution issue in Section 1.1).
We choose larger sizes for the remaining boxes in the same scale with a constant increment. This
increment is fine-grained in higher resolution branches, but the coarseness progressively increases for
low resolution scales. To be specific, if γ s represent the size increment for scale s, then box sizes are,

β s+1 + bγ s if s < nS − 1
nB
βbs = (9.2)
1 + (b − 1)γ s otherwise.
The typical values of the size increment for different scales are γ = {4, 2, 1, 1}. Note that the high
resolution branches (one-half & one-fourth) have boxes with finer sizes than the low resolution ones
(one-sixteenth & one-eighth), where coarse resolving capacity would suffice (see Figure 9.5).
There are many reasons to discretize the head sizes and classify the boxes instead of regressing
size values. The first is due to the use of pseudo ground truth. Since the size of heads itself is
approximate, tight estimation of sizes proves to be difficult (see Section 9.4.2). Similar sized heads
in two images could have different ground truths depending on the density. This might lead to some
inconsistencies in training and could result in suboptimal performance. Moreover, the sizes of heads
vary extremely across density ranges at a level not expected for normal detectors. This requires heavy
normalization of value ranges along with complex data balancing schemes. But our per-pixel box
127
classification paradigm effectively addresses these Extreme Head Sizes and Point Annotation issues
(Section 1.1).
9.2.2.2 GWTA Training
Loss: We train the LSC-CNN by back-propagating per-pixel cross entropy loss. The loss for a pixel
is defined as,
nB
X
nB nB nB
l({di }i=0 , {bi }i=0 , {αi }i=0 ) = − αi bi logdi , (9.3)
i=0
where {di }ni=0

B
is the set of nB + 1 probability values (softmax outputs) for the predefined box classes
nB
and {bi }i=0 refers to corresponding ground truth labels. All bi s take zero value except for the correct
class. The αi s are weights to class balance the training. Now the loss for the entire prediction of scale
branch s would be,
ws ,hs
s X l({Ds [x, y]}, {Bs [x, y]}, {αs })
L({Dsb }, {Bb }, {αbs }) = b b b
x,y
ws hs
s
where the inputs are the set of predictions {Dsb }nb=0
B
and pseudo ground truths {Bb }nb=0
B
(the set limits
might be dropped for convenience). Note that (ws , hs ) are the spatial sizes of these prediction maps
and the cross-entropy loss is averaged over it. The final loss for LSC-CNN after combining losses
from all the branches is,
nS
X s
Lcomb = L({Dsb }, {Bb }, {αbs }). (9.4)
s=1
Weighting: As mentioned in Section 1.1, the data imbalance issue is severe in the case of crowd
datasets. Class wise weighting assumes prime importance for effective backpropagation of Lcomb (see
Section 9.4.2). We follow a simple formulation to fix the α values. Once the box sizes are set, the
number of data points available for each class is computed from the entire train set. Let csb denote
this frequency count for the box b in scale s. Then for every scale branch, we sum the box counts
PnB s
as cssum = b=1 cb and the scale with minimum number of samples is identified. This minimum
value cmin = mins cssum , is used to balance the training data across the branches. Basically, we scale
down the weights for the branches with higher counts such that the minimum count branch has weight
one. Note that training points for all the branches as well as the classes within a branch need to be
balanced. Usually the data samples would be highly skewed towards the background class (b = 0)
in all the scales. To mitigate this, we scale up the weights of all box classes based on its ratio with
128
background frequency of the same branch. Numerically, the balancing is done jointly as,
cmin cs0
αbs = min( , 10). (9.5)
cssum csb
The term cs0 /csb can be large since the frequency of background to box is usually skewed. So we limit
the value to 10 for better training stability. Further note that for some box size settings, αbs values
itself could be very skewed, which depends on the distribution of dataset under consideration. Any
difference in the values more than an order of magnitude is found to be affecting the proper training.
Hence, the box size increments (γs) are chosen not only to roughly cover the density ranges in the
dataset, but also such that the αbs s are close within an order of magnitude.
GWTA: However, even after this balancing, training LSC-CNN by optimizing joint loss Lcomb
does not achieve acceptable performance (see Section 9.4.2). This is because the model predicts
at a high resolution than any typical crowd counting network and the loss is averaged over a rela-
tively larger spatial area. The weighing scheme only makes sure that the averaged loss values across
branches and classes is in similar range. But the scales with larger spatial area could have more in-
stances of one particular class than others. For instance in dense regions, the one-half resolution scale
(s = 3) would have more person instances and are typically very diverse. This causes the optimization
to focus on all instances equally and might lead to a local minima solution. A strategy is needed to
focus on a small region at a time, update the weights and repeat this for another region.
For solving this local minima issue (Section 1.1), we rely on the Grid Winner-Take-All (GWTA)
approach introduced in [109]. Though GWTA is originally used for unsupervised learning, we repur-
pose it to our loss formulation. The basic idea is to divide the prediction map into a grid of cells with
fixed size and compute the loss only for one cell. Since only a small region is included in the loss, this
acts as tie breaker and avoids the gradient averaging effect, reducing the chances of the optimization
reaching a local minima. Now the question is how to select the cells. The ‘winner’ cell is chosen as
the one which incurs the highest loss. At every iteration of training, we concentrate more on the high
loss making regions in the image and learn better features. This has slight resemblance to hard min-
ing approaches, where difficult instances are sampled more during training. In short, GWTA training
selects ‘hard’ regions and try to improve the prediction (see Section 9.4.2 for ablations).
Figure 9.6 shows the implementation of GWTA training. For each scale, we apply GWTA non-
linearity on the loss. The cell size for all branches is taken as the dimensions of the lowest resolution
prediction map (w0 , h0 ). There is only one cell for scale s = 0 (one-sixteenth branch), but the grows
by power of four (4s ) for subsequent branches as the spatial dimensions consecutively doubles. Now
129
Prediction maps
3
GWTA Ground truth maps
Loss
(training phase) Loss
Pseudo GT
1
...
...
Creator
...
...
2
Figure 9.6: Illustration of the operations in GWTA training. GWTA only selects the highest loss
making cell in every scale. The per-pixel cross-entropy loss is computed between the prediction and
pseudo ground truth maps.
we compute the cross-entropy loss for any cell at location (x, y) (top-left corner) in the grid as,
X s
s
lwta [x, y] = l({Dsb [x0 , y 0 ]}, {Bb [x0 , y 0 ]}, {ᾱbs }),
x0 0
(b c,b y0 c)=(x,y)
w0 h
where the summation of losses runs over for all pixels in the cell under consideration. Also note that ᾱbs is
computed using equation 9.5 with cssum = 4−s nb=1
P B s
cb , in order to account for the change in spatial size of the
predictions. The winner cell is the one with the highest loss and the location is given by,
(xswta , ywta
s
)= argmax s
lwta [x, y]. (9.6)
(x,y)=(w0 i,h0 j),i∈Z,j∈Z
Note that the argmax operator finds an (x, y) pair that identifies the top-left corner of the cell. The combined
loss becomes,
nS
1 X
Lwta = s
lwta [xswta , ywta
s
]. (9.7)
w 0 h0
s=1
We optimize the parameters of LSC-CNN by backpropagating Lwta using standard mini-batch gradient
descent with momentum. Batch size is typically 4. Momentum parameter is set as 0.9 and a fixed learning rate
schedule of 10−3 is used. The training is continued till the counting performance on a validation set saturates.
130
Input Image Pseudo GT LSC-CNN Prediction CSRNet-A Prediction
355.0 317.0 307.1
ST
PartA
179.0 174.0 160.7
ST
PartB
436.0 425.0 467.6
UCF-
QNRF
1566.9 1539.9 1480.9
UCF
-CC
Figure 9.7: Predictions made by LSC-CNN on images from Shanghaitech, UCF-QNRF and UCF-
CC-50 datasets. The results emphasize the ability of our approach to pinpoint people consistently
across crowds of different types than the baseline density regression method.
9.2.3 Count Heads

9.2.3.1 Prediction Fusion
For testing the model, the GWTA training module is replaced with the prediction fusion operation as shown in
Figure 9.2. The input image is evaluated by all the branches and results in predictions at multiple resolutions.
Box locations are extracted from these prediction maps and are linearly scaled to the input resolution. Then
standard Non-Maximum Suppression (NMS) is applied to remove boxes with overlap more than a threshold.
The boxes after the NMS form the final prediction of the model and are enumerated to output the crowd count.
Note that, in order to facilitate intermediate evaluations during training, the NMS threshold is set to 0.3 (30%
area overlap). But for the best model after training, we run a threshold search to minimize the counting error
over the validation set (typical value ranges from 0.2 to 0.3).
131
9.3 Performance Evaluation
9.3.1 Experimental Setup and Datasets
We evaluate LSC-CNN for localization and counting performance on all major crowd datasets. Since these
datasets have only point head annotations, sizing capability cannot be benchmarked. Hence, we use one face
detection dataset where bounding box ground truth is available. Further, LSC-CNN is trained on vehicle count-
ing dataset to show generalization. Figure 9.7 displays some of the box detections by our model on all datasets.
Note that unless otherwise specified, we use the same architecture and hyper-parameters given in Section 9.2.
The remaining part of this section introduces the new datasets along with the hyper-parameters if there is any
change.
TRANCOS: The vehicle counting dataset, TRANCOS [13], has 1244 images captured by various traffic
surveillance cameras. In total, there are 46,796 vehicle point annotations. Also, RoIs are specified on every
image for evaluation. We use the same architecture and box sizes as that of WorldExpo’10.
WIDERFACE: WIDERFACE [5] is a face detection dataset with more than 0.3 million bounding box
annotations, spanning 32,203 images. The images, in general, have sparse crowds having variations in pose
and scale with some level of occlusions. We remove the one-half scale branch for this dataset as highly dense
images are not present. To compare with existing methods on fitness of bounding box predictions, the fineness
of the box sizes are increased by using five boxes per scale (nB = 5). The γ is set as {4, 2, 2} and learning rate
is made lower to 10−4 . Note that for fair comparison, we train LSC-CNN without using the actual ground truth
bounding boxes. Instead, point face annotations are created by taking centers of the boxes, from which pseudo
ground truth is generated as per the training regime of LSC-CNN. But the performance is evaluated with the
actual ground truth.
9.3.2 Evaluation of Localization

The widely used metric for crowd counting is the Mean Absolute Error or MAE. MAE is simply the absolute
difference between the predicted and actual crowd counts averaged over all the images in the test set. The
counting performance of a model is directly evident from the MAE value. Further to estimate the variance and
hence robustness of the count prediction, Mean Squared Error or MSE is used. Though these metrics measure
the accuracy of overall count prediction, localization of the predictions is not very evident. Hence, apart from
standard MAE, we evaluate the ability of LSC-CNN to accurately pinpoint individual persons. An existing
metric called Grid Average Mean absolute Error or GAME [13], can roughly indicate coarse localization of
count predictions. To compute GAME, the prediction map is divided into a grid of cells and the absolute count
errors within each cell are averaged over grid. Table 9.1 compares the GAME values of LSC-CNN against
a regression baseline model for different grid sizes. Note that GAME with only one cell, GAME(0), is same
as MAE. We take the baseline as CSRNet-A [29] (labeled CSR-A) model as it has similarity to the Feature
Extractor and delivers near state-of-the-art results. Clearly, LSC-CNN has superior count localization than the
132
Metric GAME(0) ↓ GAME(1) ↓ GAME(2) ↓ GAME(3) ↓
Dataset ↓ / Model→ CSR-A LSC CSR-A LSC CSR-A LSC CSR-A LSC
ST Part A 72.6 66.4 75.5 70.2 112.9 94.6 149.2 136.5
ST Part B 11.5 8.1 13.1 9.6 21.0 17.4 28.9 26.5
UCF QNRF 155.8 120.5 157.2 125.8 186.7 159.9 219.3 206.0
UCF CC 50 282.9 225.6 326.3 227.4 369.0 306.8 425.8 390.0
Table 9.1: Comparison of LSC-CNN on localization metrics against the baseline regression method.
Our model seems to pinpoint persons more accurately.
Metric MLE ↓
Dataset ↓ / Method→ CSR-A-thr LSC-CNN
ST Part A 16.8 9.6
ST Part B 12.28 9.0
UCF QNRF 14.2 8.6
UCF CC 50 14.3 9.7
Table 9.2: Comparison of LSC-CNN on localization metrics against the baseline regression method.
Our model seems to pinpoint persons more accurately.
density regression based CSR-A.

One could also measure localization in terms of how close the prediction matches with ground truth point
annotation. For this, we define a metric named Mean Localization Error (MLE), which computes the distance
in pixels between the predicted person location to its ground truth averaged over test set. The predictions are
matched to head annotations in a one-to-one fashion and a fixed penalty of 16 pixels is added for absent or
spurious detections. Since CSR-A or any other density regression based counting models do not individually
locate persons, we apply threshold on the density map to get detections (CSR-A-thr). But it is difficult to
threshold density maps without loss of counting accuracy. We choose a threshold such that the resultant MAE is
minimum over validation set. For CSR-A, the best thresholded MAE comes to be 167.1, instead of the original
72.6. As expected, MLE scores for LSC-CNN is significantly better than CSR-A (Table 9.2), indicating sharp
localization capacity.
9.3.3 Evaluation of Sizing

We follow other face detection works [4, 8] and use the standard mean Average Precision or mAP metric to
assess the sizing ability of our model. For this, LSC-CNN is trained on WIDERFACE face dataset without
the actual box ground truth as mentioned in Section 9.3.1. Table 9.3 reports the comparison of mAP scores
obtained by our model against other works. Despite using pseudo ground truth for training, LSC-CNN achieves
a competitive performance, especially on Hard and Medium test sets, against the methods that use full box
133
Method Easy Medium Hard
Faceness [115] 71.3 53.4 34.5
Two Stage CNN [5] 68.1 61.4 32.3
TinyFace [4] 92.5 91.0 80.6
SSH [8] 93.1 92.1 84.5
CSR-A-thr (baseline) 30.2 41.9 33.5
PSDNN [111] 60.5 60.5 39.6
LSC-CNN (Pseudo GT) 40.5 62.1 46.2
LSC-CNN (Actual GT) 57.31 70.10 68.9
Table 9.3: Evaluation of LSC-CNN box prediction on WIDERFACE [5]. Our model and PSDNN are
trained on pseudo ground truths, while others use full supervision. LSC-CNN has impressive mAP in
Medium and Hard sets.
supervision. For baseline, we consider the CSR-A-thr model (Section 9.3.2) where the density outputs are
processed to get head locations. These are subsequently converted to bounding boxes using the pseudo box
algorithm of LSC-CNN and mAP scores are computed (CSR-A-thr (baseline)). LSC-CNN beats the baseline
by a strong margin, evidencing the superiority of the proposed box classification training. We also compare
with PSDNN model [111] which trains on pseudo box ground truth similar to our model. Interestingly, LSC-
CNN has higher mAP in the two difficult sets than that of PSDNN. Note that the images in Easy set are mostly
of very sparse crowds with faces appearing large. We lose out in mAP mainly due to the high discretization of
box sizes on large faces. This is not unexpected as LSC-CNN is designed for dense crowds without bounding
box annotations. But the fact that it works well on the relatively denser other two test sets, clearly shows
the effectiveness of our proposed framework. For completeness, we train LSC-CNN with boxes generated
from actual box annotations instead of the head locations (LSC-CNN (Actual GT)). As expected LSC-CNN
performance improved with the use of real box size data.
We also compute the average classification accuracy of boxes with respect to the pseudo ground truth on test
set. LSC-CNN has an accuracy of around 94.56% for ST PartA dataset and 93.97% for UCF QNRF, indicative
of proper data fitting.
9.3.4 Evaluation of Counting

Here we compare LSC-CNN with other crowd counting models on the standard MAE and MSE metrics. Ta-
ble 9.4 lists the evaluation results on UCF-QNRF dataset. Our model achieves an MAE of 120.5, which is lower
than that of [3] by a significant margin of 12.5. Evaluation on the next set of datasets is available in Table 9.5
and Table 9.6. On Part A of Shanghaitech, LSC-CNN performs better than all the other density regression
methods and has very competitive MAE to that of PSDNN [111], with the difference being just 0.5. But note
that PSDNN is trained with a curriculum learning strategy and the MAE without it seems to be significantly
134
Method MAE MSE
MCNN [1] 277 426
CMTL [24] 252 514
SCNN (Chapter 3) 228 445
DD-CNN (Chapter 8) 120.6 161.5
LSC-CNN 120.5 218.2
Table 9.4: Counting performance comparison of LSC-CNN on UCF-QNRF [2].
Method MAE MSE

Zhang et al. [9] 467.0 498.5
MCNN [1] 377.6 509.1
SCNN (Chapter 3) 318.1 439.2
CP-CNN [25] 295.8 320.9
Liu et al. [110] 279.6 388.9
IC-CNN [26] 260.9 365.5
CSR-Net [29] 266.1 397.5
SA-Net [23] 258.4 334.9
PSDNN[111] 359.4 514.8
DD-CNN (Chapter 8) 215.4 295.6
LSC-CNN 225.6 302.7
Table 9.5: LSC-CNN on UCF CC 50 [2] dataset. LSC-CNN stands state-of-the-art in UCF CC 50,
except to DD-CNN.
higher (above 80). This along with the fact that LSC-CNN has lower count error than PSDNN in all other
datasets, indicates the strength of our proposed architecture. In fact, state-of-the-art performance is obtained in
both Shanghaitech Part B and UCF CC 50 datasets. Despite having just 50 images with extreme diversity in
the UCF CC 50, our model delivers a substantial decrease of 33 points in MAE. A similar trend is observed in
WorldExpo dataset as well, with LSC-CNN acheiving lower MAE than existing methods (Table 9.7). Further
to explore the generalization of LSC-CNN, we evaluate on a vehicle counting dataset TRANCOS. The results
from Table 9.8 evidence a lower MAE than PSDNN, and is highly competitive with the best method. These
experiments evidence the top-notch crowd counting ability of LSC-CNN compared to other density regressors,
with all the merits of a detection model.
135
ST Part A ST Part B
Models MAE MSE MAE MSE
Zhang et al. [9] 181.8 277.7 32.0 49.8
MCNN [1] 110.2 173.2 26.4 41.3
SCNN (Chapter 3) 90.4 135.0 21.6 33.4
CP-CNN [25] 73.6 106.4 20.1 30.1
IG-CNN (Chapter 4) 72.5 118.2 13.6 21.1
Liu et al. [31] 72.0 106.6 14.4 23.8
IC-CNN [26] 68.5 116.2 10.7 16.0
CSR-Net [29] 68.2 115.0 10.6 16.0
SA-Net [23] 67.0 104.5 8.4 13.6
PSDNN[111] 65.9 112.3 9.1 14.2
DD-CNN (Chapter 8) 71.9 111.2 12.9 20.3
LSC-CNN 66.4 117.0 8.1 12.7
Table 9.6: Benchmarking LSC-CNN counting accuracy on Shanghaitech [1] datasets. LSC-CNN
stands state-of-the-art in ST PartB, with very competitive MAE on ST PartA.
Method Scene1 Scene2 Scene3 Scene4 Scene5 Average

Zhang et al. [9] 9.8 14.1 14.3 22.2 3.7 12.9
MCNN [1] 3.4 20.6 12.9 13.0 8.1 11.6
SCNN (Chapter 3) 4.4 15.7 10.0 11.0 5.9 9.4
CP-CNN [25] 2.9 14.7 10.5 10.4 5.8 8.8
Liu et al. [31] 2.0 13.1 8.9 17.4 4.8 9.2
IC-CNN [26] 17.0 12.3 9.2 8.1 4.7 10.3
CSR-Net [29] 2.9 11.5 8.6 16.6 3.4 8.6
SA-Net [23] 2.6 13.2 9.0 13.3 3.0 8.2
LSC-CNN (Ours) 2.9 11.3 9.4 12.3 4.3 8.0
Table 9.7: LSC-CNN on WorldExpo’10 [2] beats other methods in average MAE.
Method GAME(0) GAME(1) GAME(2) GAME(3)

Guerrero et al. [13] 14.0 18.1 23.7 28.4
Hydra CNN [21] 10.9 13.8 16.0 19.3
Li et al. [29] 3.7 5.5 8.6 15.0
PSDNN [111] 4.8 5.4 6.7 8.4
LSC-CNN (Ours) 4.6 5.4 6.9 8.3
Table 9.8: Evaluation of LSC-CNN on TRANCOS [13] vehicle counting dataset.
136
9.4.1 Effect of Multi-Scale Box Classification
As mentioned in Section 9.2.2, in general, we use 3 box sizes (nB = 3) for each scale branch and employ 4
scales (nS = 4). Here we ablate over the choice of nB and nS . The results of the experiments are presented in
Table 9.9. It is intuitive to expect higher counting accuracy with more number of scale branches (from nS = 1
to nS = 4) as people at all the scales are resolved better. Although this is true in theory, as the number of
scales increase, so do the number of trainable parameters for the same amount of data. This might be the cause
for slight increase in counting error for nS = 5. Regarding the ablations on the number of boxes, we train
LSC-CNN for nB = 1 to nB = 4 (maintaining the same size increments γ as specified in Section 9.2.2 for all).
Initially, we observe a progressive gain in the counting accuracy till nB = 3, but seems to saturate after that.
This could be attributed to the decrease in training samples per box class as nB increases.
nS nB ST PartA [1] UCF-QNRF [3]

1 3 155.2 197.8
2 3 113.9 142.1
3 3 75.3 134.5
5 3 69.3 124.8
4 1 104.7 145.6
4 2 72.6 132.3
4 4 74.3 125.4
4 3 66.4 120.5
Table 9.9: MAE obtained by LSC-CNN with different hyper-parameter settings.
9.4.2 Architectural Ablations

In this section, the advantage of various architectural choices made for our model is established through ex-
periments. LSC-CNN employs multi-scale top-down modulation through the TFM modules (Section 9.2.1.2).
We train LSC-CNN without these top-down connections (terminal 3 in Figure 9.4 is removed for all TFM
networks) and the resultant MAE is labeled as No TFM in Table 9.10. We also ablate with a sequential TFM
(Seq TFM), in which every branch gets only one top-down connection from its previous scale as opposed to
features from all lower resolution scales in LSC-CNN. The results evidence that having top-down modulation
is effective in leveraging high-level scene context and helps improve count accuracy. But the improvement is
drastic with the proposed multiple top-down connections and seems to aid better extraction of context infor-
mation. The top-down modulation can be incorporated in many ways, with LSC-CNN using concatenation of
top-down features with that of bottom-up. Following [114], we generate features to gate the bottom-up feature
137
ST PartA UCF-QNRF WIDERFACE
Method MAE MAE Easy Med Hard
No TFM 94.5 149.7 30.1 45.2 31.5
Seq TFM 73.4 135.2 31.4 47.3 39.8
Mult TFM 67.6 124.1 37.8 54.2 45.1
No GWTA 79.2 130.2 31.7 49.9 37.2
No Weighing 360.1 675.5 0.1 0.1 1.2
No Replication 79.3 173.1 30.4 44.3 35.9
Box Regression 77.9 140.6 29.8 47.8 35.2
LSC-CNN 66.4 120.5 40.5 62.1 56.2
Table 9.10: Validating various architectural design choices of LSC-CNN.
maps (Mult TFM). Specifically, we modify the second convolution layer for top-down processing in Figure 9.4
with Sigmoid activation. The Sigmoid output from each top-down connection is element-wise multiplied to the
incoming scale feature maps. A slight performance drop is observed with this setup, but the MAE is close to
that of LSC-CNN, stressing that top-down modulation in any form could be useful.
Now we ablate the training regime of LSC-CNN. The experiment labeled No GWTA in Table 9.10 cor-
responds to LSC-CNN trained with just the Lcomb loss (equation 9.4). Figure 9.8 clearly shows that without
GWTA, LSC-CNN completely fails in the high resolution scale (one-half), where the gradient averaging effect
is prominent. A significant drop in MAE is observed as well, validating the hypothesis that GWTA aids better
Ground Truth With GWTA Without GWTA

309.0 252.0 132.0
584.0 179.0
617.0 458.0
Figure 9.8: Demonstrating the effectiveness of GWTA in proper training of high resolution scale
branches (notice the highlighted region).
138
optimization of the model. Another important aspect of the training is the class balancing scheme employed.
LSC-CNN is trained with no weighting, essentially with all αbs s set to 1. As expected, the counting error reaches
an unacceptable level, mainly due to the skewness in the distribution of persons across scales. We also validate
the usefulness of replicating certain VGG blocks in the feature extractor (Section 9.2.1.1) through an experi-
ment without it, labeled as No Replication. Lastly, instead of our per-pixel box classification framework, we
train LSC-CNN to regress the box sizes. Box regression is done for all the branches by replicating the last
five convolutional layers of the TFM (Figure 9.4) into two arms; one for the per-pixel binary classification to
locate person and the other for estimating the corresponding head sizes (the sizes are normalized to 0-1 for all
scales). However, this setting could not achieve good MAE, possibly due to class imbalance across box sizes
(Section 9.2.2).
9.4.3 Comparison with Object/Face Detectors

To further demonstrate the utility of our framework beyond any doubt, we train existing detectors like FR-
CNN [28], SSH [8] and TinyFaces [4] on dense crowd datasets. The anchors for these models are adjusted to
match the box sizes (βs) of LSC-CNN for fair comparison. The models are optimized with the pseudo box
ground truth generated from point annotations. For these, we compute counting metrics MAE and MSE along
with point localization measure MLE in Table 9.11. Note that the SSH and TinyFaces face detectors are also
trained with the default anchor box setting as specified by their authors (labeled as def ). The evaluation points
to the poor counting performance of the detectors, which incur high MAE scores. This is mainly due to the
inability to capture dense crowds as evident from Figure 9.9. LSC-CNN, on the other hand, works well across
density ranges, with quite convincing detections even on sparse crowd images from WIDERFACE [5]. In ad-
dition, we compare the detectors for per image inference time (averaged over ST Part A [1] test set, evaluated
on a NVIDIA V100 GPU) and model size in Table 9.12. The results reiterate the suitability of LSC-CNN for
practical applications.
ST Part A UCF-QNRF
Method MAE MSE MLE MAE MSE MLE
FRCNN [28] 241.0 431.6 43.7 320.1 697.6 43.9
SSH (def) 387.5 513.4 96.2 564.8 924.4 126.5
SSH [8] 328.2 479.6 89.9 441.1 796.6 103.7
TinyFace (def) 288.1 457.4 37.4 397.2 786.6 50.7
TinyFace [4] 237.8 422.8 29.6 336.8 741.6 41.2
LSC-CNN 66.4 117.0 9.6 120.5 218.3 8.6
Table 9.11: LSC-CNN compared with existing detectors trained on crowd datasets.
139
Ground Truth LSC-CNN Prediction SSH Tiny Faces
355.0 317.0 78.0 122.0
ST
PartA
436.0 179.0
425.0 134.0 171.0
UCF-
QNRF
16.0 16.0 14.0 16.0
WIDER
FACE
Figure 9.9: Comparison of predictions made by face detectors SSH [8] and TinyFaces [4] against
LSC-CNN. Note that the Ground Truth shown for WIDERFACE dataset is the actual and not the
pseudo box ground truth. Normal face detectors are seen to fail on dense crowds.
Method Inference Time (ms) Parameters (in millions)

FRCNN [28] 231.4 41.5
SSH [8] 48.1 19.8
TinyFace [4] 348.6 30.0
LSC-CNN nS = 1 29.4 12.9
LSC-CNN nS = 2 32.3 18.3
LSC-CNN nS = 3 50.6 20.6
LSC-CNN nS = 4 69.0 21.9
Table 9.12: Efficiency of detectors in terms of inference speed and model size.
9.5 Conclusion
This chapter introduces a dense detection framework for crowd counting and renders the prevalent paradigm of
density regression obsolete. The proposed LSC-CNN model uses a multi-column architecture with top-down
modulation to resolve people in dense crowds. Though only point head annotations are available for training,
LSC-CNN puts bounding box on every located person. Experiments indicate that the model achieves not only
better crowd counting performance than existing regression methods, but also has superior localization with
all the merits of a detection system. Given these, we hope that the community would switch from the current
regression approach to more practical dense detection. Future research could address spurious detections and
make sizing of heads further accurate.
140
Chapter 10
Summary, Conclusion and Future Directions
E now sum up the thesis, present conclusions, and speculate on some fascinating paths for
W future research.
10.1 Summary and Conclusions

In this thesis, we have tried to address some of the major bottlenecks pertaining to dense crowd
analysis. Automated processing of huge crowd images to extract useful statistics has become quite
important necessity in the modern day. However, computing even seemingly simpler details like the
crowd density distribution or overall counts is found to be difficult. Specifically, three broad issues
need to be mitigated for developing any crowd analysis algorithm.
The very first part of the thesis deals with the drastic Diversity in crowd scenes. It gets manifested
in the form of appearance variety of people and scale variations (Chapter 1). The discriminatory pat-
terns is so different between sparse and dense crowd scenes that one require density-aware processing.
We show that using global context as feedback the counting models could correct errors due to the
diversity arising from density variations (Chapter 2). To adapt further across the diversity spectrum, a
mixture of experts approach is taken. The proposed differential training regime is proven to be benefi-
cial in creating finer expert regressors for different densities (Chapter 3). This architecture is extended
to a growing network that can adapt depending on the diversity of the crowd dataset (Chapter 4). Our
results from Part 1 demonstrates that significant performance improvement can be obtained by easing
the Diversity aspect of the crowd analysis problem.
Training for dense crowd analysis requires large datasets with millions of person annotations.
However, creating such datasets covering all possible diverse scenario is an expensive process and
the second part of the thesis focuses on mitigating this. It is demonstrated that almost 99% of the
141
parameters of a given density regressor can be trained without any annotated data (Chapter 5). The
practical advantage is that the proposed approach delivers superior performance in scenarios where
only few labeled images are available. An alternate approach to tackle the annotation difficulty is to
transform the labeling scheme itself. Binary crowd density categorization at an image-level is easy
to perform than annotating persons in scenes (Chapter 6). We leverage weak crowd density signals
along with the binary labels to train a density regressor. This framework is shown to deliver good
counting performance at a very low labeling cost. Since obtaining few annotated crowd images itself
might be difficult in many scenarios, a complete self-supervision paradigm is developed (Chapter 7).
We show for the first time that a density regressor can be fully trained from scratch without using a
single annotated image. The model delivers significant counting performance and beats other training
methodologies in less data setting as well. These findings not only alleviate to some extent the Data
Scarcity in crowd analysis, but kindles a new avenue where models could be trained directly for
solving the downstream task of interest, without providing any instance-level annotated data.
Third and the last part of the thesis is dedicated to the Localization in crowds. The widely used
approach of density regression yields good regression and counting performance. But the accurate
localization of persons in the crowd stands poor and cannot support further downstream applications
like detection, recognition etc. To enhance the localization, we propose the dot detection framework,
which pinpoints every person in the scene (Chapter 8). The model is validated to have significantly
better localization, which scales across the entire spectrum of density variations. Subsequently, a
dense detection architecture is designed to put bounding boxes on heads of people (Chapter 9). We
show that the approach gives accurate localization and sizing of the heads in diverse crowds, with the
practical benefit of not requiring any bounding box annotations for training. The results indicate that
the accurate localization of people is possible even in highly dense crowds and hence can be employed
in more real-world applications.
10.2 Future Directions

Based on the thesis, following are a list of promising directions for future research:
• Addressing Diversity in crowds is definitely still one of the major area where significant im-
provements can be made. Future works could try to enhance the efficiency of the classifiers in
the mixture of experts approaches (Chapter 2 and 3). A more ambitious direction could com-
bine the expert regressors with iterative bottom-up and top-down processing to refine density
predictions at diverse confusing contexts.
• As far as the Data Scarcity in crowd analysis is concerned, the proposed paradigm complete self-
142
supervision (Chapter 7) is promising and delivers significant counting performance. However,
the performance gap compared to state-of-the-art density regressors is large and needs to be
the addressed in future research. One possible direction is to add more crowd specific priors
so that the Sinkhorn training become well guided and result in better distribution matching.
Another research path would be to extend the method for dense detection (Chapter 9), where the
model learns to put bounding boxes on persons without using any instance-level annotations.
Even further, it is interesting to modify complete self-supervision for tasks other than crowd
counting like classification, segmentation etc. For example, one could form a prior distribution
on the image label space using language embeddings and apply Sinkhorn training for image
classification task.
• Research on Localization aspect of crowd analysis is important and directly helps to support
applications other than counting. Though the localization has dramatically improved with the
dense detection approach (Chapter 9), the bounding box sizing performance needs to be im-
proved. Moreover, studies should also include other useful tasks like segmentation of persons.
In fact, a framework that can simultaneously locate, detect and segment every single person
from highly dense assemblies should be the holy grail in crowd analysis.
143
Bibliography
[1] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowd counting via multi-
column convolutional neural network,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016. xiii, xiv, xv, xvi, xvii, xviii, xix, xx, 1, 2, 6, 9,
16, 20, 21, 22, 23, 25, 26, 27, 32, 33, 34, 35, 36, 37, 38, 40, 41, 45, 48, 49, 50, 51, 52, 59, 66,
71, 73, 76, 77, 85, 93, 94, 99, 100, 111, 112, 113, 114, 118, 119, 126, 135, 136, 137, 139
[2] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multi-source multi-scale counting in extremely
dense crowd images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2013. xiii, xviii, xix, xx, 3, 4, 5, 6, 8, 21, 25, 26, 33, 49, 63, 73, 78, 95,
96, 111, 112, 113, 118, 135, 136
[3] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah, “Com-
position loss for counting, density map estimation and localization in dense crowds,” in Pro-
ceedings of the European Conference on Computer Vision (ECCV), 2018. xvi, xix, 7, 9, 78, 94,
100, 111, 112, 116, 134, 135, 137
[4] P. Hu and D. Ramanan, “Finding tiny faces,” in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, 2017. xvii, 9, 119, 120, 133, 134, 139, 140
[5] S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A face detection benchmark,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016. xvii, xx, 119, 132, 134, 139
[6] “World’s largest selfie,” https://www.gsmarena.com/nokia lumia 730 captures worlds largest
selfie-news-10285.php, accessed: 2019-05-31. xvii, 119
[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” arXiv preprint arXiv:1409.1556, 2014. xvii, 28, 30, 45, 88, 107, 122, 123
144
BIBLIOGRAPHY
[8] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis, “SSH: Single stage headless face
detector,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV),
2017. xvii, 9, 120, 133, 134, 139, 140
[9] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting via deep convolutional
neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2015. xviii, xix, 7, 8, 20, 21, 22, 25, 26, 33, 34, 35, 49, 50, 113, 114,
118, 135, 136
[10] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preserving crowd monitoring: Count-
ing people without people models or tracking,” in IEEE Conference on Computer Vision and
Pattern Recognition, 2008. xviii, 34
[11] V. A. Sindagi, R. Yasarla, and V. M. Patel, “Pushing the frontiers of unconstrained crowd
counting: New dataset and benchmark method,” in Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2019. xix, 7, 95, 96
[12] ——, “JHU-CROWD++: Large-scale crowd counting dataset and a benchmark method,” Tech-
nical Report, 2020. xix, 7, 95, 96
[13] R. Guerrero-Gmez-Olmedo, B. Torre-Jimnez, R. Lpez-Sastre, S. M. Bascn, and D. Ooro-

Rubio, “Extremely overlapping vehicle counting,” in Proceedings of the Iberian Conference
on Pattern Recognition and Image Analysis (IbPRIA), 2015. xx, 117, 132, 136
[14] B. Wu and R. Nevatia, “Detection of multiple, partially occluded humans in a single image
by bayesian combination of edgelet part detectors,” in Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2005. 8, 118
[15] P. Viola, M. J. Jones, and D. Snow, “Detecting pedestrians using patterns of motion and ap-
pearance,” International Journal of Computer Vision (IJCV), 2005. 8, 118
[16] M. Wang and X. Wang, “Automatic adaptation of a generic pedestrian detector to a specific
traffic scene,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2011. 8, 118
[17] H. Idrees, K. Soomro, and M. Shah, “Detecting humans in dense crowds using locally-
consistent scale prior and global occlusion reasoning,” IEEE Transactions on Pattern Analysis
and Machine Intelligence (TPAMI), 2015. 8, 118
145
BIBLIOGRAPHY
[18] R. Stewart and M. Andriluka, “End-to-end people detection in crowded scenes,” arXiv preprint
arXiv:1506.04878, 2015. 8
[19] C. Wang, H. Zhang, L. Yang, S. Liu, and X. Cao, “Deep people counting in extremely dense
crowds,” in Proceedings of the ACM International Conference on Multimedia (ACMMM),
2015. 8
[20] E. Walach and L. Wolf, “Learning to count with CNN boosting,” in Proceedings of the Euro-
pean Conference on Computer Vision, 2016. 8
[21] D. Onoro-Rubio and R. J. López-Sastre, “Towards perspective-free object counting with deep
learning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016. 8,
21, 25, 33, 34, 40, 41, 49, 118, 136
[22] L. Boominathan, S. S. Kruthiventi, and R. V. Babu, “CrowdNet: A deep convolutional network

for dense crowd counting,” in Proceedings of the ACM International Conference on Multimedia
(ACMMM), 2016. 8, 21, 33, 40, 41, 49
[23] X. Cao, Z. Wang, Y. Zhao, and F. Su, “Scale aggregation network for accurate and efficient
crowd counting,” in Proceedings of the European Conference on Computer Vision (ECCV),
2018. 9, 113, 114, 135, 136
[24] V. A. Sindagi and V. M. Patel, “CNN-based cascaded multi-task learning of high-level prior and
density estimation for crowd counting,” in Proceedings of the IEEE International Conference
on Advanced Video and Signal Based Surveillance (AVSS), 2017. 9, 42, 49, 111, 122, 135
[25] ——, “Generating high-quality crowd density maps using contextual pyramid CNNs,” in Pro-
ceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. 9, 40, 42,
49, 50, 113, 114, 118, 122, 135, 136
[26] V. Ranjan, H. Le, and M. Hoai, “Iterative crowd counting,” in Proceedings of the European
Conference on Computer Vision, 2018. 9, 113, 114, 135, 136
[27] J. Liu, C. Gao, D. Meng, and A. G. Hauptmann, “DecideNet: Counting varying density crowds
through attention guided detection and density estimation,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2018. 9, 65
[28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection
with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS),
2015. 9, 120, 126, 139, 140
146
BIBLIOGRAPHY
[29] Y. Li, X. Zhang, and D. Chen, “CSRNet: Dilated convolutional neural networks for under-
standing the highly congested scenes,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2018. 9, 107, 113, 114, 115, 118, 122, 132, 135, 136
[30] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convo-
lutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017. 9
[31] X. Liu, J. Van De Weijer, and A. D. Bagdanov, “Exploiting unlabeled data in CNNs by self-
supervised learning to rank,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
2019. 9, 84, 136
[32] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M.-M. Cheng, and G. Zheng, “Crowd counting with
deep negative correlation learning,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2018. 9
[33] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang, “Crowd counting via adversarial cross-
scale consistency pursuit,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2018. 9
[34] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal
of Computer Vision (IJCV), 2004. 9
[35] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object
detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2014. 9
[36] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks
for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
9
[37] R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE International Conference on Computer
Vision (CVPR), 2015. 9
[38] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 9, 120, 126
[39] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Sin-
gle shot multibox detector,” in Proceedings of the European Conference on Computer Vision
(ECCV), 2016. 9, 120
147
BIBLIOGRAPHY
[40] C. Zhu, Y. Zheng, K. Luu, and M. Savvides, “CMS-RCNN: Contextual multi-scale region-
based CNN for unconstrained face detection,” in Deep Learning for Biometrics. Springer,
2017. 9
[41] V. A. Sindagi and V. M. Patel, “DAFE-FD: Density aware feature enrichment for face de-
tection,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision
(WACV), 2019. 10
[42] V. A. Lamme, H. Super, and H. Spekreijse, “Feedforward, horizontal, and feedback processing
in the visual cortex,” Current opinion in neurobiology, vol. 8, no. 4, pp. 529–535, 1998. 14, 15
[43] C. D. Gilbert and M. Sigman, “Brain states: top-down influences in sensory processing,” Neu-
ron, vol. 54, no. 5, pp. 677–696, 2007. 14
[44] V. Piëch, W. Li, G. N. Reeke, and C. D. Gilbert, “Network model of top-down influences on
local gain and contextual interactions in visual cortex,” Proceedings of the National Academy
of Sciences, vol. 110, no. 43, pp. E4108–E4117, 2013. 14
[45] R. Desimone, “Visual attention mediated by biased competition in extrastriate visual cortex.”
Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 353, no. 1373, p.
1245, 1998. 15
[46] D. M. Beck and S. Kastner, “Top-down and bottom-up mechanisms in biasing competition in
the human brain,” Vision research, vol. 49, no. 10, pp. 1154–1165, 2009. 15
[47] C. Gatta, A. Romero, and J. van de Veijer, “Unrolling loopy top-down semantic feedback in
convolutional deep networks,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops, 2014, pp. 498–505. 15
[48] A. Shrivastava and A. Gupta, “Contextual priming and feedback for faster r-cnn,” in European
Conference on Computer Vision. Springer, 2016, pp. 330–348. 15
[49] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” arXiv
preprint arXiv:1611.00850, 2016. 15
[50] K. Li, B. Hariharan, and J. Malik, “Iterative instance segmentation,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp. 3659–3667. 15
[51] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár, “Learning to refine object segments,” in
European Conference on Computer Vision. Springer, 2016, pp. 75–91. 15
148
BIBLIOGRAPHY
[52] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, “Beyond skip connections: Top-down
modulation for object detection,” arXiv preprint arXiv:1612.06851, 2016. 15
[53] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber, “Deep networks with internal selec-
tive attention through feedback connections,” in Advances in Neural Information Processing
Systems, 2014, pp. 3545–3553. 15
[54] Q. Wang, J. Zhang, S. Song, and Z. Zhang, “Attentional neural network: Feature selection
using cognitive feedback,” in Advances in Neural Information Processing Systems, 2014, pp.
2033–2041. 15
[55] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu
et al., “Look and think twice: Capturing top-down visual attention with feedback convolutional
neural networks,” in Proceedings of the IEEE International Conference on Computer Vision,
2015, pp. 2956–2964. 15
[56] V. Lempitsky and A. Zisserman, “Learning to count objects in images,” in Advances in Neural
Information Processing Systems, 2010. 21, 33, 49
[57] R. K. Sarvadevabhatla, S. Surya, and S. S. Kruthiventi, “Swiden: convolutional neural networks

for depiction invariant object recognition,” in Proceedings of the 24th ACM international con-
ference on Multimedia, 2016, pp. 187–191. 25
[58] S. An, W. Liu, and S. Venkatesh, “Face recognition using kernel ridge regression,” in IEEE
Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–7. 34
[59] K. Chen, C. C. Loy, S. Gong, and T. Xiang, “Feature mining for localised crowd counting.” in
BMVC, vol. 1, no. 2, 2012, p. 3. 34
[60] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”
Neural computation, vol. 3, no. 1, pp. 79–87, 1991. 38, 41
[61] S. Kumagai, K. Hotta, and T. Kurita, “Mixture of counting CNNs: Adaptive integra-
tion of CNNs specialized to specific appearance for crowd counting,” arXiv preprint
arXiv:1703.09393, 2017. 41, 42
[62] X. Qiang, G. Cheng, and Z. Wang, “An overview of some classical growing neural networks
and new developments,” in International Conference on Education Technology and Computer
(ICETC), vol. 3. IEEE, 2010, pp. V3–351. 42
149
BIBLIOGRAPHY
[63] V. Chaudhary, A. K. Ahlawat, and R. Bhatia, “Growing neural networks using soft competitive
learning,” International Journal of Computer Applications (0975-8887) Volume, 2011. 42
[64] I. Mrazova and M. Kukacka, “Image classification with growing neural networks,” Interna-
tional Journal of Computer Theory and Engineering, vol. 5, no. 3, p. 422, 2013. 42
[65] Y.-X. Wang, D. Ramanan, and M. Hebert, “Growing a brain: Fine-tuning by increasing model
capacity,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2017, pp. 2471–2480. 42
[66] Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste, W. Di, and Y. Yu, “HD-CNN: hier-
archical deep convolutional neural networks for large scale visual recognition,” in Proceedings
of the IEEE International Conference on Computer Vision, 2015, pp. 2740–2748. 42
[67] K. Ahmed, M. H. Baig, and L. Torresani, “Network of experts for large-scale image catego-
rization,” in European Conference on Computer Vision. Springer, 2016, pp. 516–532. 42
[68] Z. Wang, X. Wang, and G. Wang, “Learning fine-grained features via a CNN tree for large-scale
classification,” Neurocomputing, 2017. 42
[69] S. Lee, S. P. S. Prakash, M. Cogswell, V. Ranjan, D. Crandall, and D. Batra, “Stochastic mul-
tiple choice learning for training diverse deep ensembles,” in Advances in Neural Information
Processing Systems, 2016, pp. 2119–2127. 43
[70] D. Babu Sam, S. Surya, and R. V. Babu, “Switching convolutional neural network for crowd
counting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2017. 43, 73, 118, 122
[71] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural net-
works,” Science, 2006. 55, 56, 57, 61, 82, 84
[72] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust
features with denoising autoencoders,” in Proceedings of the International Conference on Ma-
chine Learning (ICML), 2008. 55, 56, 57, 61, 82, 84
[73] G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a proxy task for visual under-
standing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2017. 55, 57, 82, 84
150
BIBLIOGRAPHY
[74] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in Proceedings of the IEEE
International Conference on Computer Vision (ICCV), 2015. 55, 57, 84
[75] X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in
Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 55, 57,
84
[76] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Fea-
ture learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2016. 55, 57, 65, 82, 84
[77] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context
prediction,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV),
2015. 55, 57, 84
[78] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw
puzzles,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016. 55,
57, 82, 84
[79] A. Makhzani and B. J. Frey, “Winner-take-all autoencoders,” in Advances in Neural Informa-

tion Processing Systems (NIPS), 2015. 55, 57, 58, 59, 62, 82, 84, 121
[80] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proceedings of the Inter-
national Conference on Learning Representations (ICLR), 2013. 56, 82, 84
[81] P. Smolensky, “Information processing in dynamical systems: foundations of harmony theory,”

in Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. MIT
Press, 1986, pp. 194–281. 57
[82] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in international conference

on artificial intelligence and statistics, 2009. 57
[83] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in

International Conference on Machine Learning, 2016, pp. 1747–1756. 57
[84] A. v. d. Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image

generation with PixelCNN decoders,” in Advances in Neural Information Processing Systems,
2016. 57
151
BIBLIOGRAPHY
[85] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” in International

Conference on Learning Representations, 2017. 57
[86] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville,

“Adversarially learned inference,” in International Conference on Learning Representations,
2017. 57
[87] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in Proceedings of the Euro-
pean Conference on Computer Vision (ECCV), 2016. 57, 82, 84
[88] G. Larsson, M. Maire, and G. Shakhnarovich, “Learning representations for automatic col-
orization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016. 57,
82, 84
[89] D. Jayaraman and K. Grauman, “Learning image representations tied to ego-motion,” in Pro-
ceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 57, 84
[90] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning features by watch-
ing objects move,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017. 57, 84
[91] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised learning using tem-
poral order verification,” in Proceedings of the European Conference on Computer Vision
(ECCV), 2016. 57, 84
[92] P. Isola, D. Zoran, D. Krishnan, and E. H. Adelson, “Learning visual groups from co-
occurrences in space and time,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016. 57, 84
[93] R. Zhang, P. Isola, and A. Efros, “Split-brain autoencoders: Unsupervised learning by cross-
channel prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
[94] S. Jenni and P. Favaro, “Self-supervised feature learning by learning to spot artifacts,” in IEEE
Conference on Computer Vision and Pattern Recognition, June 2018. 57
[95] J. Canny, “A computational approach to edge detection,” IEEE Transactions on pattern analysis
and machine intelligence (TPAMI), 1986. 71, 91
152
BIBLIOGRAPHY
[96] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervised visual representation learn-
ing,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition
(CVPR), 2019. 74, 75, 84, 88, 98
[97] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting

image rotations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
[98] Z. Feng, C. Xu, and D. Tao, “Self-supervised representation learning by rotation feature decou-
pling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2019. 82, 84
[99] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in Advances in

Neural Information Processing Systems (NIPS), 2013. 83, 89, 90
[100] Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from synthetic data for crowd counting in
the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2019. 84
[101] T. Nathan Mundhenk, D. Ho, and B. Y. Chen, “Improvements to context based self-supervised
learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2018. 84
[102] S. Jenni and P. Favaro, “Self-supervised feature learning by learning to spot artifacts,” in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
84
[103] L. Zhang, G.-J. Qi, L. Wang, and J. Luo, “AET vs. AED: Unsupervised representation learning
by auto-encoding transformations rather than data,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2019. 84
[104] A. Clauset, C. R. Shalizi, and M. E. Newman, “Power-law distributions in empirical data,”

SIAM review, 2009. 85
[105] D. Helbing, A. Johansson, and H. Z. Al-Abideen, “Dynamics of crowd disasters: An empirical

study,” Physical review E, 2007. 85, 86
[106] M. Moussaı̈d, D. Helbing, and G. Theraulaz, “How simple rules determine pedestrian behavior
and crowd disasters,” Proceedings of the National Academy of Sciences, 2011. 85, 86
153
BIBLIOGRAPHY
[107] I. Karamouzas, B. Skinner, and S. J. Guy, “A universal power law governing pedestrian inter-
actions,” APS, 2015. 85
[108] D. Helbing, C. Kühnert, S. Lämmer, A. Johansson, B. Gehlsen, H. Ammoser, and G. B. West,

“Power laws in urban supply networks, social systems, and dense pedestrian crowds,” in Com-
plexity Perspectives in Innovation and Social Change. Springer, 2009, pp. 433–450. 85
[109] D. Babu Sam, N. N. Sajjan, H. Maurya, and R. V. Babu, “Almost unsupervised learning for
dense crowd counting,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
93, 94, 95, 121, 129
[110] X. Liu, J. van de Weijer, and A. D. Bagdanov, “Leveraging unlabeled data for crowd counting
by learning to rank,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2018. 113, 114, 135
[111] Y. Liu, M. Shi, Q. Zhao, and X. Wang, “Point in, box out: Beyond counting persons in crowds,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2019. 121, 134, 135, 136
[112] D. Babu Sam, N. N. Sajjan, R. V. Babu, and M. Srinivasan, “Divide and grow: Capturing
huge diversity in crowd images with incrementally growing CNN,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 122
[113] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,

A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recogni-
tion Challenge,” International Journal of Computer Vision (IJCV), 2015. 122
[114] D. Babu Sam and R. V. Babu, “Top-down feedback for crowd counting convolutional neural
network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018. 124, 137
[115] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “From facial parts responses to face detection: A deep
learning approach,” in Proceedings of the IEEE International Conference on Computer Vision,
2015. 134
154

Thesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis

Uploaded by

Copyright:

Available Formats

A study on Deep Learning Approaches, Architectures and

Training Methods for Crowd Analysis

Deepak Babu Sam

Computational and Data Sciences

Signature of the Thesis Supervisor: .............................................

Publications from the Thesis iv

List of Figures xiii

List of Tables xviii

I Addressing Crowd Diversity 12

2 Top-Down Feedback to correct Errors 13

3 Switching CNN to capture Crowd Diversity 25

4 Incrementally Growing CNN to adapt with larger Crowd Varieties 40

4.3.2 UCF CC 50 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

II Addressing Data Scarcity 54

5 Almost Unsupervised Learning 55

6 Binary Supervision for Density Regression 68

6.2.4 UCF-CC-50 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7 Complete Self-Supervision via Distribution Matching 82

III Addressing Person Localization 103

8 Spot-on Dot Prediction for Dense Crowds 104

8.1.2 Multi-Scale Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

9 Dense Detection to accurately resolve People in Crowds 118

10 Summary, Conclusion and Future Directions 141

5.4 Sample predictions given by GWTA-CCNN on images from Shanghaitech dataset.

7.1 Self-Supervision Vs Complete Self-Supervision: Normal self-supervision techniques

2.1 Comparison of TDF-CNN to other methods on Part A and Part B of Shanghaitech

3.1 Comparison of Switch-CNN with other state-of-the-art crowd counting methods on

5.1 Performance of GWTA-CCNN on Part A of Shanghaitech dataset. . . . . . . . . . . 62

7.1 Performance comparison of CSS-CCNN with other methods on Shanghaitech PartA

9.9 MAE obtained by LSC-CNN with different hyper-parameter settings. . . . . . . . . 137

1.1 Challenges in Dense Crowd Counting

1.2 Density Regression Paradigm

1.3 Datasets and Evaluation for Dense Crowd Analysis

1.4 Related Works in Dense Crowd Counting

1.5 Organization of the Thesis

Addressing Crowd Diversity

Top-Down Feedback to correct Errors

2.1 Related Works for Top-Down Architectures

2.2 Our Approach

2.2.1 Feedback as a Correcting Signal

Ce 1x1 | 1 Ce 1x1 | 1 Ce 1x1 | 1

MP3a 2x2 MP3a 2x2

C3c 3x3|16 C3d 3x3|24 C3c 3x3|16 C3d 3x3|24

C : Convolution : Bottom-Up CNN

(a) Bottom-Up CNN (b) Feedback Generation (c) Applying Feedback

2.2.2 Top-Down Feedback CNN

2.2.3 Training of Bottom-Up CNN

with parameters Θ, the l2 loss function is given by

2.2.4 Training of Top-Down CNN

2.3.1 Shanghaitech dataset

2.3.2 UCF CC 50 dataset

2.3.3 WorldExpo’10 dataset

2.4 Ablations and Analysis

2.4.1 Effectiveness of Feedback

Switching CNN to capture Crowd Diversity

3.1 Our Approach

3.1.3 Differential Training

Initialize Θk ∀ k with random Gaussian weights

Pretrain {Rk }3k=1 for Tp epochs : Rk ← fk (·; Θk ) ;

/*Differential Training for Td epochs*/

from ground truth be CiGT = GT

ECi (k) = |Cik − CiGT |, (3.2)

The count error for ith sample is

3.1.4 Switch Training

/Differential Training for Td epochs/