Image-Based Learning To Measure Traffic Density Using A Deep Convolutional Neural Network 2018

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

1670 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 19, NO.

5, MAY 2018

Image-Based Learning to Measure Traffic Density Using a


Deep Convolutional Neural Network
Jiyong Chung and Keemin Sohn

Abstract— Existing methodologies to count vehicles from a road image utilizing an edge detector [13]. Of course, machine learning tech-
have depended upon both hand-crafted feature engineering and rule- nologies can be applied [11], [14]. Consequently, all existing
based algorithms. These require many predefined thresholds to detect and
approaches, at least in part, are dependent on several arbitrarily
track vehicles. This paper provides a supervised learning methodology
that requires no such feature engineering. A deep convolutional neural chosen rules and hand-crafted engineering to extract features.
network was devised to count the number of vehicles on a road segment The present study provides a simple approach to count vehicles on
based solely on video images. The present methodology does not regard a road segment based solely on video images with no hand-crafted
an individual vehicle as an object to be detected separately; rather, feature engineering. Quite a few studies have dealt collectively with
it collectively counts the number of vehicles as a human would. The
test results show that the proposed methodology outperforms existing crowd counting using regression approaches [15]–[17]. However,
schemes. studies have rarely utilized a regression approach to collectively
Index Terms— Deep convolutional neural network (CNN), machine
count vehicles for the purpose of measuring the traffic density on
learning, traffic density, vehicle counting. a road segment, even though this is much simpler than detecting,
tracking, and classifying vehicles on an individual basis. Of course,
I. I NTRODUCTION once tracking an individual vehicle succeeds, the counting of vehicles
becomes trivial. However, such elaborate technology is redundant
T HE traffic state is represented by three conventional parameters:
traffic volume, speed, and density. Whereas existing surveillance
systems such as loop detectors can easily measure the former two
where the measured traffic density is utilized simply to evaluate the
traffic state (e.g., the level of service) based on the highway capacity
parameters in the field, measuring the density is difficult, even manual (HCM) along with traffic volumes and speeds collected from
though the density is a decisive parameter by which the service the existing spot detectors. In the study field of traffic-flow theory,
level is determined [1]. Recently, as computer vision technology measuring traffic density has long been regarded as impossible.
has advanced, many researchers have focused on detecting, tracking, Traditionally, the density had to be approximated by the occupancy
and classifying vehicles from video images, which has reaped very rate measured from spot detectors. Thus, the success in counting
promising results [2]–[6]. vehicles simply by road images would be a significant breakthrough
According to the taxonomy [7], existing computer vision tech- in advancing existing traffic control and management.
nologies to detect and track moving objects can be broken down Unlike previous studies wherein a conventional feed-forward
into three branches: temporal difference, optical flow, and background neural network was employed for crowd counting [15]–[17],
subtraction. The temporal difference method utilizes the differences the present study adopted a deep convolutional neural network (CNN)
in two consecutive images to detect objects [8], [9]. This method to estimate the number of vehicles from a video shoot. CNNs have
is very vulnerable where unexpected noise occurs. The optical-flow recently recorded great success when recognizing medical CT images
method depends on obtaining an effective background image as a and human faces in a field of computer vision [18]–[20]. The present
baseline to detect objects [10]. Background subtraction is the most study began with the expectation that a CNN must perform well in
prevalent method in the field [11], [12]. This method utilizes a counting vehicles, which is much simpler than recognizing medical
static background image that is prepared in advance, and then regards CT images or human faces.
the image as a baseline against which to compare other images that The next section describes the entire framework of the present
include objects. Namely, silhouettes are drawn by black pixels with vehicle-counting scheme and expounds on the principle of CNN.
an intensity that far surpasses that of the background. How to collect data to train and test a CNN is described in the
The aforementioned description is only for drawing objects sepa- third section. The counted results and comparisons with those from
rately from the background. Even though localizing pixels of objects the most prevalent methodologies, as well as with those from other
is successful, counting the number of objects is difficult. In order to previous studies adopting various methodologies, are shown in the
count objects, another segmentation process is necessary. There have fourth section. The fifth section draws conclusions and provides
been many different methods used to recognize a blob, ranging from possible extensions for the present study.
drawing a bounding box based on the convex hull theory [11] to
II. M ODELING F RAMEWORK
Manuscript received September 16, 2016; revised February 23, 2017;
accepted July 22, 2017. Date of publication August 16, 2017; date of current Preparing data to feed a CNN is the starting point of the present
version May 2, 2018. This work was supported in part by the Chung-Ang study. The input features of a CNN are the RGB values of an image
University Research Scholarship Grants in 2016 and in part by the National at each pixel level. Whereas a CNN requires no preprocess to extract
Research Council of Science and Technology Grant by the Korea Govern-
ment (MSIP) under Grant CRC-15-05-ETRI. The Associate Editor for this input features, each input image should have a label, since a CNN
paper was K. Wang. (Corresponding author: Keemin Sohn.) belongs to the category of supervised machine learning. In the present
The authors are with the Laboratory of Big-data Applications in Public study, vehicles within each input image were counted manually in
Sectors, Chung-Ang University, Seoul 06974, South Korea (e-mail: order to tag a label to the image. This labeling task is easier than
jiyong369@hanmail.net; kmsohn@cau.ac.kr).
Color versions of one or more of the figures in this paper are available
that performed by existing CNNs to detect objects, which requires
online at http://ieeexplore.ieee.org. drawing a bounding box for each target object. Nonetheless, it may
Digital Object Identifier 10.1109/TITS.2017.2732029 take great effort to manually count vehicles in all input images.
1524-9050 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.
CHUNG AND SOHN: IMAGE-BASED LEARNING TO MEASURE TRAFFIC DENSITY USING A DEEP CNN 1671

each filter shared weight parameters wherever it resided within an


image. To avoid a layer-by-layer dimensionality reduction, prior to
convoluting the filters, target images were padded with null columns
and rows that consisted of 0s. After convolution, a new layer was
created by pooling each of the 5 × 5 cells of the convoluted layer
with average values, which had a smoothing effect on the images.
At the next stage, a second convolution layer was created by
allowing 80 2×2 × 40 filters to slide through the previous pooled
layer. The second-level convolution filters extracted more complex
features than those elicited from the first-level filters. After average
pooling again, the second convolutional hidden layer was flattened
to facilitate connection to a generic hidden layer. The connection
between the flattened layer and the next fully connected layer was
Fig. 1. CNN model structure. the same as that between two consecutive hidden layers of a feed-
forward neural network. The second fully-connected layer linearly fed
An efficient way to circumvent this difficulty will be suggested in the final output layer of a single node that represented the observed
the fourth section. number of vehicles.
Input images to train and test a CNN model were obtained from A CNN is known to recognize objects irrespective of scale,
video shoots taken at the approach of an actual intersection. Video location, or orientation. In particular, one of the main motivations of
shoots for every single second were chosen to prepare the input the present study was to confirm whether a CNN can count partially
images. Most machine learning models are susceptible to be over- occluded vehicles. Also, real-world traffic images may contain either
fitted to training data. To avoid over-fitting, the model after training a few instances of vehicles or a very large number of them. Whether
should be validated against a new dataset that has never been used a CNN can count vehicles regardless of congestion level was another
in the training stage. Thus, it is important to divide available input issue that the present study tried to resolve. Answers to these
images into a training set and a test set. After dividing the input questions will be clearly presented in the fourth section.
data, the training set was augmented using various filters, so that a The training method of a CNN is not different from that of a
CNN model could accommodate different situations that the original feed-forward neural network. The basic theory is to derive weight
training data did not account for. parameters that minimize the sum of squared errors between observed
Unfortunately, at the present time, there is no systematic way to and estimated output values, which is formulated as a loss function.
determine the best model structure for a CNN within a practical A back-propagation algorithm is used to derive the gradient of the
computing time. A plausible model structure must be selected by trial- loss function with respect to each weight parameter. The algorithm,
and-error. While finding the best model structure, hyper-parameters however, has a fatal drawback. The derivative of errors is likely to be
should be determined upon a third dataset other than the training and lessened, as they are propagated from the top to the bottom layers,
test datasets. To establish a model structure, 5% of the training images which is a phenomenon that is referred to as the vanishing gradient
were selected. After training, the model performance was evaluated problem. Owing to adopting a ReLU for activating the node values
and compared based on the test data that had been separated from instead of the conventional sigmoid function, the back-propagation
the training data. The background subtraction method was chosen as algorithm successfully trained the proposed CNN model. A ReLU
a baseline to verify the utility of the present model. Finally, the com- activates node values greater than 0 into themselves and values less
parison was conducted based on three performance indices: the than 0 into 0’s, which prevents the gradient vanishing problem.
mean absolute error (MAE); the correlation coefficient with observed Readers who are interested in the details of CNN can refer to [22].
numbers; and, the percent root mean squared error (%RMSE). Another advantage of a CNN is that the number of weight
A CNN model within the entire modeling framework played a key parameters to be estimated can be reduced considerably compared
role in counting vehicles using a video image. However, it is difficult with adopting a conventional feed-forward neural network. A feed-
to explain what mechanism makes it possible to count vehicles. The forward neural network has a large number of weight parameters
most plausible way to guess the mechanics is to investigate high-level because each cell of an input image should connect to all hidden
features that the CNN extracts via filters. The features that the present nodes of the second hidden layer. A CNN, however, takes only filter
model extracted from the traffic images are shown and discussed in parameters into account, which makes it possible to recognize a large-
the fourth section. dimension image.
Fig. 1 shows the structure of the CNN model that was adopted III. DATA C OLLECTION
in the present study. The original high-resolution (90 × 600) input An actual intersection in Seoul that has 5 legs was chosen as
images were downsized into a tractable dimension (30 × 200). Since the test bed (see Fig. 2). A daytime video stream (=6 hours
each input image was in color with RGB values, the dimensions of and 26 minutes) during a weekday was available. A 150 m-long
the input image were 3 × 30 × 200. The first convolution layer was approach to the intersection was selected to collect input images.
created using 40 3 × 3 × 3 filters each of which slid through an input Fig. 3 shows the configuration of the entire intersection and the
image with a stride value of 1. Each cell value of the convolutional selected approach. The video stream was changed to .png images
hidden layer was computed by the linear combination of all weights by selecting snapshots for every second, which resulted in a total
of a filter and the values of the portion of a target image that the filter of 23,164 images.
covered, and then was activated by a rectified linear unit (ReLU). The Among the saved images, 4,362 (= 20%) were randomly chosen to
ReLU outperformed the conventional sigmoid function, which is one test the trained model at a later time. The remaining 18,532 images
of the recent breakthroughs for deep learning [21]. At this stage, were left for training, and these were then augmented to consider
each filter captured its own basic feature regardless of the feature situations that the original images did not convey. The size of the
location within an image. In addition, using filters had the advantage augmented training set increased to 111,192. For every original
of reducing the number of weight parameters to be estimated, since training image, three more blurry images and two sharper images

Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.
1672 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 19, NO. 5, MAY 2018

TABLE I
T RAINING AND T ESTING R ESULTS OF CNN

Fig. 2. Configuration of testbed. The upper photo shows a video shoot for
the entire intersection, and the shaded area with green represents the selected
approach. The small photo inset on the upper right corner is aerial view of the low-level functions on a GPU. A Numpy3 library is also needed to
intersection. The bottom photo shows the approach after adjusting orientation handle tensor arrays, so that Python, a main programming language
and viewpoint. The resolution of the final rectangular is 30×200 after resizing.
can handle tensor-type data with Keras functions. These open-source
libraries were downloaded from the Internet, and were used for
training and testing the proposed CNN model.

B. Results
The testing results are shown in Table I. As expected, there was
little difference between training and testing errors, although testing
results were slightly exacerbated, which was evidence that the model
was not overfitted. There was also little difference over 9 trials. None
Fig. 3. Distribution of vehicle number. of the MAEs exceeded 1.6. This implies that the average difference
between the estimated and observed counts was at most less than
were newly synthesized by using 5 different filters provided by the 1.6 vehicles, which was very promising when one considers that
Python image processing toolbox. There was no augmentation for the the average number of vehicles within an image was 53.2. In what
test images. The procedure above was repeated three times to consider follows, the performance of the proposed CNN will be discussed
sampling variances. Two different shares of test data (30% and 40%) based on the first trial for the 20% test examples, which was the
were also tested. worst case.
The distribution of the number of vehicles for the selected approach The comparative concept of precision vs. recall is useful for
is shown in Fig. 3. Two modes illustrate a typical pattern of daily assessing the performance of a classification model. Although our
traffic demand in urban areas. The mean number of vehicles of model is a regression model, the regression result can easily be
the selected approach was 53.2, and the minimum and maximum converted to the classification result. That is, if an estimated value
numbers were 5 and 82, respectively. fell in a certain range (±1) around the ground truth, the classification
was regarded as a success. It was otherwise regarded as a failure. The
IV. R ESULTS AND C OMPARISONS average precision and recall values were 60 and 57%, respectively.
These similar, but seemingly unsatisfactory, results were acceptable
A. Computing Environment for the purpose of the present study. It should be noted again that
Prior to discussing the results and their comparisons with those the purpose of counting vehicles in the present study was not to
from the existing methods, the computing environment must be detect and identify an individual vehicle, but to collectively count
clarified. To train the proposed CNN model, a graphical processing vehicles and thus to compute the traffic density for traffic control and
unit (GPU) was needed to compute the matrices of weight parameters management at the HCM level. Fig. 4 shows the confusion matrix
in parallel. The computing time to train the proposed CNN using by which the average precision-recall values were computed. The
only a central processing unit (CPU) with 12 cores was 1,160-fold stronger colors are the larger counts cells have. The observed number
that of the GPU-based computing time. This shows that a deep CNN of cars ranged widely from 5 to 82 along both vertical and horizontal
cannot be learned without a GPU. The GPU utilized in the present axes.
study was a NVDIA Quadro K5200. The GPU-based training took Generally, it is difficult to account for how a CNN can count the
only 42 seconds to run a single epoch. The maximum number of number of vehicles exactly. However, filters were expected to abstract
epochs needed to reach convergence was set at 5,000, so that the the features of an image, objects were recognized by the features,
total computing time would be within a practical range (=about and the objects were then counted via the last fully connected layer.
two days). Each filter’s role of extracting a specific feature could be visualized
Several software libraries were mobilized to train a deep CNN by weight parameters.
model on a GPU. Keras1 is a library that provides high-level deep- Each 3×3 filter of the first convolutional hidden layer had
learning functions. Keras should be implemented with TensorFlow2 9 weights, each of which corresponded to a cell within the filter. For
running at the back-end to ensure that it can utilize TensorFlow’s example, if a filter cell weight is positive (red) or negative (blue),
the filter captures, to the extent of that weight value, a pixel of
1 https://github.com/fchollet/keras
2 http://tensorflow.org/ 3 http://www.numpy.org/

Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.
CHUNG AND SOHN: IMAGE-BASED LEARNING TO MEASURE TRAFFIC DENSITY USING A DEEP CNN 1673

TABLE II
C OMPARISON B ETWEEN BACKGROUND S UBTRACTION AND CNN

Fig. 4. Confusion matrix for the result of the present model.

Fig. 6. Example of a real image (top), the background image (middle), and
silhouettes after background subtraction (bottom).

Fig. 5. 40 Features abstracted by filters of the first convolution layer.

the input image that is covered by the cell. On the other hand,
if a filter cell value is 0 (white), the filter discards a pixel of
the input image that is covered by the filter cell. Fig. 5 depicts
3 sets of 40 different features abstracted by 3×3 filters of the first
convolution layer, in the RGB order. Unfortunately, the features at
this stage were not intuitively recognizable enough to be linked to the
known characteristics of a vehicle or any other object within input
Fig. 7. Capability of CNN to count hidden vehicles. The proposed CNN
images. matched the exact number of vehicles within the upper image, whereas the
background subtracting did not count hidden vehicles behind trees and traffic
C. Comparisons With Other Methods signs as shown in the upper image.
The testing result was compared with those from a prevalent
method that is used to count objects in the field of computer vision. Fig. 7 shows that the proposed CNN exactly counted the occluded
The background subtraction was chosen as the baseline reference. vehicles behind trees and mounted traffic signs, whereas the back-
Averaging all RGB images of the selected approach at the pixel level ground subtraction model was unable to detect the vehicles. It was
led to a background image. Each image was then compared with confirmed that the CNN model was able to count occluded vehicles
this background. That is, a pixel of the target image was assigned 1 as humans did, as long as a large amount of correctly labeled data
if the difference between the maximum of the pixel’s RGB values could inform the model that there were vehicles hidden behind some
and the corresponding pixel value in the background exceeded a obstacles.
threshold value (= 30); otherwise, it was assigned 0. Fig. 6 shows Although the proposed CNN outperformed the naïve background
the background and an example of an image after the background subtraction model, stronger evidence was necessary to guarantee
subtraction. its applicability. Unfortunately, learning-based methods to collec-
The next step was to find the edges of the blobs of black pixels. tively count vehicles on a road segment are rare in the literature.
There are many hand-crafted technologies that can be used to box Rather, several studies have counted pedestrians using regression-
blobs within an image. The present study adopted several naïve rules. type learning algorithms [23]–[27]. One of these studies has already
First, an object was recognized if its blob size was larger than a successfully adopted a CNN model to count pedestrians in a crowded
predefined threshold (= 18). Second, larger blobs due to occlusion environment [27]. Even though the pedestrian counting methods
were separated into two or more objects when the blob size exceeded cannot be directly applied to counting vehicles, the performance
a multiple of the threshold. In addition, small blobs were merged of the methods could be a good benchmark value to evaluate the
together if the distance between their center points was less than a performance of the proposed CNN model.
threshold value (= 4). The proposed model was superior to this naive Table III lists MAEs and %RMSEs excerpted from 5 different
method, as shown in Table II. pedestrian-counting algorithms for a common dataset, and shows that

Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.
1674 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 19, NO. 5, MAY 2018

TABLE III the naïve background subtraction method, did not differ significantly
C OMPARISON W ITH O THER S TUDIES from that of the original model that was trained based on fully labeled
images. Only 3,706 images were sufficient to fine-tune the model,
which saved a considerable amount of time and effort in acquiring a
robust model.
The dual training scheme considerably reduced the computing time
for training. Table IV shows that implementing the new approach
took only 13 seconds for an epoch, whereas the computing time for
training with fully labeled images required 42 seconds per epoch.

V. C ONCLUSIONS
The present study was a demonstration of a novel approach
to counting vehicles on a road segment in order to accurately
quantify traffic density at an aggregate level for traffic control and
management. The approach succeeded in counting vehicles with an
TABLE IV
acceptable accuracy that was comparable to the results using existing
P ERFORMANCE OF THE P RE -T RAINED M ODEL
methodologies. It was concluded that the proposed CNN model is
applicable to measuring the traffic density at the HCM level.
However, further studies will be necessary to tackle several diffi-
culties regarding the proposed approach. Even though the proposed
model required no hand-crafted feature engineering, how to determine
the hyper-parameters of a CNN was not broached. A CNN model
contains several hyper-parameters such as the number of hidden
layers, the number of hidden nodes within each hidden layer, the filter
size for each convolutional hidden layer, and the number of filters
used for each hidden layer, etc. A systematic way to determine the
optimal value for the hyper-parameters will govern the performance
of counting vehicles.
In addition, vehicle details were ignored when counting vehicles in
the present CNN model. Namely, the CNN model counted vehicles
regardless of size, make, and type. Of course, the purpose of counting
vehicles was confined to evaluating traffic flows at the aggregate
level in the present study. However, in the future, advanced counting
technology should recognize the details of each vehicle. In particular,
distinguishing between moving and stopped vehicles is very impor-
tant for traffic control and management. A CNN model based on
consecutive images is now under construction to count vehicles while
discerning whether each vehicle is moving or not. If this succeeds,
the next version of the present study will measure the space mean
a CNN-based model outperformed other regression-based learning speed as well as the traffic density. The space mean speed is another
algorithms in counting pedestrians. The MAE value of the proposed important parameter in traffic engineering, and it cannot be measured
CNN model was superior to that of the best model that employed directly with the existing surveillance systems.
a CNN to count pedestrians. Furthermore, since the present dataset
ranged more widely, the %RMSE of the proposed model was much R EFERENCES
less than that of the previous pedestrian-counting model. [1] P. Ryus, M. Vandehey, L. Elefteriadou, R. G. Dowling, and
B. K. Ostrom, Highway Capacity Manual. Washington, DC, USA:
D. Reducing the Effort to Tag Labels Transportation Research Board, 2010.
[2] S. Messelodi, C. M. Modena, and M. Zanin, “A computer vision system
The present study was totally dependent upon supervised learning. for the detection and classification of vehicles at urban road intersec-
This means that the proposed model requires a large amount of tions,” Pattern Anal. Appl., vol. 8, nos. 1–2, pp. 17–31, Sep. 2005.
labeled data. In the present study, 23,164 images were labeled [3] N. Buch, J. Orwell, and S. A. Velastin, “Urban road user detection and
classification using 3-D wireframe models,” IET Comput. Vis. J., vol. 4,
manually, which required an exhaustive amount of time and effort. no. 2, pp. 105–116, Jun. 2010.
Actually, it is impractical to tag labels manually to all training and [4] H. Veeraraghavan, O. Masoud, and N. Papanikolopoulos, “Vision-based
testing images whenever vehicles on a new road approach must be monitoring of intersections,” in Proc. IEEE 5th Int. Conf. Intell. Transp.
counted. Syst., Singapore, Sep. 2002, pp. 7–12.
An innovative way to tackle the problem was suggested, and it is [5] K. Park, D. Lee, and Y. Park, “Video-based detection of street-parking
violation,” in Proc. Int. Conf. Image Process., Comput. Vis., Pattern
described in this sub-section. After pre-training a CNN model using Recognit., vol. 1. Las Vegas, NV, USA, 2007, pp. 152–156.
images tagged with an approximately counted number of vehicles, [6] S. Atev, H. Arumugam, O. Masoud, R. Janardan, and
the pre-trained model was fine-tuned based on only a small number of N. P. Papanikolopoulos, “A vision-based approach to collision prediction
exactly labeled images. A naïve background subtraction method could at traffic intersections,” IEEE Trans. Intell. Transp. Syst., vol. 6, no. 4,
pp. 416–423, Dec. 2005.
be the best fit for providing the approximated number of vehicles. [7] C. Ozkurt and F. Camci, “Automatic traffic density estimation and vehi-
Table IV shows the performance of the doubly trained model. cle classification for traffic surveillance systems using neural networks,”
The performance of the model, after pre-training with labels from Math. Comput. Appl., vol. 14, no. 3, pp. 187–196, Dec. 2009.

Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.
CHUNG AND SOHN: IMAGE-BASED LEARNING TO MEASURE TRAFFIC DENSITY USING A DEEP CNN 1675

[8] M. T. López, A. Fernández-Caballero, J. Mira, A. E. Delgado, and [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
M. A. Fernández, “Algorithmic lateral inhibition method in dynamic with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
and selective visual attention task: Application to moving objects detec- Process. Syst. (NIPS), vol. 25. Dec. 2012, pp. 1097–1105.
tion and labelling,” Expert. Syst. Appl., vol. 31, no. 3, pp. 570–594, [19] K. Simonyan and A. Zisserman. (Sep. 2014). “Very deep convolu-
Oct. 2006. tional networks for large-scale image recognition.” [Online]. Available:
[9] M. T. López, A. Fernández-Caballero, M. A. Fernández, J. Mira, https://arxiv.org/abs/1409.1556
and A. E. Delgado, “Visual surveillance by dynamic visual attention [20] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf.
method,” Pattern Recognit., vol. 39, no. 11, pp. 2194–2211, Nov. 2006. Comput. Vis. Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015,
[10] X. Ji, Z. Wei, and Y. Feng, “Effective vehicle detection technique for pp. 1–9.
traffic surveillance systems,” J. Vis. Commun. Image Represent., vol. 17, [21] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
no. 3, pp. 647–658, Jun. 2006. boltzmann machines,” in Proc. Int. Conf. Mach. Learn. (ICML), Haifa,
[11] J. Zhou, D. Gao, and D. Zhang, “Moving vehicle detection for automatic Israel, Jun. 2010, pp. 807–814.
traffic monitoring,” IEEE Trans. Veh. Technol., vol. 56, no. 1, pp. 51–59, [22] CS231n Convolutional Neural Networks for Visual Recognition.
Jan. 2007. Assessed on Aug. 5, 2017. [Online]. Available: http://cs231n.github.io/
[12] X. Niu, “A semi-automatic framework for highway extraction and convolutional-networks/
vehicle detection based on a geometric deformable model,” ISPRS [23] S. An, W. Liu, and S. Venkatesh, “Face recognition using kernel ridge
J. Photogram. Remote Sens., vol. 61, nos. 3–4, pp. 170–186, regression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Dec. 2006. Minneapolis, MN, USA, Jun. 2007, pp. 1–7.
[13] Z. Zhu and G. Xu, “VISATRAM: A real-time vision system for [24] K. Chen, C. C. Loy, S. Gong, and T. Xiang, “Feature mining for localised
automatic traffic monitoring,” Image Vis. Comput., vol. 18, no. 10, crowd counting,” in Proc. Brit. Mach. Vis. Conf., Wales, U.K., 2012,
pp. 781–794, Jul. 2000. vol. 1. no. 2, pp. 3–14.
[14] J.-W. Hsieh, S.-H. Yu, Y.-S. Chen, and W.-F. Hu, “Automatic traffic [25] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preserving
surveillance system for vehicle tracking and classification,” IEEE Trans. crowd monitoring: Counting people without people models or tracking,”
Intell. Transp. Syst., vol. 7, no. 2, pp. 175–187, Jun. 2006. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Anchorage,
[15] S.-Y. Cho, T. W. S. Chow, and C.-T. Leung, “A neural-based crowd AK, USA, Jun. 2008, pp. 1–7.
estimation by hybrid global learning algorithm,” IEEE Trans. Syst., Man, [26] K. Chen, S. Gong, T. Xiang, and C. C. Loy, “Cumulative attribute
Cybern. B, Cybern., vol. 29, no. 4, pp. 535–541, Aug. 1999. space for age and crowd density estimation,” in Proc. IEEE Conf.
[16] D. Kong, D. Gray, and H. Tao, “A viewpoint invariant approach for Comput. Vis. Pattern Recognit. (CVPR), Portland, OR, USA, Jul. 2013,
crowd counting,” in Proc. Int. Conf. Pattern Recognit., Santa Cruz, CA, pp. 2467–2474.
USA, 2006, pp. 1187–1190. [27] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting
[17] A. N. Marana, S. A. Velastin, L. F. Costa, and R. A. Lotufo, “Estimation via deep convolutional neural networks,” in Proc. IEEE Conf. Com-
of crowd density using image processing,” in Proc. IEE Colloq. Image put. Vis. Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015,
Process. Secur. Appl., Mar. 1997, pp. 11/1–11/8. pp. 833–841.

Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.

You might also like