Professional Documents
Culture Documents
Image-Based Learning To Measure Traffic Density Using A Deep Convolutional Neural Network 2018
Image-Based Learning To Measure Traffic Density Using A Deep Convolutional Neural Network 2018
Image-Based Learning To Measure Traffic Density Using A Deep Convolutional Neural Network 2018
5, MAY 2018
Abstract— Existing methodologies to count vehicles from a road image utilizing an edge detector [13]. Of course, machine learning tech-
have depended upon both hand-crafted feature engineering and rule- nologies can be applied [11], [14]. Consequently, all existing
based algorithms. These require many predefined thresholds to detect and
approaches, at least in part, are dependent on several arbitrarily
track vehicles. This paper provides a supervised learning methodology
that requires no such feature engineering. A deep convolutional neural chosen rules and hand-crafted engineering to extract features.
network was devised to count the number of vehicles on a road segment The present study provides a simple approach to count vehicles on
based solely on video images. The present methodology does not regard a road segment based solely on video images with no hand-crafted
an individual vehicle as an object to be detected separately; rather, feature engineering. Quite a few studies have dealt collectively with
it collectively counts the number of vehicles as a human would. The
test results show that the proposed methodology outperforms existing crowd counting using regression approaches [15]–[17]. However,
schemes. studies have rarely utilized a regression approach to collectively
Index Terms— Deep convolutional neural network (CNN), machine
count vehicles for the purpose of measuring the traffic density on
learning, traffic density, vehicle counting. a road segment, even though this is much simpler than detecting,
tracking, and classifying vehicles on an individual basis. Of course,
I. I NTRODUCTION once tracking an individual vehicle succeeds, the counting of vehicles
becomes trivial. However, such elaborate technology is redundant
T HE traffic state is represented by three conventional parameters:
traffic volume, speed, and density. Whereas existing surveillance
systems such as loop detectors can easily measure the former two
where the measured traffic density is utilized simply to evaluate the
traffic state (e.g., the level of service) based on the highway capacity
parameters in the field, measuring the density is difficult, even manual (HCM) along with traffic volumes and speeds collected from
though the density is a decisive parameter by which the service the existing spot detectors. In the study field of traffic-flow theory,
level is determined [1]. Recently, as computer vision technology measuring traffic density has long been regarded as impossible.
has advanced, many researchers have focused on detecting, tracking, Traditionally, the density had to be approximated by the occupancy
and classifying vehicles from video images, which has reaped very rate measured from spot detectors. Thus, the success in counting
promising results [2]–[6]. vehicles simply by road images would be a significant breakthrough
According to the taxonomy [7], existing computer vision tech- in advancing existing traffic control and management.
nologies to detect and track moving objects can be broken down Unlike previous studies wherein a conventional feed-forward
into three branches: temporal difference, optical flow, and background neural network was employed for crowd counting [15]–[17],
subtraction. The temporal difference method utilizes the differences the present study adopted a deep convolutional neural network (CNN)
in two consecutive images to detect objects [8], [9]. This method to estimate the number of vehicles from a video shoot. CNNs have
is very vulnerable where unexpected noise occurs. The optical-flow recently recorded great success when recognizing medical CT images
method depends on obtaining an effective background image as a and human faces in a field of computer vision [18]–[20]. The present
baseline to detect objects [10]. Background subtraction is the most study began with the expectation that a CNN must perform well in
prevalent method in the field [11], [12]. This method utilizes a counting vehicles, which is much simpler than recognizing medical
static background image that is prepared in advance, and then regards CT images or human faces.
the image as a baseline against which to compare other images that The next section describes the entire framework of the present
include objects. Namely, silhouettes are drawn by black pixels with vehicle-counting scheme and expounds on the principle of CNN.
an intensity that far surpasses that of the background. How to collect data to train and test a CNN is described in the
The aforementioned description is only for drawing objects sepa- third section. The counted results and comparisons with those from
rately from the background. Even though localizing pixels of objects the most prevalent methodologies, as well as with those from other
is successful, counting the number of objects is difficult. In order to previous studies adopting various methodologies, are shown in the
count objects, another segmentation process is necessary. There have fourth section. The fifth section draws conclusions and provides
been many different methods used to recognize a blob, ranging from possible extensions for the present study.
drawing a bounding box based on the convex hull theory [11] to
II. M ODELING F RAMEWORK
Manuscript received September 16, 2016; revised February 23, 2017;
accepted July 22, 2017. Date of publication August 16, 2017; date of current Preparing data to feed a CNN is the starting point of the present
version May 2, 2018. This work was supported in part by the Chung-Ang study. The input features of a CNN are the RGB values of an image
University Research Scholarship Grants in 2016 and in part by the National at each pixel level. Whereas a CNN requires no preprocess to extract
Research Council of Science and Technology Grant by the Korea Govern-
ment (MSIP) under Grant CRC-15-05-ETRI. The Associate Editor for this input features, each input image should have a label, since a CNN
paper was K. Wang. (Corresponding author: Keemin Sohn.) belongs to the category of supervised machine learning. In the present
The authors are with the Laboratory of Big-data Applications in Public study, vehicles within each input image were counted manually in
Sectors, Chung-Ang University, Seoul 06974, South Korea (e-mail: order to tag a label to the image. This labeling task is easier than
jiyong369@hanmail.net; kmsohn@cau.ac.kr).
Color versions of one or more of the figures in this paper are available
that performed by existing CNNs to detect objects, which requires
online at http://ieeexplore.ieee.org. drawing a bounding box for each target object. Nonetheless, it may
Digital Object Identifier 10.1109/TITS.2017.2732029 take great effort to manually count vehicles in all input images.
1524-9050 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.
CHUNG AND SOHN: IMAGE-BASED LEARNING TO MEASURE TRAFFIC DENSITY USING A DEEP CNN 1671
Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.
1672 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 19, NO. 5, MAY 2018
TABLE I
T RAINING AND T ESTING R ESULTS OF CNN
Fig. 2. Configuration of testbed. The upper photo shows a video shoot for
the entire intersection, and the shaded area with green represents the selected
approach. The small photo inset on the upper right corner is aerial view of the low-level functions on a GPU. A Numpy3 library is also needed to
intersection. The bottom photo shows the approach after adjusting orientation handle tensor arrays, so that Python, a main programming language
and viewpoint. The resolution of the final rectangular is 30×200 after resizing.
can handle tensor-type data with Keras functions. These open-source
libraries were downloaded from the Internet, and were used for
training and testing the proposed CNN model.
B. Results
The testing results are shown in Table I. As expected, there was
little difference between training and testing errors, although testing
results were slightly exacerbated, which was evidence that the model
was not overfitted. There was also little difference over 9 trials. None
Fig. 3. Distribution of vehicle number. of the MAEs exceeded 1.6. This implies that the average difference
between the estimated and observed counts was at most less than
were newly synthesized by using 5 different filters provided by the 1.6 vehicles, which was very promising when one considers that
Python image processing toolbox. There was no augmentation for the the average number of vehicles within an image was 53.2. In what
test images. The procedure above was repeated three times to consider follows, the performance of the proposed CNN will be discussed
sampling variances. Two different shares of test data (30% and 40%) based on the first trial for the 20% test examples, which was the
were also tested. worst case.
The distribution of the number of vehicles for the selected approach The comparative concept of precision vs. recall is useful for
is shown in Fig. 3. Two modes illustrate a typical pattern of daily assessing the performance of a classification model. Although our
traffic demand in urban areas. The mean number of vehicles of model is a regression model, the regression result can easily be
the selected approach was 53.2, and the minimum and maximum converted to the classification result. That is, if an estimated value
numbers were 5 and 82, respectively. fell in a certain range (±1) around the ground truth, the classification
was regarded as a success. It was otherwise regarded as a failure. The
IV. R ESULTS AND C OMPARISONS average precision and recall values were 60 and 57%, respectively.
These similar, but seemingly unsatisfactory, results were acceptable
A. Computing Environment for the purpose of the present study. It should be noted again that
Prior to discussing the results and their comparisons with those the purpose of counting vehicles in the present study was not to
from the existing methods, the computing environment must be detect and identify an individual vehicle, but to collectively count
clarified. To train the proposed CNN model, a graphical processing vehicles and thus to compute the traffic density for traffic control and
unit (GPU) was needed to compute the matrices of weight parameters management at the HCM level. Fig. 4 shows the confusion matrix
in parallel. The computing time to train the proposed CNN using by which the average precision-recall values were computed. The
only a central processing unit (CPU) with 12 cores was 1,160-fold stronger colors are the larger counts cells have. The observed number
that of the GPU-based computing time. This shows that a deep CNN of cars ranged widely from 5 to 82 along both vertical and horizontal
cannot be learned without a GPU. The GPU utilized in the present axes.
study was a NVDIA Quadro K5200. The GPU-based training took Generally, it is difficult to account for how a CNN can count the
only 42 seconds to run a single epoch. The maximum number of number of vehicles exactly. However, filters were expected to abstract
epochs needed to reach convergence was set at 5,000, so that the the features of an image, objects were recognized by the features,
total computing time would be within a practical range (=about and the objects were then counted via the last fully connected layer.
two days). Each filter’s role of extracting a specific feature could be visualized
Several software libraries were mobilized to train a deep CNN by weight parameters.
model on a GPU. Keras1 is a library that provides high-level deep- Each 3×3 filter of the first convolutional hidden layer had
learning functions. Keras should be implemented with TensorFlow2 9 weights, each of which corresponded to a cell within the filter. For
running at the back-end to ensure that it can utilize TensorFlow’s example, if a filter cell weight is positive (red) or negative (blue),
the filter captures, to the extent of that weight value, a pixel of
1 https://github.com/fchollet/keras
2 http://tensorflow.org/ 3 http://www.numpy.org/
Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.
CHUNG AND SOHN: IMAGE-BASED LEARNING TO MEASURE TRAFFIC DENSITY USING A DEEP CNN 1673
TABLE II
C OMPARISON B ETWEEN BACKGROUND S UBTRACTION AND CNN
Fig. 6. Example of a real image (top), the background image (middle), and
silhouettes after background subtraction (bottom).
the input image that is covered by the cell. On the other hand,
if a filter cell value is 0 (white), the filter discards a pixel of
the input image that is covered by the filter cell. Fig. 5 depicts
3 sets of 40 different features abstracted by 3×3 filters of the first
convolution layer, in the RGB order. Unfortunately, the features at
this stage were not intuitively recognizable enough to be linked to the
known characteristics of a vehicle or any other object within input
Fig. 7. Capability of CNN to count hidden vehicles. The proposed CNN
images. matched the exact number of vehicles within the upper image, whereas the
background subtracting did not count hidden vehicles behind trees and traffic
C. Comparisons With Other Methods signs as shown in the upper image.
The testing result was compared with those from a prevalent
method that is used to count objects in the field of computer vision. Fig. 7 shows that the proposed CNN exactly counted the occluded
The background subtraction was chosen as the baseline reference. vehicles behind trees and mounted traffic signs, whereas the back-
Averaging all RGB images of the selected approach at the pixel level ground subtraction model was unable to detect the vehicles. It was
led to a background image. Each image was then compared with confirmed that the CNN model was able to count occluded vehicles
this background. That is, a pixel of the target image was assigned 1 as humans did, as long as a large amount of correctly labeled data
if the difference between the maximum of the pixel’s RGB values could inform the model that there were vehicles hidden behind some
and the corresponding pixel value in the background exceeded a obstacles.
threshold value (= 30); otherwise, it was assigned 0. Fig. 6 shows Although the proposed CNN outperformed the naïve background
the background and an example of an image after the background subtraction model, stronger evidence was necessary to guarantee
subtraction. its applicability. Unfortunately, learning-based methods to collec-
The next step was to find the edges of the blobs of black pixels. tively count vehicles on a road segment are rare in the literature.
There are many hand-crafted technologies that can be used to box Rather, several studies have counted pedestrians using regression-
blobs within an image. The present study adopted several naïve rules. type learning algorithms [23]–[27]. One of these studies has already
First, an object was recognized if its blob size was larger than a successfully adopted a CNN model to count pedestrians in a crowded
predefined threshold (= 18). Second, larger blobs due to occlusion environment [27]. Even though the pedestrian counting methods
were separated into two or more objects when the blob size exceeded cannot be directly applied to counting vehicles, the performance
a multiple of the threshold. In addition, small blobs were merged of the methods could be a good benchmark value to evaluate the
together if the distance between their center points was less than a performance of the proposed CNN model.
threshold value (= 4). The proposed model was superior to this naive Table III lists MAEs and %RMSEs excerpted from 5 different
method, as shown in Table II. pedestrian-counting algorithms for a common dataset, and shows that
Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.
1674 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 19, NO. 5, MAY 2018
TABLE III the naïve background subtraction method, did not differ significantly
C OMPARISON W ITH O THER S TUDIES from that of the original model that was trained based on fully labeled
images. Only 3,706 images were sufficient to fine-tune the model,
which saved a considerable amount of time and effort in acquiring a
robust model.
The dual training scheme considerably reduced the computing time
for training. Table IV shows that implementing the new approach
took only 13 seconds for an epoch, whereas the computing time for
training with fully labeled images required 42 seconds per epoch.
V. C ONCLUSIONS
The present study was a demonstration of a novel approach
to counting vehicles on a road segment in order to accurately
quantify traffic density at an aggregate level for traffic control and
management. The approach succeeded in counting vehicles with an
TABLE IV
acceptable accuracy that was comparable to the results using existing
P ERFORMANCE OF THE P RE -T RAINED M ODEL
methodologies. It was concluded that the proposed CNN model is
applicable to measuring the traffic density at the HCM level.
However, further studies will be necessary to tackle several diffi-
culties regarding the proposed approach. Even though the proposed
model required no hand-crafted feature engineering, how to determine
the hyper-parameters of a CNN was not broached. A CNN model
contains several hyper-parameters such as the number of hidden
layers, the number of hidden nodes within each hidden layer, the filter
size for each convolutional hidden layer, and the number of filters
used for each hidden layer, etc. A systematic way to determine the
optimal value for the hyper-parameters will govern the performance
of counting vehicles.
In addition, vehicle details were ignored when counting vehicles in
the present CNN model. Namely, the CNN model counted vehicles
regardless of size, make, and type. Of course, the purpose of counting
vehicles was confined to evaluating traffic flows at the aggregate
level in the present study. However, in the future, advanced counting
technology should recognize the details of each vehicle. In particular,
distinguishing between moving and stopped vehicles is very impor-
tant for traffic control and management. A CNN model based on
consecutive images is now under construction to count vehicles while
discerning whether each vehicle is moving or not. If this succeeds,
the next version of the present study will measure the space mean
a CNN-based model outperformed other regression-based learning speed as well as the traffic density. The space mean speed is another
algorithms in counting pedestrians. The MAE value of the proposed important parameter in traffic engineering, and it cannot be measured
CNN model was superior to that of the best model that employed directly with the existing surveillance systems.
a CNN to count pedestrians. Furthermore, since the present dataset
ranged more widely, the %RMSE of the proposed model was much R EFERENCES
less than that of the previous pedestrian-counting model. [1] P. Ryus, M. Vandehey, L. Elefteriadou, R. G. Dowling, and
B. K. Ostrom, Highway Capacity Manual. Washington, DC, USA:
D. Reducing the Effort to Tag Labels Transportation Research Board, 2010.
[2] S. Messelodi, C. M. Modena, and M. Zanin, “A computer vision system
The present study was totally dependent upon supervised learning. for the detection and classification of vehicles at urban road intersec-
This means that the proposed model requires a large amount of tions,” Pattern Anal. Appl., vol. 8, nos. 1–2, pp. 17–31, Sep. 2005.
labeled data. In the present study, 23,164 images were labeled [3] N. Buch, J. Orwell, and S. A. Velastin, “Urban road user detection and
classification using 3-D wireframe models,” IET Comput. Vis. J., vol. 4,
manually, which required an exhaustive amount of time and effort. no. 2, pp. 105–116, Jun. 2010.
Actually, it is impractical to tag labels manually to all training and [4] H. Veeraraghavan, O. Masoud, and N. Papanikolopoulos, “Vision-based
testing images whenever vehicles on a new road approach must be monitoring of intersections,” in Proc. IEEE 5th Int. Conf. Intell. Transp.
counted. Syst., Singapore, Sep. 2002, pp. 7–12.
An innovative way to tackle the problem was suggested, and it is [5] K. Park, D. Lee, and Y. Park, “Video-based detection of street-parking
violation,” in Proc. Int. Conf. Image Process., Comput. Vis., Pattern
described in this sub-section. After pre-training a CNN model using Recognit., vol. 1. Las Vegas, NV, USA, 2007, pp. 152–156.
images tagged with an approximately counted number of vehicles, [6] S. Atev, H. Arumugam, O. Masoud, R. Janardan, and
the pre-trained model was fine-tuned based on only a small number of N. P. Papanikolopoulos, “A vision-based approach to collision prediction
exactly labeled images. A naïve background subtraction method could at traffic intersections,” IEEE Trans. Intell. Transp. Syst., vol. 6, no. 4,
pp. 416–423, Dec. 2005.
be the best fit for providing the approximated number of vehicles. [7] C. Ozkurt and F. Camci, “Automatic traffic density estimation and vehi-
Table IV shows the performance of the doubly trained model. cle classification for traffic surveillance systems using neural networks,”
The performance of the model, after pre-training with labels from Math. Comput. Appl., vol. 14, no. 3, pp. 187–196, Dec. 2009.
Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.
CHUNG AND SOHN: IMAGE-BASED LEARNING TO MEASURE TRAFFIC DENSITY USING A DEEP CNN 1675
[8] M. T. López, A. Fernández-Caballero, J. Mira, A. E. Delgado, and [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
M. A. Fernández, “Algorithmic lateral inhibition method in dynamic with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
and selective visual attention task: Application to moving objects detec- Process. Syst. (NIPS), vol. 25. Dec. 2012, pp. 1097–1105.
tion and labelling,” Expert. Syst. Appl., vol. 31, no. 3, pp. 570–594, [19] K. Simonyan and A. Zisserman. (Sep. 2014). “Very deep convolu-
Oct. 2006. tional networks for large-scale image recognition.” [Online]. Available:
[9] M. T. López, A. Fernández-Caballero, M. A. Fernández, J. Mira, https://arxiv.org/abs/1409.1556
and A. E. Delgado, “Visual surveillance by dynamic visual attention [20] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf.
method,” Pattern Recognit., vol. 39, no. 11, pp. 2194–2211, Nov. 2006. Comput. Vis. Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015,
[10] X. Ji, Z. Wei, and Y. Feng, “Effective vehicle detection technique for pp. 1–9.
traffic surveillance systems,” J. Vis. Commun. Image Represent., vol. 17, [21] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
no. 3, pp. 647–658, Jun. 2006. boltzmann machines,” in Proc. Int. Conf. Mach. Learn. (ICML), Haifa,
[11] J. Zhou, D. Gao, and D. Zhang, “Moving vehicle detection for automatic Israel, Jun. 2010, pp. 807–814.
traffic monitoring,” IEEE Trans. Veh. Technol., vol. 56, no. 1, pp. 51–59, [22] CS231n Convolutional Neural Networks for Visual Recognition.
Jan. 2007. Assessed on Aug. 5, 2017. [Online]. Available: http://cs231n.github.io/
[12] X. Niu, “A semi-automatic framework for highway extraction and convolutional-networks/
vehicle detection based on a geometric deformable model,” ISPRS [23] S. An, W. Liu, and S. Venkatesh, “Face recognition using kernel ridge
J. Photogram. Remote Sens., vol. 61, nos. 3–4, pp. 170–186, regression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Dec. 2006. Minneapolis, MN, USA, Jun. 2007, pp. 1–7.
[13] Z. Zhu and G. Xu, “VISATRAM: A real-time vision system for [24] K. Chen, C. C. Loy, S. Gong, and T. Xiang, “Feature mining for localised
automatic traffic monitoring,” Image Vis. Comput., vol. 18, no. 10, crowd counting,” in Proc. Brit. Mach. Vis. Conf., Wales, U.K., 2012,
pp. 781–794, Jul. 2000. vol. 1. no. 2, pp. 3–14.
[14] J.-W. Hsieh, S.-H. Yu, Y.-S. Chen, and W.-F. Hu, “Automatic traffic [25] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preserving
surveillance system for vehicle tracking and classification,” IEEE Trans. crowd monitoring: Counting people without people models or tracking,”
Intell. Transp. Syst., vol. 7, no. 2, pp. 175–187, Jun. 2006. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Anchorage,
[15] S.-Y. Cho, T. W. S. Chow, and C.-T. Leung, “A neural-based crowd AK, USA, Jun. 2008, pp. 1–7.
estimation by hybrid global learning algorithm,” IEEE Trans. Syst., Man, [26] K. Chen, S. Gong, T. Xiang, and C. C. Loy, “Cumulative attribute
Cybern. B, Cybern., vol. 29, no. 4, pp. 535–541, Aug. 1999. space for age and crowd density estimation,” in Proc. IEEE Conf.
[16] D. Kong, D. Gray, and H. Tao, “A viewpoint invariant approach for Comput. Vis. Pattern Recognit. (CVPR), Portland, OR, USA, Jul. 2013,
crowd counting,” in Proc. Int. Conf. Pattern Recognit., Santa Cruz, CA, pp. 2467–2474.
USA, 2006, pp. 1187–1190. [27] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting
[17] A. N. Marana, S. A. Velastin, L. F. Costa, and R. A. Lotufo, “Estimation via deep convolutional neural networks,” in Proc. IEEE Conf. Com-
of crowd density using image processing,” in Proc. IEE Colloq. Image put. Vis. Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015,
Process. Secur. Appl., Mar. 1997, pp. 11/1–11/8. pp. 833–841.
Authorized licensed use limited to: University of Obuda. Downloaded on February 06,2024 at 02:35:39 UTC from IEEE Xplore. Restrictions apply.