Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

Residual-Network-Leveraged Vehicle-Thrown-Waste
Identification in Real-Time Traffic
Surveillance Videos
Pengjiang Qian , Senior Member, IEEE, Kai Yuan, Jian Yao, Chao Fan,
Hua Zhang, Yuan Liu , and Xianling Lu

Abstract— We attempt to intelligently identify violations of Index Terms— Throwing waste from vehicles (TWV), deep
throwing waste from vehicles (TWV) in real-time traffic sur- learning, smart city, ResNet, waste inspection, intelligent traffic.
veillance videos. In addition to polluting the environment, TWV
easily causes injury to sanitation workers responsible for cleaning
roads by passing vehicles. However, manual inspection is still I. I NTRODUCTION
the commonest way to recognize such uncivilized behavior in
videos with very high time and labor-consuming. In answer to
these challenges, we design a novel 20-layer residual network
(Nov-ResNet-20) for training the vehicle-thrown-waste identi-
N OWADAYS, throwing waste from vehicles (TWV) has
become one of the intractable challenges for the smart
traffic management in cities. In addition to polluting the
fication model (VTWIM). Then, incorporating Nov-ResNet-20, environment, TWV easily causes traffic accidents, particu-
Selective Search, and Non-Maximum Suppression (NMS), we pro- larly for sanitation workers who are responsible for cleaning
pose the deep-residual-network-leveraged vehicle-thrown-waste roads, as TWV increases their workloads and work-difficulty
identification method (DRN-VTWI). Our method first splits one
video frame into several regions matching suspected objects and thereby raises their probability to be injured by pass-
marked with location boxes via Selective Search. Then, in terms ing vehicles [1]. Governments have issued many relevant
of the VTWIM trained by Nov-ResNet-20 our method identifies laws and traffic regulations to prohibit and punish such
the regions containing TWV. Last, our method removes the uncivilized behavior. Accordingly, many traffic video surveil-
redundant location boxes for each recognized, vehicle-thrown lance systems [2], [3], including both the hardware facilities
waste and only keeps the best one. The significance of our work
is four-fold: 1) Nov-ResNet-20 has a moderate depth: 6 convolu- (e.g., cameras and transmission and storage devices) and
tional layers, 7 residual layers, and in total 20 weight layers. Due software systems (e.g., video transmission and management
to the joint contribution of the residual, batch normalization, programs), have been deployed to monitor the vehicle move
dropout, and cross-entropy loss, it is eligible to identify TWV in real time. Most of the current traffic video surveillance
using a small quantity of manually-annotated training samples. systems in use, however, only have the functions to record as
2) Selective Search diversely marks all possible, suspected objects
in video frames, whereas NMS keeps the best location box for well as save traffic videos or pictures, without the desirable,
each recognized vehicle-thrown waste, removing all redundancies. intelligent analysis and explanation on the vehicle behavior.
In this way, DRN-VTWI finds potential violations of TWV as Besides, manual inspection is still the commonest way to
many as possible and optimally annotates vehicle-thrown wastes affirm traffic violations, despite the fact that it is quite time
in frames as well. 3) Combining the power of Nov-ResNet- and labor-consuming and even inefficient [4], [5]. Therefore,
20, Selective Search, and NMS, DRN-VTWI well solves the
challenging, intelligent identification of vehicle-thrown wastes for methods capable of effectively, intelligently identifying the
real-time traffic surveillance. Experimental studies conducted on TWV behavior are of great significance in real-time traffic
real-time traffic surveillance videos demonstrate the effectiveness surveillance.
as well as superiority of our efforts. Compared with other topics of intelligent traffic, such as
vehicle license-plate recognition [6], [7], vehicle route opti-
Manuscript received February 18, 2020; revised July 9, 2020; accepted
July 30, 2020. This work was supported in part by the National Natural mization [8], [9], and traffic flow forecasting [10], [11], studies
Science Foundation of China under Grant 61772241 and Grant 61702225, on waste identification in videos are relatively few. We briefly
in part by the Natural Science Foundation of Jiangsu Province under Grant review some relevant studies as follows. Aziz et al. [12] jointly
BK20160187, and in part by the Science and Technology Demonstra-
tion Project of Social Development of Wuxi under Grant WX18IVJN002. used the Support Vector Machine (SVM) [13], [51]–[55] and
The Associate Editor for this article was H. Gao. (Corresponding author: Hidden Markov Model (HMM) [14] to design a collection
Xianling Lu.) schedule of multiple waste bins in which SVM is used to
Pengjiang Qian, Kai Yuan, Chao Fan, Hua Zhang, and Yuan Liu are
with the School of Artificial Intelligence and Computer Science, Jiangnan classify the waste level in bins and HMM to determine the
University, Wuxi 214122, China, and also with the Jiangsu Key Labora- number of days remaining before the waste is collected.
tory of Media Design and Software Technology, Jiangnan University, Wuxi Liu et al. [15] proposed an automatic decoration garbage
214122, China (e-mail: qianpjiang@jiangnan.edu.cn; 1070523578@qq.com;
fanchao@jiangnan.edu.cn; a_go@jiangnan.edu.cn; lyuan1800@sina.com). detection system based on the improved YOLOv2 network and
Jian Yao is with the Blockchain Sub-Center, Wuxi IoT Innovation Center narrowband Internet of things (NBIoT) [16]. Niu et al. [5]
Company Ltd., Wuxi, China (e-mail: 1786779000@qq.com). designed an automated river trash monitoring system based
Xianling Lu is with the School of Internet of Things Engineering, Jiangnan
University, Wuxi 214122, China (e-mail: jnluxl@jiangnan.edu.cn). on the YOLOv3 model that aims to run faster than the
Digital Object Identifier 10.1109/TITS.2020.3015530 conventional Convolution Neural Networks (CNN) [17], [18].
1524-9050 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Carleton University. Downloaded on November 29,2020 at 07:09:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Rad et al. [19] presented a fully automated computer Algorithm 1 Object Location Extraction in Selective Search
vision application for littering quantification based on images Input: Image data
taken from the streets and sidewalks. They enlisted the Output: Object location boxes
GoogLeNet [18] framework to localize and classify different Obtain initial region R ={r1 ,…, rn } using the graph-based
types of wastes. In addition, Wang and Zhang [20] modified segmentation algorithm;
the Faster R-CNN [21], [22] object detection framework Initialize similarity set S = Ø;
by incorporating the residual network (ResNet) [23]–[25] to foreach neighboring region pair (rk , rl ) do
automatically detect the garbage from urban images in the Calculate similarity s(rk , rl );
application of intelligent urban management. Surveying the S = S ∪ s(rk ,rl );
small amount of existing literature, we have two observa- While S =Ø do
tions. One is that a few investigators have been studying the Get the highest similarity s(ri ,r j ) = max(S);
waste inspection in videos, but the direct research on TWV Merge corresponding regions rt = ri ∪ r j ;
identification is rarely reported. The other relates to deep Remove similarities regarding ri : S = S\s(ri ,r∗ );
learning [23], [26], [27]. That is, some well-established deep Remove similarities regarding r j : S = S\s(r∗ ,r j );
learning models, e.g., VGGNet [37], [38], YOLO [5], [15], Remove ri and r j from R: R = R\ {ri , r j };
ResNet, etc., have been used due to the fact that deep Calculate similarity set St between rt and its neighbors;
learning proves the well-accepted superiority on image and S = S ∪ St ;
video processing. Nevertheless, for the TWV identification, R = R ∪ rt ;
we realize that a novel deep network, which has an appro- Extract object location boxes from all regions in R
priate network depth as well as nice discrimination and
generalization abilities under the condition of a small num-
ber of training examples, is needed as manually-annotating 2) The greedy Selective Search diversely marks all possi-
TWV examples for training the network is fairly time and ble, suspected objects in video frames, whereas NMS keeps
labor-consuming. the best location box for each recognized vehicle-thrown
Therefore, to address the challenging identification of TWV waste object with removing all redundancies. As such, our
in real-time traffic videos, derived from VGG-16 [37], we first DRN-VTWI is able not only to inspect potential TWV
design a novel, dedicated 20-layer ResNet model referred violations as many as possible but also to annotate the
to as Nov-ResNet-20, and then incorporating the Selective vehicle-thrown wastes ideally in frames.
Search [28], [29] and Non-Maximum Suppression (NMS) 3) Combining the strength of Nov-ResNet-20, Selective
[30], [31] into Nov-ResNet-20, we propose the deep-residual- Search, and NMS, our proposed DRN-VTWI method is
network-leveraged vehicle-thrown waste identification method component in the challenging identification of vehicle-thrown
(DRN-VTWI for short) eventually. Our DRN-VTWI method wastes in real-time traffic surveillance videos.
trains the deep Nov-ResNet-20 using given traffic waste The rest of this manuscript is organized as follows. Related
samples, such as bottle, can, paper, and fruit peel, and work, such as Selective Search, NMS, and CNN & ResNet,
thus achieves the vehicle-thrown-waste identification model are briefly introduced in Section 2. The proposed Nov-ResNet-
(VTWIM). Then, our method inspects the real-time traffic 20 model as well as DRN-VTWI method are introduced in
video stream in frames. Specially, we first split one video detail in Section 3. Our experimental studies as well as result
frame into multiple small images corresponding to the location analyses are presented in Section 4. The conclusion is given
boxes of all suspected objects obtained using the Selective in Section 5.
Search algorithm. Second, we input all of these small images
II. R ELATED W ORK
into the obtained VTWIM and acquire those recognized as
TWV. Third, regarding each identified thrown waste in the A. Selective Search
video frame, by means of the NMS algorithm we merely Uijlings et al. [28], [29] proposed the Selective Search
keep one location box that surrounds it and has the highest algorithm to address the problem of generating possible object
confidence value, i.e., the highest probability of belonging to locations for use in image object recognition. Selective Search
TWV predicted using VTWIM. In this way, our DRN-VTWI combines the strength of both the graph segmentation and
method is able to identify the TWV scenes in the whole video exhaustive search indeed. Namely, it utilizes the graph-based
stream. segmentation algorithm [32] to create initial regions and then
In summary, in this article our contributions lie primarily in uses the hierarchical clustering to iteratively group regions
the following three points: together. Specifically, first the similarities between all neighbor
1) Nov-ResNet-20 is composed of 6 convolutional layers regions are calculated. Then the two most similar regions are
as well as 7 residual layers, and in total 20 weight layers. merged, and new similarities are calculated between the newly-
With such an appropriate depth, and benefiting from the merged region and its neighbors. Operations of merging the
organic incorporation of ResNet, batch normalization [39], most similar regions are repeated until the whole image
dropout [43], and cross-entropy loss measurement [42], becomes a single region. The overall procedure is detailed
the dedicated Nov-ResNet-20 is qualified to preferably identify in Algorithm 1.
TWV using a small quantity of manually-annotated training Selective Search attempts to capture all possible object
samples. locations. To this end, instead of a single pathway to generate

Authorized licensed use limited to: Carleton University. Downloaded on November 29,2020 at 07:09:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

QIAN et al.: ResNet-LEVERAGED VEHICLE-THROWN-WASTE IDENTIFICATION IN REAL-TIME TRAFFIC SURVEILLANCE VIDEOS 3

Algorithm 2 NMS
Input: Candidate object location boxes, threshold ε
Output: The best location box
Suppose there are N boxes, each of which is measured and
assigned a score Si (1≤ i ≤ N).
Step 1: Construct the set H for candidate boxes processed,
Fig. 1. Structure of VGG-19.
and initialize it to contain all N boxes; Build the
set M to store the optimal box and initialize it as
an empty set;
Step 2: Sort all boxes in H , select the box m having the
highest score, and move it from H to M;
Step 3: Calculate the value of Intersection over Union
(IoU) [33] between box m and any box in H . If the
value is higher than the threshold ε, the box is Fig. 2. Structure of VGG-16.
considered to overlap with box m, and this box is
removed from set H ;
Step 4: Go back to step 1 and iterate until set H is empty.
Finally, the boxes in set M are what we want.

possible object locations, the Selective Search algorithm diver-


sifies the search and uses a variety of complementary image
partitions to deal with as many image conditions as possible. Fig. 3. Example of RB.
Selective Search diversifies the search: 1) by using a variety of 3 fully-connected layers, and VGG-19 has three more convo-
color spaces with different invariance properties; 2) by using lutional layers than VGG-16.
different similarity measures s(rk , rl ); and 3) by varying the Increasing the network width or depth usually facilitates
starting regions. One can refer to [28] for the details. As such, improving its performance, and deep networks are generally
Selective Search obtains a small set of data-driven, class- better than shallow networks. Nevertheless, aimlessly deep-
independent, high quality locations. ening the network, phenomena of performance degradation
would occur. That is, with the increase of network layers,
B. Non-Maximum Suppression (NMS) the prediction accuracy on the training set gets saturated and
Input a color image, the Selective Search algorithm would even decreased. It is proven that introducing the residual
output several candidate location boxes for the same suspected block (RB) into the deep network, i.e., residual network
object, owing to the exhaustive search. Usually, these boxes (ResNet) [24], [50], can effectively suppress the performance
are likely to overlap, so it is expected to keep the best box. degradation as well as avoid the gradient diffusion and gradient
The NMS algorithm [30], [31] is enlisted for this purpose, and explosion. Fig. 3 illustrates an example of RB that consists of
its primary procedure is additionally listed in Algorithm 2. three convolutional layers with the kernel sizes being 1 × 1,
3 × 3, and 1 × 1, respectively. Also, every convolutional layer
C. Convolutional Neural Network (CNN) is commonly followed by the layers of batch normalization
(BN) [39] and ReLU [40] that have not be shown in Fig. 3 for
As one of the bases of deep learning, CNN [34]–[36], [50] brevity.
belongs to feedforward neural networks with deep structures One well-designed residual network, ResNet-50 [24], [25],
and convolution calculations. CNN captures the complex non- is worthy of being mentioned herein, which is composed
linear mapping from the input to output. For example, for of 49 convolutional layers and one fully-connected (FC) layer.
image segmentation, CNN equals to a complicated classifi- As listed in Table I, the structure of ResNet-50 can be
cation model that learns the underlying principles to parti- divided into five parts: Conv1 has one convolutional layer of
tion the input image into several groups regarding different which the kernel size, kernel number, and stride are 7 × 7,
semantic categories. As a type of well-established CNNs, 64, and 2, respectively. Conv2_x, Conv3_x, Conv4_x, and
VGGNet [37], [38], which ranked No. 1 and No. 2 for the Conv5_x denote the four different residual modules, each of
tasks of positioning and classification, respectively, in the which has 3, 4, 6, and 3 RBs, respectively. One can refer
2014 ImageNet Large Scale Visual Recognition Competition to [24] for the details.
(ILSVRC), has successfully exploited the inherency between
the network performance and network depth. There exist two
III. T HE P ROPOSED M ETHOD
notable VGGNets: VGG-16 and VGG-19 which stack multiple
3 × 3 convolutional kernels as well as 2 × 2 maximum A. Novel 20-Layer Residual Network for DRN-VTWI
pooling layers in different ways, as shown in Figs. 1 and 2 and Although ResNet proves its outstanding superiority on
Table II. VGG-16 is composed of 13 convolutional layers and image analysis [24], it was observed that well-established

Authorized licensed use limited to: Carleton University. Downloaded on November 29,2020 at 07:09:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TABLE I TABLE II
S TRUCTURE OF R ES N ET-50 S TRUCTURE C OMPARISONS AMONG VGG-16,
VGG-19, AND N OV-R ES N ET-20

overfitting problem as well. The last are three fully-connected


layers, and the softmax function is utilized to output the
probabilities of objects belonging to different waste classes.
Fig. 4. New residual block Res3 used in Nov-ResNet-20.

B. The Proposed DRN-VTWI Method


ResNets, such as ResNet-50, need considerable consumption So far, based on the Nov-ResNet-20 deep network, we
including both training time and training examples to achieve are able to propose our DRN-VTWI method. As depicted
the acceptable performance. Therefore, for the objective of in Fig. 6, the proposed DRN-VTWI contains three modules.
identifying TWV in real-time videos, we would like to first Module I trains the deep Nov-ResNet-20 using given training
present our novel 20-layer residual network, Nov-ResNet-20, samples and generates the vehicle-thrown-waste identification
considering the balance between the network performance model, VTWIM, via Nov-ResNet-20. Module II runs the
and training costs. For this purpose, we first introduce our Selective Search algorithm on each video frame and obtains
newly-designed residual block denoted as Res3 that has two several small, identification-needed image regions that contain
3 × 3 convolutional layers and uses the ReLU activation func- all suspected objects both waste-contained and non-waste-
tion and identity mapping, as illustrated in Fig. 4. Afterwards, contained. Module III identifies all of the waste-contained
derived from the structure of VGG-16, we put forward our regions via VTWIM and eventually in the video frame it helps
Nov-ResNet-20 model, as shown in Fig. 5. to keep the best location box for each recognized vehicle-
As indicated in Table II, replacing the convolutional calcu- thrown waste and remove all redundancies using NMS. Next,
lation with our designed residual block Res3 in seven of the we detail each module as follows.
original, convolutional layers of VGG-16, we are able to estab- 1) Module I : Generate Waste Identification Model Using
lish our Nov-ResNet-20 which overall contains 6 convolutional Nov-ResNet-20: As previously mentioned, Module I aims to
layers as well as 7 residual layers, and in total 20 weight layers. train the Nov-ResNet-20 model using given training data and
The structure of Nov-ResNet-20 can be divided into 5 parts. thus figure out the desirable vehicle-thrown-waste identifica-
Each part is connected to a pooling layer, and the maximum tion model (VTWIM). For this purpose, a certain number of
pooling [41] is recruited to gradually reduce the image sizes samples of every type of vehicle-thrown waste which needs
(i.e., 224 × 224 → 112 × 112 → 56 × 56 → 28 × 28 → 14 × to be identified in surveillance videos are required. Usually,
14 → 7 × 7), capturing the image essence gradually. Besides, the procedure of Module I, including both preparing the waste
the cross-entropy loss function [42] is used, and the BN [39] samples and training the network, is fairly time-consuming,
and dropout [43] mechanisms are employed to further alleviate so it must be offline and all work in this module need to be
the gradient explosion and gradient vanishing [50] and the completed above all.

Authorized licensed use limited to: Carleton University. Downloaded on November 29,2020 at 07:09:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

QIAN et al.: ResNet-LEVERAGED VEHICLE-THROWN-WASTE IDENTIFICATION IN REAL-TIME TRAFFIC SURVEILLANCE VIDEOS 5

Fig. 5. Structure of Nov-ResNet-20.

Fig. 6. Sketch of framework of proposed DRN-VTWI method (three modules).

2) Module II: Run Selective Search on Video Frames and IV. T HE E XPERIMENTAL S TUDIES
Obtain Identification-Needed Image Regions That Contain
A. Setup
Suspected Objects: Differing from Module I, both Modules
II and III are online. In Module II, first we extract the Due to the fact that, currently, the open repository of traffic
video frames, according to the sampling sequence (denoted videos containing TWV is unavailable, we had to get the traffic
as vfss), from the surveillance video. Second, on each frame, waste samples as well as testing videos ready by ourselves for
we run the Selective Search algorithm to locate all suspected the experimental studies. To this end, two types of training
objects and accordingly split the video frame into several small samples were prepared by manually photographing, capturing
identification-needed images in terms of the obtained location from traffic videos, downloading from Internet (e.g., the VOC
boxes. Third, we resize these identification-needed images into data set [44]), etc. The one includes four categories of common
the same size (denoted as ini_size). Last, we input them into traffic waste: bottle, can, paper, and fruit peel. The other is
the VTWIM that was already generated by Nov-ResNet-20 in associated with some non-waste images commonly appearing
Module I. in traffic videos, such as walker, car, and scenery. We have
3) Module III: Identify the Images Containing TWV via gathered 200 samples for each category, both waste-belonged
VTWIM and Remove all Redundant Boxes Using NMS: In and non-waste-belonged, thus we have 1000 labeled samples
Module III, first we know which identification-needed images of traffic objects in total. All of these traffic images were
contain the waste objects via the VTWIM. Each recognized pre-processed into the same size: 224 × 224, and were labeled
image would have a confidence value, i.e., the probability of as from 1 to 5, respectively, as shown in Table III. In addition,
the waste object contained belonging to a certain waste type. we have downloaded four traffic videos from Internet as the
Then all waste objects in this frame together with their location testing videos, describing the scenes of throwing the bottle,
boxes are filtered through the NMS algorithm. Consequently, can, fruit peel, and paper from vehicles, respectively, for
for each waste object, only one location box having the highest verifying the actual identification capability of TWV. As an
confidence value can be kept and all of the other redundant example, the video of throwing a plastic bottle from a car is
are removed. shown in Fig. 7.
As such, repeating Modules II & III, our proposed DRN- In addition to our designed Nov-ResNet-20, four well-
VTWI Method is capable of inspecting the TWV violations established classification techniques, including two CNNs (i.e.,
in the surveillance video and all recognized vehicle-thrown VGG-16 and VGG-19), one ResNet (i.e., NesNet-50), and one
wastes would be optimally marked with surrounding location state-of-the-art non-deep classification method (i.e., Extreme
boxes in the video stream. Learning Machine (ELM) [45]–[49]), were enlisted to train

Authorized licensed use limited to: Carleton University. Downloaded on November 29,2020 at 07:09:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TABLE III
T RAINING S AMPLE C ATEGORIES I NVOLVED IN OUR E XPERIMENTS

Fig. 7. Testing video of throwing a plastic bottle from car.

the VTWIM (see Fig. 6, Module I), respectively, for the TABLE IV
performance comparisons. T IME C ONSUMPTION OF D IFFERENT M ODULES
OF DRN-VTWI (O N AVERAGE )
The core system parameters involved in the proposed
DRN-VTWI method include: Parameters scale, sigma, and
min_size used in Selective Search; Parameters vfss for sam-
pling videos and ini_size for resizing training images in Mod-
ule II; Parameter IoU in NMS; and Parameters dropout_ratio,
batch_size, and training_step used for training the Nov-
ResNet-20 model. The trial intervals or recommended values
of these parameters are additionally listed in Table VI. Also,
the primary parameters used in the other four methods are
given therein.
Our experimental studies were carried out on a workstation 10 times with different, randomly-selected training samples.
with an Intel i7-6850K 3.60GHz CPU, 128GB of RAM, The average prediction accuracies of these classification tech-
NVIDIA TITAN XP (12 GB) GPU, Ubuntu16.04 (64 bit), niques are listed in Table V.
Python 2.7, and Tensorflow 1.12.0 (GPU). We have also recorded the running time of the proposed
DRN-VTWI method in terms of its three modules, as listed
B. Experiments and Analyses in Table IV. Specifically, on average we spent 1284 seconds in
First, we would like to validate the TWV identification training Nov-ResNet-20 with the suggested system parameters,
performance of our designed Nov-ResNet-20 model. For this 42 seconds in annotating the location boxes on every video
purposed, we randomly chose 800 labeled traffic samples from frame, and 26 seconds in identifying TWV and optimizing
the total 1000 ones to constitute the training set, and the train- the location annotation via NMS.
ing set was fed to ELM, VGG-16, VGG-19, ResNet-50, and Table V reveals that, for the TWV detection, 800 training
Nov-ResNet-20 to train their VTWIM models, respectively. samples are generally insufficient for ELM, a traditional
The leftover 200 samples was used to test their detection classification method featuring low computation burden, to fig-
accuracies regarding TWV. We have repeated such procedure ure out satisfactory TWV classifiers. In this context, deep

Authorized licensed use limited to: Carleton University. Downloaded on November 29,2020 at 07:09:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

QIAN et al.: ResNet-LEVERAGED VEHICLE-THROWN-WASTE IDENTIFICATION IN REAL-TIME TRAFFIC SURVEILLANCE VIDEOS 7

Fig. 9. Waste objects recognized via Nov-ResNet-20’s VTWIM and


optimized by NMS. (a) All recognized waste objects marked with location
boxes; (b) Only the best location box is kept for each recognized waste object
via NMS.
Fig. 8. Suspected objects with location boxes obtained by Selective Search.
TABLE VI
TABLE V C ORE S YSTEM PARAMETERS AND T HEIR S ETTINGS
AVERAGE P REDICTION A CCURACIES OF TIM S I NVOLVED IN U SED F IVE M ETHODS
O BTAINED U SING F IVE M ODELS

learning-oriented technologies exhibit intrinsic advantages.


As the evidence, the four deep models nearly twice surpass
the non-deep ELM. As far as the deep learning models
are concerned, the accuracies of VGG-16 and VGG-19 are
the same, which implies that in the cases of training TWV
classifiers with 800 labeled samples, merely deepening the
network, such as from 16 layers to 19 layers, does not
contribute to improving the network performance. Instead,
with replacing the seven conventional layers of VGG-16 with
the Res3-based residual layers (see Table II), we not only
have deepened the network (20 layers) but also enhanced
the network performance because of the residual-associated
mechanism. As for ResNet-50, conversely it did not perform
well as only 800 labeled samples are impractical to fully
optimize the weights distributed in 50 layers.
Next, we test the effectiveness of our proposed DRN-VTWI
method on detecting TWV in traffic surveillance video. Taking
the video of throwing the bottle from vehicles as the example,
we detail our experiments as follows. As illustrated in Fig. 7,
according the sampling ratio vfss, we first divided the video
stream into numerous frames. Second, on each video frame,
we ran the Selective Search algorithm to obtain the suspected
objects marked with location boxes, as shown in Fig. 8. Third, as indicated in Fig. 9 (b). Likewise, we repeated this procedure
we got the small images according to all of the location boxes on all of the video frames and completed the whole TWV
annotated in Fig. 8. Forth, we resized these images into the inspection (i.e., throwing a plastic bottle from a car) upon this
same size 224 × 224 and then input them into the VTWIM video, as listed in Fig. 10.
learned by Nov-ResNet-20. Fifth, the images belonging to the Table IV indicates the time consumption of our DRN-VTWI
category of bottle waste were recognized by the VTWIM, method. As is revealed, Module I needs quite a few minutes
as shown in Fig. 9 (a). Last, we ran the NMS algorithm on to train Nov-ResNet-20 and to figure out the VTWIM model,
the recognized waste boxes to remove the redundant ones, but this procedure is offline and does not actually influence

Authorized licensed use limited to: Carleton University. Downloaded on November 29,2020 at 07:09:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 10. Illustration of complete TWV identification via DRN-VTWI on a surveillance video.

the real-time TWV detection of DRN-VTWI. Generally, Mod- which is dependent on the specific image size of video frames.
ule II spends 42 seconds in marking all of the suspected This actually effects the practicability of our DRN-VTWI to
location boxes by means of Selective Search and Module III a certain extent. Therefore, to reduce the time consumption
costs 26 seconds in recognizing the TWV regions using of Selective Search or to propose a more efficient alternative
the obtained VTWIM as well as optimally annotating their method is our next study afoot.
location boxes via NMS. Such time burden of Models II&III
is acceptable if from the perspective of academic study.
R EFERENCES
These results demonstrate that our proposed Nov-ResNet-
20 deep learning model as well as the overall framework of [1] V. F. Carvalho, M. D. Silva, L. M. S. Silva, C. J. Borges, L. A. Silva,
the DRN-VTWI method are feasible for the TWV detection and M. L. C. C. Robazzi, “Occupational risks and work accidents:
Perceptions of garbage collectors,” J. Nursing UFPE Online, vol. 10,
in real-time traffic surveillance videos. no. 4, pp. 1185–1193, 2016.
[2] C. Kim and J.-N. Hwang, “Object-based video abstraction for video
V. C ONCLUSION surveillance systems,” IEEE Trans. Circuits Syst. Video Technol., vol. 12,
no. 12, pp. 1128–1138, Dec. 2002.
For the purpose of effectively detecting the uncivilized [3] Y. Meng and H. Wu, “Highway visibility detection method based on
behavior of TWV for intelligent traffic management, we first surveillance video,” in Proc. IEEE 4th Int. Conf. Image, Vis. Comput.
(ICIVC), Jul. 2019, pp. 197–202.
propose the dedicated deep network — Nov-ResNet-20. [4] G. P. Arya, D. P. Chauhan, and V. Garg, “Design & implementation of
Then, combining Nov-ResNet-20, Selective Search, and NMS, traffic violation monitoring system,” Int. J. Comput. Sci. Inf. Technol.,
we put forward the desirable DRN-VTWI method for identi- vol. 6, no. 3, pp. 2384–2386, 2015.
[5] G. Niu, J. Li, S. Guo, M.-O. Pun, L. Hou, and L. Yang, “SuperDock:
fying TWV in real-time traffic videos. The framework of our A deep learning-based automated floating trash monitoring system,”
DRN-VTWI method includes three modules. Module I gener- in Proc. IEEE Int. Conf. Robot. Biomimetics (ROBIO), Dec. 2019,
ates the waste identification model, VTWIM, via Nov-ResNet- pp. 1035–1040.
[6] N. Wang, X. Zhu, and J. Zhang, “License plate segmentation and
20. Module II runs the Selective Search algorithm on one video recognition of chinese vehicle based on BPNN,” in Proc. 12th Int. Conf.
frame and obtains several small, identification-needed image Comput. Intell. Secur. (CIS), Dec. 2016, pp. 403–406.
regions that contain all suspected objects both waste-contained [7] A. H. Ashtari, M. J. Nordin, and M. Fathy, “An iranian license plate
and non-waste-contained. Module III identifies all of the recognition system based on color features,” IEEE Trans. Intell. Transp.
Syst., vol. 15, no. 4, pp. 1690–1705, Aug. 2014.
waste-contained regions via VTWIM and eventually in the [8] D. Kosmanos et al., “Route optimization of electric vehicles based on
video frame it helps to keep the best location box for each dynamic wireless charging,” IEEE Access, vol. 6, pp. 42551–42565,
recognized vehicle-thrown waste, removing all redundancies 2018.
[9] J. J. Q. Yu, W. Yu, and J. Gu, “Online vehicle routing with neural
using NMS. Our experimental studies verified the effectiveness combinatorial optimization and deep reinforcement learning,” IEEE
as well as superiority of our proposed method. Trans. Intell. Transp. Syst., vol. 20, no. 10, pp. 3806–3817, Oct. 2019.
Last, we would like to mention the limitation of our pro- [10] H.-F. Yang, T. S. Dillon, and Y.-P.-P. Chen, “Optimized structure of
the traffic flow forecasting model with a deep learning approach,”
posed DRN-VTWI method. That is, the embedded Selective IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 10, pp. 2371–2381,
Search is a little bit time-consuming, the running time of Oct. 2017.

Authorized licensed use limited to: Carleton University. Downloaded on November 29,2020 at 07:09:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

QIAN et al.: ResNet-LEVERAGED VEHICLE-THROWN-WASTE IDENTIFICATION IN REAL-TIME TRAFFIC SURVEILLANCE VIDEOS 9

[11] J. S. Zhu, “An ensemble learning short-term traffic flow forecasting with [38] T. Sun, Y. Wang, J. Yang, and X. Hu, “Convolution neural networks with
transient traffic regimes,” Appl. Mech. Mater., vols. 97–98, pp. 849–853, two pathways for image style recognition,” IEEE Trans. Image Process.,
Sep. 2011. vol. 26, no. 9, pp. 4102–4113, Sep. 2017.
[12] F. Aziz et al., “Waste level detection and HMM based collection [39] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
scheduling of multiple bins,” PLoS ONE, vol. 13, no. 8, Aug. 2018, network training by reducing internal covariate shift,” in Proc. 32nd Int.
Art. no. e0202092. Conf. Int. Conf. Mach. Learn., Ithaca, NY, USA, Jul. 2015, pp. 448–456.
[13] P. Qian et al., “SSC-EKE: Semi-supervised classification with extensive [40] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rec-
knowledge exploitation,” Inf. Sci., vol. 422, pp. 51–76, Jan. 2018. tified activations in convolutional network,” 2015, arXiv:1505.00853.
[14] D. Jurafsky, J. H. Martin, Speech and Language Processing: An Intro- [Online]. Available: http://arxiv.org/abs/1505.00853
duction to Natural Language Processing, Computational Linguistics, [41] D. Yu, H. Wang, P. Chen, and Z. Wei, “Mixed pooling for convolutional
and Speech Recognition, 2nd ed. Upper Saddle River, NJ, USA: neural networks,” in Proc. Int. Conf. Rough Sets Knowl. Technol.,
Prentice-Hall, 2009, pp. 123–249. Shanghai, China, Oct. 2014, pp. 364–375.
[15] Y. Liu, Z. Ge, G. Lv, and S. Wang, “Research on automatic garbage [42] P. Golik, P. Doetsch, and H. Ney, “Cross-entropy vs. squared error
detection system based on deep learning and narrowband Internet of training: A theoretical and experimental comparison,” in Proc. 14th
Things,” J. Phys., Conf. Ser., vol. 1069, Aug. 2018, Art. no. 012032. Annu. Conf. Int. Speech Commun. Assoc., Lyon, France, Aug. 2013,
[16] S. Popli, R. K. Jha, and S. Jain, “A survey on energy efficient narrowband pp. 1756–1760.
Internet of Things (NBIoT): Architecture, application and challenges,” [43] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
IEEE Access, vol. 7, pp. 16739–16776, 2019. R. Salakhutdinov, “Dropout: A simple way to prevent neural networks
[17] K. Alex, I. Sutskever, and G. E. Hinton, “ImageNet classification with from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. 2014.
Syst., vol. 60, no. 6, Dec. 2012, pp. 84–90. [44] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
[18] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int.
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
[19] M. Saeed Rad et al., “A computer vision system to localize and classify [45] G. Huang, G.-B. Huang, S. Song, and K. You, “Trends in extreme learn-
wastes on the streets,” 2017, arXiv:1710.11374. [Online]. Available: ing machines: A review,” Neural Netw., vol. 61, pp. 32–48, Jan. 2015.
http://arxiv.org/abs/1710.11374 [46] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning
[20] Y. Wang and X. Zhang, “Autonomous garbage detection for intelligent machine: Theory and applications,” Neurocomputing, vol. 70,
urban management,” in Proc. MATEC Web Conf., vol. 232, 2018, nos. 1–3, pp. 489–501, Dec. 2006.
Art. no. 01056. [47] J. Tang, C. Deng, and G.-B. Huang, “Extreme learning machine for
[21] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. multilayer perceptron,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27,
(ICCV), Dec. 2015, pp. 1440–1448. no. 4, pp. 809–821, Apr. 2016.
[22] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards [48] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:
real-time object detection with region proposal networks,” IEEE Trans. A new learning scheme of feedforward neural networks,” in Proc. IEEE
Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. Int. Joint Conf. Neural Netw., Jul. 2004, pp. 985–990.
[23] K. Xu et al., “Multichannel residual conditional GAN-leveraged abdom- [49] G. Huang, S. Song, J. N. D. Gupta, and C. Wu, “Semi-supervised and
inal pseudo-CT generation via dixon MR images,” IEEE Access, vol. 7, unsupervised extreme learning machines,” IEEE Trans. Cybern., vol. 44,
pp. 163823–163830, 2019. no. 12, pp. 2405–2417, Dec. 2014.
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [50] T. Zia and S. Razzaq, “Residual recurrent highway networks for learning
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. deep sequence prediction models,” J. Grid Comput., vol. 18, no. 1,
(CVPR), Jun. 2016, pp. 770–778. pp. 169–176, Mar. 2020, doi: 10.1007/s10723-018-9444-4.
[51] V. Christianini and J. Shawe-Taylor, An Introduction to Support Vector
[25] E. Rezende, G. Ruppert, T. Carvalho, F. Ramos, and P. de Geus,
Machines. Cambridge, U.K.: Cambridge Univ. Press, 2002.
“Malicious software classification using transfer learning of ResNet-50
[52] K. I. Kim, K. Jung, S. H. Park, and H. J. Kim, “Support vector machines
deep neural network,” in Proc. 16th IEEE Int. Conf. Mach. Learn. Appl.
for texture classification,” IEEE Trans. Pattern Anal. Mach. Intell.,
(ICMLA), Dec. 2017, pp. 1011–1014.
vol. 24, no. 11, pp. 1542–1550, Nov. 2002.
[26] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
[53] C.-F. Lin and S.-D. Wang, “Fuzzy support vector machines,” IEEE
pp. 436–444, May 2015.
Trans. Neural Netw., vol. 13, no. 2, pp. 464–471, Mar. 2002.
[27] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, Cambridge, MA, [54] H. Qu, Y. Oussar, G. Dreyfus, and W. Xu, “Regularized recurrent least
USA: MIT 2016, pp. 327–399. squares support vector machines,” in Proc. Int. Joint Conf. Bioinf., Syst.
[28] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and Biol. Intell. Comput., Aug. 2009, pp. 508–511.
A. W. M. Smeulders, “Selective search for object recognition,” Int. J. [55] X. Wang, F.-L. Chung, and S. Wang, “On minimum class locality
Comput. Vis., vol. 104, no. 2, pp. 154–171, Apr. 2013. preserving variance support vector machine,” Pattern Recognit., vol. 43,
[29] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and no. 8, pp. 2753–2762, Aug. 2010.
A. W. M. Smeulders, “Segmentation as selective search for
object recognition,” in Proc. Int. Conf. Comput. Vis., Nov. 2011,
pp. 1879–1886.
[30] S. Qiu, G. Wen, Z. Deng, J. Liu, and Y. Fan, “Accurate non-maximum
suppression for object detection in high-resolution remote sensing
images,” Remote Sens. Lett., vol. 9, no. 3, pp. 237–246, Mar. 2018.
[31] A. Neubeck and L. Van Gool, “Efficient non-maximum suppression,” in
Proc. 18th Int. Conf. Pattern Recognit. (ICPR), Aug. 2006, pp. 850–855.
[32] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based
image segmentation,” Int. J. Comput. Vis., vol. 59, no. 2, pp. 167–181, Pengjiang Qian (Senior Member, IEEE) received
Sep. 2004. the Ph.D. degree from Jiangnan University, Wuxi,
[33] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar- Jiangsu, China, in March 2011. He is currently a
chies for accurate object detection and semantic segmentation,” in Proc. Full Professor with the School of Artificial Intel-
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587. ligence and Computer Science, Jiangnan Univer-
[34] J. Gu et al., “Recent advances in convolutional neural networks,” 2015, sity. He has authored or coauthored more than
arXiv:1512.07108. [Online]. Available: http://arxiv.org/abs/1512.07108 80 papers published in international/national jour-
[35] N. Passalis and A. Tefas, “Training lightweight deep convolutional nals and conferences, e.g., the IEEE T RANSAC -
neural networks using Bag-of-Features pooling,” IEEE Trans. Neural TIONS ON N EURAL N ETWORKS AND L EARNING
Netw. Learn. Syst., vol. 30, no. 6, pp. 1705–1715, Jun. 2019. S YSTEMS (TNNLS), the IEEE T RANSACTIONS ON
[36] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks S YSTEMS , M AN , AND C YBERNETICS —PART B:
and applications in vision,” in Proc. IEEE Int. Symp. Circuits Syst., C YBERNETICS (TSMC-B), the IEEE T RANSACTIONS ON C YBERNETICS ,
Jun. 2010, pp. 253–256. the IEEE T RANSACTIONS ON F UZZY S YSTEMS (TFS), PR, InS, and KBS.
[37] Y. Tang and X. Wu, “Scene text detection and segmentation based on His research interests include data mining, pattern recognition, bioinformatics
cascaded convolution neural networks,” IEEE Trans. Image Process., and their applications, such as analysis and processing for medical imaging,
vol. 26, no. 3, pp. 1509–1520, Mar. 2017. intelligent traffic dispatching, and advanced business intelligence in logistics.

Authorized licensed use limited to: Carleton University. Downloaded on November 29,2020 at 07:09:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Kai Yuan is currently pursuing the M.S. degree with Hua Zhang received the M.S. degree of public
the School of Artificial Intelligence and Computer administration from Tongji University in July 2008.
Science, Jiangnan University, Wuxi, Jiangsu, China. His research interests include informatization con-
His research interest includes intelligent algorithms struction, computer science, and artificial intelli-
and their applications. gence. He is currently the Supreme Leader with
the School of Artificial Intelligence and Computer
Science, Jiangnan University. He also has the honor
to serve as the Deputy Secretary-General of the
Educational Informatization Branch of China Asso-
ciation of Higher Education.

Jian Yao is a Senior Expert and the Deputy Director Yuan Liu received the M.S. degree from the Wuxi
of the Blockchain Sub-Center, Wuxi IoT Innova- University of Light Industry in 1998. He is cur-
tion Center Company Ltd. He once worked as rently the Dean as well as a Full Professor of the
the Research and Development Expert with Tencent School of Artificial Intelligence and Computer Sci-
Cloud and China Mobile, and has served as the ence, Jiangnan University, Wuxi, Jiangsu, China. His
Technical Director of TIPS Intelligent Resident Sys- main researches focus on the software development
tem in Dongfang Credit Union, a NASDAQ-listed of network information systems, network security,
company. He is currently the Special Consultant of and digital media applications. His current research
Wuxi Telecom and Wuxi Mobile. He has participated interests include network traffic measurement, social
in the design of the overall plan of Jiangsu provin- network, and digital media. He has published more
cial government external network, the trusted data than 100 academic articles in the authoritative and
model of Block Chain, and the trusted data exchange scheme of Wuxi new core journals. He is also a member of the 863 Expert panel in the information
district. He has the national certifications of CDCS data center expert, senior security technology domain of the Ministry of Science and Technology,
information analyst, communication engineer, and senior network planners. a Senior Member of the China Computer Federation (CCF), and a member
of the CyberSecurity Association of China (CSAC).

Chao Fan received the Ph.D. degree from the Xianling Lu received the B.S. and M.S. degrees in
University of Tokyo in 2018. He is currently a computer science and applications from the Nanjing
Lecturer with the School of Artificial Intelligence University of Aeronautics and Astronautics (NUAA)
and Computer Science, Jiangnan University. His in 1999 and 2004, respectively, and the Ph.D. degree
research interests include artificial intelligence and from the Nanjing University of Science and Tech-
complex networks science. nology University (NUST) in 2009. He is currently
a Professor with the Department of Computer Sci-
ence and Technology, Jiangnan University, China.
His research interests focus on deep learning, data
mining, and wireless sensor network systems.

Authorized licensed use limited to: Carleton University. Downloaded on November 29,2020 at 07:09:09 UTC from IEEE Xplore. Restrictions apply.

You might also like