Object Detection Springer Paper

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Signal, Image and Video Processing (2022) 16:1913–1923

https://doi.org/10.1007/s11760-022-02151-0

ORIGINAL PAPER

Human and object detection using Hybrid Deep Convolutional Neural


Network
P. Mukilan1 · Wogderess Semunigus1

Received: 19 March 2021 / Revised: 9 December 2021 / Accepted: 16 January 2022 / Published online: 1 March 2022
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022

Abstract
In recent years, human and object detection has increased research in different real-time applications. Due to improvement in
the field of deep learning, various methods have been designed for human, object detection and recognition. Hence, Hybrid
Deep Convolutional Neural Network (HDCNN) is developed for human and object detection from the video frames. The
HDCNN is a combination of Convolutional Neural Network (CNN) and Emperor Penguin Optimization (EPO). Here, EPO
is utilized to increase the system parameters of the CNN structure. Initially, pre-processing is applied to eliminate the noise
presented in the image and image quality is enhanced. Here, the Gaussian filter is used for the background subtraction in the
images. The three different types of databases are considered to validate the proposed methodology. The proposed HDCNN
method is tested in MATLAB and compared with existing methods like Deep Neural Network (DNN), CNN and CNN-Firefly
Algorithm (FA), respectively. The proposed method is justified with the statistical measurements like accuracy, precision,
recall and F-Measure, respectively.

Keywords Object detection · Deep learning · Emperor · Kernel parameters · CNN · Firefly algorithm

Abbreviations 1 Introduction
ALO Ant-Lion Optimization
Recently, the Unmanned Aerial Vehicles (UAV) have been
ANN Artificial Neural Networks
increased in modern world and have the ability to operate
CNN Convolutional Neural Network
a different kind of missions that are considered inaccessi-
DNN Deep Neural Network
ble and dangerous of humans such as forest surveillance
DL Deep Learning
and nuclear power plant. Hence, visual detection of human
FA Firefly-Algorithm
instances with their object detection and computing human
IoT Internet of Things
actions has been considered as main important aim of the sys-
GA Genetic Algorithms
tem [1]. The human and object detection emphasizes objects
HDCNN Hybrid Deep Convolutional Neural Network
which provide interest in a video, and this has emerged as
IMFF Improved Multi-scale Feature Fusion
an active section of research because of its usage of different
MODT Multi-Object Detection and Tracking
applications such as action recognition, object segmentations
PSO Particle Swarm Optimization
and object tracking. Already, know that, the data from a mov-
RNN Recurrent Neural Network
ing object are an important and required for supporting the
SSD Single Shot Multi-box Detector
human and object video calculation [2, 3]. The human and
SVM Support Vector Machine
object detection is computing the object motion with the
UAV Unmanned Aerial Vehicles
consideration of optical flow which most minute variation
VSP-GMN Visual Semantic Pose Graph Matrix Network
regarding changes and illumination in specified location [4].
Recognizing human and object interaction can be a class of
B P. Mukilan
venmukilan@bhu.edu.et relationship in the visual detection where operation is too
related to localized combination of object and human; in
1 Department of Electrical and Computer Engineering, College addition, it should infer the relation among them, for exam-
of Engineering and Technology, Bule Hora University, Bule
ple, "driving a car," "eating and apple."
Hora, Ethiopia

123
1914 Signal, Image and Video Processing (2022) 16:1913–1923

These relations are computed from the video that is con- • Scalability
sidered as a human and object detection, and it is a main
challenging task [5]. This human object detection is a critical From the deep learning methods, the CNN is most suit-
task because the collected image may contain same human able for image-based researches which attains best detection
continuously interacting with different objects ("type on lap- accuracy in various applications. Hence, CNN is considered
top and sit on a couch"), fine granites interactions ("jump in this research, but the structure of CNN is difficult. To
horse," "feed horse," "walk horse"), multiple humans oper- achieve the better CNN design, meta-heuristic methods are
ations the same interaction with objects, multiple humans used in many researchers such as Genetic Algorithms (GA),
doing same interaction and object ("throw and catch ball"). Firefly Algorithm (FA), Particle Swarm Optimization (PSO)
Abovementioned statements are difficult in detection of and Ant-Lion Optimization (ALO), respectively.
human and object from the images [6]. These problems must Here, develop a hybrid approach for perceiving the human
consider to develop an optimal method for object detection and object images. The proposed hybrid approach is a
in video. Additional, spatial data analysis is also important combination of Convolutional Neural Network (CNN) and
to detect the human and object detection which is also diffi- Emperor Penguin Optimization (EPO). To empower the per-
cult due to variations of the humans and objects in video. To formance of CNN the EPO will be utilized for selecting
resolve the moving parts in the video, scale invariant methods optimal network parameters.
are utilized which identify the moving parts of the human as The main objective of the work is to validate the gait
well as object [7]. The object shape is also important to iden- recognition relating to a binary class under the classification
tify the fined object detection which also achieved with the problem throughout the person and objects.
help of various features, for example shape feature. Different The proposed methodology is working with four phases
kinds of features are obtainable to recognize the human and such as pre-processing phase, feature extraction phase, activ-
object from the video. But the separate features increase the ity recognition phase, and authentication phase. In the
difficulty of the work [8]. pre-processing phase, the unwanted noise is eliminated with
The numerous approaches are developed by researchers the help of an average smoothing filter.
for noticing humans and objects in video. These approaches In the feature extraction phase, the twelve different fea-
are machine learning, DL and optimization algorithm respec- tures are extracted such as entropy, energy, absolute latency
tively [9]. Various Machine learning technologies are applied to amplitude ratio, peak to peak slope, peak to peak time,
to detect objects and humans from the image obtained from peak to peak signal variation, skewness, kurtosis, variance,
the video. The large volume of incoming data is too much for mean, minimum amplitude and maximum amplitude. In the
the machine learning algorithm to handle. DL algorithms are activity recognition phase, the proposed approach is utilized
being studied to recognize objects and individuals from video to recognize the activities.
footage in order to manage the massive amount of data as an The organization of the paper is as presented as below,
input [10]. Many alternative DL approaches are available for the recent literature is reviewed in Sect. 2, and the proposed
example Convolutional Neural Network (CNN), Recurrent methodology is presented in Sect. 3. The performance analy-
Neural Network (RNN), and Deep Reinforcement Learning, sis part is mentioned as in Sect. 4, and the conclusion section
are available to manage the massive input data as well as is described in Sect. 5.
person and object recognition.
The most significant advantage of DL is that it can per-
form feature engineering on its own. In a DL technique, an 2 Recent analysis of literature review
algorithm scans the data to detect features that correlate and
then combines them to facilitate rapid learning. The main Numerous researchers are focused to identify human and
advantages of DL are mentioned as below, object from the videos. Approximately the research is
reviewed in this section.
• Feature engineering is no longer required Gong et al. [11] explained the Single-Shot Multi-box
• Data labeling is no longer necessary Detector (SSD) with Improved Multi-scale Feature Fusion
• The ability to offer high-quality solutions is now possible, (IMFF) for real-time recognition and detection positioning.
and unneeded resources are no longer incurred. Initializing, the multi-scale features were extracted with the
• Feature Generation Automation help of SSD network. Additionally, this feature combination
• Works Well With Unstructured Data of sensitive and position data achieved with the help of low-
• Better Self-Learning Capabilities level detail features. And, context information is achieved
• Supports Parallel and Distributed Algorithms with the help of high-level semantic features. The combined
• Cost-Effectiveness features were collected from the feature fusion that opti-
• Advanced Analytics mally enhances the accuracy in terms of position of the SSD

123
Signal, Image and Video Processing (2022) 16:1913–1923 1915

network with their target prediction layer. After that, a fea- learning is employed to obtain a significant amount of data.
ture combined prediction design was enhanced the semantics The proposed method detects both the human and the item
of target features without varying spatial resolution of the from the image collection in the most efficient manner fea-
prediction layer in SSD. The complete SSD network config- sible. Figure 1 depicts the entire framework of the suggested
uration can be enhanced the accuracy in the human movement approach. The data were acquired through an open-source
recognition. system at first. Following that, video frames are split into
Pérez-Hernández Perez et al. [12] described the improved individual frames that can be utilized to distinguish people
CNN to identify the objects from the image. The CNN was and detect things. It is possible that the image contains noise
utilized to identify the small objects in the video clips. In as well as unwanted information. After the pre-processing
the presented method, two-level operation was considered step is finished, CNN is used to start the classification phase.
which is working by deep learning such as object detection as The CNN constraints are improved with the help of EPO.
well as binary classifier. The initial level selected the optimal These modeling elements are described in depth in the fol-
regions from the input frame. Then the next level was applied lowing section.
a binarization method based on a CNN classifier. The CNN
classifier was working by one-versus-one and one-versus-all.
The presented method was tested with the detecting weapons 3.1 Video to frame conversion
and objects from the video surveillance cameras. The pre-
sented was validated with six different types of objects such The incoming video frames are split up into multiple frames
as card, purse, bill, smartphone, knife and pistol. for people and objects. When frame count is taken into
A visual semantic posture graph matrix network (VSP- account, person and object detection is done from the video.
GMN) for human and object interaction was demonstrated A single frame is rarely enough to recognize a person or
by Zhijun Liang et al. [13]. Their method was powered by a an object in a movie. It can also be used to build a detection
two-stream network that successfully identifies a variety of backdrop frame [19]. As seen in Fig. 1, these changed frames
contextual cues, such as spatial and visual, from both main are applied to recognize human and object detection. The
and secondary subject–object relationships. The human pose diagram’s working flow is shown in Fig. 2b, along with the
features were also calculated using a modular network. pseudocode, after which the collected images are employed
Elhoseny et al. [14] carried out a multi-object detec- in additional detection and pre-processing processes.
tion and tracking (MODT) for identifying and detecting the
objects from the video frames. Matveev et al. [15] proposed
3.2 Noise removal
a set of dimensional-based features for quick object recog-
nition in public spaces. The extraction and classification of
The pre-processing model removes noise from video frames
characteristics was a straightforward operation that allowed
while also improving image contrast [20]. Every frame of
Internet of Things (IoT) devices to run on little power. The
video changes the item and human movement, and as a result,
overall analysis of the literature review section is presented
detection may not obtain the optimum results. The Gaussian
in Table 1.
filter is used in the proposed model to remove undesired fea-
tures and noises from the images during the extraction phase.
The Gaussian model can be expressed in the following way:
3 Design of proposed system
2
1 − f
Deep learning (DL) has become increasingly popular in the M( f )  √ e 2σ 2 (1)
detection of humans and objects in recent years. In many 2π σ 2
applications, such as speech recognition, computer vision where σ is described as standard deviation and f is
and natural language processing, neural networks based on denoted as the underframe distance [19]. The standard devi-
deep learning technologies are presently numerous times ation operation is a type of two-dimensional convolution that
superior to standard machine learning approaches. As the is used to blur photographs. The photographs are then cleaned
number of errors decreases, the level of precision increases up to remove any undesirable information or noise.
[16, 17]. The ability of a DL technique to execute feature
engineering on its own is one of the most significant advan-
tages of using it. In this strategy, an algorithm looks for traits 3.3 Background subtraction
that correlate in the data and then combines them to encour-
age faster learning without being explicitly told to do so Most of the videos or images, the region of interest is evalu-
[18]. The proposed technique is built and proven to detect ated based on their changes presented in the individuals and
both humans and objects. To determine the findings, deep items. As a result, object and human detection necessitates

123
1916 Signal, Image and Video Processing (2022) 16:1913–1923

Table 1 Comparison of literature papers

Author/year Methodology Objective Advantages Disadvantages

Gong et al.[11] (2020) Single-Shot Multi-Box IMFF-SSD is being used For human body moving At various scales, the
Detector (SSD)-based to solve challenges in object identification, SSD network failed to
Multi-scale Feature real-time human moving high positioning and detect human moving
Fusion (IMFF-SSD) systems and target motion recognition targets. For various
detection method detection, monitoring, accuracy were achieved weather conditions,
location, and recognition illumination, and
filming angles, it is
unsatisfactory
Pérez-Hernández et al. ODBC-based deep Six small things, such as a Achieved robustness, Noisy instances are not
[12] (2020) learning methodology pistol, knife, accuracy and reliability clearly removed
is named as CNN smartphone, bill, to identify small objects
classifier with pocketbook, and card, handled from detection
One-Versus-All (OVA) are handled similarly to videos
and One-Versus-one weapons in test
(OVO) surveillance footage to
improve the baseline
detection
Elhoseny et al. [14] MODT with probability- To analyze Multi-Object Achieved maximum Due to occlusions and
(2020) based grasshopper (MO) detection in detection rate and distractions in the
algorithm (PGA) with objects from video tracking rate object’s surroundings,
Kalman filtering surveillance like image backdrops
method and background noise,
tracking moving targets
is difficult. Even yet,
different motion
estimating approaches
are required for MODT
analysis to attain a
higher detection rate
Matveev et al. [15] A fast detection and Recognize moving objects At low-performance and The detection range is
(2020) classification method in urban situations such low-energy limited, the detection
with low-power as pedestrians, cyclists consumption, error rate is high, and
and cars traveling at background subtraction the price is exorbitant
speeds up to 60 km/h evaluates moving objects
with ranges of up to 25 m distance, width and
height

extracting and detecting them from a continual background. 3.4 CNN model design
The background subtraction is expressed mathematically
expressed as below: The feature extractor and classifier are the two main compo-
nents of the CNN architecture. Every layer in the network
takes the output of its rapid previous layer as input and sends
S.F  {A.F − backgr oundimage} (2)
its output as input to the upcoming layer in feature extrac-
tion layers. This CNN design is depicted in Fig. 2 and is

S.Fi f S.F ≥ t made up of three different sorts of layers. Convolution, max-
Backgr oundsubstraction  (3) pooling and classification are the three methods. Convolution
0i f S.F < t 1
and max-pooling layers are found in the middle and lower
network levels. Odd layers are utilized for max pooling, and
where t is denoted as reference value in the subtraction even layers are used for convolution. Every layer plane is
procedure. The goal of the background removal process is to usually created by combining one or more layers from the
determine the object’s frontal areas in noiseless video frames. layer before it. The nodes planes are linked to a minor part
Most of the videos or images, the region of interest is evalu- of every level attached to the previous level. Every node on
ated based on their changes presented in the individuals and convolutional layer extracted the elements from the input
items. As a result, object and human detection necessitates images. Higher-level objects come from objects that expand
extracting and detecting them from a continual background.

123
Signal, Image and Video Processing (2022) 16:1913–1923 1917

Fig. 1 Structure of the proposed method

to lower-level layers. As functions are expanded to the high- of input and output feature maps cannot change in this layer.
est layer or level, the sizes of the functions are minimized to For instance, if the input maps are here meant; then, defi-
match the size of the convolution kernel or maximum join nitely there are output maps. Because of the sample reduction
operations. Nonetheless, the number of feature maps is usu- process, the output maps every dimension size is decreased
ally enlarged to demonstrate the best input image properties according to downsampling mask reduction. The following
and assure the classification accuracy [21]. The output of the equation expressed that every dimension of output is half of
CNN last layer is used as the input of the classification layer, every input dimension for entire images, for example, 2 × 2
which is a fully linked network. Because feedback neural sample reduction kernel is used.
networks perform best at the classification layer and the iso-

lated properties are utilized as input indicating the size of x lj  down x l1 (5)
j
the weight matrix of the outcome, and they were used as a
classification layer. Various CNN layers are explained in the 
next section. Here, down x l1
j is sub-sampling function. In this layer,
Preceding layers are convolved among learnable kernels. mostly two various operations are presented such as average
Every yield feature map can be combination with more than pooling and max-pooling.
one input feature map.
⎛ ⎞ 3.6 Classification layer

x lj  f ⎝ xil−1 ∗ kil j + blj ⎠ (4) It is a fully linked layer that calculates the outcome for each
i∈M j
class using the properties taken from the convolutional layer
in the preceding steps. The EPO algorithm is used to select the
Here, x lj is current layer output, xil−1 —output of the previ- best kernel parameters in the CNN structure. The following
ous layer, kil j —the kernel for the present layer and blj —biases is a detailed description of EPO.
for the current layer. M j is input map selection and for every Pseudocode:
output map b additive biases are provided. Input: video frames.
Output: classified shots.
3.5 Sub-sampling layer Steps:

The layer of sub-sampling is less functional for input files. (i) Isolate the video frames into equal size of blocks with
This is generally recognized as the pooling layer. The number dimension p*q

123
1918 Signal, Image and Video Processing (2022) 16:1913–1923

Fig. 2 Analysis of a CNN model human and object detection and b detail steps

(ii) Apply preprocessing process on the blocks and estimate 3.7 Emperor penguin optimization
the Euclidean distance between the blocks of adjacent
frames. In the HDCNN network, the optimal design parameter of
(iii) Apply feature extraction phased and extracted the fea- CNN is selected with the help of EPO algorithm. The
tures from the input video and images. emperor penguin is the largest and massive of all penguin
(iv) if distance <—thresh then species. Based on the plumage and size of the female and
male emperor penguins, both are very identical. Throughout
Frame belongs to Human shot the breeding season, emperor penguins are found in large
Else colonies, including hundreds of thousands of emperor pen-
Frame belongs to Object shot guins. To arrive at the ocean for hunting, the female EPO
End if travels 50 miles. At sea, the EPO can dive to a depth of

123
Signal, Image and Video Processing (2022) 16:1913–1923 1919

Fig. 3 Preprocessing and


detection results of
a PETS09-S2L1, b car and
c forensic

1,900 feet and remain underwater for more than 25 min [22].  is velocity of wind,—gradient of . To make the
Emperor penguins regularly take a stand during huddling at complex potential, vector  is combined with .
the limits of the polygon network. To determine the boundary
of the huddle around the polygon, the flow of wind around F   + i (7)
the huddle is defined. The complicated variable concepts are
utilized to represent the randomly produced huddle boundary Here, i is imaginary constant and F—analytical function
of the emperor penguin. on the polygon plane.

ψ  ∇ (6)

123
1920 Signal, Image and Video Processing (2022) 16:1913–1923

Table 2 Design factors of the proposed method

S. No Method Description Value

1 HDCNN Batch size (Maximum) 500


2 Momentum 0.9
3 Initial population 50
4 Iteration (Maximum) 100
5 Initial learn rate 0.05
6 Inertia factor 0.7298
7 DNN Input node 100
8 Hidden layer 50
9 Weight decay 0.0001

Table 3 Formula for Statistical measurements

S. No Description Formula

TP+TN
1 Accuracy TP+TN+FP+FN
TP
2 Precision TP+FP
TP
3 Recall TP+FN
2∗TP
4 F-Measure 2∗TP+FP+FN

3.8 Temperature profile around the huddle

Mathematical representation assumes that the temperature


T  0, while the polygon radius R > 1 and temperature
T  1, while the radius R < 1. The temperature profile
around the huddle T  is calculated below

Maxiteration
T  T − (8)
x − Maxiteraion

0, i f D > 1
T  (9)
1, i f D < 1

Here, xis current iteration, Maxiteration —maximum num-


ber of iterations, D—radius and T —time for evaluating best
optimal solution in a search space.

Fig. 4 Sample input video of a PETS09-S2L1, b car and c forensic


3.9 Relocate the mover

According to the levels of Emperor Penguin, the best gained


optimal solution is updated [23]. To improve the Emperor
Penguin’s following position, the below equation is proposed 4 Results and discussion
−→ −−→ −→
Pep (x + 1)  P(x) − A. Dep (10) Here, the analysis of proposed methodology is validated and
this is implemented in MATLAB with the 4 GB RAM com-
Based on EPO algorithm, the CNN design factors are opti- puter and Intel i7 processor. The proposed method is designed
mally chosen which improves the detection accuracy of the to recognize the human and object interaction which vali-
object detection. The performance of the proposed method dated with the three-dataset collection such as car database,
is presented in below section. PETS09-S2L1 and surveillance videos in forensic database.

123
Signal, Image and Video Processing (2022) 16:1913–1923 1921

These databases are collected from [24]. From the PETS09- mathematical formulations of the statistical measurements
S2L1 dataset, the input video has the resolution 768 × 576 are presented in Table 3.
and the length is 795 (01:54). It has 19 tracks and 4476 boxes. While utilizing the proposed method, the testing and train-
The databases are utilized to validate the proposed method- ing measurements are depicted in Fig. 5. Based on that, the
ology for identifying the human and object interaction and statistical measurements of the proposed as well as existing
object detection from the video frames. Additionally, the methods are illustrated in Fig. 6. From Fig. 6a, the com-
proposed methodology is validated with the help of com- parison analysis of accuracy is presented. So, the proposed
parison analysis. The proposed method can be compared method has been achieved; the maximum and minimum
with existing methods such as DNN, CNN and CNN-FA, accuracy is 0.92–0.95 rate. Similarly, the DNN has been
respectively. The design factors of the proposed method are achieved; the maximum and minimum accuracy is 0.89–0.92.
tabulated in Table 2. To validate the proposed methodology, The CNN has been achieved; the maximum and minimum
three databases are used in this paper. Sample input frames accuracy is 0.78–0.82. The CNN-FA has been achieved;
of the video are illustrated in Fig. 3. the maximum and minimum accuracy is 0.72–0.89. From
The sample input images are depicted in Fig. 4 which Fig. 6b, the comparison analysis of precision is presented.
is extracted from the video. The statistical measurements So, the proposed method has been achieved; the maxi-
are computed based on confusion matrix computations. The mum and minimum precision is 0.91–0.97 rate. Similarly,

Fig. 5 Proposed training and testing measurements

123
1922 Signal, Image and Video Processing (2022) 16:1913–1923

the DNN has been achieved; the maximum and minimum


precision is 0.85–0.89. The CNN has been achieved; the max-
imum and minimum precision is 0.72–0.79. The CNN-FA
has been achieved; the maximum and minimum preci-
sion is 0.89–0.95. From Fig. 6c, the comparison analysis
of recall is presented. So, the proposed method has been
achieved; the maximum and minimum recall is 0.95–0.97
rate. Similarly, the DNN has been achieved; the maxi-
mum and minimum recall is 0.82–0.89. The CNN has been
achieved; the maximum and minimum recall is 0.79–0.85.
The CNN-FA has been achieved; the maximum and min-
imum recall is 0.82–0.89. From Fig. 6d, the comparison
analysis of accuracy is presented. So, the proposed method
has been achieved; the maximum and minimum F-Measure
is 0.95–0.96 rate. Similarly, the DNN has been achieved; the
maximum and minimum F-Measure is 0.82–0.90. The CNN
has been achieved; the maximum and minimum F-Measure
is 0.79–0.85. The CNN-FA has been achieved; the maximum
and minimum F-Measure is 0.79–0.91. From the overall anal-
ysis, we can conclude the proposed method is achieved the
best results more than the existing methods.

5 Conclusion

In this paper, HDCNN has been developed to detect object


and human in videos optimally. The proposed method is a
combination of CNN and EPO. The CNN design parameters
are optimally chosen with the help of EPO algorithm. The
proposed algorithm has the ability to identify the object as
well as humans from the videos. The proposed algorithm has
been detected the object and human without noise even under
very low level of illumination. Statistical metrics such as
accuracy, precision, recall and F-Measure have all been used
to validate the suggested technique. The proposed approach
was compared to DNN, CNN and CNN-FA, which are all
existing methods. The proposed strategy produces the finest
results in terms of statistical measurements, according to
the comparative study. In the future, various dataset will be
analyzed and compared with the different methods. The sta-
tistical performances of various methods are needed to be
analyzed. In future, video is captured from indoor and out-
door environment and analyzed their results.

References
1. Dey, L., Chakraborty, S., Mukhopadhyay, A.: Machine learning
techniques for sequence-based prediction of viral–host interac-
tions between SARS-CoV-2 and human proteins. Biomed. J. 43(5),
438–450 (2020)
2. Randhawa, P., Shanthagiri, V., Kumar, A., Yadav, V.: Human
activity detection using machine learning methods from wearable
Fig. 6 Comparative analysis of proposed method a accuracy, b preci- sensors. Sensor Rev. (2020)
sion, c recall and d F-Measure

123
Signal, Image and Video Processing (2022) 16:1913–1923 1923

3. Minor, E.N., Howard, S.D., Green, A.A.S., Glaser, M.A., Park, 15. Matveev, I., Karpov, K., Chmielewski, I., Siemens, E., Yurchenko,
C.S., Clark, N.A.: End-to-end machine learning for experimental A.: Fast object detection using dimensional based features for pub-
physics: using simulated data to train a neural network for object lic street environments. Smart Cities 3(1), 93–111 (2020)
detection in video microscopy. Soft Matter 16(7), 1751–1759 16. O’Shea, T., Hoydis, J.: An introduction to deep learning for the
(2020) physical layer. IEEE Trans. Cognit. Commun. Netw. 3(4), 563–575
4. Kousik, N.V., Natarajan, Y., Raja, R.A., Kallam, S., Patan, R., (2017)
Gandomi, A.H.: Improved salient object detection using hybrid 17. Aceto, G., Ciuonzo, D., Montieri, A., Pescapé, A.: mobile
Convolution Recurrent Neural Network. Expert Syst. Appl. 166, encrypted traffic classification using deep learning: experimen-
114064 (2020) tal evaluation, lessons learned, and challenges. IEEE Trans. Netw.
5. Jin, L., Yang, J., Kuang, K., Ni, B., Gao, Y., Sun, Y., Gao, P., Serv. Manage. 16(2), 445–458 (2019)
et al.: Deep-learning-assisted detection and segmentation of rib 18. Aceto, G., Ciuonzo, D., Montieri, A., Pescapé, A.: Toward effec-
fractures from CT scans: development and validation of FracNet. tive mobile encrypted traffic classification through deep learning.
Biomedicine 62, 103106 (2020) Neurocomputing 409, 306–315 (2020)
6. Jung, M., Chi, S.: Human activity classification based on sound 19. Mukilan, P., Semunigus, W.: Human object detection: An enhanced
recognition and residual convolutional neural network. Autom. black widow optimization algorithm with deep convolution neural
Constr. 114, 103177 (2020) network. Neural Comput. Appl. (2021)
7. Fu, K., Zhang, T., Zhang, Y., Sun, X.: OSCD: a one-shot conditional 20. Chaudhary, V., Kumar, V.: Fusion of multi-exposure images using
object detection framework. Neurocomputing (2020) recursive and Gaussian filter. Multidimension. Syst. Signal Process.
8. Freire-Obregón, D., Castrillón-Santana, M., Barra, P., Bisogni, C., 31(1), 157–172 (2020)
Nappi, M.: An attention recurrent model for human cooperation 21. Kun, Fu., Chang, Z., Zhang, Y., Guangluan, Xu., Zhang, K., Sun,
detection. Comput. Vis. Image Understand. 102991 (2020) X.: Rotation-aware and multi-scale convolutional neural network
9. Xiong, Q., Zhang, J., Wang, P., Liu, D., Gao, R.X.: Transferable for object detection in remote sensing images. ISPRS J. Pho-
two-stream convolutional neural network for human action recog- togramm. Remote. Sens. 161, 294–308 (2020)
nition. J. Manuf. Syst. 56, 605–614 (2020) 22. Harifi, S., Khalilian, M., Mohammadzadeh, J., Ebrahimnejad, S.:
10. Deguchi, M.: Simple and low-cost object detection method based Optimizing a neuro-fuzzy system based on nature inspired emperor
on observation of effective permittivity change. Microelectron. J. penguins colony optimization algorithm. IEEE Trans. Fuzzy Syst.
95, 104678 (2020) (2020)
11. Gong, M., Shu, Y.: Real-time detection and motion recognition 23. Harifi, S., Khalilian, M., Mohammadzadeh, J., Ebrahimnejad, S.:
of human moving objects based on deep learning and multi-scale Optimization in solving inventory control problem using nature
feature fusion in video. IEEE Access 8, 25811–25822 (2020) inspired Emperor Penguins Colony algorithm. J. Intell. Manuf.
12. Pérez-Hernández, F., Tabik, S., Lamas, A., Olmos, R., Fujita, 1–15 (2020)
H., Herrera, Fo.: Object detection binary classifiers methodology 24. https://motchallenge.net/vis/PETS09-S2L1
based on deep learning to identify small objects handled similarly:
application in video surveillance. Knowl-Based Syst. 194, 105590
(2020)
Publisher’s Note Springer Nature remains neutral with regard to juris-
13. Liang, Z., Rojas2†∗, J., Liu, J., Guan, Y.: Visual-semantic-pose
dictional claims in published maps and institutional affiliations.
graph mixture networks for human-object interaction detection,
(2020)
14. Elhoseny, M.: Multi-object detection and tracking (MODT)
machine learning model for real-time video surveillance systems.
Circuits Syst. Signal Process. 39(2), 611–630 (2020)

123

You might also like