You are on page 1of 5

2022 14th International Conference on Mathematics, Actuarial Science, Computer Science and Statistics (MACS)

CNN-LSTM Based Smart Real-time Video


Surveillance System
Waqas Iqrar Malik ZainUl Abidien Waqas Hameed
Department of Electrical and Computer Department of Electrical and Computer Department of Electrical and Computer
Engineering Engineering Engineering
COMSATS University Islamabad, COMSATS University Islamabad, COMSATS University Islamabad,
Abbottabad Campus Abbottabad Campus Abbottabad Campus
Abbottabad, Pakistan Abbottabad, Pakistan Abbottabad, Pakistan
waqasiqrar99@gmail.com malikzn1234@gmail.com waqash1998@gmail.com

Dr. Aamir Shahzad


Department of Electrical and Computer
Engineering
COMSATS University Islamabad,
Abbottabad Campus
Abbottabad, Pakistan
ashahzad@cuiatd.edu.pk

Abstract— Modern surveillance technologies are intended to in still images and moving videos. The object detection
be precise, cost-effective, and run without the need for human process utilizes two distinct techniques; the first involves
interaction. In this regard, few topologies were employed in the image classification, while the second is object localization.
previous decade which treated human movement as traditional At the algorithm level, not only the object is detected but also
object detection. Because of this approach, suspicious behavior
a tagged bounding box is drawn to point to the object’s
was also tackled as an object, which made the existing human
activity recognition (HAR) systems slow and inefficient when it position. Consequently, object detection can detect several
came to classification for real-time application. In these recent objects in a single image or video frame. With the progress in
years, machine learning algorithms have made significant object detection algorithms, its applications broadened to
advancements in the time-dependent classification of events. multiple fields including medical treatment, civil
This research presents an HAR system that employs a engineering, defense, agriculture, etc. [1]. However, human
Convolution Neural Network (CNN) to extract spatial activity recognition (HAR) remained a challenge because it
information along with a Long Short-Term Memory (LSTM) involves sophisticated classification of motion and gestures.
approach for the rapid and precise sequential tracking of an In this research, using closed-circuit television (CCTV)
identified object. This CNN-LSTM technique not only lowers
cameras on a real-time operating surveillance system, we
the model's complexity but also improves its accuracy which
allows it to be executed in real-time. Therefore, the proposed further explore HAR to detect criminal activities and fighting
CNN-LSTM approach can detect suspicious activities in real- sequences. A smart surveillance system is presented for
time at 10-13 FPS and obtain the best tracking performance in detecting and locating an unscrupulous act within a specific
any circumstance while implemented on Raspberry Pi which time frame. If the proposed HAR system detects suspicious
works as a standalone system. activity, it will notify the authorities. To be more specific,
criminal conduct will be detected if individuals are fighting
Keywords— CNN-LSTM, deep learning, fight detection, or if one is trying to rob the other. It will distinguish the action
human activity recognition, intelligent surveillance system, and identify the two individuals as objects engaging in it.
sequence-based activity detection.
There is significant existing research related to
I. INTRODUCTION surveillance systems. In the earliest days of object detection,
background subtraction, the statistics method, or the frame
Throughout human history, surveillance has been an
difference method were used. Point tracking was typically
integral component of society. Humans have been attempting
employed for detecting [2]. Later motion detection was done
to develop surveillance measures for a safer way of life since
with the help of a histogram of oriented gradients (HOG) and
ancient times. Due to current technologies such as cameras,
support vector machine (SVM) algorithms [3]. HOG is an
computers, GPUs, etc., surveillance has been vastly improved
advancement on the scale-invariant feature transform and
in recent decades. Previously, metal detection, x-ray
shape contexts that were available at the time it was
scanning, and manual monitoring of security cameras were
developed. HOGs were most known for their application in
utilized to conduct surveillance. As the primary function of
the detection of pedestrians, and it was the first multi-object
surveillance is to detect and monitor actions, surveillance
detection method. The deformable part-based model (DPM)
systems in the era of artificial intelligence can be extremely
technique was the following one that was introduced later [4].
precise and operate without human intervention.
Instead of using the HOG rule for detection, it employs the
One of the most significant developments in the Internet
"Divide and Conquer" rule. The fact that they were not
of Things (IoT) is object detection, which can process objects
focusing on sequence base detection is nonetheless their most
This project has been awarded sponsorship from the Federal Ministry of
Information Technology in Pakistan under IGNITE program
significant shortcoming. The primary issue with the many
(FYP-Code: NGIRI-2022-12025). other methods was that they lacked a reference to the

978-1-6654-6071-2/22/$31.00 ©2022 IEEE


movement of time, which means that they are unable to sections. The first is for training, while the second is for
meaningfully differentiate between scenarios such as validation. 75% of the data is chosen for training, while the
moving, fighting, or hugging. Consequently, there is room for remaining 25% is used for validation.
advancement in several aspects of these procedures.
B. Feature Extraction
For this paper, we trained the model using an HAR
technique that combines CNN and LSTM. Even though both Various types of CNN architecture are tested for Feature
architectures (CNN and LSTM) have been the subject of a extraction like VGG16 [8] and MobileNetV2 [9]. Since we
significant amount of research and a lot of research has been have a smaller dataset, so we use transfer learning, therefor
done on them, the authors of this paper have implemented our model is trained on the MobileNet V2 of CNN
both of these methods in order to achieve the best possible architecture. For the implementation of this idea, this model
outcome for HAR. It is pertinent to mention that to the best was best for real-time applications. MobileNet V2 usually
of the author’s knowledge, this combined method is not support any input size image greater than 32 x 32 dimension.
employed for detecting the real-time HAR. Therefore, the For this experiment, we took the input image shape (128 x
main contribution of this paper is to provide a smart real-time 128). In April 2017, MobileNet v2 was released and available
surveillance solution using CNN and LSTM. to the public. The bottleneck layers and shortcut links were
The remaining of this paper is organized as follows: updated from the previous edition [10].
Section II contains a review of some previously published Video segments are uniformed before sending for feature
relevant research studies. In section III, the proposed strategy extraction, 20 frames from every video are uniformly
to provide a smart real-time surveillance system is presented selected. Then these frames are cropped to match the input
in detail. The experimental findings as a result of the size of the architecture.
implementation of the proposed work are given in Section IV.
Finally, section V concludes this paper.
II. RELATED WORK
Due to advancements in the field of video recognition,
HAR research is extensively subdivided into numerous types.
A two-stream network is the most prevalent type of action
recognition in deep learning. This technique deals with two
CNN, the first one is used for spatial feature extraction which
deals with a single frame of action and tries to learn from it
Fig. 1. Architecture of proposed methodology.
[5] whereas the other CNN uses a sequence of frames to
extract features using optical flow vectors technique. C. Classification
Another HAR system has been created employing sensors For classification, LSTM is used for activity detection as
that require very little power and may be worn on the body. it can learn on time sequence and thus can learn patterns
This architecture generates features by fast Fourier transform based on this [11]. As the name of LSTM suggests, it is a
as well as discrete wavelet transform by using data obtained sequential learning system that uses time as a reference to
from the inertial measurement unit (IMU) sensor. A neural discover patterns. LSTM networks were created primarily to
network is trained on obtained data and as an experimental address the issue of long-term dependency. LSTMs have
result of this approach, six activities can be detected with feedback connections that distinguish them from
12.5mW power consumption [6]. This technology provides a conventional feedforward neural networks. This trait enables
good approach to HAR, but it also comes with the expense of LSTMs to process whole sequences of data without treating
wearable sensors and energy consumption, both of which are each point in the series individually. Instead, LSTMs retain
not optimal when applied on a larger scale. important information about prior data in the sequence to
A different approach is discussed in an article [7], which assist in the processing of new data points. Consequently,
referred to the time series approach of HAR using multi- LSTMs excel at processing sequences of data such as text,
channel CNN as a framework. This article presents us with voice, and time series in general. LSTMs utilize a number of
the methodology to convert data obtained from a 3D "gates" that regulate how information in a data sequence
accelerometer into an image, which is fed into CNN with 3 enters, is stored, and exits the network. In a standard LSTM
convolutional layers, and one fully connected layer. forget gate, there are three gates: input gate, output gate, and
III. METHODOLOGY forget gate.
The first gate in the series is called the forget gate [12].
In this part of the paper, a discussion on the different At this stage, it determines which bits of the cell state, which
approaches to the extraction of features and categorization of serve as the long-term memory of the network, are valuable
video segments for violence detection is carried out. Fig. 1 in light of both the previous hidden state and the newly
shows the architecture of a CNN and LSTM that is used for received input data. The name of the second gate in the
this purpose. network is the "input gate" [13]. In this stage, it is decided
A. Preprocessing what new information needs to be added to the long-term
memory (cell state) of the network, taking into account the
The videos in the dataset are represented by their number,
previously hidden state as well as the incoming input data.
number of frames per clip, width, and height. These videos
The name "output gate" refers to this particular gate. This
are transformed into frames and resized to 128 by 128 by 3.
stage, which determines the new hidden state, takes place
Using two additional approaches, the data is divided into two
once the update in the long-term memory has been finished
and is ready to be used.
Fig. 2 presented above provides an overview of the HAR
approach that we have proposed. In the first phase, a video is
broken down into frames, and these frames are then run
through a CNN Neural Network, which can identify people
in the video obtained directly from a Pi-Cam and processed
through a Raspberry Pi. After that, this result is sent on to
LSTM so that it can be compared with results for other blocks
of CNN for the detection of a fight scene. In the end, training
losses and validation losses are calculated and plotted.
1) Validation losses
On the other hand, validation loss is a metric that's used
to evaluate how well a deep learning model performs on the
validation set. A section of the dataset known as the
validation set has been set aside specifically for the purpose
of verifying the accuracy of the model. The validation loss is
comparable to the training loss in that it is determined by
adding up all of the mistakes made in each example that is
included in the validation set.
In addition, the validation loss is evaluated after every
epoch that passes. This tells us whether the model requires
further tuning or adjustments or whether it is already perfect.
Typically, this is accomplished by drawing a learning curve
to represent the validation loss.
2) Training losses
When evaluating how well a deep learning model fits the
training data, one metric that is utilized is called the training Fig. 2. Overview diagram of Proposed Method.
loss. In other words, it determines how inaccurate the model
3) Real Life Dataset: This is the largest collection,
is based on the data from the training set. Take note that the
containing over 2000 video clips of actual life situations
training set is only a subset of the dataset that is utilized to
train the model in the beginning. Computation, the training including physical conflict. The fact that this dataset is the
loss is determined by adding up all of the different kinds of best example of data augmentation is the primary benefit of
mistakes that were made during the training process. utilizing it. Within this dataset, there is a total of 1000
It is also essential that the training loss is calculated after examples accessible for each class. Previously, a VGG16 and
every batch that is processed. In most cases, this is shown an LSTM model that achieved an accuracy of 88.2% were
graphically by displaying a curve of the training loss. trained on this dataset [15].

IV. EXPERIMENTAL RESULTS AND DISCUSSION The proposed architecture is applied to the above datasets.
The accuracy of this architecture, with respect to various
This section of the paper will discuss the experiments and datasets, is provided below in Table 1.
their results. First, the datasets are explained in detail, then
the graphical results are obtained and analyzed to assess the Table 1: Comparison of Proposed model with different approaches.
performance of the proposed methodology. Approach Dataset Accuracy
A. Datasets VGG16 + LSTM Hockey Dataset 87.05% [16]
This subsection gives the details of the used dataset for
STIP + MoSIFT Hockey Dataset 90 % [14]
this approach and experimental setup. Later experimental
results will be discussed. There is some background in all VGG16 + LSTM Real-Life Dataset 88.2% [15]
clips but still, there is a lot of background motion. STIP + MoSIFT Peliculas 90% [14]
1) Hockey Dataset: This dataset includes short video CNN+LSTM Real-Life Dataset 91% (this research)
clips taken from the hockey dataset that depict either fighting
CNN+LSTM Peliculas 98% (this research)
or non-fighting scenes. This dataset includes a total of one
thousand movies of both scenes. After training the base model on the real life dataset and
2) Peliculas Dataset: This dataset includes 200 videos the Peliculas dataset we get accuracy of 91% and 98%
total, with 100 movies relating to violent acts and 100 videos respectively.
relating to non-violent acts. An accuracy of 90% was It is to mention that, while training only 50% (500 videos
achieved by the authors of this dataset when they used it to of each) of the dataset is used due to COLAB memory issues
train models on STIP and MoSIFT [14]. It is mentioned that therefore the performance is somewhat compromised. Fig. 3
each video of this dataset has about two to four seconds of shows the obtained total accuracy and total validation
runtime. accuracy trained on real life dataset which is 91%.
Fig. 3. Achieved total accuracy and total validation accuracy on real life Fig. 6. Achieved total accuracy vs total validation accuracy on the Peliculas
dataset dataset

Its total loss vs total validation loss can be observed in On this dataset, the model performs best with an accuracy
Fig. 4. Model did perform well, but not as well as planned of 98%. The total loss vs validation loss while training is
because fewer datasets were used. But still has achieved a presented in Fig. 5. Fig. 6 shows the graphical representation
total accuracy of 91% which is better than the other models, of attained accuracy of the model on the Peliculas dataset.
previously presented in the literature. It is noteworthy that the presented methodology resulted
in an accuracy of 91% that is significantly improved from
previous research. Moreover, the trained model has achieved
this much correctness with a varying background that causes
different illumination levels and adds more objects in the
video. Consequently, owing to discussed advantages the
presented CNN-LSTM-based HAR system can be
implemented as a real-time smart surveillance system.
V. CONCLUSION
The field of application that deals with identifying fights
and suspicious actions on video are expanding. This
capability may prove to be of great use in situations involving
video monitoring, such as in correctional institutions, mental
hospitals, and senior care facilities. Techniques of HAR that
have traditionally focused primarily on individual actors and
fundamental events can be utilized within the context of this
Fig. 4. Achieved total loss vs validation loss on a real-life dataset particular application. The fundamental objective of this
Another training of the model was performed on the research is to present a standalone HAR system that can
Peliculas dataset using the same architecture. Following the identify potentially harmful video behavior speedily and
training, the model was able to successfully reach an accuracy accurately. The presented HAR approach combines CNN and
of 98%, which is the greatest it has been able to achieve on LSTM to deliver optimal precision and speed, which is
this dataset so far. beneficial for real-time applications. The accuracy of the
presented HAR technique is up to 98% for the Peliculas
dataset and reduces to 91% for real life datasets with varying
backgrounds. However, it is notable that his accuracy of 91%
is improved compared to its previous counterparts. To say in
nutshell, this paper introduces a real-time standalone HAR
solution that not only detects but also classifies unscrupulous
activities. In the future, a camera will be mounted on a
moving automated platform such that the overall system may
operate in the closed quarters and provide real-time HAR
feedback to the control room.
ACKNOWLEDGMENTS
The Project Lab in the Department of Electrical and
Computer Engineering at CUI Abbottabad is where the
majority of this work is accomplished. The author wishes to
express gratitude to both the department's employees and
Fig. 5. Achieved total loss vs total validation loss on Peliculas dataset
management for their unwavering support during the course [8] K. Simonyan and A. Zisserman, "Very deep convolutional
networks for large-scale image recognition," arXiv preprint
of this research.
arXiv:1409.1556, 2014.
[9] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
REFERENCES "Mobilenetv2: Inverted residuals and linear bottlenecks," in
[1] G. Swapna, S. Kp, and R. Vinayakumar, "Automated detection Proceedings of the IEEE conference on computer vision and
of diabetes using CNN and CNN-LSTM network and heart rate pattern recognition, 2018, pp. 4510-4520.
signals," Procedia computer science, vol. 132, pp. 1253-1262, [10] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling, "Spherical
2018. cnns," arXiv preprint arXiv:1801.10130, 2018.
[2] P. K. Mishra and G. Saroha, "A study on video surveillance [11] S. Hochreiter and J. Schmidhuber, "Long short-term memory,"
system for object detection and tracking," in 2016 3rd Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
International Conference on Computing for Sustainable Global [12] A. Kulshrestha, L. Chang, and A. Stein, "Use of LSTM for
Development (INDIACom), 2016, pp. 221-226: IEEE. Sinkhole-Related Anomaly Detection and Classification of
[3] P. K. Roy and H. Om, "Suspicious and violent activity detection InSAR Deformation Time Series," IEEE Journal of Selected
of humans using HOG features and SVM classifier in Topics in Applied Earth Observations and Remote Sensing, vol.
surveillance videos," in Advances in Soft Computing and 15, pp. 4559-4570, 2022.
Machine Learning in Image Processing: Springer, 2018, pp. 277- [13] H. Abbasimehr and R. Paki, "Improving time series forecasting
294. using LSTM and attention models," Journal of Ambient
[4] T. Cucliciu, C.-Y. Lin, and K. Muchtar, "A DPM based object Intelligence and Humanized Computing, vol. 13, no. 1, pp. 673-
detector using HOG-LBP features," in 2017 IEEE International 691, 2022.
Conference on Consumer Electronics-Taiwan (ICCE-TW), 2017, [14] E. Bermejo Nievas, O. Deniz Suarez, G. Bueno García, and R.
pp. 315-316: IEEE. Sukthankar, "Violence detection in video using computer vision
[5] M. Jogin, M. Madhulika, G. Divya, R. Meghana, and S. Apoorva, techniques," in International conference on Computer analysis of
"Feature extraction using convolution neural networks (CNN) images and patterns, 2011, pp. 332-339: Springer.
and deep learning," in 2018 3rd IEEE international conference [15] M. M. Soliman, M. H. Kamal, M. A. E.-M. Nashed, Y. M.
on recent trends in electronics, information & communication Mostafa, B. S. Chawky, and D. Khattab, "Violence recognition
technology (RTEICT), 2018, pp. 2319-2323: IEEE. from videos using deep learning techniques," in 2019 Ninth
[6] G. Bhat, R. Deb, V. V. Chaurasia, H. Shill, and U. Y. Ogras, International Conference on Intelligent Computing and
"Online human activity recognition using low-power wearable Information Systems (ICICIS), 2019, pp. 80-85: IEEE.
devices," in 2018 IEEE/ACM International Conference on [16] Ş. Aktı, G. A. Tataroğlu, and H. K. Ekenel, "Vision-based fight
Computer-Aided Design (ICCAD), 2018, pp. 1-8: IEEE. detection from surveillance cameras," in 2019 Ninth
[7] Y. Zheng, Q. Liu, E. Chen, Y. Ge, and J. L. Zhao, "Time series International Conference on Image Processing Theory, Tools
classification using multi-channels deep convolutional neural and Applications (IPTA), 2019, pp. 1-6: IEEE.
networks," in International conference on web-age information
management, 2014, pp. 298-310: Springer.

You might also like