Professional Documents
Culture Documents
CNN LSTM
CNN LSTM
Abstract— Modern surveillance technologies are intended to in still images and moving videos. The object detection
be precise, cost-effective, and run without the need for human process utilizes two distinct techniques; the first involves
interaction. In this regard, few topologies were employed in the image classification, while the second is object localization.
previous decade which treated human movement as traditional At the algorithm level, not only the object is detected but also
object detection. Because of this approach, suspicious behavior
a tagged bounding box is drawn to point to the object’s
was also tackled as an object, which made the existing human
activity recognition (HAR) systems slow and inefficient when it position. Consequently, object detection can detect several
came to classification for real-time application. In these recent objects in a single image or video frame. With the progress in
years, machine learning algorithms have made significant object detection algorithms, its applications broadened to
advancements in the time-dependent classification of events. multiple fields including medical treatment, civil
This research presents an HAR system that employs a engineering, defense, agriculture, etc. [1]. However, human
Convolution Neural Network (CNN) to extract spatial activity recognition (HAR) remained a challenge because it
information along with a Long Short-Term Memory (LSTM) involves sophisticated classification of motion and gestures.
approach for the rapid and precise sequential tracking of an In this research, using closed-circuit television (CCTV)
identified object. This CNN-LSTM technique not only lowers
cameras on a real-time operating surveillance system, we
the model's complexity but also improves its accuracy which
allows it to be executed in real-time. Therefore, the proposed further explore HAR to detect criminal activities and fighting
CNN-LSTM approach can detect suspicious activities in real- sequences. A smart surveillance system is presented for
time at 10-13 FPS and obtain the best tracking performance in detecting and locating an unscrupulous act within a specific
any circumstance while implemented on Raspberry Pi which time frame. If the proposed HAR system detects suspicious
works as a standalone system. activity, it will notify the authorities. To be more specific,
criminal conduct will be detected if individuals are fighting
Keywords— CNN-LSTM, deep learning, fight detection, or if one is trying to rob the other. It will distinguish the action
human activity recognition, intelligent surveillance system, and identify the two individuals as objects engaging in it.
sequence-based activity detection.
There is significant existing research related to
I. INTRODUCTION surveillance systems. In the earliest days of object detection,
background subtraction, the statistics method, or the frame
Throughout human history, surveillance has been an
difference method were used. Point tracking was typically
integral component of society. Humans have been attempting
employed for detecting [2]. Later motion detection was done
to develop surveillance measures for a safer way of life since
with the help of a histogram of oriented gradients (HOG) and
ancient times. Due to current technologies such as cameras,
support vector machine (SVM) algorithms [3]. HOG is an
computers, GPUs, etc., surveillance has been vastly improved
advancement on the scale-invariant feature transform and
in recent decades. Previously, metal detection, x-ray
shape contexts that were available at the time it was
scanning, and manual monitoring of security cameras were
developed. HOGs were most known for their application in
utilized to conduct surveillance. As the primary function of
the detection of pedestrians, and it was the first multi-object
surveillance is to detect and monitor actions, surveillance
detection method. The deformable part-based model (DPM)
systems in the era of artificial intelligence can be extremely
technique was the following one that was introduced later [4].
precise and operate without human intervention.
Instead of using the HOG rule for detection, it employs the
One of the most significant developments in the Internet
"Divide and Conquer" rule. The fact that they were not
of Things (IoT) is object detection, which can process objects
focusing on sequence base detection is nonetheless their most
This project has been awarded sponsorship from the Federal Ministry of
Information Technology in Pakistan under IGNITE program
significant shortcoming. The primary issue with the many
(FYP-Code: NGIRI-2022-12025). other methods was that they lacked a reference to the
IV. EXPERIMENTAL RESULTS AND DISCUSSION The proposed architecture is applied to the above datasets.
The accuracy of this architecture, with respect to various
This section of the paper will discuss the experiments and datasets, is provided below in Table 1.
their results. First, the datasets are explained in detail, then
the graphical results are obtained and analyzed to assess the Table 1: Comparison of Proposed model with different approaches.
performance of the proposed methodology. Approach Dataset Accuracy
A. Datasets VGG16 + LSTM Hockey Dataset 87.05% [16]
This subsection gives the details of the used dataset for
STIP + MoSIFT Hockey Dataset 90 % [14]
this approach and experimental setup. Later experimental
results will be discussed. There is some background in all VGG16 + LSTM Real-Life Dataset 88.2% [15]
clips but still, there is a lot of background motion. STIP + MoSIFT Peliculas 90% [14]
1) Hockey Dataset: This dataset includes short video CNN+LSTM Real-Life Dataset 91% (this research)
clips taken from the hockey dataset that depict either fighting
CNN+LSTM Peliculas 98% (this research)
or non-fighting scenes. This dataset includes a total of one
thousand movies of both scenes. After training the base model on the real life dataset and
2) Peliculas Dataset: This dataset includes 200 videos the Peliculas dataset we get accuracy of 91% and 98%
total, with 100 movies relating to violent acts and 100 videos respectively.
relating to non-violent acts. An accuracy of 90% was It is to mention that, while training only 50% (500 videos
achieved by the authors of this dataset when they used it to of each) of the dataset is used due to COLAB memory issues
train models on STIP and MoSIFT [14]. It is mentioned that therefore the performance is somewhat compromised. Fig. 3
each video of this dataset has about two to four seconds of shows the obtained total accuracy and total validation
runtime. accuracy trained on real life dataset which is 91%.
Fig. 3. Achieved total accuracy and total validation accuracy on real life Fig. 6. Achieved total accuracy vs total validation accuracy on the Peliculas
dataset dataset
Its total loss vs total validation loss can be observed in On this dataset, the model performs best with an accuracy
Fig. 4. Model did perform well, but not as well as planned of 98%. The total loss vs validation loss while training is
because fewer datasets were used. But still has achieved a presented in Fig. 5. Fig. 6 shows the graphical representation
total accuracy of 91% which is better than the other models, of attained accuracy of the model on the Peliculas dataset.
previously presented in the literature. It is noteworthy that the presented methodology resulted
in an accuracy of 91% that is significantly improved from
previous research. Moreover, the trained model has achieved
this much correctness with a varying background that causes
different illumination levels and adds more objects in the
video. Consequently, owing to discussed advantages the
presented CNN-LSTM-based HAR system can be
implemented as a real-time smart surveillance system.
V. CONCLUSION
The field of application that deals with identifying fights
and suspicious actions on video are expanding. This
capability may prove to be of great use in situations involving
video monitoring, such as in correctional institutions, mental
hospitals, and senior care facilities. Techniques of HAR that
have traditionally focused primarily on individual actors and
fundamental events can be utilized within the context of this
Fig. 4. Achieved total loss vs validation loss on a real-life dataset particular application. The fundamental objective of this
Another training of the model was performed on the research is to present a standalone HAR system that can
Peliculas dataset using the same architecture. Following the identify potentially harmful video behavior speedily and
training, the model was able to successfully reach an accuracy accurately. The presented HAR approach combines CNN and
of 98%, which is the greatest it has been able to achieve on LSTM to deliver optimal precision and speed, which is
this dataset so far. beneficial for real-time applications. The accuracy of the
presented HAR technique is up to 98% for the Peliculas
dataset and reduces to 91% for real life datasets with varying
backgrounds. However, it is notable that his accuracy of 91%
is improved compared to its previous counterparts. To say in
nutshell, this paper introduces a real-time standalone HAR
solution that not only detects but also classifies unscrupulous
activities. In the future, a camera will be mounted on a
moving automated platform such that the overall system may
operate in the closed quarters and provide real-time HAR
feedback to the control room.
ACKNOWLEDGMENTS
The Project Lab in the Department of Electrical and
Computer Engineering at CUI Abbottabad is where the
majority of this work is accomplished. The author wishes to
express gratitude to both the department's employees and
Fig. 5. Achieved total loss vs total validation loss on Peliculas dataset
management for their unwavering support during the course [8] K. Simonyan and A. Zisserman, "Very deep convolutional
networks for large-scale image recognition," arXiv preprint
of this research.
arXiv:1409.1556, 2014.
[9] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
REFERENCES "Mobilenetv2: Inverted residuals and linear bottlenecks," in
[1] G. Swapna, S. Kp, and R. Vinayakumar, "Automated detection Proceedings of the IEEE conference on computer vision and
of diabetes using CNN and CNN-LSTM network and heart rate pattern recognition, 2018, pp. 4510-4520.
signals," Procedia computer science, vol. 132, pp. 1253-1262, [10] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling, "Spherical
2018. cnns," arXiv preprint arXiv:1801.10130, 2018.
[2] P. K. Mishra and G. Saroha, "A study on video surveillance [11] S. Hochreiter and J. Schmidhuber, "Long short-term memory,"
system for object detection and tracking," in 2016 3rd Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
International Conference on Computing for Sustainable Global [12] A. Kulshrestha, L. Chang, and A. Stein, "Use of LSTM for
Development (INDIACom), 2016, pp. 221-226: IEEE. Sinkhole-Related Anomaly Detection and Classification of
[3] P. K. Roy and H. Om, "Suspicious and violent activity detection InSAR Deformation Time Series," IEEE Journal of Selected
of humans using HOG features and SVM classifier in Topics in Applied Earth Observations and Remote Sensing, vol.
surveillance videos," in Advances in Soft Computing and 15, pp. 4559-4570, 2022.
Machine Learning in Image Processing: Springer, 2018, pp. 277- [13] H. Abbasimehr and R. Paki, "Improving time series forecasting
294. using LSTM and attention models," Journal of Ambient
[4] T. Cucliciu, C.-Y. Lin, and K. Muchtar, "A DPM based object Intelligence and Humanized Computing, vol. 13, no. 1, pp. 673-
detector using HOG-LBP features," in 2017 IEEE International 691, 2022.
Conference on Consumer Electronics-Taiwan (ICCE-TW), 2017, [14] E. Bermejo Nievas, O. Deniz Suarez, G. Bueno García, and R.
pp. 315-316: IEEE. Sukthankar, "Violence detection in video using computer vision
[5] M. Jogin, M. Madhulika, G. Divya, R. Meghana, and S. Apoorva, techniques," in International conference on Computer analysis of
"Feature extraction using convolution neural networks (CNN) images and patterns, 2011, pp. 332-339: Springer.
and deep learning," in 2018 3rd IEEE international conference [15] M. M. Soliman, M. H. Kamal, M. A. E.-M. Nashed, Y. M.
on recent trends in electronics, information & communication Mostafa, B. S. Chawky, and D. Khattab, "Violence recognition
technology (RTEICT), 2018, pp. 2319-2323: IEEE. from videos using deep learning techniques," in 2019 Ninth
[6] G. Bhat, R. Deb, V. V. Chaurasia, H. Shill, and U. Y. Ogras, International Conference on Intelligent Computing and
"Online human activity recognition using low-power wearable Information Systems (ICICIS), 2019, pp. 80-85: IEEE.
devices," in 2018 IEEE/ACM International Conference on [16] Ş. Aktı, G. A. Tataroğlu, and H. K. Ekenel, "Vision-based fight
Computer-Aided Design (ICCAD), 2018, pp. 1-8: IEEE. detection from surveillance cameras," in 2019 Ninth
[7] Y. Zheng, Q. Liu, E. Chen, Y. Ge, and J. L. Zhao, "Time series International Conference on Image Processing Theory, Tools
classification using multi-channels deep convolutional neural and Applications (IPTA), 2019, pp. 1-6: IEEE.
networks," in International conference on web-age information
management, 2014, pp. 298-310: Springer.