Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Computers, Materials & Continua

DOI:10.32604/cmc.202x.xxxxxx

Type: xxx

Violence detection using Computer Vision Approaches

Khalid Raihan Talha1 and Koushik Banerjee2


1
Department of Electrical and Computer Engineering, North South University, Bashundhara, Dhaka-1229, Bangladesh

*
Corresponding Author: Author’s Name. Email: author@institute.xxx
Received: XX Month 202X; Accepted: XX Month 202X

Abstract: Violent crime has always been a major social problem. The rise of
violent behavior in public areas can be attributed to a variety of factors. Greed,
frustration, and hostility among individuals, as well as social and economic
anxieties, are the primary causes of increased violence. It is critical to protect our
possessions, as well as our lives, from threats such as robbery or homicide. It is
impossible to prevent crime and violent acts unless brain signals are studied and
a certain pattern deduced from criminal ideas is detected in real-time. Due to its
technological viability, it has yet to be realized. However, using deep learning-
based computer vision technologies, we can detect violent activities in public
areas. The goal of this project is to build a real-time violent activity monitoring
system that will be capable of detecting violence very quickly and efficiently.
The public of any city can benefit from it, as it will allow the people of the law
enforcement department to take necessary actions to prevent violent activities.
When the system is implemented, it will be able to detect the speed of the
movements of people and their distances from another person walking in public
places by using cameras. The system will mainly detect the speed of hand and
leg movements of a person who will be very close to another person. If anyone
identified as a violent maker, the server-side of the system will notify the people
who will be responsible for preventing violence in a very short period. The
system has built using the concepts of computer vision and neural networks. The
system has been developed and tested initially on the personal computing
devices of the system developers. This system is very easy to design and
develop, making it very easy to use for any kind of public area surveillance. At
the same time, the system gives its desired output due to its high accuracy.

Keywords: Violence detection; convolutional neural networks; LSTM;


Computer vision.

1 Introduction

For a long time, one of the major issues has been the occurrence of violence in daily life. It can easily
destroy the peace and harmony of any society. However, criminal activities from 2014 to 2017 declined a
lot. However, starting in 2017, it started to rise again. From 2017 to 2018, we can see an increase of
6.79%. ([1] Violent behavior in public areas is happening due to various factors. Individual greed,
frustration, and hatred, as well as social and economic insecurity, are the leading causes of violence. To
solve this issue, expected or unexpected violence should be detected at an early stage so that it can be
stopped as soon as possible.

Computer vision and deep learning have recently been used to investigate human actions and behavior.

This work is licensed under a Creative Commons Attribution 4.0 International License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
xxxxCMC, 202x, vol.xx, no.xx

Despite being the most terrifying societal issue, few works automate action detection, violence detection,
or protest detection. In terms of social security and stability, this field of study is quite useful. It is
impossible to prevent crime and violent acts unless brain signals are studied and a specific pattern derived
from criminal thinking is discovered in real-time. Due to technological feasibility, it has yet to be
accomplished. Using deep learning-based computer vision, we can now easily detect aggressive activity
in public areas. Most public sites and private institutions already have surveillance cameras installed. ([2]
Effective violent detection techniques can assist the government or authorities in taking a quick and
systematic approach to identifying violence and preventing the loss of human life and property. As human
beings and members of society, we all desire to have secure streets, communities, and workplaces.
Because it does not involve any explicit feature engineering, deep learning outperforms machine learning.
There are some disadvantages, including high processing costs and large training datasets. These
technological considerations drive us to create a model that requires less training time and a smaller
number of training examples. Using deep learning methodologies, we offer approaches in our system that
will be able to spot violent threats and activities.

Previously, the presence of a body, the degree of action, and even aspects of the sound associated with
violent activities were used to distinguish between violent and non-violent activities. Surveillance
cameras are not very effective in recording sounds related to certain activities (Audio-visual content-
based violent scene characterization) [3]. Frame-based video analysis, on the other hand, is purely based
on a sequence of frames (that is, a picture) rather than sounds. There are various sorts of violence,
including one-on-one violence, mob violence, family violence, sports violence, gun violence, and many
others. Violence detection with C3D Convolutional Neural Network (3D-CNN) for detecting violent
scenes in a video stream was one of the previous works. The 3D-CNN is a deep supervised learning
system that uses films to learn spatiotemporal discriminant features (a sequence of image frames). Unlike
2D convolutions, this method applies 3D kernels to a series of image frames in their context, resulting in
3D activation maps that capture both spatial and temporal information. Three datasets were combined for
this task: hockey fights, movies, and crowd violence [4]. They were able to get an accuracy of 84.428% at
the 36th training epoch [5]. Another contribution was a work that uses the concept of convolutional neural
networks (CNNs) and the Google Object Detection API and uses these two new developments in
technology to retrain a pre-trained model to perform weapon detection in real-time surveillance. From one
of the latest contributions, we were able to know that this problem can also be solved by using
convolutional neural networks. By scanning the sequential flow of video frames, a bidirectional LSTM
model (CNN-BiLSTM) architecture is used to detect real-time violence. They had more than 98%
accuracy for their three different models.[6]. But unlike their work, we will only create a single model,
which will not only decrease the server load but also increase the response time of applications where it
will be deployed.

This project aims to investigate the effect of training convolutional neural networks with one extra class
"non-weapon" based on two original classes "gun" and "knife". The knife was accurately identified as a
knife, while the phone was successfully identified as a phone by the Inception model non-weapon with
99% and 56% accuracy, respectively.

To predict violence in the sequential flow of frames, we will utilize the Convolutional Neural Network
Bidirectional LSTM model (CNN-BiLSTM) architecture. To begin with, we divide a video into numerous
frames. We pass each frame through a convolutional neural network to extract the information present in
the current frame. Then, to recognize any sequential flow of events, we utilize a bidirectional LSTM layer
to compare the information of the current frame once with the prior frames and once with the upcoming
frames. Finally, the classifier determines whether or not an action is violent.

After introducing our topic, we will go directly to the methodology in Section 2, where we will be
discussing the steps and ways to implement our system, including experimental setup, data processing,
CMC, 202x, vol.xx, no.xxxxxx

and training methods. Then we will discuss the results of our work with qualitative and quantitative data
in section 3. Under section 3, we will illustrate accuracy evolution and accuracy comparison. In the end,
we will discuss the conclusion of our paper, including the necessary figures and tables, and future
upgrade chances.

2 Methodology

In this section, we will discuss the proposed framework for creating our computer vision model and its
architecture.

2.1 Model Architecture


Our model must be able to predict sequences in successive frames, such as a pattern in the movement
of the individuals or the degree of their motion, to classify violent or non-violent activities. This is not
possible by considering only the spatial features (features belonging to a particular frame) of the frames.
While detecting sequences in frames, temporal or time-related factors must be taken into account. The
temporal features can be handled in either a forward or backward order. Our model processes the
temporal features in both directions in addition to the spatial features, which helps the model to become
more accurate and, at the same time, consumes less computational time. The lightweight models are
always preferred in surveillance due to their low-cost structure. The model consists of three sub-parts. [7]

2.1.1 CNN
The Convolutional Neural Network (CNN) is the most common neural network in the field of
computer vision to detect and classify images, comprising an input convolutional layer followed by three
layers of convolution and max pooling. The kernel size for each convolutional layer is 3×3. 64 kernels are
used in each convolutional layer. After passing through "relu" activation, the output from each
convolutional layer is

Figure 1:Convolutional Neural Network(CNN)


The function is max pooled to extract the features. Each maximum pooling uses a filter size of 2 2.
Finally, the features are fattened and sent to the next model.TensorFlow's (https://www.tensorflow.org/)
and Keras' (https://keras.io/) APIs have been used to deploy convolutional neural networks. In this
diagram, the basic CNN capability is displayed in Fig. 1.

2.1.2 The Bidirectional LSTM Cells


xxxxCMC, 202x, vol.xx, no.xx

The basic LSTM cell appears in Fig. 2. Long-term memory cells are frequently used to reexamine a
portion of previously prepared highlights.LSTM mimics the action of the human brain to keep in mind the
already prepared event. The first layer in an LSTM cell is known as the overlooking entryway layer,
signified by ft. It is passed through a sigmoid function to urge a yield of either 1 or 1. The esteem shows a
disregard state and 1 signifies a keep in mind state. The condition of the disregard door layer is given as,

Figure 2. Basic LSTM cell

Figure 3: BiLSTM Shell

The next layer is called the input gate layer. In this layer, the remembered state data is retrained with the
new features.

The yield from the disregard entryway layer is duplicated into the cell state vector (ct) of the past LSTM
cell (ct1). The result is included in the yield from the input door layer, increased to the covered upstate
vector of the final state upon passing through a "tanh" operation to make a cell state vector for the
following LSTM cell. This vector, upon passing through a "tan h" work, is increased to the covered
CMC, 202x, vol.xx, no.xxxxxx

upstate vector of the past state (ht1) upon passing through a "sigmoid" work to create a hidden state
vector for the following LSTM cell (ht). Subsequently, within the last layer of Crt, a portion of the
highlights from the past state and the recently obliged highlights of the current cell are included and
passed to the other state.

Where it is an input vector to the LSTM unit and bf, bi, and bo are the weight vectors for the forget gate
layer, input gate layer, and output gate layer, respectively. In the LSTM, the features are remembered and
passed from state 1 to state 2 to state n. The LSTM can also work in the reverse direction as well; the
features will be remembered and passed from state n to state 2 to state 1. By combining both these
mechanisms, we achieve a bidirectional LSTM layer as shown in Fig. 3. The bidirectional LSTM cells are
more accurate in storing data. For violence detection, a bidirectional LSTM will compare the sequence of
frames once in the forward direction and once in the reverse direction. This mechanism adds various cell
states and training features that add robustness to our model.

2.1.3 The Dense Layers


The dense layers are omnipresent when it comes to deep learning. Here, the fully connected dense
layers help to add random weights (Wi) to random features (Xi) and test which set of features gives the
best accuracy over a certain number of epochs by passing through an activation function. In Fig. 4, the
entire architecture of our proposed model has been shown.

Figure 4: Node Architecture


2.2 Experimental Setup
2.2.1. Data Processing
Frames have been extracted from the videos. The extracted frames are reshaped to 100×100 pixels
(denoted as x y). The training data is a Numpy3 array, with each of its rows representing a sequence or
pattern in video. A sequence might include a degree of movement and actions, whether a movement of
the arm is a punch or a handshake, etc. The minimum number of frames required to extract a sequence is
2. However, we have used 10 consecutive frames (denoted as n) to extract the temporal features (that is,
time-related features). The total number of samples (denoted by N) is the number of such sequences in the
xxxxCMC, 202x, vol.xx, no.xx

dataset ((total number of frames) / (number of frames to consider in a sequence). For a simple
implementation, NumPy allows an arbitrary value of 1 to be used. Hence, a structure containing a
sequence of 10 consecutive frames with their respective class labels is prepared. The shape of the training
data is (-1, N, x, y, c). Here, c represents the number of channels in each frame. The pictorial
representation of the training data is shown in Fig. 5.

Figure 5:Visualization of the training data


2.2.2 Data Frame Collection
These are discussed in ‘section 2.5’.
2.2.3 Data Frame Separation
The video datasets are divided into a 90/10 ratio for random selection. 10% of images and videos are
used for testing in the evaluation step. 90% of the images are used to feed into the model for training
purposes, and this could be done by using a Python script. On the other hand, the weapon image dataset is
divided into an 80/20 ratio in random selection.
2.2.4 Model Training
For detecting the human fights group of 10 consecutive frames, 100 100 dimensions, was passed to the
model with a shape as shown in Fig 4 to extract the spatial and temporal features. Stochastic gradient
descent has been used as an optimizer with a learning rate of 0.01 and a decay rate of 1 e 6. The loss
function used in this paper is "sparse categorical cross-entropy". In this multi-class classification problem,
we have used "0 or 1" as class labels, instead of one-hot encoding, in a batch size of 5 samples at an
CMC, 202x, vol.xx, no.xxxxxx

instant. For training and testing purposes, the datasets are divided into 9:1 ratios. The entire model has
been built and trained from scratch for 25 epochs only to maintain its lightweight computation cost.
2.2.5 Model Testing
Once the model finishes training, at this stage, a test dataset is used to evaluate the model and output
the average precision and map. Then the script outputs the result from the model at the command prompt.
The testing process can be run on the existing trained model. 1.

Figure 6: System Layout


2.3 Requirement
2.3.1 Software
a.TensorFlow with GPU support-An open-source software library used for machine learning.
b. Python 3.9.x
c. Algorithm-CNN, RNN, LSTM, Deep Learning, Computer Vision, VS code.
d. Libraries used: Keras, Numpy, Tensorflow Object Detection.
2.3.2 Hardware
Processor-AMD Ryzen 5 2400G
RAM-16.0 GB
GPU-Radeon Vega 11 Graphics
Operating System-Windows 10 Professional 64 bit
2.4 Resources
a. Movies Fight Detection Dataset-200 video clips
b. Hockey Fight Detection Dataset -1000 video clips
c. Crowd/Violent-Flows Fight Detection Dataset-246 video clips
2.5. Dataset
xxxxCMC, 202x, vol.xx, no.xx

The effectiveness of the CNN Bidirectional LSTM model architecture has been validated by running
on the standard datasets for violent and non-violent action detection, namely the Hockey Fights dataset
[8], the Movies dataset [8], the Violent Flows dataset [9], and the Weapons datasets for image
classification and object detection tasks.

The Hockey Fights Dataset: The Hockey Fights dataset contains clips from ice hockey matches. The
dataset has 500 violent clips and 500 non-violent clips with an average duration of 1 s. The clips had
similar backgrounds and subjects. Hockey Fight Detection Dataset-Academic Torrents

The Movies Dataset contains clips from different movies for action sequences, whereas the non-fight
sequences consist of clips from action recognition datasets. The dataset has 100 violent clips and 100 non-
violent clips with an average duration of 1 s. Unlike the Hockey Fights dataset, the clips from movies
have different backgrounds and subjects. Movies Fight Detection Dataset-Academic Torrents

Dataset: The Violent Flows data to deal with crowd violence. The dataset consists of videos of human
actions from the real world, CCTV footage of crowd violence, and YouTube videos, properly maintaining
the standard benchmark protocols. The dataset consists of 246 videos with properly biased samples.
Crowd Violence: Non-violence Database (open.ac.il)

3 Results and Analysis

In this section, we will discuss the result of our proposed framework and analyze the strength and
weaknesses of our model.

3.1 Accuracy Evaluation:

As we have used the CNN-BiLSTM model architecture, it can handle our specifically chosen datasets
very much efficiently. Each of our datasets is divided into 9 parts also known as epochs for training the
desired model and 1 epoch for getting the validation. From each of the epochs, we can get the information
of training accuracy, training loss, validation accuracy, and validation loss.

3.1.1 Hockey Fight Dataset:

For training the model for the hockey dataset we have used 10 epochs. Every epoch was having 55
steps by maintaining the batch size of 16 frames. From Fig 7 and 8 we can see that The maximum
accuracy achieved was 94.9% for training and 96.94% for validation.
CMC, 202x, vol.xx, no.xxxxxx

Figure 7: training and validation accuracy achieved from hockey fight dataset

Figure 8: training and validation loss achieved from hockey fight dataset

3.1.2 Movie dataset:

After getting the first trained model from the hockey dataset, we have overfitted it with the movie
dataset by maintaining 10 epochs. Each of the epochs was going through 55 steps. From Fig 9 and 10 we
can see that The maximum accuracy achieved was 92.92% for training and 96.94% for validation.
xxxxCMC, 202x, vol.xx, no.xx

Figure 9: training and validation accuracy achieved from movie dataset

Figure 10: training and validation loss achieved from movie dataset
CMC, 202x, vol.xx, no.xxxxxx

3.1.3 Violence Flow Dataset:

After getting the first trained model from the hockey dataset, we have overfitted it with the violence
flow dataset by maintaining 10 epochs. Each of the epochs was going through 55 steps. From Fig 11 and
12 we can see that The maximum accuracy achieved was 77.31% for training and 80% for validation.

Figure 11: Training and validation accuracy achieved from violence flow dataset

Figure 12: Training and validation accuracy achieved from violence flow dataset
xxxxCMC, 202x, vol.xx, no.xx

3.2 Accuracy Comparison:

From the accuracy evaluation, we can see that our final model has achieved an accuracy of more than
77% for training and 80% for validation accuracy. The comparison between our model and its
architecture with other existing models and their architecture is given in Tab 1.

Methods Hockey Movie Violence Flow

MoIWLD[10] 96.8±1.04% - 93.19±0.12%

ViF+OViF [11] 87.5±1.7% - 88±2.45%

Spatiotemporal Encoder 98.1±0.58% 100±0% 93.87±2.58%


[12]

Conv 3D [13] 98.3±0.81% 100±0% 97.17±0.95%

CNN-LSTM [6] 97.1±0.55% 100±0% 94.57±2.34%

CNN-BiLSTM (our 94.9% 92.92% 77.31%


model

Table 1: Comparison between the accuracy of our model with the existing models

4. Conclusion:

Our proposed CNN-BiLSTM based violence detection system can make society a secure place to stay
for peace-loving people. By using our proposed framework we were able to achieve a decent accuracy in
the final over-fitting of our model. Despite the satisfactory performance of our proposed model, it needs
to be further validated with more standard datasets where identification of one to many or many to many
violent activities. In future work, we will be able to increase the accuracy of our model by maintaining
our model architecture. Our model will be having combined violence and weapon detection capabilities.
Soon, we are planning to detect metal by using thermal vision cameras which will allow us to
differentiate between real guns and fake guns.[14] We will also give our system the capability to
determine whether a gun holder is a member of the law enforcement team (police) or not. Shortly, our
system will also be capable of detecting violent activities by using night vision [15]and thermal vision.
[16]

Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the
present study

References

[1] “Bangladesh Crime Rate & Statistics 2000-2022.”


https://www.macrotrends.net/countries/BGD/bangladesh/crime-rate-statistics (accessed Jan. 13, 2022).
[2] M. Ramzan et al., “A Review on State-of-the-Art Violence Detection Techniques,” IEEE Access, vol. 7.
pp. 107560–107575, 2019. doi: 10.1109/access.2019.2932114.
[3] “Audio-visual content-based violent scene characterization.” https://ieeexplore.ieee.org/stamp/stamp.jsp?
tp=&arnumber=723496&isnumber=15617 (accessed Nov. 11, 2021).
[4] S. Accattoli, P. Sernani, N. Falcionelli, D. N. Mekuria, and A. F. Dragoni, “Violence Detection in Videos
CMC, 202x, vol.xx, no.xxxxxx

by Combining 3D Convolutional Neural Networks and Support Vector Machines,” Applied Artificial
Intelligence, vol. 34, no. 4. pp. 329–344, 2020. doi: 10.1080/08839514.2020.1723876.
[5] F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haq, and S. W. Baik, “Violence Detection Using
Spatiotemporal Features with 3D Convolutional Neural Network,” Sensors , vol. 19, no. 11, May 2019, doi:
10.3390/s19112472.
[6] R. Halder and R. Chatterjee, “CNN-BiLSTM Model for Violence Detection in Smart Surveillance,” SN
Computer Science, vol. 1, no. 4. 2020. doi: 10.1007/s42979-020-00207-x.
[7] R. Halder and R. Chatterjee, “CNN-BiLSTM Model for Violence Detection in Smart Surveillance,” SN
Computer Science, vol. 1, no. 4. 2020. doi: 10.1007/s42979-020-00207-x.
[8] E. B. Nievas, O. D. Suarez, G. B. García, and R. Sukthankar, “Violence Detection in Video Using
Computer Vision Techniques,” Computer Analysis of Images and Patterns. pp. 332–339, 2011. doi:
10.1007/978-3-642-23678-5_39.
[9] T. Hassner, Y. Itcher, and O. Kliper-Gross, “Violent flows: Real-time detection of violent crowd behavior,”
2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 2012.
doi: 10.1109/cvprw.2012.6239348.
[10] T. Zhang, W. Jia, X. He, and J. Yang, “Discriminative Dictionary Learning With Motion
Weber Local Descriptor for Violence Detection,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 27, no. 3. pp. 696–709, 2017. doi: 10.1109/tcsvt.2016.2589858.
[11] Y. Gao, H. Liu, X. Sun, C. Wang, and Y. Liu, “Violence detection using Oriented
VIolent Flows,” Image and Vision Computing, vol. 48–49. pp. 37–41, 2016. doi:
10.1016/j.imavis.2016.01.006.
[12] A. Hanson, K. Pnvr, S. Krishnagopal, and L. Davis, “Bidirectional Convolutional LSTM
for the Detection of Violence in Videos,” Lecture Notes in Computer Science. pp. 280–295, 2019.
doi: 10.1007/978-3-030-11012-3_24.
[13] J. Li, X. Jiang, T. Sun, and K. Xu, “Efficient Violence Detection Using 3D Convolutional
Neural Networks,” 2019 16th IEEE International Conference on Advanced Video and Signal Based
Surveillance (AVSS). 2019. doi: 10.1109/avss.2019.8909883.
[14] A. Castillo, S. Tabik, F. Pérez, R. Olmos, and F. Herrera, “Brightness guided
preprocessing for automatic cold steel weapon detection in surveillance videos with deep learning,”
Neurocomputing, vol. 330. pp. 151–161, 2019. doi: 10.1016/j.neucom.2018.10.076.
[15] A. Castillo, S. Tabik, F. Pérez, R. Olmos, and F. Herrera, “Brightness guided
preprocessing for automatic cold steel weapon detection in surveillance videos with deep learning,”
Neurocomputing, vol. 330. pp. 151–161, 2019. doi: 10.1016/j.neucom.2018.10.076.
[16] R. Ippalapally, S. H. Mudumba, M. Adkay, and N. V. H. R., “Object Detection Using
Thermal Imaging,” 2020 IEEE 17th India Council International Conference (INDICON). 2020. doi:
10.1109/indicon49873.2020.9342179.

You might also like