Human Violence Detection Using LHOGF Algorithm and Deep Learning Model11

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)

Human Violence Detection Using LHOGF


Algorithm and Deep Learning Model
2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N) | 978-1-6654-7436-8/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICAC3N56670.2022.10074425

Ruchika Gupta
Akash Chauhan Computer Science and Engineering,
Computer Science and Engineering, Chandigarh University, Gharuan
Chandigarh University, Gharuan ruchikae7396@cumail.in
eakashchauhan087@gmail.com

Abstract—A minor action of human is performed with some


purpose. Understanding the behavior and the way to interact of
human with environment in automatic way has attained a lot of
attention in research field over past few years due to its potential
in diverse domains. Several applications such as intelligent video
surveillance and for monitoring the environmental home, for
storing and retrieving the video etc. are related to the domain of
recognizing the human activity. The human violence is detected
in several stages in which the data is pre-processed, features are
extracted, and the data is classified. This research introduces a
deep learning algorithm for predicting the human violence.
Python is executed to simulate the introduced algorithm and
diverse metrics such as precision, and recall are considered to
analyze the outcomes. The proposed model shows significant high
performance as compared to existing model.

Keywords—Human Violence, Deep Learning, LHOGF, Video


Processing

I. INTRODUCTION
Violence is a kind of behavior in which human behave
aggressively. Researchers focus on analyzing diverse visual Fig. 1. Main stages of video-based violence detection
patterns of violent motions and developing different
descriptors for representing such attributes. Thus, the
performance is computed on 3 classic benchmarks data sets.
At present, computer vision is extensively utilized in In the training stage, the first step includes data acquisition
enhancing the potential of computer and presence of enormous from images/videos. This stage aims to pre-process every
sized datasets. The significant application of DL is widely individual video frame for inserting appropriateness in it to
adapted in computer vision and utilized in diverse domains of process further. Several techniques are executed for
classifying the image and detecting the object. This technique computing their impact on the procedure of classifying the
also assists in tackling the issue related to detect the violence data. A Gaussian kernel is employed for mitigating the noise
[1]. Unlike the manual attributes- based techniques, DL impact, a histogram equalization is implemented for
techniques becomes more robust and accurate. distributing the pixel intensities to a huge contrast range, and a
MoG (Mixture of Gaussians) is adopted to subtract the
Nevertheless, some complexities are occurred when its background so that the objects not related to the actors of the
computing efficacy and recognizing accuracy is considered. scene are avoided. Furthermore, a constant factor is assisted in
Moreover, for practical applications, the major emphasize is maximizing and mitigating the dimensions at multiple scales
on generating an effectual and stable DL framework. The which are considered to compute the video frames [2]. Next, a
system of detecting the violence is executed for recognizing suitable descriptor is selected to encode the structural
the events in real-time in order to avoid hazardous situations. information and suppress the comprehensive textural
The challenging task is of understanding the crucial principles. information. The special descriptor vectors are utilized to
Figure 1 illustrates the major stages techniques implemented model the distribution of local structures and geometrical
to detect the video- based violence. The video-based violence information. Moreover, different descriptors are useful for
detection process has two stages: training and testing. quantifying the structural attributes of the image. Thereafter, a
descriptor of superior dimension is created using all
descriptors. The descriptor is effective for extracting the
feature vectors from the video frames for 2 dissimilar
approaches. First of all, the descriptor is useful for computing

ISBN: 978-1-6654-7436-8/22/$31.00 ©2022 IEEE 1202


Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:07:21 UTC from IEEE Xplore. Restrictions apply.
2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)

the entire content of every frame. Afterward, every frame is various circumstances. The fighting events are differentiated
divided into blocks through a grid with specific dimension the from the normal ones using attributes generated via motion
attributes are processed for extracting the buried attributes in blobs [4]. The aggressive behaviors are not detected easily in
noisy input data. Subsequent to this process, diverse machine video surveillance circumstances. The earlier methods aimed
learning algorithms deploys the data. In the testing phase, data at extracting the descriptors of spatio-temporal or statistic
is acquired from surveillance cameras, and smartphone images. attributes in motion areas. However, the potential for
Then, pre- processing is applied to split data into frames. recognizing the video-based violent activities is limited. Thus,
Finally, the decision making is performed about the action a novel technique based on SVM is presented for overcoming
being violent or non-violent. Violence detection depends on such issue. At first, the optical flow fields are distributed to
two methods which are as follows: partition the motion regions into segments. Thereafter, a way
is put forward to extract 2 kinds of low-level attributes: LHOG
A. Action recognition whose extraction is done from color images and LHOF from
This process is effective for recognizing the human optical flow images, for defining the emergence and dynamics
activities. There are 4 kinds of human activities on the basis of of violent behaviors in the motion regions. At last, BoW
intricacy of their acts and the number of bodily organs algorithm is employed to code the gathered attributes and the
involved in the action. A gesture is defined as a series of generation of a specific-length vector is done for every video
motions that the hands, head, or other body parts performed clip so that the duplicate information is eliminated. Moreover,
for conveying a certain message. Diverse gestures are the video-level vectors are classified through Support Vector
compiled in the actions of a single person. A set of human Machine. This algorithm performed well in contrast to other
actions in which two people are included is known as algorithms on 3 complex datasets.
interaction [3]. The action is performed using 2 persons: a
human and an object. The mixture of gestures, actions, or 2) Violence detection techniques using SVM: This
interactions is recognized as group activities in which more section defines the deployment of Support Vector Machine for
than two people are engaged. detecting the violence. This algorithm is capable of dealing
with the issues related to classify the images. the dimension
B. Violence detection space of this algorithm is considered for illustrating the data
The major concern is on detecting the violence that on attributes and differentiating 2 groups. It is a significant
becomes a major interesting field in recognizing the action. method in computer vision due to its robustness and potential
The process of detecting the violence emphasizes on of considering sensitive information. Support vector Machine
determining the occurrence of violence in least duration, in is planned on the basis of kernel which aids in converting the
automatic and effective way. At present, the applications such data into a high-dimensional space to tackle the issue. The
as video surveillance, human–computer interaction, becomes major limitation is that the findings are not transparent. An
popular to recognize the human actions in video automatically. innovative technique to recognize the school violence makes
The violence is detected for recognizing the occurrence of the implementation of K-Nearest Neighbor algorithm such that
violence. The case in which any difficulty is found to detect the foreground moving objects are determined. After that, the
the violence, is called the subjective notion of violence. The morphological processing approaches are adopted to pre-
attributes are employed in this case to differentiate the process the known objectives. The circumscribed rectangular
violence from generic acts. The major issue is related to detect frame of moving objects is optimized using a circumscribed
the violence at application as well as research level. rectangular frame integrating method. The extraction of
diverse attributes is done for differentiating the school
In daily life, violence is defined as the distrustful violence from daily actions [5].
occurrences. The major issue in recognizing the cation is of
recognizing such actions in surveillance cameras using II. LITERATURE SURVEY
computer vision. The major goal of researchers is to introduce M. -S. Kang, et.al (2021) put forward the proposal of a
several technique and mechanisms to detect the violence or nascent framework of detecting violence. It was possible to
unusual occurrences, due to the maximization of crime rates. combine this with classic 2D ConvNets [7]. This work
A number of techniques are created so that the violence is contributed to the learning of spatio- temporal characterization
detected. The classification algorithms split the techniques of in video footage with a new idea of frame-grouping to enable
detecting the violence in 3 phases in which ML is employed to 2D ConvNets. Motion Salency Map (MSM), a Spatial
detect the violence, Support Vector Machine is utilized to Attention Unit, had potential to acquire main regions of
detect the violence and DL is also employed for this purpose. attribute maps. T-SE (temporal squeeze-and-excitement) unit
Because of the extensive deployment of Support Vector which was a temporal attention unit was capable of naturally
Machine and DL in compute vison, thus these categories are highlighting time frames associated with a specified incident.
discussed as: The designed structure proved more efficient in comparison to
1) Violence detection using ML techniques: The crucial the benchmark compositions and made the computing task
research domain in computer vision, is to recognize the action. less complex. Notably, MobileNetV3 and EfficientNet-B0
However, most researches aims at relatively basic activities. obtained high-quality results on six diverse datasets with the
To recognize the specific events with instant practical incepted frameworks.
application, such as fighting or general violent conduct, X. Hu, et.al (2022) introduced a nascent ALCM (Angle-
becomes popular. The video surveillance becomes effective in Level co-incidence matrix) architecture. This architecture

1203

Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:07:21 UTC from IEEE Xplore. Restrictions apply.
2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)

recorded the co-incidence of two particular quantified angle detection tasks, to acquire templates of compound vicious
levels amongst fibers with their nearby fibers [8]. This work activities, the WiVi additionally used interrelated attributes
respectively computed three ALCMs for the three derived from integrated subcarriers to leverage status
perpendicular planes and formed a TOP-ALCM so as to fully information of medium. The new framework selected suitable
represent the viciousness in the measurements. This work characteristics for the classifier set-up in various conditions by
additionally presented both traditional and DNN oriented applying the feature fusion strategy. Many practical scenarios
architectures to identify violence in videos, which took were considered in this work for the implementation and
advantage of several attributes like energy, entropy, and evaluation of the new architecture. The newly designed
symmetry measured from the fabricated computed for architecture obtained 93.46% recall and 93.57% specificity in
taxonomy, while the latter directly classified TOP-ALCM the trial outcomes.
with ConvNet. The outcomes of conducted trials validated the
C. Gu, et.al (2020) formulated a nascent semantic
dominance of the new framework over standard mythologies
of violence detection in video sequences. communication based VVD framework between multimedia
data from the similar video footage [13]. This work extracted
A. Mehmood, et.al (2021) detected suspicious activities in the attributes of three dissimilar modes namely presence,
crowded scenes thorough a new less pricey computing motion, and auditory using deep learning methodologies. Next,
methodology [9]. The introduced mechanism adopted an this work applied common subspace learning to combine these
already trained 2D ConvNet for mobility information and to attribute present in multiple modes by selecting a feature-level
compute optical flow in an economical manner. This fusion methodology. This process was directed using two
architecture implemented a trivial form of 2D CNN to make learning strategies called multitasking and semantic
the detection extremely accurate while minimizing the cost of embedding by implementing semantic communication. The
computation. This work corrected the spatial streams using efficiency of the formulated architectures was evaluated by
RGB photo frames and temporal streams using SG3I (stacked conducting experimentations on many open-source datasets
grayscale 3-channel images). The UMN, Hockey Fights and and an independent dataset known as VCD (Violence
Violent Flow datasets respectively achieved 99.12%, 99.71% Correspondence Detection). The fabricated architecture
and 98.81% accuracy with the designed system and obtained fairly reasonable outcomes.
successfully detected various anomalies in experimentation.
P. Wang, et.al (2020) presented an architecture integrating
W. Song, et.al (2019) introduced an original improved 3D CNN and trajectory-based methodology to detect brute force
CNN based system for detecting violence in video footages [14]. This methodology derived the spatiotemporal attributes
[10]. This work not only augmented the data preprocessing of the video frame with a CNN by using non-natural and deep
scheme but also devised a fresh sampling methodology with attributes. This work introduced two CNN variants, namely
the aid of the main frame as the separating node. Next, the the multi- foot input architecture and the SPP-enabled
input frame series was formed by adapting a random sampling architecture, to address the issue of accurate recognition of
approach. The introduced sampling methodology proved its facial images in monitored video frames. The accuracy of the
efficiency during the trial assessments on the mob viciousness technique introduced on the Crow dataset was 92% and the
dataset. This work used an unvarying sampling technique to Hockey dataset was 97.6% after evaluating the performance of
build a 3DCNN for brief films. Also, this work employed a brute force recognition methodology. The presented solution
nascent sampling approach for lengthier films. The presented more accurately detected violence in video clips in the
system achieved outstanding results of 99.62%, 99.97% and experimentations.
94.3% in the context of hockey fights, films and mob violence,
respectively. The presented methodology proved D. J. Samuel, et.al (2019) devised a real-word violent
behavior recognition framework. This framework processed
unpretentious and competent in the experimentation.
huge amount of inputted flowing data and detected violent
F. U. M. Ullah, et.al (2022) designed an IIoT- supported activities using AI (artificial intelligence) simulations [15].
architecture with VD-Network empowered by computer The Spark architecture used the HOG (Histogram of Oriented
intelligence [11]. Firstly, the vital info related to human beings Gradient) to derive the attributes of single frames following
or doubtful stuffs like knives/guns was collected by passing the separation of attributes. Next, feature based labelling of
the original sequence of videos into a frivolous ConvNet. frames was performed in the form of a violence framework,
After detecting suspected stuff, warning was produced in the individual fragment framework and negative framework. This
IIoT system in the form of a preceding VD, whereas the process helped in the training of BDLSTM (Bidirectional long
information was passed only to the relevant branches. Frames short-term memory) network for recognizing violent activities.
containing only objects were fully propagated to the cloud for In terms of performance, the validation of the new framework
querying, using convolutional long-term memory (ConvLSTM) established its strength in detecting violent activities by
to extract features. The experiment performed confirmed the obtaining 94.5% accuracy.
efficiency of the fabricated system as it improved the accuracy
by 3.9% more than others. III. RESEARCH METHODOLOGY
L. Zhang, et.al (2022) developed WiVi, a universal passive The core of this project is to apply deep learning models
framework for detecting violence. This framework was for human violence estimation. The deep leaning model will
contingent on marketable WiFi architecture [12]. In addition be trained using the Local Histogram of Oriented Gradient
to time-sequence attributes applied in existent behavior features. Deep learning learns multiple layers of models that

1204

Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:07:21 UTC from IEEE Xplore. Restrictions apply.
2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)

are related to multiple levels of concepts; They produce a


measure of level perceptions where the bigger the level, the
more abstract ideas are learned. The semantic relationship
between a query and document in a search engine represents
an example of a deeply structured, non-linear pattern. The
search engine is based on the concept of retrieving and
ranking documents as per the search requirement of a user
called a search query. The major breakthrough of deep
learning lies in its potential to learn distributed illustrations;
Vector illustrations, or their structured arrangements in an
accurate manner. Deep learning methodologies practiced into
information recovery and web search, also incorporate natural
language processing like word embedding such as matching,
translation, classification, designed prediction, searching,
query solving and image extraction. Deep Structured Semantic
Model (DSSM) is a popular deep learning model for
increasing search engine significance. This model uses search
relevance labels directly in training. Particularly, DSSM is
composed of two different phases: document retrieval and
ranking. Document retrieval aims at obtaining as many
potentially related documents as possible on the basis of
content similarity. Ranking involves arranging the recovered
documents in the best relevance sequence on the basis of
different features, such as content similarity, originality and
acceptance. Especially, the document retrieval phase is
expected to retrieve previously missed relevant documents
based on DSSM-specific similarity levels. It also removes Fig. 3. Proposed Model
irrelevant documents with good lexical matching. Furthermore,
the ranking phase is expected to improve the ranking IV. RESULT AND DISCUSSION
relevance by adding DSSM features into the ranker. DSSM- This project concentrates on human violence detection.
based retrieval and ranking search engines outperform a The human violence detection has various phases which
baseline model with standard semantic similarity features (e.g., include pre-processing, feature extraction and classification.
latent semantic analysis and latent Dirichlet allocation) and The classification phase will predict the human violence. The
standard ranking features. Figure 2 showing the flowchart of deep learning model is applied for the prediction. The
the proposed work. performance of the model is analyzed in terms of accuracy,
precision and recall
As shown in figure 3, deep learning model is applied for
the human violence prediction. The Local Histogram of
Oriented Gradient features are derived which is given feed to
deep learning for the training.

TABLE I. PERFORMANCE ANALYSIS

Fig. 2. Proposed Framework

1205

Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:07:21 UTC from IEEE Xplore. Restrictions apply.
2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)

A. Authors and Affiliations improvement in accuracy using the new methodology than LR
The template is designed so that author affiliations are not for the human violence detection. In future proposed model
repeated each time for multiple authors of the same affiliation. can be improved using transform learning algorithms.
Please keep your affiliations as succinct as possible (for REFERENCES
example, do not differentiate among departments of the same
organization). This template was designed for two affiliations.
[1] M. Ramzan et al., "A Review on State-of-the-Art Violence Detection
1) For author/s of only one affiliation (Heading 3): To Techniques," in IEEE Access, vol. 7, pp. 107560-107575, 2019
change the default, adjust the template as follows. [2] Y. Baveye, C. Chamaret, E. Dellandréa and L. Chen, "Affective Video
Content Analysis: A Multidisciplinary Insight," in IEEE Transactions on
Affective Computing, vol. 9, no. 4, pp. 396-409, 1 Oct.-Dec. 2018
[3] H. Fradi, B. Luvison and Q. C. Pham, "Crowd Behavior Analysis Using
Local Mid-Level Visual Descriptors," in IEEE Transactions on Circuits
and Systems for Video Technology, vol. 27, no. 3, pp. 589- 602, March
2017
[4] M. Bianculli, N. Falcionelli and A. F. Dragoni, “A dataset for automatic
violence detection in videos”, Data in Brief, vol. 1, no. 46, pp. 256-
261Nov. 2020
[5] T. Khalil, J. I. Bangash and D. A. Ramli, “Detection of Violence in
Cartoon Videos Using Visual Features”, Procedia Computer Science,
vol. 18, no. 5, pp. 2149-2163, Oct. 2021
[6] D. K., V. L.K.P. and C. S., “Autocorrelation of gradients based violence
detection in surveillance videos”, ICT Express, vol. 12, no. 7, pp. 127-
134, July 2020
[7] M. -S. Kang, R. -H. Park and H. -M. Park, "Efficient Spatio-Temporal
Modeling Methods for Real-Time Violence Recognition," in IEEE
Access, vol. 9, pp. 76270-76285, 2021
[8] X. Hu, Z. Fan and D. Zhang, “TOP-ALCM: A Novel Video Analysis
Method for Violence Detection in Crowded Scenes,” Information
Fig. 4. Performance Analysis
Sciences, vol. 1, no. 9, pp. 172-177, May 2022
[9] A. Mehmood, "Efficient Anomaly Detection in Crowd Videos Using
Figure 4 exhibits the use of three evaluation indices (i.e., Pre-Trained 2D Convolutional Neural Networks," in IEEE Access, vol.
precision, accuracy, recall) to evaluate the performance of the 9, pp. 138283-138295, 2021
fabricated framework. The performance is compared with the [10] W. Song, D. Zhang, X. Zhao, J. Yu, R. Zheng and A. Wang, "A Novel
logistic regression model. The obtained outcomes show up to Violent Video Detection Scheme Based on Modified 3D Convolutional
5% of improvement in performance using the new Neural Networks," in IEEE Access, vol. 7, pp. 39172-39179, 2019
methodology. [11] F. U. M. Ullah et al., "AI-Assisted Edge Vision for Violence Detection
in IoT-Based Industrial Surveillance Networks," in IEEE Transactions
V. CONCLUSION on Industrial Informatics, vol. 18, no. 8, pp. 5359-5370, Aug. 2022
[12] L. Zhang, X. Ruan and J. Wang, "WiVi: A Ubiquitous Violence
In this paper, it is concluded that human violence detection Detection System With Commercial WiFi Devices," in IEEE Access,
is the complex task due nature of the input. The pipeline of vol. 8, pp. 6662-6672, 2020
human violence detection includes many processes. The [13] C. Gu, X. Wu and S. Wang, "Violent Video Detection Based on
feature algorithm called Local Histogram of Oriented Gradient Semantic Correspondence," in IEEE Access, vol. 8, pp. 85958-85967,
features is applied which extract relevant features from the 2020
video clips. The deep learning model is applied on the [14] P. Wang, P. Wang and E. Fan, “Violence detection and face recognition
extracted features for producing predictive outcomes. The based on deep learning”, Pattern Recognition Letters, vol. 12, no. 7, pp.
1572-1579, Dec. 2020
fabricated architecture applied in python considers three
evaluation measures to validate outcomes. The reliability is [15] D. J. Samuel and F. E. A., “Real time violence detection framework for
football stadium comprising of big data analysis and deep learning
tested by comparing the fabricated architecture and the LR through bidirectional LSTM,” Computer Networks, vol. 7, no. 60, pp.
algorithm. The obtained outcomes show up to 5% of 62412-62420, Jan. 2019

1206

Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:07:21 UTC from IEEE Xplore. Restrictions apply.

You might also like