Human Violence Recognition in Video Surveillance in Real-Time

Human Violence Recognition in Video
Surveillance in Real-Time
Herwin Alayn Huillcen Baca1 , Flor de Luz Palomino Valdivia2 , Ivan Soria
Solis3 , Mario Aquino Cruz4 , and Juan Carlos Gutierrez Caceres5
1
Jose Maria Arguedas National University, Apurimac, Peru
hhuillcen@unajma.edu.pe
2
Jose Maria Arguedas National University, Apurimac, Peru
fpalomino@unajma.edu.pe
3
José Marı́a Arguedas National University, Apurimac, Peru isoria@unajma.edu.pe
4
Micaela Bastidas University, Apurimac, Peru maquino@unamba.edu.pe
5
San Agustin National University, Arequipa, Peru jgutierrezca@unsa.edu.pe
Abstract. The automatic detection of human violence in video surveil-

lance is an area of great attention due to its application in security, mon-
itoring, and prevention systems. Detecting violence in real time could
prevent criminal acts and even save lives. There are many investigations
and proposals for the detection of violence in video surveillance; how-
ever, most of them focus on effectiveness and not on efficiency. They
focus on overcoming the accuracy results of other proposals and not on
their applicability in a real scenario and real-time. In this work, we pro-
pose an efficient model for recognizing human violence in real-time, based
on deep learning, composed of two modules, a spatial attention module
(SA) and a temporal attention module (TA). SA extracts spatial features
and regions of interest by frame difference of two consecutive frames and
morphological dilation. TA extracts temporal features by averaging all
three RGB channels in a single channel to have three frames as input to
a 2D CNN backbone. The proposal was evaluated in efficiency, accuracy,
and real-time. The results showed that our work has the best efficiency
compared to other proposals. Accuracy was very close to the result of
the best proposal, and latency was very close to real-time. Therefore our
model can be applied in real scenarios and in real-time.
Keywords: Human Violence recognition · Video Surveillance · Real-

time · Frame difference · Channel average · Real scenario
1 Introduction
Human action recognition is an area of great interest to the scientific community
due to its various applications, such as robotics, medicine, psychology, human-
computer interaction, and primarily video surveillance. An automatic violence
detection system could alert about an occurrence or a crime and allow actions to
be taken to mitigate said occurrence. Therefore, it is essential to detect violent
activity in real-time. Although the recognition of violence in videos has achieved
2 H. Huillcen et al.
many improvements, most works aim to improve performance in known datasets,

but few aim at a real-time scenario.
There are many techniques to detect violence in videos. The typical methods
include optical flow [1–5]. When the optical flow is combined other methods:
RGB frames as input, CNN variants of two streams (Two Stream) [6–10] and
3D CNN variants [11–14] achieve good results. Thus, the optical flow is a motion
representation for video action recognition tasks. However, extracting the optical
flow is time-consuming and inefficient for real-time recognition tasks.
The most promising techniques are based on deep learning [12,15–19], which,
unlike optical flow, uses neural networks as an extractor of characteristics, en-
coding, and classification; these techniques achieve better performance, reducing
the computational cost of optical flow, but it is still heavy in terms of parameters
and FLOPS, so applying them in a real scenario is still a challenge.
We focused on recognizing human violence in video surveillance that can be
applied in a real scenario. Classification models must identify human violence at
the precise moment of occurrence, that is, in real-time. Thus, three objectives
are proposed in our approach:
1. The model must be efficient in terms of parameters and FLOPs.

2. Good and cutting-edge accuracy results.
3. Minimum latency times that guarantee recognition in real-time.
The motivation is that the main difficulty when processing videos is dealing
with their high Spatio-temporal nature, this simple fact makes the video process-
ing task computationally expensive even when short video clips are processed,
given that they can contain a large number of images. Also, since there is a
dynamic between the spatial content of consecutive frames, this creates a time
dimension. How to describe spatial and temporal information to understand the
content of a video remains a challenge. Our proposal assumes this challenge, to
propose a model that contributes to the current state of the art.
Another motivation is that, although there are different proposals for the
recognition of human violence in video surveillance, most have focused on effec-
tiveness, but not on efficiency. Thus, there are very exact models, but with high
computational costs that could not be used in real scenarios and in real-time.
Our proposal makes a contribution to the domain of video surveillance,
since the installation of video surveillance cameras in the streets has become
widespread worldwide, with the aim of combating crime, however, dedicated
personnel are needed to physically observe the videos to identify some kind of
violence; With our proposal, this activity will be carried out by the computer
system that alerts the personnel about a violent human action in real time, in
such a way to proceed with the corresponding action and mitigate the violent
act, even saving lives. On the other hand, according to the objectives, the pro-
posal makes a contribution to the state of the art with an efficient model in
terms of number of parameters, FLOPs and minimum latencies.
The pipeline of our proposal consists of two modules. The Spatial attention
module (SA) extracts the map of spatial characteristics of each frame using
Human Violence Recognition in Video Surveillance 3
background extraction, RGB difference, and morphological dilation. The tem-

poral attention module (TA) extracts characteristics from a sequence of three
consecutive frames since violent acts have short periods of time and usually are
punching, pushing, or kicking. We use the average of each RGB channel of three
consecutive frames as input to a pre-trained 2D CNN network.
The rest of the document deals with the related works that inspired the
proposal, then shows the details of the proposal, and finally, the experiments
and results.
2 Related Work
A key factor to achieving efficiency and accuracy in recognizing human violence

is the extraction of regions or elements of the frames that involve a violent act;
these regions considerably reduce the computational cost and provide better
characteristics for recognition.
This process is called spatial attention, and there are good proposals for it;
thus, regularization and object detection methods are proposed [20,21] based on
Persistence of Appearance (PA) and consists of recognizing the motion bound-
aries of objects making use of the Euclidean distance between two consecutive
frames. We take this method into account as a spatial attention mechanism.
Still, we consider these boundaries too sharp, so we add a process of dilating the
boundaries through morphological dilation.
The 2D CNNs are commonly used for spatial processing of images, that
is, for processing along the width and height of the image; they have an ade-
quate computational cost, but they cannot encode spatial information; therefore,
they could not extract Spatio-temporal features from a video. A solution to this
problem was a proposal based on 3D CNN [12], which takes several frames from
end-to-end video as input to extract features. This proposal has good accuracy
results, but unfortunately, it has a high computational cost and cannot be used in
a real-time scenario. This problem of the inefficiency of 3D CNNs was addressed
by various proposals, such as the combination of 3D CNN and DenseNet [26, 30]
and others such as R(2+1)D [23], and P3D [24], they replace the 3D CNN by
the 2D CNN, achieving almost the same performance and with the advantage
of having much lower parameters and FLOPs than the 3D CNN.
Referring specifically to the recognition of human violence, currently, several
works use 2D CNN, 3D CNN, and LSTM techniques [22,25–30]; a good approach
was presented by Sudhakaran [18], uses the difference of frames as input of a CNN
and combines with LSTM. Another group of works still uses optical flow in Two-
Stream models [12]. These approaches have greatly improved the performance
of human violence recognition in videos but have not yet yielded results in a
real-time scenario.
The time attention module of our proposal is based on these approaches.
It uses 2D CNN with a straightforward strategy: take every three frames and
calculate the average of the three RGB channels in a single channel, so that the
2D CNN takes the three averages as if there were three channels, that is, we
convert the color information into temporary information, since the color is not
essential when recognizing violence in a video. This strategy brings light to the
model, which is our goal.
Finally, as a summary of the state-of-the-art, Table 1 presents a comparison
of the results of the efficiency-oriented proposals and Table 2 of the accuracy-
oriented proposals.
Table 1. Comparison of efficiency-oriented proposals.
Model #Params (M) FLOPs(G)

C3D [12] 78 40,04
I3D [14] 12,3 55,7
3D CNN end to end [26] 7,4 10,43
ConvLSTM [18] 9,6 14,4
3D CNN + DenseNet(2,4,12,8) [30] 4,34 5,73
Table 2. Comparison of accuracy-oriented proposals.
Model Hockey (%) Movie (%) RWF (%)

VGG-16+LSTM [27] 95,1 99 -
Xception+Bi-LSTM [28] 98 100 -
Flow Gated Network [31] 98 100 87,3
SPIL [33] 96,8 98,5 89,3
3D CNN end to end [26] 98,3 100 -
3D CNN + DenseNet(2,4,6,8) [30] 97,1 100 -
3 Proposal.
The proposal’s main objective is to achieve efficiency in FLOPs, adequate ac-
curacy results, and obtain a latency that guarantees its use in real-time. For
that, an architecture composed of two modules is proposed, Figure 1 shows the
pipeline of the proposed architecture.
Spatial Attention module (SA) receives T + 1 frames from the end-to-end
original video. It calculates the motion boundaries of the motion object, elimi-
nating regions that are not part of the motion (background) through the frame
difference of two consecutive frames.
Temporal Attention module (TA), Receives T frames from the Spatial At-
tention module (SA) and creates Spatio-temporal characteristics maps for the
recognition of violence, through the average of each frame channel and a pre-
trained 2D CNN.
Fig. 1. Pipeline of the Proposed Architecture.
3.1 Spatial Attention module (SA).

When recognizing human violence in a video, the movement characteristics of
people are more important than the color or background of a given frame; there-
fore, our spatial attention module extracts the boundaries of moving objects
from each frame, through three steps. See Figure 2.
1. This module take T = 30 frames and performs a frame difference Dt of two

consecutive RGB frames Xt , Xt+1 , calculating the Euclidean distance and
then performing the sum for each channel.
2. The next step is to generate a spatial attention map Mt , applying two average
pooling layers (Kernel 15, stride 1) and four convolutional layers with ReLU
activation (Kernel 7, stride 1).
3. Finally, to extract the regions corresponding to the movement of objects, the
module performs the Hadamard Product between Mt and the second frame
Xt+1 .
3.2 Temporal Attention module (TA).

Considering that human violence is mainly represented by movements of short
duration, such as punches, knife blows, or kicks, we propose the temporary at-
tention module that takes three consecutive frames through a 2D CNN.
On the other hand, we know that a 2D CNN processes frames individually
and would not be able to extract temporal features from a video. In contrast, a
3D CNN can process Spatio-temporal features of a video end-to-end; however,
Fig. 2. Mechanism of Spatial Attention module (SA).
it is too heavy in terms of parameters and FLOPS, making it unfeasible for

our purposes. In this way, we take advantage of the fact that a 2D CNN pro-
cesses three RGB channels of a single frame, so we propose to use the average
of the three RGB channels (Xt , Xt+1 , Xt+2 ) of a single frame in a single channel
(At ) to have the possibility that the 2D CNN can process three frames simul-
taneously, See Figure 3. In other words, we use temporal information instead of
color information. In addition, color has no significance in recognition of human
violence.
The 2D CNN network used in the proposal is the EfficientNet-B0 pretrained
network; the choice is because this network takes as priority in minimizing the
number of FLOPs in its model, which is consistent with our objective.
4 Experiment and Results.

The evaluation of our proposal is made in terms of efficiency, accuracy, and
latency, but before it describes the dataset used and the model’s configuration.
The proposal results include a configuration with only the temporal attention
module (TA).
Results of the proposal considering another configuration are also included,
where it includes only the temporary attention (TA) module.
4.1 Datasets.
Several datasets are available; we take the most representative.
Fig. 3. Mechanism of Temporal Attention module (TA).
RWF-2000 [31] is the largest violence detection dataset containing 2,000 real-
life surveillance images. with a duration of 5 seconds. We take RWF-2000 as the
main reference because it has a greater number of videos and is very heteroge-
neous in its characteristics of speed, background, lighting, and camera position.
Hockey [32] contains 1000 videos compiled from different images of ice hockey.
Each video has 50 frames, and all the videos have similar backgrounds and violent
actions.
Movies [32] is a relatively smaller dataset containing 200 video clips at various
resolutions. The videos are diverse in content, and videos with the ’violent’ tag
are collected from different movie clips.
4.2 Configuration Model.
– The source code implementation was based on Pytorch.

– The input frames were resized to a resolution of 224 x 224.
– The models were trained using Adam optimizer and a learning rate of 0.001
– The batch size used was 8 due to the limitations of the Nvidia GeForce RTX
2080 Super graphics card; however, we suggest using a larger setting for
higher cards. The number of input frames was set to 30, which must always
be a multiple of 3 due to the nature of the temporal attention module (TA).
– Number of epochs was 1000
4.3 Efficiency Evaluation.
To assess the efficiency of our proposal, we compute the number of parameters

and FLOPs and compare them with results from other prominent proposals
identified in the Related Works section. See Table 3.
Table 3. Comparison of efficiency results with other proposals.
Model #Params (M) FLOPs(G)

C3D [12] 78 40,04
I3D [14] 12,3 55,7
3D CNN end to end [26] 7,4 10,43
ConvLSTM [18] 9,6 14,4
3D CNN + DenseNet(2,4,12,8) [30] 4,34 5,73
Proposal SA+TA 5,29 4,17
Proposal TA only 5,29 0,15
It is observed that according to the number of parameters, 3D CNN+DenseNet

(2,4,12,8) still exceeds our results, but by a slight difference. However, accord-
ing to the FLOPs, there is a big difference, as the amount drops from 5.73 to
0.15; then it is concluded that our proposal has low complexity and is light in
processing FLOPs.
This means that our model is very efficient, has low complexity, and can be
deployed on lightweight devices with little computational power. We consider
that these results are a contribution to the state-of-the-art.
4.4 Accuracy Evaluation.
To evaluate the quality of our proposal, the model was trained and tested with
5-fold cross-validation for the Hockey and Movie datasets. In contrast, for the
RWF-2000 dataset, we separated it into 80% for training and 20% for testing.
Subsequently, these results were tabulated and compared with the accuracy
of other proposals in the related works section. See Table 2.
To make a good evaluation, we took the RWF-2000 dataset as a reference,
as it has a more significant number of videos and is more heterogeneous in its
characteristics.
According to the Hockey dataset, our result is only beaten by 3D CNN end-
to-end [26], which is good; on the other hand, according to the Movie dataset,
the result achieved is the maximum; therefore, these results are cutting-edge and
support the quality of our proposal.
Finally, in the case of RWF-200, our result is very close to SPIL [33] due to a
slight difference, which also confirms the quality of our proposal and a significant
contribution to state-of-the-art.
Table 4. Comparison of accuracy results with other proposals
Model Hockey (%) Movie (%) RWF (%)

VGG-16+LSTM [27] 95,1 99 -
Xception+Bi-LSTM [28] 98 100 -
Flow Gated Network [31] 98 100 87,3
SPIL [33] 96,8 98,5 89,3
3D CNN end to end [26] 98,3 100 -
3D CNN + DenseNet(2,4,6,8) [30] 97,1 100 -
Proposal SA+TA 97,2 100 87,75
Proposal TA only 97.0 100 86,9
4.5 Real-time Evaluation.

As far as we know, there is no formal method to measure if a model can be used
in real-time; however, we are based on similar works [22, 30], and we agree that
this evaluation is made by measuring the processing time for every 30 frames
since the speed of 30 FPS is taken as a reference.
Table 5. Latencies of our proposal in Nvidia GeForce RTX 2080 Super
Model Latency (ms)

Proposal SA+TA 12,2
Proposal TA only 10,8
Table 5 shows the result of this measurement with an NVidia GeForce RTX
2080 Super graphics card. If real-time is 0 ms, then the results show that both
models are close to real-time. In other words, in a real scenario, our proposal
takes 0.0122 s and 0.0108 s to process a video of 1 s duration. This demonstrates
the efficiency and low latency to be used in real and real-time scenarios.
In this section, we must compare the models of our proposal SA+TA and TA,
respectively. The SA+TA model has a better accuracy result than TA but has
higher FLOPs and higher latency; this model can be used in normal devices with
good computational power. While the SA model has relatively lower accuracy
results, it is pretty light in FLOPs and latency; therefore, it can be used in
devices with light computing power.
5 Discussion and Future Work

Regarding the efficiency results and according to Table 3, our proposal: TA only,
has the best result in terms of FLOPs, with a value of 0.15. This result is because
the temporary attention module (TA) replaces the use of 3D CNN with 2D CNN.
The proposals based on 3D CNN, although they extract space-temporal features,
but with a high computational cost; 2D CNN can extract spatial features, but
not temporal features. However, using T+1 frames of a single channel as a result
of the average of the RGB channels, and inserting it as temporary information
to the 2D CNN, reduces the computational cost considerably, despite not having
the best result in accuracy.
Using the temporal attention module (TA) and the spatial attention module
(SA) together makes the previous experiment improve the accuracy results, in-
creasing the value by 0,2 0,85 for the Hockey and RWF datasets, respectively,
while for the Movie dataset the result is maintained, See Table 4. This improve-
ment is because the spatial attention module extracts regions of interest from
each frame, eliminating the background and the color. However, the FLOPs are
increased by 4,02, and the number of parameters remains at 5,29; showing that
using both modules slightly improves accuracy but at a high cost in FLOPs.
Our goal is to recognize human violence in video surveillance in a real scenario
and in real-time; therefore, we propose to use only the temporal attention module
(TA).
In future works, we propose to replace the pretrained 2D CNN network with
other proposals, such as the MovileNet versions, which could improve the number
of parameters or the FLOPs. On the other hand, to improve the accuracy, other
background extraction methods could be used without taking into account the
pooling layers and the convolutional layers, because these layers generate the
increase in FLOPs.
6 Conclusions
We propose a new efficient model for recognizing human violence in video surveil-
lance, oriented to be used in real-time and real situations. We use a spatial
attention module (SA) and a temporal attention module (TA).
We demonstrate that extracting temporal features from a video can be per-
formed by a 2D CNN by replacing the RGB color information of each frame
with the time dimension.
We show that our model has low computational complexity in terms of
FLOPs, allowing it to be used in devices with low computational power, espe-
cially in real-time. Likewise, our model contributes to the state-of-the-art com-
pared to other proposals. Our model is very light in terms of the number of
parameters, slightly outperformed by 3D CNN + DenseNet, therefore, it is a
state-of-the-art result.
It was shown that the proposal has a latency close to real-time, resulting
in 10.8 ms in processing 30 frames. Finally we present two variations on our
proposal: one for light devices in computational power (SA), and another for
devices with better computational characteristics.
References
1. Y. Gao, H. Liu, X. Sun, C. Wang, and Y. Liu, ”Violence detection using Oriented
VIolent Flows, ” Image Vis. Comput., vol. 48–49, no. 2015, pp. 37–41, 2016, doi:
10.1016/j.imavis.2016.01.006.
2. O. Deniz, I. Serrano, G. Bueno, and T. K. Kim, ”Fast violence detection in video,
” VISAPP 2014 - Proc. 9th Int. Conf. Comput. Vis. Theory Appl., vol. 2, no.
December 2014, pp. 478–485, 2014, doi: 10.5220/0004695104780485.
3. P. Bilinski, ”Human violence recognition and detection in surveillance videos, ”
2016, pp. 30–36, doi: 10.1109/AVSS.2016.7738019.
4. T. Zhang, W. Jia, X. He, and J. Yang, ”Discriminative Dictionary Learning With
Motion Weber Local Descriptor for Violence Detection, ” IEEE Trans. Circuits Syst.
Video Technol., vol. 27, no. 3, pp. 696–709, 2017.
5. T. Deb, A. Arman, and A. Firoze, ”Machine Cognition of Violence in Videos Using
Novel Outlier-Resistant VLAD, ” in 2018 17th IEEE International Conference on
Machine Learning and Applications (ICMLA), 2018, pp. 989–994.
6. K. Simonyan and A. Zisserman, ”Two-stream convolutional networks for action
recognition in videos, ” in Advances in Neural Information Processing Systems,
2014, vol. 1, no. January, pp. 568–576.
7. C. Feichtenhofer, A. Pinz, and A. Zisserman, ”Convolutional Two-Stream Network
Fusion for Video Action Recognition, ” in Proceedings of the IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition, 2016, vol. 2016-
Decem, no. i, pp. 1933–1941, doi: 10.1109/CVPR.2016.213.
8. B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, ”Real-Time Action Recognition
with Deeply Transferred Motion Vector CNNs, ” IEEE Trans. Image Process., vol.
27, no. 5, pp. 2326–2339, 2018, doi: 10.1109/TIP.2018.2791180.
9. L. Wang et al., ”Temporal segment networks: Towards good practices for deep action
recognition, ” in Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2016, vol. 9912
LNCS, pp. 20–36, doi: 10.1007/978-3-319-46484-8-2.
10. Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann, ”Hidden Two-Stream Convolu-
tional Networks for Action Recognition, ” in Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 2019, vol. 11363 LNCS, pp. 363–378, doi: 10.1007/978-3-030-20893-
6-23.
11. S. Ji, W. Xu, M. Yang, and K. Yu, ”3D Convolutional neural networks for human
action recognition, ” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp.
221–231, 2013, doi: 10.1109/TPAMI.2012.59.
12. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, ”Learning spatiotem-
poral features with 3D convolutional networks, ” in Proceedings of the IEEE Inter-
national Conference on Computer Vision, 2015, vol. 2015 Inter, pp. 4489–4497, doi:
10.1109/ICCV.2015.510.
13. Z. Qiu, T. Yao, and T. Mei, ”Learning Spatio-Temporal Representation with
Pseudo-3D Residual Networks, ” in Proceedings of the IEEE International
Conference on Computer Vision, 2017, vol. 2017-Octob, pp. 5534–5542, doi:
10.1109/ICCV.2017.590.
14. J. Carreira and A. Zisserman, ”Quo Vadis, action recognition? A new model and
the kinetics dataset, ” in Proceedings - 30th IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2017, 2017, vol. 2017-Janua, pp. 4724–4733, doi:
10.1109/CVPR.2017.502.
15. Z. Dong, J. Qin, and Y. Wang, ”Multi-stream deep networks for person to person
violence detection in videos, ” in Chinese Conference on Pattern Recognition, 2016,
pp. 517–531.
16. P. Zhou, Q. Ding, H. Luo, and X. Hou, ”Violent interaction detection in video
based on deep learning, ” in Journal of physics: conference series, 2017, vol. 844, no.
1, p. 12044.
17. I. Serrano, O. Deniz, J. L. Espinosa-Aranda, and G. Bueno, ”Fight recognition
in video using hough forests and 2D convolutional neural network, ” IEEE Trans.
Image Process., vol. 27, no. 10, pp. 4787–4797, 2018.
18. S. Sudhakaran and O. Lanz, ”Learning to detect violent videos using convolutional
long short-term memory, ” in 2017 14th IEEE International Conference on Advanced
Video and Signal Based Surveillance (AVSS), 2017, pp. 1–6.
19. A. Hanson, K. Pnvr, S. Krishnagopal, and L. Davis, ”Bidirectional convolutional
lstm for the detection of violence in videos, ” in Proceedings of the European Con-
ference on Computer Vision (ECCV), 2018, p. 0.
20. O. Ulutan, S. Rallapalli, M. Srivatsa, C. Torres, and B. S. Manjunath,”Actor condi-
tioned attention maps for video action detection” in Proc.IEEE Winter Conf. Appl.
Comput. Vis. (WACV), Mar. 2020, pp. 516-525.
21. L. Meng, B. Zhao, B. Chang, G. Huang, W. Sun, F. Tung, and L. Sigal, ”Inter-
pretable spatio-temporal attention for video action recognition” in Proc. IEEE/CVF
Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019,pp. 1513-1522.
22. Kang, M. S., Park, R. H., & Park, H. M. (2021). Efficient spatio-temporal modeling
methods for real-time violence recognition. IEEE Access, 9, 76270-76285.
23. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, ”A closer look
at spatiotemporal convolutions for action recognition” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2018,pp. 6450-6459.
24. Z. Qiu, T. Yao, and T. Mei, ”Learning spatio-temporal representation with pseudo-
3D residual networks” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
pp. 5534-5542.
25. A. Hanson, K. PNVR, S. Krishnagopal, and L. Davis, ”Bidirectional convolutional
LSTM for the detection of violence in videos” in Proc. Eur. Conf. Comput. Vis.
(ECCV), Munich, Germany, Sep. 2018, pp. 280-295.
26. J. Li, X. Jiang, T. Sun, and K. Xu, ”Efficient violence detection using 3D convo-
lutional neural networks” in Proc. 16th IEEE Int. Conf. Adv. Video Signal Based
Surveill. (AVSS), Sep. 2019, pp. 1-8.
27. M. M. Soliman, M. H. Kamal, M. A. El-Massih Nashed, Y. M. Mostafa, B. S.
Chawky, and D. Khattab, ”Violence recognition from videos using deep learning
techniques” in Proc. 9th Int. Conf. Intell. Comput. Inf. Syst.(ICICIS), Dec. 2019,
pp. 80-85.
28. S. Akti, G. A. Tataroglu, and H. K. Ekenel, ”Vision-based fight detection from
surveillance cameras” in Proc. 9th Int. Conf. Image Process. Theory, Tools Appl.
(IPTA), Nov. 2019, pp. 1-6.
29. A. Traoré and M. A. Akhloufi, ”2D bidirectional gated recurrent unit convolutional
neural networks for end-to-end violence detection in videos” in Proc. Int. Conf.
Images Anal. Recognit.
30. Huillcen Baca, Herwin Alayn; Gutierrez Caceres, Juan Carlos; Luz Palomino Val-
divia, Flor de. Efficiency in Human Actions Recognition in Video Surveillance Using
3D CNN and DenseNet. En Future of Information and Communication Conference.
Springer, Cham, 2022. p. 342-355.
31. M. Cheng, K. Cai, and M. Li, ”Rwf-2000: An open large scale video database for
violence detection,” arXiv preprint arXiv:1911.05913, 2019.
32. E. B. Nievas, O. D. Suarez, G. B. Garciaa, and R. Sukthankar, ”Violence detection

in video using computer vision techniques,” in International conference on Computer
analysis of images and patterns. Springer, 2011, pp. 332–339.
33. Su, Y., Lin, G., Zhu, J., & Wu, Q. (2020, August). Human interaction learning on
3d skeleton point clouds for video violence recognition. In European Conference on
Computer Vision (pp. 74-90). Springer, Cham.

Human Violence Recognition in Video Surveillance in Real-Time

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Human Violence Recognition in Video Surveillance in Real-Time

Uploaded by

Copyright:

Available Formats

Human Violence Recognition in Video

Abstract. The automatic detection of human violence in video surveil-

Keywords: Human Violence recognition · Video Surveillance · Real-

many improvements, most works aim to improve performance in known datasets,

1. The model must be efficient in terms of parameters and FLOPs.

background extraction, RGB difference, and morphological dilation. The tem-

A key factor to achieving efficiency and accuracy in recognizing human violence

Table 1. Comparison of efficiency-oriented proposals.

Model #Params (M) FLOPs(G)

Table 2. Comparison of accuracy-oriented proposals.

Model Hockey (%) Movie (%) RWF (%)

Fig. 1. Pipeline of the Proposed Architecture.

3.1 Spatial Attention module (SA).

1. This module take T = 30 frames and performs a frame difference Dt of two

3.2 Temporal Attention module (TA).

Fig. 2. Mechanism of Spatial Attention module (SA).

it is too heavy in terms of parameters and FLOPS, making it unfeasible for

4 Experiment and Results.

Fig. 3. Mechanism of Temporal Attention module (TA).

4.2 Configuration Model.

– The source code implementation was based on Pytorch.

4.3 Efficiency Evaluation.

To assess the efficiency of our proposal, we compute the number of parameters

Table 3. Comparison of efficiency results with other proposals.

Model #Params (M) FLOPs(G)

It is observed that according to the number of parameters, 3D CNN+DenseNet

4.4 Accuracy Evaluation.

Table 4. Comparison of accuracy results with other proposals

Model Hockey (%) Movie (%) RWF (%)

4.5 Real-time Evaluation.

Table 5. Latencies of our proposal in Nvidia GeForce RTX 2080 Super

Model Latency (ms)

5 Discussion and Future Work

32. E. B. Nievas, O. D. Suarez, G. B. Garciaa, and R. Sukthankar, ”Violence detection

You might also like