Professional Documents
Culture Documents
Human Violence Recognition in Video Surveillance in Real-Time
Human Violence Recognition in Video Surveillance in Real-Time
Surveillance in Real-Time
Herwin Alayn Huillcen Baca1 , Flor de Luz Palomino Valdivia2 , Ivan Soria
Solis3 , Mario Aquino Cruz4 , and Juan Carlos Gutierrez Caceres5
1
Jose Maria Arguedas National University, Apurimac, Peru
hhuillcen@unajma.edu.pe
2
Jose Maria Arguedas National University, Apurimac, Peru
fpalomino@unajma.edu.pe
3
José Marı́a Arguedas National University, Apurimac, Peru isoria@unajma.edu.pe
4
Micaela Bastidas University, Apurimac, Peru maquino@unamba.edu.pe
5
San Agustin National University, Arequipa, Peru jgutierrezca@unsa.edu.pe
1 Introduction
Human action recognition is an area of great interest to the scientific community
due to its various applications, such as robotics, medicine, psychology, human-
computer interaction, and primarily video surveillance. An automatic violence
detection system could alert about an occurrence or a crime and allow actions to
be taken to mitigate said occurrence. Therefore, it is essential to detect violent
activity in real-time. Although the recognition of violence in videos has achieved
2 H. Huillcen et al.
The motivation is that the main difficulty when processing videos is dealing
with their high Spatio-temporal nature, this simple fact makes the video process-
ing task computationally expensive even when short video clips are processed,
given that they can contain a large number of images. Also, since there is a
dynamic between the spatial content of consecutive frames, this creates a time
dimension. How to describe spatial and temporal information to understand the
content of a video remains a challenge. Our proposal assumes this challenge, to
propose a model that contributes to the current state of the art.
Another motivation is that, although there are different proposals for the
recognition of human violence in video surveillance, most have focused on effec-
tiveness, but not on efficiency. Thus, there are very exact models, but with high
computational costs that could not be used in real scenarios and in real-time.
Our proposal makes a contribution to the domain of video surveillance,
since the installation of video surveillance cameras in the streets has become
widespread worldwide, with the aim of combating crime, however, dedicated
personnel are needed to physically observe the videos to identify some kind of
violence; With our proposal, this activity will be carried out by the computer
system that alerts the personnel about a violent human action in real time, in
such a way to proceed with the corresponding action and mitigate the violent
act, even saving lives. On the other hand, according to the objectives, the pro-
posal makes a contribution to the state of the art with an efficient model in
terms of number of parameters, FLOPs and minimum latencies.
The pipeline of our proposal consists of two modules. The Spatial attention
module (SA) extracts the map of spatial characteristics of each frame using
Human Violence Recognition in Video Surveillance 3
2 Related Work
convert the color information into temporary information, since the color is not
essential when recognizing violence in a video. This strategy brings light to the
model, which is our goal.
Finally, as a summary of the state-of-the-art, Table 1 presents a comparison
of the results of the efficiency-oriented proposals and Table 2 of the accuracy-
oriented proposals.
3 Proposal.
The proposal’s main objective is to achieve efficiency in FLOPs, adequate ac-
curacy results, and obtain a latency that guarantees its use in real-time. For
that, an architecture composed of two modules is proposed, Figure 1 shows the
pipeline of the proposed architecture.
Spatial Attention module (SA) receives T + 1 frames from the end-to-end
original video. It calculates the motion boundaries of the motion object, elimi-
nating regions that are not part of the motion (background) through the frame
difference of two consecutive frames.
Temporal Attention module (TA), Receives T frames from the Spatial At-
tention module (SA) and creates Spatio-temporal characteristics maps for the
recognition of violence, through the average of each frame channel and a pre-
trained 2D CNN.
Human Violence Recognition in Video Surveillance 5
4.1 Datasets.
Several datasets are available; we take the most representative.
Human Violence Recognition in Video Surveillance 7
RWF-2000 [31] is the largest violence detection dataset containing 2,000 real-
life surveillance images. with a duration of 5 seconds. We take RWF-2000 as the
main reference because it has a greater number of videos and is very heteroge-
neous in its characteristics of speed, background, lighting, and camera position.
Hockey [32] contains 1000 videos compiled from different images of ice hockey.
Each video has 50 frames, and all the videos have similar backgrounds and violent
actions.
Movies [32] is a relatively smaller dataset containing 200 video clips at various
resolutions. The videos are diverse in content, and videos with the ’violent’ tag
are collected from different movie clips.
To evaluate the quality of our proposal, the model was trained and tested with
5-fold cross-validation for the Hockey and Movie datasets. In contrast, for the
RWF-2000 dataset, we separated it into 80% for training and 20% for testing.
Subsequently, these results were tabulated and compared with the accuracy
of other proposals in the related works section. See Table 2.
To make a good evaluation, we took the RWF-2000 dataset as a reference,
as it has a more significant number of videos and is more heterogeneous in its
characteristics.
According to the Hockey dataset, our result is only beaten by 3D CNN end-
to-end [26], which is good; on the other hand, according to the Movie dataset,
the result achieved is the maximum; therefore, these results are cutting-edge and
support the quality of our proposal.
Finally, in the case of RWF-200, our result is very close to SPIL [33] due to a
slight difference, which also confirms the quality of our proposal and a significant
contribution to state-of-the-art.
Human Violence Recognition in Video Surveillance 9
Table 5 shows the result of this measurement with an NVidia GeForce RTX
2080 Super graphics card. If real-time is 0 ms, then the results show that both
models are close to real-time. In other words, in a real scenario, our proposal
takes 0.0122 s and 0.0108 s to process a video of 1 s duration. This demonstrates
the efficiency and low latency to be used in real and real-time scenarios.
In this section, we must compare the models of our proposal SA+TA and TA,
respectively. The SA+TA model has a better accuracy result than TA but has
higher FLOPs and higher latency; this model can be used in normal devices with
good computational power. While the SA model has relatively lower accuracy
results, it is pretty light in FLOPs and latency; therefore, it can be used in
devices with light computing power.
but with a high computational cost; 2D CNN can extract spatial features, but
not temporal features. However, using T+1 frames of a single channel as a result
of the average of the RGB channels, and inserting it as temporary information
to the 2D CNN, reduces the computational cost considerably, despite not having
the best result in accuracy.
Using the temporal attention module (TA) and the spatial attention module
(SA) together makes the previous experiment improve the accuracy results, in-
creasing the value by 0,2 0,85 for the Hockey and RWF datasets, respectively,
while for the Movie dataset the result is maintained, See Table 4. This improve-
ment is because the spatial attention module extracts regions of interest from
each frame, eliminating the background and the color. However, the FLOPs are
increased by 4,02, and the number of parameters remains at 5,29; showing that
using both modules slightly improves accuracy but at a high cost in FLOPs.
Our goal is to recognize human violence in video surveillance in a real scenario
and in real-time; therefore, we propose to use only the temporal attention module
(TA).
In future works, we propose to replace the pretrained 2D CNN network with
other proposals, such as the MovileNet versions, which could improve the number
of parameters or the FLOPs. On the other hand, to improve the accuracy, other
background extraction methods could be used without taking into account the
pooling layers and the convolutional layers, because these layers generate the
increase in FLOPs.
6 Conclusions
We propose a new efficient model for recognizing human violence in video surveil-
lance, oriented to be used in real-time and real situations. We use a spatial
attention module (SA) and a temporal attention module (TA).
We demonstrate that extracting temporal features from a video can be per-
formed by a 2D CNN by replacing the RGB color information of each frame
with the time dimension.
We show that our model has low computational complexity in terms of
FLOPs, allowing it to be used in devices with low computational power, espe-
cially in real-time. Likewise, our model contributes to the state-of-the-art com-
pared to other proposals. Our model is very light in terms of the number of
parameters, slightly outperformed by 3D CNN + DenseNet, therefore, it is a
state-of-the-art result.
It was shown that the proposal has a latency close to real-time, resulting
in 10.8 ms in processing 30 frames. Finally we present two variations on our
proposal: one for light devices in computational power (SA), and another for
devices with better computational characteristics.
Human Violence Recognition in Video Surveillance 11
References
1. Y. Gao, H. Liu, X. Sun, C. Wang, and Y. Liu, ”Violence detection using Oriented
VIolent Flows, ” Image Vis. Comput., vol. 48–49, no. 2015, pp. 37–41, 2016, doi:
10.1016/j.imavis.2016.01.006.
2. O. Deniz, I. Serrano, G. Bueno, and T. K. Kim, ”Fast violence detection in video,
” VISAPP 2014 - Proc. 9th Int. Conf. Comput. Vis. Theory Appl., vol. 2, no.
December 2014, pp. 478–485, 2014, doi: 10.5220/0004695104780485.
3. P. Bilinski, ”Human violence recognition and detection in surveillance videos, ”
2016, pp. 30–36, doi: 10.1109/AVSS.2016.7738019.
4. T. Zhang, W. Jia, X. He, and J. Yang, ”Discriminative Dictionary Learning With
Motion Weber Local Descriptor for Violence Detection, ” IEEE Trans. Circuits Syst.
Video Technol., vol. 27, no. 3, pp. 696–709, 2017.
5. T. Deb, A. Arman, and A. Firoze, ”Machine Cognition of Violence in Videos Using
Novel Outlier-Resistant VLAD, ” in 2018 17th IEEE International Conference on
Machine Learning and Applications (ICMLA), 2018, pp. 989–994.
6. K. Simonyan and A. Zisserman, ”Two-stream convolutional networks for action
recognition in videos, ” in Advances in Neural Information Processing Systems,
2014, vol. 1, no. January, pp. 568–576.
7. C. Feichtenhofer, A. Pinz, and A. Zisserman, ”Convolutional Two-Stream Network
Fusion for Video Action Recognition, ” in Proceedings of the IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition, 2016, vol. 2016-
Decem, no. i, pp. 1933–1941, doi: 10.1109/CVPR.2016.213.
8. B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, ”Real-Time Action Recognition
with Deeply Transferred Motion Vector CNNs, ” IEEE Trans. Image Process., vol.
27, no. 5, pp. 2326–2339, 2018, doi: 10.1109/TIP.2018.2791180.
9. L. Wang et al., ”Temporal segment networks: Towards good practices for deep action
recognition, ” in Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2016, vol. 9912
LNCS, pp. 20–36, doi: 10.1007/978-3-319-46484-8-2.
10. Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann, ”Hidden Two-Stream Convolu-
tional Networks for Action Recognition, ” in Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 2019, vol. 11363 LNCS, pp. 363–378, doi: 10.1007/978-3-030-20893-
6-23.
11. S. Ji, W. Xu, M. Yang, and K. Yu, ”3D Convolutional neural networks for human
action recognition, ” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp.
221–231, 2013, doi: 10.1109/TPAMI.2012.59.
12. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, ”Learning spatiotem-
poral features with 3D convolutional networks, ” in Proceedings of the IEEE Inter-
national Conference on Computer Vision, 2015, vol. 2015 Inter, pp. 4489–4497, doi:
10.1109/ICCV.2015.510.
13. Z. Qiu, T. Yao, and T. Mei, ”Learning Spatio-Temporal Representation with
Pseudo-3D Residual Networks, ” in Proceedings of the IEEE International
Conference on Computer Vision, 2017, vol. 2017-Octob, pp. 5534–5542, doi:
10.1109/ICCV.2017.590.
14. J. Carreira and A. Zisserman, ”Quo Vadis, action recognition? A new model and
the kinetics dataset, ” in Proceedings - 30th IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2017, 2017, vol. 2017-Janua, pp. 4724–4733, doi:
10.1109/CVPR.2017.502.
12 H. Huillcen et al.
15. Z. Dong, J. Qin, and Y. Wang, ”Multi-stream deep networks for person to person
violence detection in videos, ” in Chinese Conference on Pattern Recognition, 2016,
pp. 517–531.
16. P. Zhou, Q. Ding, H. Luo, and X. Hou, ”Violent interaction detection in video
based on deep learning, ” in Journal of physics: conference series, 2017, vol. 844, no.
1, p. 12044.
17. I. Serrano, O. Deniz, J. L. Espinosa-Aranda, and G. Bueno, ”Fight recognition
in video using hough forests and 2D convolutional neural network, ” IEEE Trans.
Image Process., vol. 27, no. 10, pp. 4787–4797, 2018.
18. S. Sudhakaran and O. Lanz, ”Learning to detect violent videos using convolutional
long short-term memory, ” in 2017 14th IEEE International Conference on Advanced
Video and Signal Based Surveillance (AVSS), 2017, pp. 1–6.
19. A. Hanson, K. Pnvr, S. Krishnagopal, and L. Davis, ”Bidirectional convolutional
lstm for the detection of violence in videos, ” in Proceedings of the European Con-
ference on Computer Vision (ECCV), 2018, p. 0.
20. O. Ulutan, S. Rallapalli, M. Srivatsa, C. Torres, and B. S. Manjunath,”Actor condi-
tioned attention maps for video action detection” in Proc.IEEE Winter Conf. Appl.
Comput. Vis. (WACV), Mar. 2020, pp. 516-525.
21. L. Meng, B. Zhao, B. Chang, G. Huang, W. Sun, F. Tung, and L. Sigal, ”Inter-
pretable spatio-temporal attention for video action recognition” in Proc. IEEE/CVF
Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019,pp. 1513-1522.
22. Kang, M. S., Park, R. H., & Park, H. M. (2021). Efficient spatio-temporal modeling
methods for real-time violence recognition. IEEE Access, 9, 76270-76285.
23. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, ”A closer look
at spatiotemporal convolutions for action recognition” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2018,pp. 6450-6459.
24. Z. Qiu, T. Yao, and T. Mei, ”Learning spatio-temporal representation with pseudo-
3D residual networks” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
pp. 5534-5542.
25. A. Hanson, K. PNVR, S. Krishnagopal, and L. Davis, ”Bidirectional convolutional
LSTM for the detection of violence in videos” in Proc. Eur. Conf. Comput. Vis.
(ECCV), Munich, Germany, Sep. 2018, pp. 280-295.
26. J. Li, X. Jiang, T. Sun, and K. Xu, ”Efficient violence detection using 3D convo-
lutional neural networks” in Proc. 16th IEEE Int. Conf. Adv. Video Signal Based
Surveill. (AVSS), Sep. 2019, pp. 1-8.
27. M. M. Soliman, M. H. Kamal, M. A. El-Massih Nashed, Y. M. Mostafa, B. S.
Chawky, and D. Khattab, ”Violence recognition from videos using deep learning
techniques” in Proc. 9th Int. Conf. Intell. Comput. Inf. Syst.(ICICIS), Dec. 2019,
pp. 80-85.
28. S. Akti, G. A. Tataroglu, and H. K. Ekenel, ”Vision-based fight detection from
surveillance cameras” in Proc. 9th Int. Conf. Image Process. Theory, Tools Appl.
(IPTA), Nov. 2019, pp. 1-6.
29. A. Traoré and M. A. Akhloufi, ”2D bidirectional gated recurrent unit convolutional
neural networks for end-to-end violence detection in videos” in Proc. Int. Conf.
Images Anal. Recognit.
30. Huillcen Baca, Herwin Alayn; Gutierrez Caceres, Juan Carlos; Luz Palomino Val-
divia, Flor de. Efficiency in Human Actions Recognition in Video Surveillance Using
3D CNN and DenseNet. En Future of Information and Communication Conference.
Springer, Cham, 2022. p. 342-355.
31. M. Cheng, K. Cai, and M. Li, ”Rwf-2000: An open large scale video database for
violence detection,” arXiv preprint arXiv:1911.05913, 2019.
Human Violence Recognition in Video Surveillance 13