Indoor Violence Detection Using Video Transformer

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Indoor Violence Detection using Lightweight

Transformer Model
Arushi Kumar Arpan Shetty Archit Sagar
Computer Science & Engineering Computer Science & Engineering Computer Science & Engineering
PES University PES University PES University
Bengaluru, India Bengaluru, India Bengaluru, India
arushi32001@gmail.com arpan.sundarram@gmail.com archit2098@gmail.com

Charushree A Preet Kanwal


Computer Science & Engineering Computer Science & Engineering
PES University PES University
Bengaluru, India Bengaluru, India
charushree.a@gmail.com preetkanwal@pes.edu

Abstract—Human activity recognition from surveillance videos This paper proposes a solution to these problems – an
is an active research area in Computer Vision and Machine automated indoor violence detection system. The aim was to
Learning. The detection of indoor violent activities is a major create a supervised model that detects acts of violence against
challenge in the surveillance sector due to the lack of sufficient
indoor footage, and the presence of obstacles inducing occlusions a victim in pseudo real-time. The proposed system takes in
in a household environment. Several studies on the topic of vio- visual data (typically recorded using CCTV cameras installed
lence detection have been done, with the best performing models in a user’s living quarters) and identifies physical violence in
using Convolutional Neural Networks for spatial feature learning, near real-time, upon which consequential action can be taken.
followed by a Recurrent Neural Network variant for temporal The detection of violence from surveillance videos is an
feature learning. Recent studies suggest that Vision Transformers
have a better run-time and are more robust to partial occlusions active research area in computer vision and machine learning;
than their deep convolutional network counterparts. In this most of the top performing existing systems use a variant of
paper, we present a light-weight transformer model, drawing a Convolutional Neural Network for spatial feature extraction,
upon the recent success of video vision transformers in action followed by a variant of a Recurrent Neural Network for
recognition. The model extracts spatial features from the input temporal feature extraction and classification, trained on an
frames, and adds temporal relation to the selected frames with the
help of tubelet embedding, which are then encoded using various outdoor violence dataset. Though this deep-learning approach
transformer layers. Generally, although transformer models are is very successful in visual classification results, it is sus-
shown to need extensive training on large datasets, it has been ceptible to failure if the network is fed images where the
proven that using efficient preprocessing, we can train the model regions of interest are partially occluded but the images in the
on comparatively small sized datasets. The light-weight model training data were not. This is because CNNs are composed of
has achieved an accuracy of 88.24% on our self-curated indoor
violence dataset and a 98% accuracy on a subset containing linear filters, and are greatly affected by occlusions. However,
videos in which the subjects involved in the activity are occluded occlusion is not an issue that can be avoided in an indoor
from view. violence detection system, given the severe lack of indoor
Index Terms—transformer, tubelet embedding, indoor violence violence training data.
detection, occlusion In contrast to their convolutional counterparts, Transform-
ers use the self-attention mechanism, and are more robust
I. I NTRODUCTION to partial occlusions in test data. The Vision Transformer
(ViT) presented in [1] is a pure-transformer based architecture
Violence in indoor settings is a threat to individual safety; that has outperformed its convolutional counterparts in image
there are numerous cases of domestic violence and child abuse classification. ViT achieves excellent results compared to state-
that are either reported too late, or dismissed because of of-the-art convolutional networks when pre-trained on vast
lack of evidence. Recent digital advancements have led to amounts of data and transferred to several mid-sized or small
the availability of affordable in house surveillance, greatly picture recognition benchmarks. ViT also ensures the usage
increasing the security offered to the victims. Although the of significantly fewer computation resources to train. Arnab
increased availability of CCTV cameras provides the means et al. presented a pure Transformer-based approach for video
for detection of indoor violence, it is impractical to constantly classification, Video Vision Transformer (ViViT) [2]. ViViT is
monitor the footage; this creates a need for an indoor violence an extension of vision transformer, considering the fact that
detection tool that is efficient and effective. each video consists of several image frames. The fundamental
computation performed in this architecture is self-attention Abdali proposed a data efficient transformer(DeVTr) [5]
which can be described as concentrating one’s attention on which used a pre-trained 2D CNN as an embedding layer for
oneself. Furthermore, ViT can be utilized for real time de- the input data. The use of a pre-trained CNN helps in faster
ployment due to less computational effort requirement. learning and less memory consumption. The transformer thus
The current study implements a variant of the video vision proves to be a better temporal feature extractor for videos
transformer (ViViT) for the violence classification task. It ag- compared to recurrent models. The model achieved 96.25%
gregates spatio-temporal tokens from the input video, encodes accuracy on the RLVS dataset proving to be a better violence
them using a network of transformer layers, and implements detection model than all the models which preceded it.
pre-processing so that the pure transformer is able to classify In 2021, Anurag Arnab and Mostafa Dehghani proposed a
videos in real-time. pure transformer model [2] which after effective regularization
was able to perform well on smaller datasets. They make use of
II. R ELATED W ORK tubelet embedding which can efficiently extract tokens spatio-
temporally. They also experiment various ways of factorizing
The detection of violent activities falls under the category along the spatial and temporal dimensions, to effectively
of action recognition. This requires a model to give attention process the large number of tokens. It achieved state-of-
to the spatial orientation of the actors as well as capture the the-art results on multiple video classification benchmarks,
relation of their position in time i.e, from one frame to another outperforming prior methods based on deep 3D convolutional
in a video. Substantial work has been done on this subject networks.
using a combination of convolutional networks (CNNs) to
capture the spatial information and recurrent neural networks III. M ETHODOLOGY
(RNNs) to capture the temporal data. A. Experimental Setup
M. M. Soliman and M. H. Kamal [3] investigated the use Initially, the model was trained on the two standard datasets
of deep learning techniques for the purpose of recognizing for violence detection – Hockey Fight Detection Dataset [6]
violence in videos. They proposed a model which uses a and Real Life Violent Situation (RLVS) Dataset [3]. Upon
pre-trained VGG-16 on ImageNet as spatial feature extractor evaluation of the respective trained models, it was found that
followed by Long Short-Term Memory (LSTM) as temporal certain common actions like handshakes, waving, walking
feature extractor. They also introduced a new benchmark while swinging hands were wrongly classified. Due to this, a
dataset called Real- Life Violence Situations(RLVS) [6] which custom dataset was created, which consists of clips collected
contained 2000 short videos divided into violent and non- from YouTube, and relevant clips from GitHub [7] and RLVS
violent videos. The proposed RLVS benchmark was used for dataset. The dataset consists of 1112 videos – 556 violent
fine-tuning the proposed model which helped it achieve an and 556 non-violent. In addition to general non-violent videos,
accuracy of 88.24%. we have added videos of waving, walking, dancing, jumping,
Vaswani proposed a simpler network architecture – the exercising, doing house chores and other such actions that are
Transformer network [4] – which completely depends on common in an indoor setting. Using the trial and error method,
the attention mechanism. The model architecture proved to it was found that the training and validation accuracy of the
be superior in quality than the convolutional and recurrence model stabilized at around the 40th epoch (Fig. 1).
models. This architecture could model global dependency
between input and output well. Since the model contained no
convolution nor recurrence, positional embedding was added
to inject information about the absolute/relative position of
each token. Following this, many transformer models emerged
which started replacing the RNNs to capture temporal infor-
mation as well as CNNs in capturing the spatial information.
Alexey Dosovitskiy [1] used a transformer as a spatial
feature extractor. They found out that unlike convolutional
and recurrent neural networks, Transformers do not inherently
encode inductive biases. However, since Transformers assume
minimal prior knowledge on the structure of a problem, the
network was expected to figure out these image-specific con-
cepts on its own, and relationships between sequential frames
needed to be discovered using the self-attention mechanism.
This necessitated a very large training data. Due to which they Fig. 1. Training vs Validation accuracy
trained the Vision Transformer in two-stages: the Transformer
was first trained on a large-scale dataset in an unsupervised The accuracy of the model based on different input sequence
/ self-supervised manner; these pre-trained weights were then lengths is shown in “Fig. 2”. From the figure, it can be deduced
adapted to the downstream tasks using smaller datasets. that if the sequence length of the input video frames is more,
Fig. 3. Learning rate vs Accuracy
Fig. 2. Sequence length vs Accuracy

the accuracy of the model increases. If the sequence length is would take the model a long time to learn the values from
too low, and if the violent act happens at the end of the input the input frames. A high value of learning rate means that it
video, it would be classified as non-violent, which is a false would overshoot the optimal point. From “Fig. 3”, it can be
negative. The model achieves highest accuracy at sequence deduced that the model gives the highest accuracy at learning
length value equal to 35. But upon analysis of the input frames, rate value of 1e-3.
it was seen that in cases where videos did not have enough C. Frame Selection
frames, the last frame was duplicated and appended to the list
of frames, which was then given as input to the model. Hence, From the total number of frames in each video in the
the sequence length values of 35 and 40 are discarded, and the training dataset, only Mseq frames were sent for further
selected sequence length value is 30. processing. This was done to reduce training and computation
time of the model. The frames needed to be selected so as to
The testing environment used was Google Colab. Table I ensure minimum loss of information.
lists the specifications provided by the environment.

TABLE I
TESTING ENVIRONMENT SPECIFICATIONS
Fig. 4. Frame selection method
Specification Value
CPU Model Name Intel® Xeon® CPU @ 2.20GHz
CPU Cores 1 We started with a simple selection and improved it to discard
RAM Size 12.68GB redundant information:
Disk Size 78.19GB
GPU Model Name NVIDIA Tesla T4 1) Method I: A straightforward approach, in which the first
GPU CUDA Version 11.2 Mseq frames were selected.
GPU Memory 16GB Drawback: This approach worked only for extremely
short videos. Another issue was, since the frames were
consecutive, most adjacent frames had negligible delta
B. Hyperparameter Tuning in the actors’ positions.
Upon experimental research, the maximum sequence length 2) Method II: To overcome drawbacks of Method I, this
(Mseq) of the frames selected from the videos was set to 30, method avoided storing consecutive frames and hence
the image size of each frame to 90, 90 (height, width), and the proved applicable for longer videos. Frames were uni-
learning rate of the model to 1e-3. A batch size of 32 and 40 formly selected from the total frames of the video. Each
epochs were used for each experiment. The Adam optimizer frame was temporally equidistant from the other. The
was used due to its high performance, low computation time distance between frames was calculated as (total no.
and low training cost. Since the expected output for each of frames per video / Mseq). This way, even the final
video is either 0 for non-violent and 1 for violent and the second of a long video could have a frame in the selected
model outputs the probability of violence (fractional), the loss frames. This worked well when all the videos were of
function chosen is binary cross entropy. same duration.
“Fig. 3” shows the accuracy of the model versus the Drawback: When all the videos were not of the same
different learning rates. A small learning rate means that it duration, the temporal speed of the selected frames of
two different videos differed. A longer video with a the computation as one channel would require. As thou-
larger number of frames have selected frames at a greater sands of frames are being analyzed, we reduce the
temporal distance (and thus a greater delta) than the computation time by eliminating the color channels which
frames selected from a shorter video. Due to this, the we don’t need;
model lost temporal information of various actions in • In addition to the above, our target systems — surveil-
the videos. lance systems — mostly have cameras which capture
3) Method III: A fixed number of frames F was selected footage in gray-scale.
from each second of the video till the quota of
IV. I MPLEMENTATION
Mseq frames was met. The F frames were temporally
equidistant from one another. The architecture has been inherited from [2]. “Fig. 6”
Drawback: This method worked only if the violent shows the pictorial representation of the model architecture.
action (for videos containing violence) happened within
the first T seconds, where T is (Mseq / F).

For all the three datasets, the violent action in


majority of violence video occurred in the first T
seconds. As we can see in “Fig. 5” Method III achieved
the best accuracy on all three datasets due to its
effective minimization of loss of information. Hence
Method III was used for the selection of frames.

Fig. 6. Model Architecture

First, the input videos were tokenized where-in nt frames


from the video were taken using the frame selection method
detailed in the previous subsection.

Then, these selected frames were pre-processed where-in the


frames were changed from three-channel RGB to one-channel
Gray-scale and resized to the selected image size. These pre-
processed frames were then concatenated.
Parallelly, each frame was divided into non-overlapping
patches of size nh*nw as shown in “Fig. 7”. Hence a total
of nt ∗ nh ∗ nw tokens were passed to the encoder, where t is
the length of the tubelet and nt is the number of tubelets.
Fig. 5. Frame Selection Method Comparison

D. Data Pre-processing
Out of the total videos, 80% were used for training, 10%
for testing and 10% for validation.
• The pre-processing began with dividing each of the
Fig. 7. Conversion of frames to patches
videos into frames.
• Each frame was then converted to grayscale and cropped
The non-overlapping patches from successive frames were
to a uniform size (90, 90) to speed up the computation. arranged in the form of spatio-temporal tubes as shown in
• Out of all the frames only 30 frames were taken per video.
“Fig. 8”, each tubelet with a dimension of t ∗ h ∗ w. Positional
This was done by picking 5 temporally equidistant frames embeddings were added to these tokens. Tokens with the same
per second of the video. spatial index have the same embedding. Tubelet embedding,
• If a certain video did not contain the minimum number
unlike uniform frames sampling, extracts a series of patches.
of frames (30), the last frame of that video was replicated Thus, it does not lose the time index and the frame position
to reach the required number of frames. of each patch. Hence, it is able to capture information of each
Each frame was converted to gray-scale for two reasons — patch and its change over the temporal track. Due to this,
• Performing pixel-to-pixel processing of a megapixel im- Tubelet embedding was used to extract the tokens from the
age using three color channels takes almost three times video.
“Fig. 9” shows how well the model trained on the custom
dataset fared in predicting the labels of the videos correctly.
• True Positive : The activity is violent, and is classified as
violent.
• True Negative : The activity is non-violent, and is not
classified as violent.
• False Positive : The activity is non-violent, but the model
Fig. 8. Tubelet embedding classifies it as violent.
• False Negative : The activity is violent, but the model
does not classify it as violent.
The tublets were then flattened and their corresponding
position encoding in 3D space was appended to them. The
above series of tokens was given as input to the transformer
model. First, the encoded patches were passed on to the
normalization layer. Next, these normalized tokens were
passed through a Multi-Headed Self Attention Layer (MSA)
to extract the important features. After another layer of
normalization, the tokens were passed to two Dense layers,
both having ReLu as the activation function. The tokens were
normalized again before passing it to the Global Average
Pooling layer. Finally, a Dense layer with Sigmoid activation
function was used to get the final output. Since we have a
binary classification of output, the optimizer used was Adam
and the loss function used was Binary-Cross Entropy.

V. R ESULTS AND D ISCUSSION


A. Analysis of Models
Fig. 9. Confusion matrix of accuracy on custom dataset
The main reason to create a custom dataset was to train
the model on violent videos that were taken in an indoor
environment. The model proposed in this paper was trained on B. Comparison of datasets
3 different datasets: Real Life Violence Dataset, Hockey Fights
Dataset and the custom dataset curated by us; the respective Table III shows the comparison of the model, trained on
accuracies are as listed in Table II. To find the accuracy of the custom dataset, on the videos of the other two datasets.
model on occluded videos, videos containing occlusion were
removed from the respective datasets and added to a testing TABLE III
PERFORMANCE OF MODEL TRAINED ON CUSTOM DATA ON
batch for each dataset. The model was then tested on this STANDARD DATASETS
dataset; the respective accuracies are also listed in Table II.
Dataset Violence Accuracy Non-Violence Accuracy
RLVS 82.40% 78.62%
TABLE II
Hockey Fights 88.40% 77.80%
PERFORMANCE ON VARIOUS DATASETS

Dataset used to train Accuracy of the Accuracy of the


train the model model on the model on occlusion
using the same respective dataset videos taken from the C. Response to Occlusion
architecture used to train it respective dataset
RLVS 87% 95.12%
From the results detailed in prior sections, it can be seen
Hockey Fights 92% 95.16% that the model trained on any of the three datasets, gives a
Custom Dataset 88.24% 98% very high accuracy in the prediction of violence in videos
containing occlusions. This is because of the feature extrac-
From Table II, it can be concluded that the model trained on tion mechanism. The multi-headed self-attention layer gives
our dataset works better for detecting indoor violence. There attention to the different important features in the patches of
are two main reasons for this: the frames. The model proposed in this report is compared
• The dataset contains a large amount of videos that were with the State-of-the-Art model – DeVTr [5] – trained on
taken in indoor environments using different angles; the custom dataset curated by us. The models have also
• The non-violence part of the dataset contains videos been compared based on their performance on the test set of
which have been collected considering the numerous occlusion videos, which were taken from our custom dataset.
common actions that occur indoors. The respective accuracies are listed in Table IV.
TABLE IV [2] Arnab, Anurag & dehghani, Mostafa & Heigold, Georg & Sun, Chen
C OMPARISON WITH STATE - OF - THE - ART MODEL ON CUSTOM DATASET & Lucic, Mario & Schmid, Cordelia. (2021). ViViT: A Video Vision
Transformer. 6816-6826. 10.1109/ICCV48922.2021.00676.
Model Accuracy Accuracy on test set [3] M. M. Soliman, M. H. Kamal, M. A. El-Massih Nashed, Y. M. Mostafa,
State-of-the-Art Model 76.27% 80.49% B. S. Chawky and D. Khattab, ”Violence Recognition from Videos using
Custom model 88.24% 98% Deep Learning Techniques,” 2019 Ninth International Conference on
Intelligent Computing and Information Systems (ICICIS), 2019, pp. 80-
85, doi: 10.1109/ICICIS46948.2019.9014714.
[4] Vaswani, Ashish & Shazeer, Noam & Parmar, Niki & Uszkoreit, Jakob
D. Practical Application & Jones, Llion & Gomez, Aidan & Kaiser, Lukasz & Polosukhin, Illia.
(2017). Attention Is All You Need.
In practice, the input to our system is a stream of frames [5] A. R. Abdali, ”Data Efficient Video Transformer for Violence De-
captured by a surveillance camera. The system pre-processes tection,” 2021 IEEE International Conference on Communication,
the frames, which are then sent to the tranformer model which Networks and Satellite (COMNETSAT), 2021, pp. 195-199, doi:
10.1109/COMNETSAT53002.2021.9530829.
checks for the presence of violent actions in the stream of [6] Bermejo, Enrique & Deniz, Oscar & Bueno, Gloria & Sukthankar, R..
frames. In an environment such as a daycare centre where our (2011). Violence detection in video using computer vision techniques.
system is put to use, an email alert containing a video clip Computer Analysis of Images and Patterns. 332-339.
[7] M. Bianculli, N. Falcionelli, P. Sernani, S. Tomassini, P. Contardo, M.
with the violent action would be sent to the children’s parents Lombardi, A.F. Dragoni, A dataset for automatic violence detection in
using our integrated notification service. When the action is videos, Data in Brief 33 (2020). doi:10.1016/j.dib.2020.106587.
predicted to be non-violent, the processed frames are dropped, [8] Gedas Bertasius, Heng Wang and Lorenzo Torresani, ”Is space-time
attention all you need for video understanding?”, arXiv preprint
and the next 30 frames are added to the input stream (the arXiv:2102.05095, 2021.
process repeats). This prevents the need to store a large volume [9] Dong, Z., Qin, J., Wang, Y. (2016). Multi-stream Deep Networks for
of surveillance data for monitoring. Person to Person Violence Detection in Videos. In: Tan, T., Li, X.,
Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR
2016. Communications in Computer and Information Science, vol 662.
VI. C ONCLUSION AND F UTURE W ORK Springer, Singapore.
[10] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa,
This paper presents a Video Transformer approach to solve Alexandre Sablayrolles and Hervé Jégou, ”Training data-efficient image
the problem of detecting indoor violence in real-time. It was transformers & distillation through attention”, 2020.
found that the state-of-the-art research in the field of violence
detection does not perform well on very common violence
acts in an indoor setting. This is due to the difference in
environment (indoor v/s outdoor) compared to the datasets
most top-performing violence detection models have been
trained on. Traditional deep learning models need to be trained
on large and diverse datasets to handle the problems of
occlusions and dimness posed by an indoor environment.
To explore a solution to the shortcomings of traditional
models (which are susceptible to failure in case of low video
resolution and partial occlusion of the subjects in the video),
we explored the implementation of the novel Pure Video
Transformers, which outperform traditional CNN and RNN
networks due to their self-attention mechanism and ability to
capture long-term temporal dependencies.
The solution proposed in this paper is aimed at accurately
predicting indoor violence in a generic real-world setting, and
hence strives to maximise accuracy on gray-scale videos in
minimal computation time. Future work towards this objective
could include integration of audio processing with our system,
and use both audio and visual cues to detect violence.
The system with the architecture detailed in this paper was
successful in producing competitive results on state-of-the
art datasets, and outperforms some top-performing violence
detection models at detecting violence in an indoor setting.

R EFERENCES
[1] Dosovitskiy, Alexey & Beyer, Lucas & Kolesnikov, Alexander & Weis-
senborn, Dirk & Zhai, Xiaohua & Unterthiner, Thomas & Dehghani,
Mostafa & Minderer, Matthias & Heigold, Georg & Gelly, Sylvain &
Uszkoreit, Jakob & Houlsby, Neil. (2020). An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale.

You might also like