Professional Documents
Culture Documents
Indoor Violence Detection Using Video Transformer
Indoor Violence Detection Using Video Transformer
Indoor Violence Detection Using Video Transformer
Transformer Model
Arushi Kumar Arpan Shetty Archit Sagar
Computer Science & Engineering Computer Science & Engineering Computer Science & Engineering
PES University PES University PES University
Bengaluru, India Bengaluru, India Bengaluru, India
arushi32001@gmail.com arpan.sundarram@gmail.com archit2098@gmail.com
Abstract—Human activity recognition from surveillance videos This paper proposes a solution to these problems – an
is an active research area in Computer Vision and Machine automated indoor violence detection system. The aim was to
Learning. The detection of indoor violent activities is a major create a supervised model that detects acts of violence against
challenge in the surveillance sector due to the lack of sufficient
indoor footage, and the presence of obstacles inducing occlusions a victim in pseudo real-time. The proposed system takes in
in a household environment. Several studies on the topic of vio- visual data (typically recorded using CCTV cameras installed
lence detection have been done, with the best performing models in a user’s living quarters) and identifies physical violence in
using Convolutional Neural Networks for spatial feature learning, near real-time, upon which consequential action can be taken.
followed by a Recurrent Neural Network variant for temporal The detection of violence from surveillance videos is an
feature learning. Recent studies suggest that Vision Transformers
have a better run-time and are more robust to partial occlusions active research area in computer vision and machine learning;
than their deep convolutional network counterparts. In this most of the top performing existing systems use a variant of
paper, we present a light-weight transformer model, drawing a Convolutional Neural Network for spatial feature extraction,
upon the recent success of video vision transformers in action followed by a variant of a Recurrent Neural Network for
recognition. The model extracts spatial features from the input temporal feature extraction and classification, trained on an
frames, and adds temporal relation to the selected frames with the
help of tubelet embedding, which are then encoded using various outdoor violence dataset. Though this deep-learning approach
transformer layers. Generally, although transformer models are is very successful in visual classification results, it is sus-
shown to need extensive training on large datasets, it has been ceptible to failure if the network is fed images where the
proven that using efficient preprocessing, we can train the model regions of interest are partially occluded but the images in the
on comparatively small sized datasets. The light-weight model training data were not. This is because CNNs are composed of
has achieved an accuracy of 88.24% on our self-curated indoor
violence dataset and a 98% accuracy on a subset containing linear filters, and are greatly affected by occlusions. However,
videos in which the subjects involved in the activity are occluded occlusion is not an issue that can be avoided in an indoor
from view. violence detection system, given the severe lack of indoor
Index Terms—transformer, tubelet embedding, indoor violence violence training data.
detection, occlusion In contrast to their convolutional counterparts, Transform-
ers use the self-attention mechanism, and are more robust
I. I NTRODUCTION to partial occlusions in test data. The Vision Transformer
(ViT) presented in [1] is a pure-transformer based architecture
Violence in indoor settings is a threat to individual safety; that has outperformed its convolutional counterparts in image
there are numerous cases of domestic violence and child abuse classification. ViT achieves excellent results compared to state-
that are either reported too late, or dismissed because of of-the-art convolutional networks when pre-trained on vast
lack of evidence. Recent digital advancements have led to amounts of data and transferred to several mid-sized or small
the availability of affordable in house surveillance, greatly picture recognition benchmarks. ViT also ensures the usage
increasing the security offered to the victims. Although the of significantly fewer computation resources to train. Arnab
increased availability of CCTV cameras provides the means et al. presented a pure Transformer-based approach for video
for detection of indoor violence, it is impractical to constantly classification, Video Vision Transformer (ViViT) [2]. ViViT is
monitor the footage; this creates a need for an indoor violence an extension of vision transformer, considering the fact that
detection tool that is efficient and effective. each video consists of several image frames. The fundamental
computation performed in this architecture is self-attention Abdali proposed a data efficient transformer(DeVTr) [5]
which can be described as concentrating one’s attention on which used a pre-trained 2D CNN as an embedding layer for
oneself. Furthermore, ViT can be utilized for real time de- the input data. The use of a pre-trained CNN helps in faster
ployment due to less computational effort requirement. learning and less memory consumption. The transformer thus
The current study implements a variant of the video vision proves to be a better temporal feature extractor for videos
transformer (ViViT) for the violence classification task. It ag- compared to recurrent models. The model achieved 96.25%
gregates spatio-temporal tokens from the input video, encodes accuracy on the RLVS dataset proving to be a better violence
them using a network of transformer layers, and implements detection model than all the models which preceded it.
pre-processing so that the pure transformer is able to classify In 2021, Anurag Arnab and Mostafa Dehghani proposed a
videos in real-time. pure transformer model [2] which after effective regularization
was able to perform well on smaller datasets. They make use of
II. R ELATED W ORK tubelet embedding which can efficiently extract tokens spatio-
temporally. They also experiment various ways of factorizing
The detection of violent activities falls under the category along the spatial and temporal dimensions, to effectively
of action recognition. This requires a model to give attention process the large number of tokens. It achieved state-of-
to the spatial orientation of the actors as well as capture the the-art results on multiple video classification benchmarks,
relation of their position in time i.e, from one frame to another outperforming prior methods based on deep 3D convolutional
in a video. Substantial work has been done on this subject networks.
using a combination of convolutional networks (CNNs) to
capture the spatial information and recurrent neural networks III. M ETHODOLOGY
(RNNs) to capture the temporal data. A. Experimental Setup
M. M. Soliman and M. H. Kamal [3] investigated the use Initially, the model was trained on the two standard datasets
of deep learning techniques for the purpose of recognizing for violence detection – Hockey Fight Detection Dataset [6]
violence in videos. They proposed a model which uses a and Real Life Violent Situation (RLVS) Dataset [3]. Upon
pre-trained VGG-16 on ImageNet as spatial feature extractor evaluation of the respective trained models, it was found that
followed by Long Short-Term Memory (LSTM) as temporal certain common actions like handshakes, waving, walking
feature extractor. They also introduced a new benchmark while swinging hands were wrongly classified. Due to this, a
dataset called Real- Life Violence Situations(RLVS) [6] which custom dataset was created, which consists of clips collected
contained 2000 short videos divided into violent and non- from YouTube, and relevant clips from GitHub [7] and RLVS
violent videos. The proposed RLVS benchmark was used for dataset. The dataset consists of 1112 videos – 556 violent
fine-tuning the proposed model which helped it achieve an and 556 non-violent. In addition to general non-violent videos,
accuracy of 88.24%. we have added videos of waving, walking, dancing, jumping,
Vaswani proposed a simpler network architecture – the exercising, doing house chores and other such actions that are
Transformer network [4] – which completely depends on common in an indoor setting. Using the trial and error method,
the attention mechanism. The model architecture proved to it was found that the training and validation accuracy of the
be superior in quality than the convolutional and recurrence model stabilized at around the 40th epoch (Fig. 1).
models. This architecture could model global dependency
between input and output well. Since the model contained no
convolution nor recurrence, positional embedding was added
to inject information about the absolute/relative position of
each token. Following this, many transformer models emerged
which started replacing the RNNs to capture temporal infor-
mation as well as CNNs in capturing the spatial information.
Alexey Dosovitskiy [1] used a transformer as a spatial
feature extractor. They found out that unlike convolutional
and recurrent neural networks, Transformers do not inherently
encode inductive biases. However, since Transformers assume
minimal prior knowledge on the structure of a problem, the
network was expected to figure out these image-specific con-
cepts on its own, and relationships between sequential frames
needed to be discovered using the self-attention mechanism.
This necessitated a very large training data. Due to which they Fig. 1. Training vs Validation accuracy
trained the Vision Transformer in two-stages: the Transformer
was first trained on a large-scale dataset in an unsupervised The accuracy of the model based on different input sequence
/ self-supervised manner; these pre-trained weights were then lengths is shown in “Fig. 2”. From the figure, it can be deduced
adapted to the downstream tasks using smaller datasets. that if the sequence length of the input video frames is more,
Fig. 3. Learning rate vs Accuracy
Fig. 2. Sequence length vs Accuracy
the accuracy of the model increases. If the sequence length is would take the model a long time to learn the values from
too low, and if the violent act happens at the end of the input the input frames. A high value of learning rate means that it
video, it would be classified as non-violent, which is a false would overshoot the optimal point. From “Fig. 3”, it can be
negative. The model achieves highest accuracy at sequence deduced that the model gives the highest accuracy at learning
length value equal to 35. But upon analysis of the input frames, rate value of 1e-3.
it was seen that in cases where videos did not have enough C. Frame Selection
frames, the last frame was duplicated and appended to the list
of frames, which was then given as input to the model. Hence, From the total number of frames in each video in the
the sequence length values of 35 and 40 are discarded, and the training dataset, only Mseq frames were sent for further
selected sequence length value is 30. processing. This was done to reduce training and computation
time of the model. The frames needed to be selected so as to
The testing environment used was Google Colab. Table I ensure minimum loss of information.
lists the specifications provided by the environment.
TABLE I
TESTING ENVIRONMENT SPECIFICATIONS
Fig. 4. Frame selection method
Specification Value
CPU Model Name Intel® Xeon® CPU @ 2.20GHz
CPU Cores 1 We started with a simple selection and improved it to discard
RAM Size 12.68GB redundant information:
Disk Size 78.19GB
GPU Model Name NVIDIA Tesla T4 1) Method I: A straightforward approach, in which the first
GPU CUDA Version 11.2 Mseq frames were selected.
GPU Memory 16GB Drawback: This approach worked only for extremely
short videos. Another issue was, since the frames were
consecutive, most adjacent frames had negligible delta
B. Hyperparameter Tuning in the actors’ positions.
Upon experimental research, the maximum sequence length 2) Method II: To overcome drawbacks of Method I, this
(Mseq) of the frames selected from the videos was set to 30, method avoided storing consecutive frames and hence
the image size of each frame to 90, 90 (height, width), and the proved applicable for longer videos. Frames were uni-
learning rate of the model to 1e-3. A batch size of 32 and 40 formly selected from the total frames of the video. Each
epochs were used for each experiment. The Adam optimizer frame was temporally equidistant from the other. The
was used due to its high performance, low computation time distance between frames was calculated as (total no.
and low training cost. Since the expected output for each of frames per video / Mseq). This way, even the final
video is either 0 for non-violent and 1 for violent and the second of a long video could have a frame in the selected
model outputs the probability of violence (fractional), the loss frames. This worked well when all the videos were of
function chosen is binary cross entropy. same duration.
“Fig. 3” shows the accuracy of the model versus the Drawback: When all the videos were not of the same
different learning rates. A small learning rate means that it duration, the temporal speed of the selected frames of
two different videos differed. A longer video with a the computation as one channel would require. As thou-
larger number of frames have selected frames at a greater sands of frames are being analyzed, we reduce the
temporal distance (and thus a greater delta) than the computation time by eliminating the color channels which
frames selected from a shorter video. Due to this, the we don’t need;
model lost temporal information of various actions in • In addition to the above, our target systems — surveil-
the videos. lance systems — mostly have cameras which capture
3) Method III: A fixed number of frames F was selected footage in gray-scale.
from each second of the video till the quota of
IV. I MPLEMENTATION
Mseq frames was met. The F frames were temporally
equidistant from one another. The architecture has been inherited from [2]. “Fig. 6”
Drawback: This method worked only if the violent shows the pictorial representation of the model architecture.
action (for videos containing violence) happened within
the first T seconds, where T is (Mseq / F).
D. Data Pre-processing
Out of the total videos, 80% were used for training, 10%
for testing and 10% for validation.
• The pre-processing began with dividing each of the
Fig. 7. Conversion of frames to patches
videos into frames.
• Each frame was then converted to grayscale and cropped
The non-overlapping patches from successive frames were
to a uniform size (90, 90) to speed up the computation. arranged in the form of spatio-temporal tubes as shown in
• Out of all the frames only 30 frames were taken per video.
“Fig. 8”, each tubelet with a dimension of t ∗ h ∗ w. Positional
This was done by picking 5 temporally equidistant frames embeddings were added to these tokens. Tokens with the same
per second of the video. spatial index have the same embedding. Tubelet embedding,
• If a certain video did not contain the minimum number
unlike uniform frames sampling, extracts a series of patches.
of frames (30), the last frame of that video was replicated Thus, it does not lose the time index and the frame position
to reach the required number of frames. of each patch. Hence, it is able to capture information of each
Each frame was converted to gray-scale for two reasons — patch and its change over the temporal track. Due to this,
• Performing pixel-to-pixel processing of a megapixel im- Tubelet embedding was used to extract the tokens from the
age using three color channels takes almost three times video.
“Fig. 9” shows how well the model trained on the custom
dataset fared in predicting the labels of the videos correctly.
• True Positive : The activity is violent, and is classified as
violent.
• True Negative : The activity is non-violent, and is not
classified as violent.
• False Positive : The activity is non-violent, but the model
Fig. 8. Tubelet embedding classifies it as violent.
• False Negative : The activity is violent, but the model
does not classify it as violent.
The tublets were then flattened and their corresponding
position encoding in 3D space was appended to them. The
above series of tokens was given as input to the transformer
model. First, the encoded patches were passed on to the
normalization layer. Next, these normalized tokens were
passed through a Multi-Headed Self Attention Layer (MSA)
to extract the important features. After another layer of
normalization, the tokens were passed to two Dense layers,
both having ReLu as the activation function. The tokens were
normalized again before passing it to the Global Average
Pooling layer. Finally, a Dense layer with Sigmoid activation
function was used to get the final output. Since we have a
binary classification of output, the optimizer used was Adam
and the loss function used was Binary-Cross Entropy.
R EFERENCES
[1] Dosovitskiy, Alexey & Beyer, Lucas & Kolesnikov, Alexander & Weis-
senborn, Dirk & Zhai, Xiaohua & Unterthiner, Thomas & Dehghani,
Mostafa & Minderer, Matthias & Heigold, Georg & Gelly, Sylvain &
Uszkoreit, Jakob & Houlsby, Neil. (2020). An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale.