Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Abstract:

This research proposes an extension of a Convolutional Neural Network (CNN)


- Long Short-Term Memory (LSTM) approach for human activity recognition to
include scenarios involving multiple individuals. Traditionally, activity
recognition in videos involved analyzing each frame independently using an
image classifier and selecting the most frequent activity label among frames.
However, this method is not effective as it doesn’t consider all the aspects
involved in a video. Advancements in deep learning, particularly CNNs and
LSTMs, have enhanced human activity recognition systems. CNNs excel at
extracting spatial features, identifying objects, scenes, and activities within
frames, while LSTMs effectively capture temporal dependencies, understanding
how visual elements evolve over time in video sequences. Here, we train and
test the deep learning models ConvLSTM and LRCN using the UCF50 - Action
Recognition Dataset, which comprises 50 action categories, with each category
containing 25 videos per action. To achieve multi-person activity recognition,
YOLO (You Only Look Once) is employed for detecting humans within the
video frames. Subsequently, a tracking mechanism is utilized to follow each
person across at least 20 consecutive frames. These frames are fed into the
LRCN model for activity recognition per each person. Predicted activities are
labeled on top of the bounding boxes of each person, enabling simultaneous
recognition of multiple individuals' activities in a video stream.
1. Inroduction
Human activity recognition in video streams has been a challenging yet crucial
task in computer vision and artificial intelligence. Traditional methods treat
each frame independently, leaving the temporal context necessary for
understanding complex activities. Recent advancements in deep learning,
particularly the integration of Convolutional Neural Networks (CNNs) and
Long Short-Term Memory (LSTM) networks, have significantly improved the
accuracy and efficiency of activity recognition systems. CNNs excel at
extracting spatial features, while LSTMs effectively capture temporal
dependencies, enabling a more comprehensive understanding of dynamic
activities. Moreover, these approaches struggle to distinguish activities
involving multiple individuals simultaneously, limiting their practical
applications.

In this research, we extend the CNN-LSTM approach to address the challenge


of recognizing activities involving multiple individuals within a video stream.
In this study, we present a novel approach for multi-person activity recognition
that combines object detection, tracking, and deep learning methodologies. The
YOLO (You Only Look Once) object detection algorithm is used to identify
humans in video frames. The algorithm returns (x, y) coordinates of each
detected person. Our tracking method utilizes the (x, y) coordinates of persons
detected and clusters them based on the distances calculated from the
coordinates of persons from other frames. We do this for at least 20 frames (the
number of frames depends on the fps) and collect those each person's regions
frames.

The collected frames, each containing the activity region of a person, are then
input into the Long Short-Term Memory Recurrent Convolutional Network
(LRCN) model for activity prediction. By using the temporal dependencies
captured by LRCN, we aim to accurately recognize activities performed by each
individual throughout the video sequence.

Our study focuses on training and evaluating deep learning models, including
ConvLSTM and LRCN, using the UCF50 - Action Recognition Dataset. This
dataset comprises diverse action categories, providing a robust testing ground
for our proposed approach.
The objective of this research is to develop a system capable of accurately
recognizing activities performed by multiple individuals in real-world video
streams. By integrating state-of-the-art deep learning techniques with object
detection and tracking mechanisms, we aim to advance the capabilities of
human activity recognition systems and contribute to applications such as
surveillance, sports analysis, and human-computer interaction.

2. Literature Survey
This paper[1] uses deep learning for human action recognition. It
combines both CNN and LSTM for this. The CNN is used to extract useful
patterns from a frame and LSTM is used to store these extracted patterns. In this
way CNN extracts useful patterns from each frame in a video and LSTM stores
these patterns for overall video analysis. They tested this model with a big
dataset and found that the proposed system works faster than others with a
better accuracy of about 80%.
This paper[2] uses depth sensors for human action recognition. They
done this by creating a special set of information based on skeleton shapes
they in the images that are captured by sensors. Then finally with the help of
Multi-class Support Vector Machine the action is predicted.

The main aim of this paper[3] is to alert the people when ever an
unusual activity is detected. This can be done by combining YOLO(You Look
Only Once) with deep learning models. YOLO is used for object detection and
deep learning models are used for classification of action which is performed by
recognised object. This deep learning models are able to recognise simple to
complex actions.

This paper [4]propose a new approach for human action recognition. The
approach is nothing but combining all the deep learning models to made a
hybrid model with high accuracy. The proposed model is then tested with
datasets like UCF sports, UCF 101, KTH dataset. The results shows that the
proposed model on an average gives an accuracy of 96.3% when it is tested
with KTH dataset.

This paper [5] proposes a new approach for human action recognition.
The proposed approach is that the Action Recognition system takes the data
from the accelerometers and gyroscopes in smartphones and they used
datamining and machine learning techniques for action recognition. At first they
used Random Forest algorithm as a machine learning technique and later due to
its complex computations it is modified to Modified Random Forest Algorithm
which creates small decision trees for classification.

This paper [6]talks about a new method called “Action Fusion”. This
method helps computer to understand what humans are doing with help of a
picture that show how someone moves and information about his specific body
part. They use three different ways to train the computer and finally they
combine all these ways to get the best result from the computer.They tested this
model on three datasets and found that it works better than other ways people
have tried.

In this paper[7] the authors studied ten recent techniques that use the
Kinect camera to recognize actions, and they tested them using six different sets
of data. They also made some improvements to some of these techniques and
tested them too. What they found was that most methods are better at
recognizing actions when different people are doing them compared to when the
same person is viewed from different angles. They also found that techniques
focusing on the skeleton of the person are better for recognizing actions from
different angles than techniques focusing on the depth of the image.

This paper[8] focuses on identifying human activities in sports videos that


is captured by a camera. They split the recognition process into 4 steps they
are: dividing the video into frames, identifying human body parts in those
frames, recognizing the activity using convolutional neural network, and
determining the timing of each activity using a deep learning algorithm. They
tested their method on two different sports datasets and found that it works
better than existing methods for recognizing activities like swimming,bowling
etc.
This paper[9] proposes a new way for identifying human actions. The
way is that the wearable devices are connected to the internet through the wifi.
These wearable devices collects data and sends this data to the cloud services
where all this data is analysed using a deep learning model. So, this helps
monitor what people are doing at home to assist in various areas like healthcare,
elderly care, and entertainment.
3. Proposed Method
3.1 Choosing the dataset
UCF50 Dataset:
The UCF50 dataset is a widely used benchmark dataset for action recognition.It
consist of 50 action categories, each containing 25 videos per action. The
dataset covers a wide range of human activities,making it suitable for training
and evaluating action recognition models.
Code for downloading and extracting UCF50 dataset in google colab:
# Download the UCF50 dataset
!wget --no-check-certificate https://www.crcv.ucf.edu/data/UCF50.rar

# Extract the dataset


!unrar x UCF50.rar
Kinetics Dataset:
The Kinetics dataset is a large scale action recognition dataset containing
approximately 650,000 video clips across 700 action classes.It covers a diverse
range of human actions captured from various sources,making it valuable for
training deeplearning models for action recognition tasks.
Code for downloading kinetics dataset in google colab:
# Download the Kinetics dataset
!wget --no-check-certificate
https://www.deepmind.com/documents/88/kinetics_600_val.zip

# Extract the dataset


!unzip kinetics_600_val.zip

3.2 Visualize the data along with its annotations

we will visualize the data along with labels to get an idea about what we will be
dealing with. We will be using the UCF50 - Action Recognition Dataset,
consisting of realistic videos taken from youtube which differentiates this data
set from most of the other available action recognition data sets as they are not
realistic and are staged by actors. The Dataset contains:
 50 Action Categories
 25 Groups of Videos per Action Category
 133 Average Videos per Action Category
 199 Average Number of Frames per Video
 320 Average Frames Width per Video
 240 Average Frames Height per Video
 26 Average Frames Per Seconds per Video
For visualization, we will pick 20 random categories from the dataset and a
random video from each selected category and will visualize the first frame of
the selected videos with their associated labels written. This way we’ll be able
to visualize a subset (20 random videos) of the dataset.
3.3 Preprocess the dataset for Analysis
In the first step, we create the extract_frames() function to extract frames from
videos and resize them to 255 pixels. Additionally, we perform normalization
and remove unwanted frames that do not contain activity information, returning
only useful frames. Subsequently, we resize the frames to a standard size of 64 x
64 pixels, a common preprocessing practice. We set the sequence length to 20
frames, which serves as the default throughout the project.
In the second step, we define the dataset_creation() function, which involves
mapping features, labels, and video paths for each selected video category. To
skip frames effectively, we calculate the skip factor using the formula:
skip_frames_window = max(int(video_frames_count/SEQUENCE_LENGTH)
skipping every 10th frame from the sequence, ensuring a representative subset
of frames for analysis.
Finally, we convert the encoded indexes of different classes to one-hot encoding
using Keras, a built-in library in Python. This conversion facilitates the
representation of categorical variables in a more suitable format for deep
learning models.
3.4 Divide the data into test and train sets
Before proceeding with the training and testing of our proposed models, it is
crucial to partition our structured dataset into separate training and testing sets.
The key requirements for splitting of data are features and
one_hot_encoded_labels on categorical data. To achieve this, we utilize the
sklearn library in Python to split the dataset, allocating 75% of the data for
training and reserving 25% for testing. t. To avoid any kind of bias we rearrange
the dataset by putting shuffle = True. And also, we set random_state to 27. This
partitioned dataset will serve as the basis for training and testing our
ConvLSTM and LRCN models.
4.1: Implement ConvLSTM
In this step, we will implement the first approach: the ConvLSTM model. This
model blends Convolutional Neural Networks (CNNs) with Long Short-Term
Memory (LSTM) cells. A ConvLSTM cell is essentially an LSTM network but
with convolutions integrated into its structure. This unique combination allows
the model to recognize spatial features in the data while also considering the
temporal relationships between them.

4.2:Develop the model architecture

For constructing our model, we'll utilize the Keras library's ConvLSTM2D
recurrent layers. These layers are crucial as they integrate convolutional
operations into the LSTM architecture. We'll specify the number of filters and
kernel size required for these convolutions. After processing through
ConvLSTM2D layers, the output is flattened and passed to a Dense layer with
softmax activation. This layer computes the probability of each action category.
We will also use MaxPooling3D layers to reduce the dimensions of the frames
and avoid unnecessary computations and Dropout layers to prevent overfitting
the model on the data. The architecture is a simple one and has a small number
of trainable parameters. This is because we are only dealing with a small subset
of the dataset which does not require a large-scale model.
The ConvLSTM model we've built consists of a total of 44,524 parameters, all
of which are trainable. There are no non-trainable parameters in this model.
Below is a detailed breakdown of the layers in the ConvLSTM model:
1.ConvLSTM2D Layer: This layer integrates convolutional operations into the
LSTM architecture, enabling the model to learn spatial features while
considering temporal relationships.
2.MaxPooling3D Layer: This layer reduces the dimensions of the frames,
helping to streamline computations and prevent overfitting.
3.Dropout Layer: This layer aids in preventing overfitting by randomly
dropping a fraction of input units during training.
4.Flatten Layer: This layer flattens the output of the ConvLSTM2D layer into
a one-dimensional array.
5.Dense Layer: This layer computes the probability of each action category
using softmax activation.
Overall, the ConvLSTM model is designed to effectively process video data,
capturing both spatial and temporal information to make accurate predictions.

4.3: Configure and train the model


Before diving into model training, we initialize an instance for EarlyStopping
Callback. This allows us to monitor our model's performance during training
and stop it early if the validation loss stops improving.
Next, we evaluate our model's accuracy by analyzing loss percentage and other
relevant parameters. Once we're satisfied with the setup, we commence model
training. During this process, we set the maximum length of each batch to 4,
and we ensure our dataset is randomized before training to prevent bias.
Below is a summary of the trained ConvLSTM model, providing an overview
of its architecture and parameters.

4.4: Visualize the training metrics


Visualizing training metrics for a ConvLSTM (Convolutional Long Short-Term
Memory) model involves plotting various metrics over time as the model trains.
Here those metrics are total loss and total validation loss.

5.1: Implement LRCN


LRCN models generally contains both convolutional neural networks and
recurrent neural networks. The convolutional neural networks are used to
extracts useful features from the input data and recurrent neural networks are
used to process those extracted features by remembering what happened before.
5.2: Develop the model architecture
There are two variants of LRCN architecture they are LRCN-fc6 and
LRCN-fc7. In LRCN-fc6 architecture the LSTM is placed after the first fully
connected layer and in LRCN-fc7 the LSTM is placed after the second fully
connected layer. LRCN (Long-term Recurrent Convolutional Networks)models
consists of several layers, they are convolutional layers, recurrent layers, fully
connected layers and output layer. Convolutional layers extracts spatial features
from the input image. And then the recurrent layers helps to memorize those
extracted features. And then fully connected is useful to extract complex
patterns from those extracted features. And then finally the output layer
produces the desired output.
1.convolutional layer: These layers are stacked one up on another inorder to
extract the useful features from the input. On passing the information through
these layers features will be extracted.
2.Recurrent layers: They are usually LSTM or GRU. These layers process the
features extracted by the convolutional layer. They process the present
information by remembering what happened before.
3.Fully Connected Layers: Fully connected layer finally use those extracted
features to create a final representation which is suitable for the final task.
Activation functions are used in fully connected layers inorder to extract
complex patterns from the extracted features.
4.Output Layer: Finally output layer produces the desired output. The number of
nodes in the output layer depends on specific task the LRCN is designed for. So,
in conclusion each layer in LRCN has its specific task to process the input video
in order to produce the required output.
5.3: Configure and train the model
To train the model we used a technique which is known as Early
Stopping. It is used to stop the training process in early if required criteria is
met. This helps in avoiding overfitting. This can be done by using earlystopping
function which monitors the validation loss during the training.
And next compiling step where we can specify the metric which we want
to monitor during the training. This can be done by using the compile method.
Here we specified accuracy as a metric to monitor during training. And then
after this we start training the model until we got a good accuracy. Here we
provide batch_size as 4 which tells how many training examples are processed
together before updating model parameters.
5.4: Visualize the training metrics

From the metrics it is observed that LRCN performed better than ConvLSTM
with 95% accuracy and 20% loss. While ConvLSTM provided 80% accuracy.
6. Integrate Multi-Person Recognition Technique:
From the results, it appears that the LRCN model has performed very well in a
small number of classes. so in this step, we will create a method which
combines Yolo model, LRCN prediction model and multi-person activity
recognition logic. Initially, the YOLO model is loaded to identify persons in
each frame of the video. Upon detecting persons, their regions are processed for
prediction using the LRCN model.

The processing involves clustering the person regions based on threshold. We


calculate the distance between each detected person in the frame and add them
into their corresponding person cluster. If the calculated distance is greater than
the threshold, then it belongs to new person cluster. By this way we are keeping
track of each person in video separately. Once a cluster reaches a predetermined
sequence length, those frames will be resized and normalized and given to
LRCN model which predicts the corresponding action label. A bounding box is
drawn around the person region and predicted label on top of it. This happens
for each cluster of persons. Finally, the processed frames with predicted labels
are written to an output video file, providing a comprehensive analysis of the
activities captured in the input video.
7. Conclusion:
References:
1. Sanchez-Caballero, Adrian, David Fuentes-Jimenez, and Cristina Losada-
Gutiérrez. "Exploiting the convlstm: Human action recognition using raw depth video-
based recurrent neural networks." arXiv preprint arXiv:2006.07744 (2020)..
2. Taha, Ahmed, et al. "Exploring behavior analysis in video surveillance
applications." International Journal of Computer Applications 93.14 (2014): 22-32.

3. Padmaja, Budi, Madhu Bala Myneni, and Epili Krishna Rao Patro. "A comparison
on visual prediction models for MAMO (multi activity-multi object) recognition using
deep learning." Journal of Big Data 7.1 (2020): 24.

4. Jaouedi, Neziha, Noureddine Boujnah, and Med Salim Bouhlel. "A new hybrid
deep learning model for human action recognition." Journal of King Saud University-
Computer and Information Sciences 32.4 (2020): 447-453.

5. Polu, Sandeep Kumar, and S. K. Polu. "Human activity recognition on


smartphones using machine learning algorithms." International Journal for Innovative
Research in Science & Technology 5.6 (2018): 31-37.
6. Kamel, Aouaidjia, et al. "Deep convolutional neural networks for human action
recognition using depth maps and postures." IEEE Transactions on Systems, Man,
and Cybernetics: Systems 49.9 (2018): 1806-1819.
7. Wang, Lei, Du Q. Huynh, and Piotr Koniusz. "A comparative review of recent
kinect-based action recognition algorithms." IEEE Transactions on Image
Processing 29 (2019): 15-28.
8. Deotale, Disha, and Madhushi Verma. "Human activity recognition in untrimmed
video using deep learning for sports domain." (2020).
9. Bianchi, Valentina, et al. "IoT wearable sensor and deep learning: An integrated
approach for personalized human activity recognition in a smart home
environment." IEEE Internet of Things Journal 6.5 (2019): 8553-8562.

You might also like