Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Video-based action recognition

Balgynbek Dikhan Rustem Sankubayev


School of Engineering and Digital Sciences School of Engineering and Digital Sciences
Data Science Computer Science
balgynbek.dikhan@nu.edu.kz rustem.sankubayev@nu.edu.kz

Sultan Abylkairov
School of Sciences and Humanities
Mathematics
sultan.abylkairov@nu.edu.kz

1. Introduction passing a wide spectrum of sports and physical activities.


Furthermore, it presents challenging real-world scenarios,
In today’s society the advancements in camera technol- including variations in lighting, backgrounds, camera an-
ogy have made it more affordable and accessible to every- gles, and instances of occlusion. These inherent complexi-
one. Every mobile phone now comes with its built-in cam- ties make it an ideal testing ground for algorithms designed
era. Video devices are also widely used in fields such as to address real-world challenges.
security, traffic management, healthcare, sports and more.
The abundance of video data generated from these sources
3. Technical approach
offers opportunities for research within the computer vision
community. For example, having the ability to understand 3.1. Dataset
actions is crucial, in many important applications that im-
The first step is to download the UCF Sports Dataset
pact society greatly including intelligent surveillance [9],
[14], which consists of video clips showcasing ten actions.
patient monitoring, sports analysis and human computer in-
The dataset contains a total of 150 sequences with a reso-
teraction [13].
lution of 720 x 480. It is. Covers actions, like diving, golf
swing, kicking, lifting riding horse running, skateboarding,
2. Literature Review swinging bench, swinging side and walking. These actions
In the field of recognizing human actions, researchers represent a range of scenarios and perspectives. The dataset
have conducted numerous studies using various methods is divided into two subsets; training and testing. This initial
and types of data. In the early stages of research, the focus phase sets the foundation for model training and evaluation.
was on using RGB or grayscale video as input, mainly be- Data preprocessing plays a role in ensuring consistency
cause of their wide availability and ease of use [11]. How- across the dataset. This involves standardizing video pa-
ever, in recent years, there has been a notable shift towards rameters such as resolutions and frame rates. Additionally
the exploration of alternative data modalities, as evidenced individual frames are extracted from the video clips to serve
by numerous works [1, 10, 12, 5, 17, 4, 19, 7]. These modal- as the basis for feature extraction.
ities include, but are not limited to, skeletal data, depth in-
3.2. Baseline
formation, infrared sequences, point clouds, event streams,
audio, acceleration, radar, and more. This trend has been To capture information from the video frames we uti-
driven by the proliferation of accurate and cost-effective lize the Space Time Interest Point (STIP) feature extrac-
sensor technologies. Moreover Convolutional Neural Net- tion method [8]. STIP helps capture details about motion
works (CNN) [18, 3, 6], and Long Short-Term Memory net- patterns within each frame and enables us to identify pat-
works (LSTM) [2] have demonstrated remarkable perfor- terns that define different actions. We systematically extract
mance when applied to human action recognition. STIP features from every frame to create a representation
In this project, we have chosen to utilize the freely avail- of action dynamics. The approach involves conducting a
able UCF Sports Dataset [16]. We find this dataset partic- series of experiments to find the parameters for extracting
ularly valuable due to its rich diversity of actions, encom- features, such as scale and location thresholds.

1
During data representation each video is converted into a on incorporating the chosen STIP feature extraction method
sequence of STIP feature vectors. To improve action recog- and SVM classifier.
nition techniques are used to segment videos into segments As we move forward with developing and testing our
or shots. This helps the model effectively identify action action recognition system we expect to see some results.
sequences. These findings will help shape our project and add to our
understanding of video based action recognition using the
3.3. Main approach UCF Sports Dataset.
In this step we choose the Support Vector Machine
(SVM) [15] as our classification model to differentiate be- 5. Conclusion
tween ten actions. SVM is a used algorithm in machine The proposed technical method provides a structure
learning and computer vision in video based action recog- for video-based action recognition using the UCF Sports
nition. Selecting the SVM configuration ensures perfor- Dataset. It emphasizes the significance of data prepara-
mance, in action recognition tasks. tion, extracting features, selecting models and conducting
We train our model using the SVM classifier and STIP evaluations to guarantee the recognition systems accuracy
features extracted from video frames. The training process and dependability. This project makes a contribution to the
involves refining the SVM model by experimenting with field of computer vision and action recognition by provid-
regularization settings to enhance its accuracy and overall ing insights into designing models, optimizing parameters
performance. and employing evaluation strategies.
3.4. Evaluation Metric References
To thoroughly assess the models performance we imple- [1] J. Donahue, L. Anne Hendricks, S. Guadarrama,
ment Leave One Out (LOO) validation methodology [14]. M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
This approach guarantees that every data point is utilized rell. Long-term recurrent convolutional networks for visual
for both training and testing resulting in an assessment of recognition and description. In Proceedings of the IEEE
the system. We. Record evaluation metrics such as accu- conference on computer vision and pattern recognition,
racy, precision, recall and F1 score, for each action category pages 2625–2634, 2015. 1
providing a comprehension of the model’s efficacy. [2] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes. Two
stream lstm: A deep fusion framework for human action
recognition. In 2017 IEEE winter conference on applications
4. Preliminary results of computer vision (WACV), pages 177–186. IEEE, 2017. 1
In our project we have made progress in preparing and [3] C. Gao, Y. Du, J. Liu, J. Lv, L. Yang, D. Meng, and A. G.
planning our video based action recognition project. We Hauptmann. Infar dataset: Infrared action recognition at dif-
conducted a review of existing literature that focused on ferent times. Neurocomputing, 212:36–47, 2016. 1
[4] R. Ghosh, A. Gupta, A. Nakagawa, A. Soares, and
methods for extracting features, designing classifiers and
N. Thakor. Spatiotemporal filtering for event-based action
evaluating the effectiveness of video based action recog-
recognition. arXiv preprint arXiv:1903.07067, 2019. 1
nition. This initial step provided us with insights into the
[5] Z. Jiang, V. Rozgic, and S. Adali. Learning spatiotemporal
practices and techniques used by researchers in this field. features for infrared action recognition with 3d convolutional
After consideration we decided to use the UCF Sports neural networks. In Proceedings of the IEEE conference on
Dataset as our dataset for action recognition. We thoroughly computer vision and pattern recognition workshops, pages
examined the dataset to gain an understanding of its con- 115–123, 2017. 1
tents, which includes ten distinct actions. [6] R. Kavi, V. Kulathumani, F. Rohit, and V. Kecojevic. Mul-
For feature extraction we chose the Space Time Interest tiview fusion for activity recognition using deep neural
Point (STIP) method. This method allows us to capture in- networks. Journal of Electronic Imaging, 25(4):043010–
formation within video frames making it ideal for our action 043010, 2016. 1
recognition task. [7] Y. Kim and T. Moon. Human detection and activity classifi-
cation based on micro-doppler signatures using deep convo-
As for the classification component of our project we
lutional neural networks. IEEE geoscience and remote sens-
opted to use Support Vector Machine (SVM) as our clas-
ing letters, 13(1):8–12, 2015. 1
sifier. SVMs are known for their effectiveness in handling [8] I. Laptev. On space-time interest points. International jour-
action recognition tasks and their ability to capture linear nal of computer vision, 64:107–123, 2005. 1
relationships within data. [9] W. Lin, M.-T. Sun, R. Poovandran, and Z. Zhang. Human
Currently we are actively involved in developing the pro- activity recognition for video surveillance. 2008 IEEE Inter-
gram code for our action recognition system. We’re mak- national Symposium on Circuits and Systems (ISCAS), pages
ing progress, with the coding process specifically working 2737–2740, 2008. 1
[10] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang.
Skeleton-based action recognition using spatio-temporal
lstm network with trust gates. IEEE transactions on pattern
analysis and machine intelligence, 40(12):3007–3021, 2017.
1
[11] R. Poppe. A survey on vision-based human action recog-
nition. Image and Vision Computing, 28(6):976–990, 2010.
1
[12] H. Rahmani and A. Mian. 3d action recognition from novel
viewpoints. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1506–1515,
2016. 1
[13] I. Rodomagoulakis, N. Kardaris, V. Pitsikalis, E. Mavroudi,
A. Katsamanis, A. Tsiami, and P. Maragos. Multimodal hu-
man action recognition in assistive human-robot interaction.
2016 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 2702–2706, 2016. 1
[14] M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach
a spatio-temporal maximum average correlation height filter
for action recognition. In 2008 IEEE conference on computer
vision and pattern recognition, pages 1–8. IEEE, 2008. 1, 2
[15] K. Simonyan and A. Zisserman. Two-stream convolutional
networks for action recognition in videos. Advances in neu-
ral information processing systems, 27, 2014. 2
[16] K. Soomro and A. R. Zamir. Action recognition in realistic
sports videos. In Computer vision in sports, pages 181–208.
Springer, 2015. 1
[17] Y. Wang, Y. Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou,
and J. Yuan. 3dv: 3d dynamic voxel for action recognition
in depth video. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages 511–520,
2020. 1
[18] R. Yang and R. Yang. Dmm-pyramid based deep architec-
tures for action recognition with depth cameras. In Asian
Conference on Computer Vision, pages 37–49. Springer,
2014. 1
[19] M. Zeng, L. T. Nguyen, B. Yu, O. J. Mengshoel, J. Zhu,
P. Wu, and J. Zhang. Convolutional neural networks for hu-
man activity recognition using mobile sensors. In 6th inter-
national conference on mobile computing, applications and
services, pages 197–205. IEEE, 2014. 1

You might also like