Research Paper

Hindawi
Mobile Information Systems

Volume 2022, Article ID 5168898, 9 pages
https://doi.org/10.1155/2022/5168898
Research Article
A Method of Key Posture Detection and Motion Recognition in
Sports Based on Deep Learning
Shaohong Pan
Xichang University, Xichang City, Sichuan 615000, China
Correspondence should be addressed to Shaohong Pan; xcc03300051@xcc.edu.cn
Received 16 February 2022; Revised 20 March 2022; Accepted 21 March 2022; Published 25 April 2022
Academic Editor: Tongguang Ni
Copyright © 2022 Shaohong Pan. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Moving target recognition and analysis is an important research direction in the field of computer vision, which is widely used in
our life, such as intelligent robot, video surveillance, medical education, sports competition, and national defense security. By
analyzing the video of weightlifting, this paper extracts the key postures of athletes’ training, so as to assist coaches to train
athletes more professionally. Based on DL (Deep Learning), a key pose extraction method of sports video (RoI_KP for short)
based on classified learning of regions of interest is proposed. By fine-tuning CNN (Convolutional Neural Network), a network
model suitable for video classification of weightlifting in the region of interest is obtained. Finally, according to the
classification results, the selection strategy of classification results is designed to extract key poses. According to the
characteristics of different modal information, different DNN (deep neural network) is adopted, and various depth networks
are combined to mine the multi-modal spatio-temporal depth features of human movements in video. Experimental results
show that the method proposed in this paper is very competitive.
1. Introduction detect abnormal behavior in wards and notify the hospital

in real time, which not only reduces the risk of a patient’s
Video analysis technology is the basis of content-based video sudden situation, but also lowers the cost of medical treat-
retrieval. In the traditional text-based query technology, ment. The query condition in content-based video retrieval
query keywords can basically reflect the query intention is an image sequence or a description of video content. To
[1]. However, in content-based video retrieval, there is a dif- create an index, first extract the underlying features, then
ference between the underlying features and the upper calculate and compare the distance between these features
understanding, mainly because the underlying features can- and the query condition to see if they are similar.
not fully reflect or match the query intention [2]. At the Intelligent license plate recognition, real-time 3D effect
same time, human movements in sports videos are very playback of sports competitions, and other technologies
complicated and skillful, and compared with daily sports have emerged one after another, which makes the advan-
[3], the analysis of sports videos is more difficult and chal- tages of computer vision task based on machine learning
lenging [4]. Therefore, the analysis of sports videos can not [5–8] more and more obvious, and it has high flexibility
only bring more viewing effects to sports competitions, but and openness, which indicates the development direction
also analyze competitions between athletes and coaches of intelligent image processing [9–11]. In the process of
and assist athletes in training. In comparison to manual weightlifting, whether the athletes’ movements are standard
monitoring, the computer does not need to rest, and it will or not will directly affect the athletes’ performance. In the
not miss important information monitoring due to a variety whole process of weightlifting, there are several key postures
of external factors, allowing it to significantly reduce man- that will affect the athletes’ performance. These key postures
power and material resources while improving work effi- are very important for the success of weightlifting. However,
ciency. This technology is also essential in the medical due to the limitation of the hardware conditions of the mon-
field. Ward monitoring, for example, can automatically itoring equipment (camera) and the influence of
2 Mobile Information Systems
environmental factors, there are many complicated and the size of input pictures and improved the accuracy. On this
unavoidable interference factors when acquiring surveillance basis, literature [19] puts forward a faster detection method
video, such as various occlusion between objects, noise, of Faster-RCNN, and puts forward a new innovation, that
background influence, and sudden change of light, which is, extracting the RPN (Region Proposal Network) of candi-
makes it very challenging for us to use target detection and date regions, thus completing the technical evolution from
tracking in intelligent monitoring and other related applica- RCNN to Faster-RCNN. Literature [20] puts forward a
tions. However, the recognition of motion and weight in new detection method, and its performance is greatly
video is also facing a huge test: the number of super-large- improved compared with the traditional one-stage and
scale videos requires higher and higher computing perfor- two-stage framework, especially in the real-time situation,
mance of the algorithm. At the same time, with the emer- and its accuracy is much higher than YOLOV3 at the same
gence of a large number of video content such as aerial rate. Literatures [21, 22] use NN (nearest neighbor) classifier
photography and first-person perspective, the shortcomings to judge whether tracking is successful or not. Neural net-
of traditional algorithms in dealing with different camera work classifier can measure the similarity between the col-
perspectives, cluttered background, occlusion, and other lected correct image and the current new target image. The
issues are becoming more and more obvious. DNN (deep neural network) and detector in the tracking-
In the training process of athletes, the judgment of key learning-detection framework were improved in literature
postures in previous weightlifting videos was mainly based [23]. The tracking failure point is identified by calculating
on coaches’ experience, which not only wasted manpower, the DNN’s forward-backward error and the output’s
but also the extracted key postures were not accurate and spatial-temporal similarity. Literature [24] uses the twin net-
easily influenced by subjective factors. Facing weightlifting, work to directly calculate the representation error of the tar-
this paper applies video analysis technology to weightlifting get appearance model, with the goal of selecting the region
training, and analyzes the movement of weightlifting pro- that is closest to the target model as the tracking result.
cess, aiming at putting forward a reliable automatic extrac- HRNet is an end-to-end multi-person 2D pose estimation
tion method of key posture frames in the movement method proposed in the literature [25]. The top-down
process, which is of far-reaching significance for athletes method has some flaws, such as misalignment of key points
and coaches to master movement techniques as soon as pos- in crowded situations or inaccurate detection in bust situa-
sible, improve training efficiency, and improve training level. tions, based on its characteristics. According to the literature
[26], organizing two-dimensional posture nodes by guiding
2. Related Work all pelvic joints can be done quickly and with good results.
If the person in the video is performing fast and complex
Literature [12] detects the edge of the input video stream, actions, the estimated two-dimensional posture will be
uses the edge features to establish gesture descriptors, and unstable, resulting in motion distortion.
then clusters the features calculated by the video stream to
get the important gestures. In this way, good model param- 3. Research Method
eters can be learned through loop iterative modeling. Litera-
tures [13, 14] apply SVM (Support Vector Machine) to 3.1. Sports Key Posture Detection. In order to detect the tar-
human motion recognition. They use local spatio-temporal get from the image in any scene, the key is to choose a robust
features to represent human motion, and then input local feature set, and apply this feature set, so that the targets of
spatio-temporal feature vectors into SVM to judge the cate- different classes have good discrimination and can adapt to
gory of human motion. Literature [15] extracts spatial fea- the differences of the same class of targets. The features used
tures from static points of interest in video, and extracts for target recognition mainly include shape, texture, and
spatio-temporal features from dynamic points of interest at color, while the shape of the target is not affected by factors
the same time, and obtains a composite feature set contain- such as illumination, and it is often used as a good feature
ing both static spatial features and dynamic spatio-temporal for target recognition. Different descriptions of the shapes
features, which is input into linear SVM for classification. A of objects produce different feature sets.
proposed algorithm in the literature [16] is as follows: The The process of weightlifting is generally divided into five
gesture features were obtained by training with the edge fea- stages: extending the knee to lift the bell, leading the knee to
tures obtained after the video was edge detected. The motion lift the bell, exerting strength, squatting and supporting, and
was then approved by a majority of the members. Literature standing up. Coaches and athletes want to observe the
[17] proposes a full-body and half-body model, in which the movement process of each movement, and hope to observe
body model employs densely sampled shape context descrip- biomechanical parameters such as the movement track of
tors and the prior model is exhaustively trained in the data- each snatch or clean and jerk, the speed of each stage, and
base using multi-angle and multi-pose. Finally, the image’s the performance of work. It is impossible to use traditional
features are used to detect human body parts in the image, pose estimation methods to precisely locate the positions
which is determined by the gesture space representation of various parts of the human body, and then extract the
and inference algorithm. motion key frames according to the attitude estimation
Literature [18] introduced the spatial pyramid pool layer method. Moreover, there is a serious problem—the similar-
into CNN (Convolutional Neural Network) and proposed ity between frames. Because of the continuity of motion,
SPPnet, which reduced the limitation of CNN network on the difference between adjacent video frames is very small.
Mobile Information Systems 3
In this chapter, a key pose extraction method of sports and the coordinates of key points of the 3D skeleton pre-
video (RoI_KP for short) based on region of interest classifi- dicted by 3DPoseNet.
cation learning is proposed. The algorithm flow is shown in In this paper, the loss function with symmetric con-
Figure 1. straint is used, and the loss function defined in this paper
First, frame the video. Then, based on the first frame of is formula (3).
the video, the trained model is used to segment and extract
2
the foreground, and the video is segmented according to
Lsym = 〠 Pi − P j 2 − P
i′ − P ′
j , ð3Þ
the standard of the first frame. The region of interest directly 2 2
ði, jÞ∈E
related to key frame extraction is obtained. Furthermore,
CNN network is used to extract and classify the features of
each video frame to obtain candidate key frames. Finally, where E is the set of all adjacent points, and Pi′, P′j represents
the key frame extraction strategy is formulated, and key the key point of the symmetric part. Of course, there are
frames are selected according to the probability value of other restrictions on human bones, such as the limitation
the corresponding class output by each frame. of joint angle.
Because there is a lot of background interference infor- This paper presents a general solution to the occlusion
mation in weightlifting video, this paper extracts the region problem that makes extensive use of the torso, limbs, and
of interest to reduce the influence of background on key head. Because 2D posture can accurately predict 3D posture,
frame extraction. Then, divide and fine-tune the region: tra- the restored 3D posture can be divided into four states,
verse the image with four corners of the image as the starting including whether the limbs are covered and whether the
points in turn, and update the label value of each pixel point posture of the entire limbs can be predicted. This paper
to the minimum value of the label values of four neighboring introduces the developing concept of time series information
points that is not zero until the label values of all points do [17] in this case. The difference is that this paper only needs
not change; to determine some position data, i.e., the unpredictable
(occluded part) joint position can be determined using the
joint points from the previous moment.
labelx,y = min labellabel ∈ LABELx,y : ð1Þ
Pti = λPt−1
i : ð4Þ
Up to this point, a new annotation map of the region of
interest has been obtained, and each non-zero and continu- In which, the position t of the joint point i (occlusion
ous region has a different number. Finally, only the region node) of the predicted Pti at the time point is called
with the largest number needs to be selected as the region
ðjX i jjY i jjZ i jÞT , and Pt−1
i represents the position at t − 1 time.
of interest.
λ is a vector, which represents the general goal of the move-
Unlike 2D human pose estimation, 3D human pose esti-
ment direction of people as objects.
mation based on DL is more challenging. This is mainly
because 2D pose estimation has a wider training data set, 3.2. Motion Recognition Method Based on DL. With the
so it can better solve occlusion and accuracy, while 3D pose rapid development of computer hardware and the successful
estimation has great challenges in occlusion and accuracy application of DL in various fields of computer vision, moving
due to the lack of training data set. target recognition based on DL has also developed into a key
Three-dimensional human posture estimation is to esti- technology in the field of computer vision, which is widely
mate the three-dimensional coordinates ðx, y, zÞ of the related used in medical treatment, transportation, security, and so
nodes from pictures or videos, which is essentially a regression on, and can also be used as the basis of other application tech-
problem. It is widely used in motion capture system, anima- nologies, such as image processing and three-dimensional
tion, behavior understanding, and games. It can also be used modeling. The traditional methods of target detection can be
as an auxiliary link of other algorithms such as pedestrian rec- divided into three steps: selecting candidate areas based on
ognition, and it can also be combined with other tasks related images, extracting visual features, and classifying by a class
to human body such as human body analysis. of commonly used classifiers such as SVM model. With the
Each block is based on a simple, deep, and multi-layer development and application of DL, it not only simplifies the
neural network with batch normalization function. After complexity of traditional methods, but also improves the
obtaining 2D key points, input them into 3DPoseNet and detection performance. This kind of target detection methods
output the estimated 3D coordinate key points, each of is generally based on region extraction, which can be divided
which is expressed as Pi = jX i jjY i jjZ i j. Assuming that 3D into two types: one-stage method and two-stage method.
joint ground truth is available, 3D joint MSE (Mean Square Obtaining 3D coordinates from video is extremely diffi-
Error) loss is adopted as the loss function by 3D 3DPoseNet: cult. Another method is to represent human motion directly
using 2D human posture changes. Poses of the two-
2
LMES = 〠Pi − Pi′2 : ð2Þ dimensional human body in different viewpoints have differ-
ent forms. Viewpoint normalization, multi-viewpoint tra-
versal, and viewpoint invariant feature extraction are three
Among them, Pi′is the marked true coordinate of the 3D common techniques for achieving viewpoint-independent
skeleton, while Pi is the corresponding predicted coordinate, representation of human motion. The camera’s position
Video
First Y Automatic segmentation

frame? of region of interest
Region of interest
extraction
Video analysis
Feature extraction
Candidate attitude
recognition
Key frame extraction

strategy
Figure 1: Algorithm framework.
should be fixed, but this is not always the case in sports Independent RGB map features and depth map features
videos, because the target moves very quickly, and the cam- can be obtained. On the basis of this model, the output of the
era often moves with it to keep up with the target’s move- first fully connected layer can be regarded as a convolution
ment, or the moving target will disappear. When filming a feature with good representation effect, and it has classifica-
diving video, for example, the camera will move down as tion and discrimination characteristics.
the human body falls. This chapter proposes a multi-modal Human motion in video has time characteristics, so only
information-based method for action recognition. Figure 2 mining spatial depth features cannot express the temporal
depicts the general framework of this method. characteristics of motion. The networks that deal with tim-
The static information used in this paper is RGB map ing information in common DL methods are all based on
and depth map. RGB map contains global information of the architecture of RNN (recurrent neural network). In the
environment and human body, which can be used to elimi- task of human motion recognition, the trajectory of key skel-
nate background clutter and other problems. On the basis of eton points has effective time information.
the learning model based on DNN, different network struc- RNN network processes sequence data in an iterative
tures are designed for different modes, so as to discover the way according to the time scale of the sequence. This pro-
temporal and spatial depth characteristics of motion video cessing method makes RNN have obvious advantages in
and improve the performance of motion recognition model. modeling and feature extraction of sequence data. This
In this chapter, CNN network with the same structure is method uses the cross entropy loss function:
used for RGB map and depth map. CNN for RGB map rep-
resents the color and texture features of human body and
background in the video, while CNN for depth map accu- E yt , yt′ = −yt log yt′ = −〠 yt log yt′ ð6Þ
t
rately distinguishes the front and back scenes of the video
to prevent the misunderstanding of features caused by back-
ground interference. where yt is the correct label at time t, and yt′is the network
At the top of the network, this chapter uses Softmax loss prediction label. The training goal is to make the value of
function. Assuming that the output of the last full connec- the loss function reach a lower level by calculating the gradi-
tion layer is a vector xi of K dimension, the Softmax function ent optimization parameter W of the loss function.
is defined as: The static model and the dynamic model learn the spa-
tial and temporal characteristics of the action video, respec-
1 tively. Compared with single modal information, the
σðxi Þ = exp ðxi Þ : ð5Þ features under multi-modal information have more differ-
∑Ki=1 expðxi Þ ences and complementarities, so the fusion of spatial and
RNN
unit
Action sequence
Temporal and spatial characteristics of depth

Depth time
characteristics
RNN
unit
R
CNN Classifier
G B
RGB map Depth space

feature
CNN
Depth map
Figure 2: Architecture diagram of deep space-time feature motion recognition.
temporal features can provide information with stronger be reduced. Therefore, it is a very important task to find the
representation ability. The goal of this experiment is to most distinctive spatio-temporal features in video, which can
directly concatenate temporal and spatial depth features improve the accuracy of the above algorithms.
and estimate the weights of the two features in a linear com- Testing with 512 image frames, the correct recognition
bination process based on the test accuracy of the two net- rate of the four poses is shown in Figure 3.
work models, in order to identify the importance of The experimental results show that the algorithm pre-
different features. sented in this paper achieves a good recognition result for
In the late stage of fusion, the probability outputs using each category. The incorrectly divided frame image of a knee
spatial and temporal depth features are superimposed by lin- joint posture is output along with the probability value for
ear weighting to get the final predicted value. Specifically, the each category. All of the incorrectly divided results are found
late fusion adopts the following methods: to belong to the first category of knee joint or the third cat-
egory of force posture. Different scenes may have different
! backgrounds or even the same actions. At the same time, it
N
1
P= 〠 αP1 + ð1 − αÞP2 ð7Þ should be noted that if the human body’s clothing matches
N i=1 the video’s background color, problems such as video expo-
sure caused by outdoor lighting and weather will make it dif-
α is the weighting parameter, P1 , P2 is the output proba- ficult to distinguish actions and behaviors. The same video
bility of the network using spatial depth feature and tempo- may not have a static background, and the task will become
ral depth feature, respectively, and P is the final prediction more difficult as the background changes. Figure 4 depicts
probability, where N is the number of sample features of the probability change curves of four categories for each
the video after multiple sampling. video frame.
The RoI_KP method proposed in this paper has greatly
4. Results Analysis and Discussion improved the performance. It can be seen. Traditional
CNN classification algorithm will have great fluctuation in
The action behavior of human body is not only distinguish- classification performance, and it will be unstable in
able in image space, but also distinguishable in time series. classification.
The tasks of image recognition and detection need to mine With the emergence of various new shooting methods,
the spatial features, while the video increases the informa- such as self-timer, aerial photography, and first-angle shoot-
tion of time dimension relative to the image. Therefore, a ing, it is impossible for video information to always keep the
video human motion recognition algorithm needs to dig same camera angle. Therefore, the same video action
deep into its temporal and spatial features. A segment of sequence may cause different feature representations due to
video contains many frames. different camera angles. Therefore, the identification of dif-
If all the video tilts are used for a video, it will be a task ferent camera angles is also an urgent problem to be solved.
that consumes a lot of computing time. At the same time, The experimental input is set as a continuous and orderly
because not all frames are related, the recognition effect will video meal, so the representation of features has all the
180 98.00%
160 97.50%
140
97.00%
Number of right and wrong

120
96.50%
Accuracy
100
96.00%
80
95.50%
60
95.00%
40
20 94.50%
0 94.00%
Posture 1 Posture 2 Posture 3 Posture 4
Posture
Correct
Wrong
Accuracy
Figure 3: Test accuracy of four key frames.
1
processing are sufficient for training. Learning transfer
allows researchers to obtain good training results even when
0.8 the new data set is small, labeling is incomplete or small, and
the distribution of the training and test sets of data is differ-
ent. Low-level features can also be used for similar tasks
Probability value
0.6 because the representation of convolution kernels for CNN

is a process from low-level to high-level and from general
0.4 to special.
In this paper, the 3DPoseNet network is evaluated
according to the public data set Human3.6M, and the other
0.2 two data sets are used for testing. Human3.6M is the most
widely used public data set in 3D pose at present, which con-
tains 3.6 million video frames for 11 topics, among which 7
0
0 5 10 15 20
topics use 3D annotation pose, each object performs 15
actions, and the video frames (video with frame rate of
Each frame image 50fps) are captured by four cameras with different angles
Stretch knees Power of view, as shown in Figure 7.
Knee leading Culmination It demonstrates that the 3DPoseNet model’s results have
a lower average error than all other methods and are unaf-
Figure 4: Test video results.
fected by other data or operation methods. The average error
of MPJPE in the 3DPoseNet network of this method is
spatial information, and at the same time contains part of reduced by at least 6 mm, demonstrating the method’s effec-
the temporal information. However, there is little pre- tiveness. In general, if a human posture is similar at two dif-
training information of depth map information, so the net- ferent moments of an action, the distance between the two
work is trained from scratch. Figures 5 and 6 are the com- descriptor features describing the human posture will be
parison results of two groups of videos. small; on the other hand, if the human posture is very differ-
While the traditional CNN algorithm shows great fluctu- ent at two different moments, the distance between the two
ation in the fourth key frame and the probability values of descriptor features will be large, and this property has little
other poses are relatively large, which has an impact on the relation with the viewpoint position of observing the move-
classification results, the traditional CNN algorithm has a ment. The self-similar matrix with the same characteristics
good effect in the fourth key frame and the probability value has similar patterns, as shown in Figure 7, indicating that
of the fourth key pose is very large. The pre-training process changes in sports performers have little impact on the self-
is not used for RNN but used for bone point information similar matrix. The difference in self-similar matrix based
because the processed features obtained after pre- on different features is very obvious from a vertical
1 85
0.8 75
Probability value
0.6
65
Error
0.4
55
0.2
45
0
0 5 10 15 20
35
Each frame image ref (15) ref (17) ref (19) ref (22) ref (23) Our
Stretch knees Power Methods
Knee leading Culmination
Eat Photo
Figure 5: CNN test results. Greet Pose
Phone
1 Figure 7: Results of average position error per joint.
0.8
60
Probability value
0.6 55
50
0.4
45
Error
0.2 40
35
0
0 5 10 15 20
30
Each frame image
25
Stretch knees Power ref (15) ref (17) ref (19) ref (22) ref (23) Our
Knee leading Culmination
Methods
Figure 6: Results based on RoI_KP. Eat Photo
Greet Pose
perspective, so the self-similar matrix is dependent on the Phone
image’s underlying features. As shown in Figure 8, it can
Figure 8: Joint position error based on Procrustes analysis.
be clearly found that the accuracy of this method is greatly
improved.
All the videos of human body movements were shot in based on a large amount of training data. Different depth
fixed scenes. In order to prevent the moving target from los- networks have different characteristics. CNN network pays
ing from the field of vision, moving lens is often used to more attention to the relationship between local informa-
shoot, and it is difficult to detect the movement of human tion, so it is suitable for image recognition and detection
body from the optical flow field. Figure 9 shows the process tasks.
of 3D CNN in pre-training stage and network fine-tuning Experiments show that the self-similar matrix of human
stage, respectively. It can be seen that the fluctuation and posture obtained by using directional gradient histogram of
loss output are larger in the training process, and the net- human foreground image to represent human posture in dif-
work output loss is reduced after fine-tuning, thus improv- ferent viewpoints has better stability. By applying the filter to
ing the final classification effect of the model and reducing the post-processing of 3D pose attitude, the predicted atti-
the time cost of network learning. tude stability can be effectively improved. On the basis of
The action recognition method based on DL in single successful motion capture, through the analysis and extrac-
mode includes CNN and recurrent neural network. It also tion of human motion feature parameters, we can automat-
explains the problems and difficulties brought by single- ically analyze and understand various human motions and
mode video motion recognition, and introduces multi- behaviors. Motion recognition technology has a wide appli-
mode motion recognition method. DL gets better results cation prospect and great economic and social value in
× 104 Conflicts of Interest

6
The author does not have any possible conflicts of interest.
5
Loss of function output
4 References
[1] J. Miller, U. Nair, R. Ramachandran, and M. Maskey, “Detec-
3
tion of transverse cirrus bands in satellite imagery using deep
learning,” Computers & Geosciences, vol. 118, pp. 79–85, 2018.
2
[2] H. Liu, J. Jin, Z. Xu, Y. Bu, Y. Zou, and L. Zhang, “Deep learn-
1
ing based code smell detection,” IEEE Transactions on Soft-
ware Engineering, vol. 99, 2021.
0 [3] M. Gao, W. Cai, and R. Liu, “AGTH-Net: attention-based
0 50 100 150 graph convolution-guided third-order hourglass network for
sports video classification,” Engineering, vol. 2021, pp. 1–10,
Iteration 2021.
Pre-training [4] Z. Wang, Z. Zhou, H. Lu, and J. Jiang, “Global and local sensi-
Fine-tune tivity guided key salient object re-augmentation for video
saliency detection,” Pattern Recognition, vol. 103, no. 2, article
Figure 9: 3-D CNN pre-training and fine-tuning network loss. 107275, 2020.
[5] J. Zhou, Y. Wang, and W. Zhang, “Underwater image restora-
advanced human-computer interaction, rehabilitation engi- tion via information distribution and light scattering prior,”
neering, sports analysis, somatosensory game control, and Computers and Electrical Engineering, vol. 100, article
content-based retrieval. 107908, 2022.
[6] X. Gu, W. Cai, M. Gao, Y. Jiang, X. Ning, and P. Qian,
“Multi-Source Domain Transfer Discriminative Dictionary
5. Conclusion Learning Modeling for Electroencephalogram-Based Emo-
tion Recognition,” in IEEE Transactions on Computational
As the most important element in the human environment, Social Systems, pp. 1–9, Institute of Electrical and Electron-
rich and diverse human movements carry a great deal of ics Engineers, 2022.
information which is extremely important for human social [7] M. Zhao, A. Jha, Q. Liu et al., “Faster mean-shift: GPU-
interaction. Therefore, studying human movements has pro- accelerated clustering for cosine embedding-based cell seg-
found economic and social value. The research of human mentation and tracking,” Medical Image Analysis, vol. 71, arti-
motion recognition based on vision involves multi- cle 102048, 2021.
disciplinary knowledge, and integrates the related research [8] W. Cai, B. Zhai, Y. Liu, R. Liu, and X. Ning, “Quadratic poly-
results of computer vision, pattern recognition, image pro- nomial guided fuzzy C-means and dual attention mechanism
for medical image segmentation,” Displays, vol. 70, article
cessing, machine learning, and many other disciplines.
102106, 2021.
When we regress 2D to 3D pose nodes based on DL, we
can effectively improve the accuracy of the network model [9] J. Zhou, X. Wei, J. Shi, W. Chu, and W. Zhang, “Underwater
image enhancement method with light scattering characteris-
by calculating the scale of 2D to 3D node coordinates of each
tics,” Computers and Electrical Engineering, vol. 100, article
frame based on the least square method. The human body in 107898, 2022.
weightlifting video is extracted by skeleton to enhance the
[10] D. Yao, Z. Zhi-li, Z. Xiao-feng et al., “Deep hybrid: multi-graph
expression of features, thus further improving the accuracy
neural network collaboration for hyperspectral image classifi-
of key frame extraction. Deep CNN is used to mine the spa- cation,” Defence Technology, vol. 5, 2022.
tial characteristics of static information, improve the repre-
[11] J. Zhou, D. Liu, X. Xie, and W. Zhang, “Underwater image res-
sentation of dynamic information, and recursive neural toration by red channel compensation and underwater median
network is used to process it. Experimental results show that dark channel prior,” Applied Optics, vol. 61, no. 10, pp. 2915–
the method proposed in this paper is very competitive. 2922, 2022.
From 2D pose estimation to 3D pose prediction, if there [12] A. Lamas, S. Tabik, P. Cruz et al., “MonuMAI: dataset, deep
is occlusion, it is mainly self-occlusion, which will greatly learning pipeline and citizen science based app for monumen-
affect people’s visual effect. Therefore, the focus of our future tal heritage taxonomy and classification,” Neurocomputing,
work is to better solve the occlusion problem and improve vol. 420, pp. 266–280, 2021.
the processing speed of the framework. [13] S. Nandyal and S. L. Kattimani, “Bird swarm optimization-
based stacked autoencoder deep learning for umpire detection
and classification,” Scalable Computing, vol. 21, no. 2, pp. 173–
Data Availability 188, 2020.
[14] X. Zhao, T. Zuo, and X. Hu, “OFM-SLAM: a visual semantic
The data used to support the findings of this study are avail- SLAM for dynamic indoor environments,” Mathematical
able from the corresponding author upon request. Problems in Engineering, vol. 2021, 16 pages, 2021.
[15] Q. Guo, “Detection of head raising rate of students in class-

room based on head posture recognition,” Traitement du Sig-
nal, vol. 37, no. 5, pp. 823–830, 2020.
[16] P. Wang, “Research on sports training action recognition
based on deep learning,” Scientific Programming, vol. 2021,
Article ID 6878, 8 pages, 2021.
[17] A. Bruno, F. Gugliuzza, R. Pirrone, and E. Ardizzone, “A
multi-scale colour and keypoint density-based approach for
visual saliency detection,” Access, vol. 8, pp. 121330–121343,
2020.
[18] Y. Xu, “A sports training video classification model based on
deep learning,” Scientific Programming, vol. 2021, Article ID
2896, 11 pages, 2021.
[19] J. Liu, D. Chen, Y. Wu, R. Chen, P. Yang, and H. Zhang,
“Image edge recognition of virtual reality scene based on
multi-operator dynamic weight detection,” Access, vol. 8,
pp. 111289–111302, 2020.
[20] F. A. Memon, U. A. Khan, A. Shaikh, A. Alghamdi, P. Kumar,
and M. Alrizq, “Predicting actions in videos and action-based
segmentation using deep learning,” Access, vol. 9, pp. 106918–
106932, 2021.
[21] G. Kalakoti and G. Prabakaran, “Key-frame detection and
video retrieval based on DC coefficient-based cosine orthogo-
nality and multivariate statistical tests,” Traitement du Signal,
vol. 37, no. 5, pp. 773–784, 2020.
[22] H. Gu, J. Zhang, T. Liu et al., “DIAVA: a traffic-based frame-
work for detection of SQL injection attacks and vulnerability
analysis of leaked data,” IEEE Transactions on Reliability,
vol. 69, no. 1, pp. 188–202, 2020.
[23] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-
detection,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 34, no. 7, pp. 1409–1422, 2012.
[24] B. Pu, N. Zhu, K. Li, and S. Li, “Fetal cardiac cycle detection in
multi-resource echocardiograms using hybrid classification
framework,” Future Generation Computer Systems, vol. 115,
no. 3, pp. 825–836, 2021.
[25] J. Cui, M. Wang, Y. Luo, and H. Zhong, “DDoS detection and
defense mechanism based on cognitive-inspired computing in
SDN,” Future Generation Computer Systems, vol. 97, pp. 275–
283, 2019.
[26] C. Li-quan, L. You, F. Shen, Z. Shan, and J. Chen, “Pose recog-
nition in sports scenes based on deep learning skeleton
sequence model,” Journal of Intelligent and Fuzzy Systems,
vol. 3, pp. 1–10, 2021.

Research Paper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Paper

Uploaded by

Copyright:

Available Formats

Hindawi

Mobile Information Systems

Correspondence should be addressed to Shaohong Pan; xcc03300051@xcc.edu.cn

Academic Editor: Tongguang Ni

1. Introduction detect abnormal behavior in wards and notify the hospital

First Y Automatic segmentation

Key frame extraction

Figure 1: Algorithm framework.

Temporal and spatial characteristics of depth

RGB map Depth space

Figure 2: Architecture diagram of deep space-time feature motion recognition.

Number of right and wrong

Figure 3: Test accuracy of four key frames.

0.6 because the representation of convolution kernels for CNN

1 Figure 7: Results of average position error per joint.

× 104 Conflicts of Interest

[15] Q. Guo, “Detection of head raising rate of students in class-

You might also like