10.1007@s00371 020 01868 8

The Visual Computer
https://doi.org/10.1007/s00371-020-01868-8
ORIGINAL ARTICLE
Improved human action recognition approach based on two-stream

convolutional neural network model
Congcong Liu1 · Jie Ying1 · Haima Yang1 · Xing Hu1 · Jin Liu2
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract
In order to improve the accuracy of human abnormal behavior recognition, a two-stream convolution neural network model
was proposed. This model includes two main parts, VMHI and FRGB. Firstly, the motion history images are extracted and
input into VGG-16 convolutional neural network for training. Then, the RGB image is input into Faster R-CNN algorithm
for training using Kalman filter-assisted data annotation. Finally, the two stream VMHI and FRGB results are fused. The
algorithm can recognize not only single person behavior, but also two person interaction behavior and improve the recognition
accuracy of similar actions. Experimental results on KTH, Weizmann, UT-interaction, and TenthLab dataset showed that the
proposed algorithm has higher accuracy than the other literature.
Keywords Human action recognition · Kalman filter · Motion history image · Faster R-CNN · Video surveillance
1 Introduction action recognition algorithms are mainly divided into tradi-

tional feature extraction methods and methods based on deep
Human action recognition has been widely applied in public learning [2–28].
security, surveillance, smart robotics, and industrial automa- The traditional human action recognition methods are fur-
tion fields. In real-life scenarios, there are often many ther divided into two categories: feature extraction methods
uncontrollable factors, such as changes in light level, shad- based on human motion information and feature extraction
ows, and camera angles, which can significantly limit the methods based on spatiotemporal interest points. In [2–5], the
practical application of human action recognition. To iden- feature extraction methods based on human motion informa-
tify the human action or behavior in a video, not only a tion were studied. Fujiyoshi et al. [2] used the star graph of
single-frame image but also the relationship between consec- five vertices of the limbs and head to represent the human
utive frames of the video should be considered [1]. Human posture in the current frame and used the vector of five fea-
ture points and center of gravity as a feature vector of the
action. Yang et al. [3] collected 3D coordinates of the nodes
B Haima Yang
from the depth images of the human body, and the human
snowyhm@sina.com
contour formed by these nodes was used as a feature in
Congcong Liu
lcccate@163.com behavior identification. Chaudhry et al. [4] normalized the
half-wave rectifier in two directions obtaining the motion
Jie Ying
yingjsh@163.com vectors in the upper and lower left directions and formed
the final motion descriptor. Bobick et al. [5] followed the
Xing Hu
huxing@usst.edu.cn same idea but extracted different features for identification.
Namely, they used motion energy images and motion his-
Jin Liu
flyingpine@sina.com tory images to explain the people movement in a sequence
of images. In [6–8], the feature extraction methods based
1 School of Optical Electrical and Computer Engineering, on spatiotemporal interest points were introduced. Schuldt
University of Shanghai for Science and Technology, Shanghai et al. [6] expanded Harris’s spatial and temporal fea-
200093, China
ture points to three-dimensional space-time interest points.
2 School of Electronic and Electrical Engineering, Shanghai Through the Gaussian blur and local angular point extraction
University of Engineering Science, Shanghai 201620, China
123
C. Liu et al.
in three-dimensional space and time, the space-time interest which could fuse multiple segments and obtain more con-
points were obtained, and the pixel histogram statistics of text information. Chen et al. [23] incorporated semi-coupled
the space-time interest points was performed, and finally, the two-stream fusion network and applied it to the video with
feature vector describing actions was formed. Rapantzikos extremely low resolution for behavior recognition. The meth-
et al. [7] proposed the application of discrete Wavelet trans- ods of additive fusion, splicing fusion, and convolution fusion
form in three dimensions, respectively, and selected points of were proposed. Wang et al. [24] used a 3D CNN instead of
interest in space and time based on low-pass and high-pass a 2D-CNN and adopted the STPP (spatiotemporal pyramid
filtering responses in each dimension. Hu et al. [8] proposed pooling) in the last convolution layer to achieve consistent
a novel histogram of oriented contextual gradient (HOCG) feature dimensions of output. Zhao et al. [25] used the 3D
descriptor for AED (abnormal event detection) based on the CNN, RNN, and bidirectional GRU, where human skeleton
contextual gradients. sequence was used as an input. Afrasiabi [26] extracted opti-
The development of deep learning has contributed to sig- cal flow fields from video frames using convolutional neural
nificant progress in the target detection field. Many deep networks as features. Imran et al. [27] proposed a three-
learning models have been proposed for target detection, stream architecture for fusion of RGB, inertial and skeleton
such as AlexNet [9], VGGNet [10], GoogleNet [11], ResNet data. Yi et al. [28] proposed a new trajectory descriptor based
[12], Faster R-CNN [13], YOLO [14], and SSD [15]. Also, on HOG, HOF, MBH and proposed a novel approach to cal-
many researchers have applied different deep learning meth- culate saliency map based on optical flow. However, there
ods into human action recognition. Li et al. [16] proposed are still many problems in human action recognition such
a method based on LSTM (long short-term memory) and as differentiation of similar actions, recognition of interac-
CNN (convolutional neural network); first different features tion between people, and the need for manual annotation of
were extracted, and then they were, respectively, input into a large amount of data.
three LSTM networks and seven CNN networks. Next, all ten In this paper, we present a robust pedestrian action recog-
networks were fused, and three fusion methods, maximum nition approach based on the motion history image (MHI),
fusion, average fusion, and element-by-element multipli- RGB frames, and convolutional neural network. We study the
cation fusion, were adopted. Lastly, the final results were relationship between single frame and continuous frames of a
output. Donahue et al. [17] proposed the LRCN (long-term video. The proposed approach is capable of locating individu-
recurrent convolutional network), where the spatial informa- als in a fixed video and recognizing single-human motion and
tion was extracted by a CNN, and then time information was human–human interactions. Moreover, we use Kalman filter
extracted from a video by the LSTM network and finally clas- for data annotation to assist manual annotation. The entire
sified. Ji et al. [18] proposed a method based on a 3D-CNN, algorithm includes the VMHI (VGG-16 and MHI) structure,
where the time was added on the basis of 2D-CNN. This the FRGB (Faster R-CNN and RGB frames) structure, and
method could extract both spatial and temporal information the resulting fusion structure. During training, in the VMHI
from a video. Wang et al. [19] proposed an improved recog- branch, one MHI image is generated for every ten consec-
nition network combining 3D CNN and LSTM, effectively utive frames of a video and then fed to the VGG-16 neural
reducing the number of network parameters and facilitating network input. In the FRGB branch, the annotation informa-
the training process. Simonyan et al. [20] put forward the two- tion obtained by the Kalman filter and the RGB frames is
stream convolutional network; they adopted two identical fed to the Faster R-CNN network input. During testing, the
CNNs, of which one obtained spatial information from input MHI and the last frame of continuous ten frames are input to
video frame, and the other network obtained time information the trained VGG-16 model and Faster R-CNN model, respec-
from input optical flow information of video; then, the two tively, and their output signals are fed to the Softmax classifier
networks were fused by means of average fusion or fusion whose output is then combined with the border information
classification using the SVM (support vector machine); the of the Faster R-CNN, providing a final output result. We
best performance was achieved by using the SVM for fusion performed the experiments on the KTH, Weizmann, Tenth-
classification. Feichtenhofer et al. [21] improved the fusion Lab, and UT-interaction datasets to evaluate the performance
strategy and conducted the fusion from the middle layer of of the proposed approach and then compared the obtained
a two-stream network. The experimental results showed that results with the results of the state-of-the-art approaches.
the proposed strategy was better than the original two-stream The rest of this paper is organized as follows. Section 2
network, and the number of parameters was significantly gives a detailed description of our approach. Section 3
reduced. On the basis of a two-stream network, Wang et al. presents the experimental setup and experimental results
[22] introduced the idea of segmentation and sparse sam- analysis, and Sect. 4 concludes the paper.
pling and proposed the TSN (temporal segment network),
123
Improved human action recognition approach based on two-stream convolutional neural network model
Fig. 1 Human action recognition approach
2 Proposed approach for human action idea of MHI was first proposed in [29]. Assume τ denotes
detection the moving time of human, and δ is the decay parameter.
When τ is too small, a part of the motion information is lost.
The algorithm proposed in the paper benefits from the MHI, On the other hand, when τ is too large, the pixel intensity
RGB frames, and CNN. The overall framework of the algo- cannot be accurately determined, so it is difficult to judge
rithm is shown in Fig. 1, where it can be seen that the proposed the movement direction. As δ becomes larger, the part that
algorithm consists of three main parts. occurs earlier is eliminated first. The MHI intensity value
The first part is the VMHI structure. As shown in Fig. 1a, Hτ (x, y, t) is defined by Eq. (1)
the VMHI consists of the MHI block and VGG-16 network.
The MHI block expresses the target motion in the form of
τ, if ψ( x, y, t) = 1
image brightness. An MHI image is generated from 10 suc- Hτ (x, y, t) =
max(0, Hτ (x, y, t) − δ), SL ≤ 0 < SM
cessive frames of a video, and then it is fed to the VGG-16
(1)
network input. The second part of the proposed algorithm is
the FRGB structure. As shown in Fig. 1b, the FRGB consists
The update function ψ(x,y,t) is defined by the inter-frame
of the RGB frames with annotations, Kalman filter algorithm,
difference method, and ξ is the artificially set difference
and Faster R-CNN algorithm. Kalman filter algorithm is used
threshold. Assume I (x, y, t) represents the intensity value
to extract the information on a moving-target position and
of a pixel with the coordinates (x, y) in frame t, and Δ rep-
generate annotations (path, image size, human action name,
resents the inter-frame distance. As ξ becomes bigger, the
and ground truth information of human object). The Faster
background noise disappears, but ξ over the assembly leads
R-CNN deep architecture is used to detect human activities
to a hollow in the center area.
from the RGB frames. The Faster R-CNN mainly includes
four parts, the deep convolutional layers (VGG-16), Region
1, if D(x, y, t) ≥ ξ
Proposal Network (RPN), Region of Interest Pooling (ROI ψ(x,y,t) = (2)
0, otherwise
Pooling) layer, and the classification layer. The third part
D(x, y, t) = |I (x, y, t) − I (x, y, t ± Δ)| (3)
of the proposed algorithm is the fusion process which takes
the output of the Softmax algorithm of the VMHI structure
and the output of the FRGB structure to determine the tar- In this paper, we first extract MHI images from the video
get behavior category. Lastly, the information on the target frames and then organize images into a dataset, which is
motion type is combined with the target border information used for the VGG-16 neural network training. In the VGG-
of the FRGB output to obtain the final output result. 16 neural network testing, MHI images are also generated
from the frames and then fed to the trained VGG-16 network
input for judgment. The frames used in training and testing
2.1 VMHI algorithm architecture are different. The obtained judgment result is merged with
the FRGB structure judgment result, and then the action type
2.1.1 Motion history image is determined.
The MHIs obtained from the benchmark datasets are pre-
The MHI expresses the target motion through image bright- sented in Figs. 2, 3, 4, and 5, where it can be seen that
ness by calculating the change in each pixel value in time. The the brightness of the pixels in images is high where there
123
C. Liu et al.
(a) walking (b) jogging (c) running (a) handshaking (b) pointing (c) hugging
(d) pushing (e) kicking (f) punching

(d) boxing (e) hand waving (f) hand clapping
Fig. 4 The MHIs obtained from the UT-interaction dataset
Fig. 2 The MHIs obtained from the KTH dataset
(a) walk (b) run (c) wave1
(a) run (b) walk (c) skip
(d) wave2 (e) crouch (f) bench
(d) jack (e) jump (f) pjump
(g) hand (h) hug (i) push
(g) side (h) wave2 (i) wave1 (j) kick
Fig. 5 The MHIs obtained from the TenthLab dataset
2.1.2 VGG-16 model training
(j) bend A well-known VGG-16 model is trained with the MHI

Fig. 3 The MHIs obtained from the Weizmann dataset
images. The VGG-16 represents a deep convolutional neural
network with a superior recognition capability. The VGG-16
configuration is described in detail in [25]. The input images
is some movement. Although some actions cannot be clearly are resized to 224 × 224 pixels. The size of the convolutional
recognized from single-frame pictures by the deep learn- kernel is set to 3 × 3 for all convolutional layers. All hidden
ing algorithm, the recognition effect of MHIs is good. For layers use the RELU (rectified linear unit) activation func-
instance, running and jogging of the KTH dataset have a tion. As shown in Fig. 6a, b, the network model is divided
significant difference in MHI, while some actions cannot be into six parts, the first five parts denote the convolution net-
accurately recognized using the MHI. However, the recog- work, and the last part is the fully connected network. The
nition efficiency of the deep learning algorithm is good in first part consists of two conv3–64 layers and a maxpool
single-frame pictures. For instance, running and skipping of layer. In this part, the image size changes from 224 × 224
the Weizmann dataset have a significant difference in single- to 224 × 224 × 64, which is regarded as an input of the sec-
frame pictures. The combination of MHI and single-frame ond part. The second part is similar to the first part, and it
images makes the results more accurate. consists of two conv3–128 layers and a maxpool layer. In
123
2.2 FRGB algorithm architecture
2.2.1 Extraction of moving target annotation
Annotating data manually are time-consuming. Acuna et al.

[30] proposed an interactive object labeling tool called the
Polygon-RNN++, which followed the idea of Polygon-RNN
[31]. However, the Polygon-RNN++ can be used to anno-
tate the contour of an object with precise curves, so in this
work, the Kalman filter is used to annotate a moving target
in each frame with a minimum rectangular box, and then
the information on obtained rectangular-box coordinates is
used for VGG-16 model training. Kalman filter is an optimal
linear recursive filtering method that is extremely efficient
at solving target tracking problems [32]. It is insensitive to
noise with the function of adaptive and predictive correction.
Kalman filter is based on state equation and measurement
equation, and it uses recursive methods to predict changes in
linear systems. The state Eq. (4) and measurement Eq. (5)
are, respectively, defined by:
xk = Ak,k−1 xk−1 + ξk−1 (4)

z k = Hk z k + ηk (5)
where xk is the state at time k, z k is the measured value at time

k, A( k, k − 1) is the state transition matrix, Hk is the mea-
surement matrix, ξk is the system noise, and ξk ∈N (0, Q k );
further, ηk is the measurement noise, and ηk ∈N (0, Rk ), and
lastly, Q k and Rk are the variances of ξk and ηk , respec-
tively. Kalman filter can be summarized as a process of state
prediction (6, 7) and state correction (8–10).
The state prediction equation is given by:
xkl = Ak,k−1 xk−1

l
(6)
The error covariance prediction equation is given by:

Fig. 6 The architecture of the VGG-16 model trained and tested with
the MHIs and RGB frames
Pk,k−1 = Ak,k−1 Pk−1 ATk,k−1 + Q k−1 (7)
The Kalman filter gain is defined by:

−1
this part, the image is resized to 56 × 56 × 128. The third K k = Pk,k−1 HKT Hk Pk,k−1 HkT + Rk (8)
part consists of three conv3–256 layers and a maxpool layer.
In this part, the image size changes from 56 × 56 × 128 The state correction equation is given by:
to 28 × 28 × 256. Similar to the third part, the fourth part
consists of three conv3–512 layers and a maxpool layer. In xkl = xk,k−1
l
+ K k z k − Hk xk,k−1
l
(9)
this part, the image size changes from 28 × 28 × 256 to
14 × 14 × 512. The output of the fifth part is converted into a The error covariance correction matrix is given by:
one-dimensional vector consisted of 7 × 7 × 512 = 25,088
parameters and then fed to two fully connected layers having Pk = Pk,k−1 − K k Hk Pk,k−1 (10)
4096 neurons and a Dropout layer. A fully connected layer
consisted of 1000 neurons and the Softmax layer are used to The state prediction equation is based on (4), and the state
deal with the classification output probability. l
prediction vector xk,k−1 and the error covariance prediction
123
C. Liu et al.
vector Pk,k−1 are obtained by (7). The state correction equa-

tion is based on the measurement equation which is given by
(5), and it corrects the state prediction vector, determines vec-
tor xkl , and calculates the minimum error covariance matrix.
The Kalman filter detection is presented as follows. First,
the initial parameters are set and video sequence is read.
Next, the background estimation is performed to generate
an initial background image. Then, the video sequence is
sequentially read, and the foreground target in the current
frame is obtained using the background and the data on the
current frame which denotes the estimation obtained by the
Kalman filter algorithm based on the previous-frame data.
Further, the foreground target is formed by connecting the
Fig. 7 The RPN architecture
regions using the morphological expansion algorithm, and
then the moving target is detected. Finally, the coordinates
of the moving target are saved as an input for neural network Assume the ROI coordinates are (x0 , y0 , x1 , y1 ), and the
training. input size is (y1 − y0 ) × (x1 − x0 ). Then, when the output
size is height pool × widthpool , the size of the sliding kernel
2.2.2 Faster R-CNN model training kernelsliding of ROI is defined by:
The Faster R-CNN model is trained with the images and kernelsliding = (y1 − y0 )/height pool × (x1 − x0 )/widthpool
annotations obtained by Kalman filter algorithm. The Faster (12)
R-CNN mainly includes the VGG-16 convolutional layers,
RPN, ROI Pooling, and the classification layer. The VGG-16 After the ROI Pooling layer, the proposal feature maps are
feature extraction convolutional layers are shown in Fig. 6a. extracted as the input of classification. The fully connected
The output of the fifth part denotes the RPN input. Since layer and Softmax layer are used to judge which human action
the classical methods for region proposal generation such as the target belongs to, and then the probability is determined.
Sliding Window and Selective Search in R-CNN are time- At the same time, the bounding box regression is used to
consuming, in the Faster R-CNN, the RPN is employed to make the detected target box more accurate. The bounding
generate region proposals, which dramatically reduces the box regression output is given by:
running time [13]. As shown in Fig. 7, first a 3×3 sliding win-
dow slides on the feature map, mapping the center point of the tx = (x − xa ) /wa , t y = (y − ya ) /h a (13)
current 3×3 region back to the original graph. Anchors areas
tw = ln (w/wa ) , th = ln (h/h a ) (14)
of (1282 , 2562 , 5122 ) and the length-width ratio of (1:1, 1:2,
2:1) are placed on the original graph. Therefore, each pixel tx∗ = x ∗ − xa /wa , t y∗ = y ∗ − ya /h a (15)

corresponds to nine anchors. The formula for the anchors tw∗ = ln w ∗ /wa , th∗ = ln h ∗ /h a (16)
mapping back to the original image can be expressed as fol-
lows: where x, y, w, and h denote the box’s center coordinates,

width, and height, respectively; x, xa , and x ∗ are x coor-
(x, y) = Sx , Sy (11) dinates of the predicted box, anchor box, and ground-truth
box, respectively; y, ya , and y ∗ are y coordinates of the pre-
where S represents the final product of the convolution neural dicted box, anchor box, and ground-truth box, respectively;
network, x and y represent the coordinates on the original w, wa , and w ∗ are widths of the predicted box, anchor box,

graph, and x and y represent the coordinates on the feature and ground-truth box, respectively; lastly, h, h a , and h ∗ are
map. heights of the predicted box, anchor box, and ground-truth
Then, the anchors are fed to two parallel fully con- box, respectively.
nected layers, box-regression layer, and box-classification
layer. The box-regression layer is used to adjust the posi- 2.3 Decision level fusion
tion of a candidate box, and the box-classification layer is
used to distinguish whether the object in anchor is a tar- Combination of the multiple classifiers’ results has always
get or not. Finally, the proposals are saved for the following been a discussion topic [33]. In this paper, we use the proba-
ROI Pooling. The size of the input proposals is not con- bility scores generated by Softmax classifier to combine the
sistent. ROI Pooling is introduced to solve this problem. two streams of RGB frames and MHI, as shown in Fig. 1.
123
Fig. 8 Frame examples of KTH dataset
The Softmax classifier is generally used to solve the multi-

classification problems, and it is defined by:

P(i) = exp θiT x /Σk=1
K
exp θ KT x (17)
The output of the Softmax classifier P(i) is a normalized

classification probability, so P(i) value is up to 1. In (17), θiT x
denotes multiple inputs. When the output node is selected
last, the node with the highest probability is selected as a pre-
diction target. In this paper, the M-score denotes the Softmax
probability of the VMHI algorithm, and the R-score denotes Fig. 9 Frame examples of Weizmann dataset
the Softmax probability of the FRGB algorithm. The fusion
result of these scores can be formulated by (18). The behav-
ior type corresponding to the highest score is taken as a final
behavior type.
F S = max [Mscore , Rscore ] . (18)
3 Experiments and results
We conducted experiments on four different datasets to eval-

uate the performance of our approach. The KTH dataset
[34], Weizmann dataset [35] UT-interaction dataset [36], Fig. 10 Frame examples of UT-interaction dataset
and TenthLab dataset. The KTH and Weizmann dataset con-
sisted of videos that represented single human actions, while
the UT-interaction dataset contained human–human interac- ferent datasets, and ξ was set to be in the range 50–75. In the
tions. The obtained results on each of these datasets were VGG-16 model, the learning rate was set to 10−4 , the weight
compared with the previously reported results to demon- decay was set to 0.0005, and the momentum was set to 0.9
strate the performance of our approach. For each class in with a batch size of 16. All the values were chosen based on
the first three datasets, there were 500 RGB images and 500 previously reported results and our analysis of the dataset. In
MHIs for experiment, 80% for training and 20% for testing. the testing process, each MHI was generated from 10 consec-
TenthLab dataset consisted both single-human actions and utive frames and then input into the trained VGG-16 model
human–human interactions, there were 200 RGB images and for judgment. After that, the last frame of each successive
200 MHIs for experiment, 80% for training and 20% for test- 10 frames was input into the trained Faster R-CNN model
ing. In the MHI block, τ was set to 100, δ was set to 2 based for testing. Finally, the results of the two models were com-
on our previous research; the value of ξ was different for dif- bined. All of the experiments were performed on a PC with
123
C. Liu et al.
Fig. 11 Frame examples of TenthLab dataset Fig. 13 Recognition results of Weizmann dataset (each output box is
associated with a category label and a softmax score)
Fig. 12 Recognition results of KTH dataset (each output box is asso-

ciated with a category label and a softmax score)
Fig. 14 Recognition results of UT-interaction dataset (each output box
is associated with a category label and a softmax score)
an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10 GHz, 32 GB
RAM, and a GPU of type NVIDIA GeForce GTX 1080, run-
ning a Windows 10 64-bit operating system. obtained. These four values can be calculated from the true
positives (TP), false negatives (FN), false positives (FP), and
3.1 Performance metrics used for the evaluation true negatives (TN) values in the confusion matrix.
If the index of each type of action is i, then TPi means
The confusion matrix is a highly visualized table of anal- the number of pictures that the real and predicted action cat-
ysis, which summarizes the classification of the model in egory are recognized as action i; FPi means the number of
the form of a matrix. According to the confusion matrix, the pictures that real action category is not action i but recog-
four values of accuracy, precision, recall and F-score can be nized as action i; FNi means the number of pictures that real
123
ρ = TPi / (TPi + FNi ) (20)
The F-score (F) and accuracy (A) are calculated as follows:
F = 2πρ/ (ρ + π ) (21)
A = (TPi + TNi ) / (TPi + FPi + FNi + TNi ) . (22)
3.2 Description of datasets
3.2.1 KTH dataset
The KTH video dataset [34] consisted of videos that rep-

resented six different human actions: walking, jogging,
running, boxing, hand waving, and hand clapping, as shown
in Fig. 8. These videos were performed by 25 actors in four
different scenarios: outdoors, outdoors with scale variation,
outdoors with different clothes, and indoors. Each video was
taken at the resolution of 160 × 120 pixels, 25 fps.
3.2.2 Weizmann dataset
The Weizmann dataset [35] contained videos that represented

nine different actors performing 10 types of human actions:
run, walk, skip, jumping-jack (jack), jump-forward-on-two-
Fig. 15 Recognition results of TenthLab dataset (each output box is legs (jump), jump-in-place-on-two-legs (pjump), gallop-
associated with a category label and a softmax score) side-ways (side), wave-two-hands (wave2), wave-one-hand
(wave1), and bend, as shown in Fig. 9. Each video was taken
at the resolution of 180 × 144 pixels, 50 fps.
3.2.3 UT-interaction dataset
The UT-interaction video dataset [36] contained videos that

represented six types of human–human interactions: hand-
shaking, pointing, hugging, pushing, kicking, and punching,
as shown in Fig. 10. Several participants under 15 different
clothing conditions appeared in the videos. The videos were
divided into two sets. Set 1 was obtained on a parking lot,
and Set 2 was obtained on a lawn on a windy day. The videos
are taken at the resolution of 720 × 480 pixels, 30 fps, and
the person height in the video was about 200 pixels.
3.2.4 TenthLab dataset

Fig. 16 Comparison of two-stream-based human action algorithm in
KTH dataset The TenthLab dataset was collected at 10th floor of Optical
Electrical building in University of Shanghai for Science and
Technology, which contained videos that represented three
action category is action i but not recognized as action i; different actors performing 10 types of human actions (6
TNi means the number of pictures that the real and predicted types of single-human actions and 4 types of human–human
action category are not recognized as action i; precision and interactions): walk, run, wave1, wave2, crouch, bench, hand,
recall (ρ) are defined as follows: hug, push, and kick, as shown in Fig. 11. Each video was
taken at the resolution of 1920 × 1080 pixels, 30 fps. Com-
π = TPi / (TPi + FPi ) (19) pared with the three open datasets, the TenthLab dataset was
123
C. Liu et al.
Fig. 17 Comparison of
two-stream-based human action
algorithm in Weizmann dataset
Predicted action
walking handwaving running boxing jogging handclapping
walking
100 0 0 0 0 0
handwaving
0 100 0 0 0 0
Actual action
running
0 0 96 0 4 0
boxing
0 0 0 100 0 0
jogging
Fig. 18 Comparison of two-stream-based human action algorithm in 0 0 3 0 97 0

UT-interaction dataset
handclapping
0 0 0 0 0 100
Fig. 20 KTH dataset
more suitable for the actual scene: Firstly, it used the over-
Fig. 19 Comparison of
two-stream-based human action
algorithm in TenthLab dataset
123
Predicted action
handshaking pushing hugging pointing kicking punching
handshaking
100 0 0 0 0 0
pushing
0 98 0 0 0 2
Actual action
hugging
0 0 100 0 0 0
pointing
0 0 0 100 0 0
kicking
0 0 0 0 100 0
punching
0 4 0 0 0 96
Fig. 21 UT-interaction dataset
Fig. 23 Confusion matrices of TenthLab dataset

Predicted action
walk run jump jack skip pjump side wave2 wave1 bend
walk
100 0 0 0 0 0 0 0 0 0 layer of VGG-16 to obtain the final category. At the same

time, the target position was judged and marked with a red
0 96 0 0 4 0 0 0 0 0
run
rectangle. The results of the four datasets are displayed in

Figs. 12, 13, 14 and 15, where the numbers in the brackets
jump
0 0 100 0 0 0 0 0 0 0
denote the highest scores.
jack
0 0 0 100 0 0 0 0 0 0
Actual action
3.3.2 Performance analysis

skip
0 5 0 0 95 0 0 0 0 0
pjump
0 0 0 0 0 100 0 0 0 0 Figures 16, 17, 18, and 19 depicts the comparison of

two-stream-based human action algorithm in KTH dataset,
side
0 0 0 0 0 0 100 0 0 0 Weizmann dataset, UT-interaction dataset, and TenthLab

dataset. The combination algorithm of VMHI and FRGB
wave1 wave2
0 0 0 0 0 0 0 100 0 0 makes the results more accurate.

The confusion matrices of four datasets are illustrated in
0 0 0 0 0 0 0 0 100 0
Figs. 20, 21, 22, and 23. For each class in the first three
bend
0 0 0 0 0 0 0 0 0 100 datasets, there were 100 RGB images and 100 MHIs for test-
ing, for each class in the TenthLab dataset, there were 40 RGB
Fig. 22 Confusion matrices of Weizmann dataset images and 40 MHIs for testing. Each MHI corresponded to
the last frame of 10 consecutive RGB images in a video.
In these matrices, rows represented the actual classes and
head view angle to collect the video; secondly, the direction columns represented the predicted classes. In Figs. 20, 21, 22,
of the movement of the characters in the video is changed. and 23, diagonal values denote the rate of correctly recog-
nized actions, and the off-diagonal values denote the rate of
3.3 Results analysis misrecognized actions.
The proposed algorithm showed strong applicability
3.3.1 Experimental results because not only it could recognize a single-human action,
but also achieved good recognition of human–human inter-
The last frame of each set consisted of consecutive ten frames action. The proposed approach performed well in recogni-
was fed to the trained Faster R-CNN model for testing, and tion of walking, hand waving, boxing, and hand clapping
the score obtained by the softmax layer of Faster R-CNN in KTH dataset, walking, jumping-jack, jump-forward-
model was combined with the score obtained by the softmax on-two-legs, jump-in-place-on-two-legs, gallop-side-ways,
123
C. Liu et al.
Table 4 Experimental evaluation of TenthLab dataset

Action Accuracy (%) Precision Recall F-score
Walk 97.5 0.951 1 0.975

Run 95 0.974 1 0.987
Wave1 100 1 1 1
Wave2 100 1 1 1
Crouch 95 0.904 1 0.950
Bench 90 0.947 1 0.973
Hand 97.5 0.951 1 0.975
Hug 100 1 1 1
Push 95 0.974 1 0.987
Kick 100 1 1 1
Fig. 24 Computational time for per approach part
Table 1 Experimental evaluation of KTH dataset wave-two-hands, wave-one-hand, and bending in Weizmann
Action Accuracy (%) Precision Recall F-score dataset, handshaking, hugging, pointing, and kicking in UT-
interaction dataset, and wave1, wave2, hug, and kick in
Walking 100 1 1 1
TenthLab dataset. However, some actions were misrecog-
Hand waving 100 1 1 1
nized because the features of these actions were similar in
Running 96 0.970 1 0.985
motion and shape. In KTH dataset, the recognition accuracy
Boxing 100 1 1 1 for running was 96%, and 4% of frames were misclassified
Jogging 97 0.960 1 0.980 as jogging, while the recognition accuracy for jogging was
Hand clapping 100 1 1 1 97%, and 3% of frames were misclassified as running. In UT-
interaction dataset, the recognition accuracy for pushing was
Table 2 Experimental evaluation of Weizmann dataset 98%, and 2% of frames were misclassified as punching, while
the recognition accuracy for punching was 96%, and 4% of
Action Accuracy (%) Precision Recall F-score
frames were misclassified as pushing. Further, in Weizmann
Walk 100 1 1 1 dataset, the recognition accuracy for running was 96%, and
Run 96 0.950 1 0.974 4% of frames were misclassified as skipping, while the recog-
Jump 100 1 1 1 nition accuracy for skipping was 95%, and 5% of frames
Jack 100 1 1 1 were misclassified as running. Lastly, in TenthLab dataset,
Skip 95 0.960 1 0.980 the recognition accuracy for walk was 97.5%, and 2.5% of
Pjump 100 1 1 1 frames were misclassified as run, while the recognition accu-
Side 100 1 1 1
racy for run was 95%, and 5% of frames were misclassified as
Wave2 100 1 1 1
walk, the recognition accuracy for crouch was 95%, and 5%
of frames were misclassified as bench, while the recognition
Wave1 100 1 1 1
accuracy for bench was 97.5%, and 2.5% of frames were
Bend 100 1 1 1
misclassified as crouch, the recognition accuracy for hand
was 97.5%, and 2.5% of frames were misclassified as push,
Table 3 Experimental evaluation of UT-interaction dataset while the recognition accuracy for push was 95%, and 5% of
Action Accuracy (%) Precision Recall F-score frames were misclassified as hand, Tables 1, 2, 3, and 4 pro-
vide the performance measures such as accuracy, precision,
Handshaking 100 1 1 1 recall, and F-score for various actions in the KTH dataset,
Pushing 98 0.961 1 0.980 the Weizmann dataset, UT-interaction dataset, and TenthLab
Hugging 100 1 1 1 dataset. The results prove that the proposed approach is robust
Pointing 100 1 1 1 in terms of human action recognition.
Kicking 100 1 1 1 To conduct a quantitative evaluation, we calculated an
Punching 96 0.980 1 0.990 average recognition rate and compared the obtained results
with the results of the state-of-the-art recognition approaches
proposed by Qian [37], Xu [38], Chou [39], Ko [40], Wang
[41], Vishwakarma [42], Sahoo [43], and Vishwakarma [44].
123
Table 5 Recognition accuracy

Approach KTH dataset Weizmann dataset UT-interaction dataset
comparison of different
approaches on different datasets Qian et al. [37] 96.66 97.78 –
(%)
Xu et al. [38] 95.80 99.10 –
Chou et al. [39] 90.58 95.56 –
Ko et al. [40] – – 83.80
Wang et al. [41] – – 85.80
Vishwakarma et al. [42] 96.60 97.50 –
Suraj et al. [43] – – 87.50
Vishwakarma et al. [44] 96.66 96.00 100.00
Ours 98.83 99.10 99.00
The chosen approaches are the newest approaches related 4 Conclusions

to the paper subject. Qian [37] re-interpreted the newly
proposed spatial-temporal motion accumulative images and In this paper, an approach for human behavior recognition
proposed a hierarchical cascaded classifier based on multiple based on MHI, RGB frames, and CNN is proposed. The
nearest neighbor classifiers. Xu [38] proposed a two-stream proposed approach combines the MHI and VGG-16 deep
dictionary learning architecture to detect human action, convolutional neural network to capture the spatiotempo-
which consisted of interest patch (IP) detector and descriptor, ral information and uses Kalman filter and Faster R-CNN
two-stream dictionary models, and SVM for classification. deep architecture to capture the static information and
Chou [39] compared three generic systems for multi-view detect target location. Extensive experiments on four datasets
video-based human action recognition, namely, the near- were conducted to evaluate the performance of the pro-
est neighbor classifier, Gaussian mixture model classifier, posed approach. In the experiments, three different datasets
and the nearest mean classifier. The nearest mean classi- were used, KTH, Weizmann, UT-interaction, and TenthLab
fier showed the best performance, so we used it in the datasets. The experimental results showed that the average
comparison. Ko [40] extracted the corresponding frames accuracy of our approach was 98.83%, 99.10%, 99.00%, and
to construct a separate dataset and trained the VGG-16 97.00%, respectively. Moreover, the results obtained by our
model by using Caffe library. Wang [41] proposed a new approach were compared with the results of the state-of-the-
approach for interaction recognition based on the sparse art approaches, and it was shown that our approach had better
representation of feature covariance matrices. Vishwakarma classification accuracy than the other advanced recognition
[42] proposed an efficient and robust HAR framework is approaches. Therefore, the proposed approach is suitable for
by integrating spatial distribution of gradients (SDGs) and human action recognition. However, the introduced algo-
difference of Gaussian (DoG)-based spatiotemporal inter- rithm still has two demerits. Firstly, the recognization of
est points (STIP). Sahoo [43] proposed a local maxima of more complicated multi-human interactions, such as three
difference image (LMDI)-based interest point detection tech- persons in action fight is not efficient. Secondly, when one
nique. Vishwakarma [44] proposed a twofold transformation person at very close distance from each other could be easily
via Gabor wavelet transform (GWT) and ridgelet transform recognized as interaction. In our future work, we will try to
(RT) to identify human action. The comparison results for improve the proposed approach to solve these problems.
all the mentioned approaches and our approach are given in
Table 1, wherein it can be seen that the recognition accuracy Acknowledgements Shanghai Natural Science Foundation
(No. 17ZR1443500), Fund Project of National Natural Science Foun-
of our approach was better than that of the other approaches. dation of China (No. 61701296), Joint Funds of the National Natural
In Table 5, some of the cells do not contain any number Science Foundation of China (No. U1831133).
because the results for that datasets are not available in the
literature.
To evaluate the real time of the approach, we divided it into
4 parts: MHI, VGG-16, Kalman filter, and Faster R-CNN. References
The average computational time for each part is depicted in
1. Poppe, R.: A survey on vision-based human action recognition.
Fig. 24, which showed that the proposed approach for human
Image Vis. Comput. 28(6), 976–990 (2010)
action recognition can be performed in an affordable time. 2. Fujiyoshi, H., Lipton, A.J.: Real-time human motion analysis by
image skeletonization. Appl. Comput. Vis. 87, 113–120 (1998)
3. Yang, X., Tian, Y.L.: Effective 3D action recognition using Eigen-
Joints. J. Vis. Commun. Image Represent. 25(1), 2–11 (2014)
123
C. Liu et al.
4. Chaudhry, R., Ravichandran, A., Hager, G.: Histograms of oriented 25. Zhao, R., Ali, H., Smagt, P.V.D.: Two-stream RNN/CNN for action
optical flow and Binet–Cauchy kernels on nonlinear dynamical recognition in 3D videos. In: IEEE International Conference on
systems for the recognition of human actions. In: IEEE Conference Intelligent Robots and Systems (2017)
on Computer Vision and Pattern Recognition, pp. 20–25 (2009) 26. Afrasiabi, M., Khotanlou, H., Mansoorizadeh, M.: DTW-CNN:
5. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recog- time series-based human interaction prediction in videos using
nition using motion history volumes. Comput. Vis. Image Underst. CNN-extracted features. Vis. Comput. (2019). https://doi.org/10.
104(2), 249–257 (2006) 1007/s00371-019-01722-6
6. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a 27. Imran, J., Raman, B.: Evaluating fusion of RGB-D and inertial sen-
local SVM approach. In: IEEE International Conference on Pattern sors for multimodal human action recognition. J. Ambient Intell.
Recognition, pp. 23–26 (2004) Hum. Comput. 11, 189–208 (2020)
7. Rapantzikos, K., Avrithis, Y., Kollias, S.: Dense saliency-based 28. Yi, Y., Li, A., Zhou, X.F.: Human action recognition based on action
spatiotemporal feature points for action recognition. In: IEEE Con- relevance weighted encoding. Signal Process. Image Commun. 80,
ference on Computer Vision and Pattern Recognition, pp. 43–48 115640 (2020)
(2009) 29. Bobick, A.F., Davis, J.W.: The recognition of human movement
8. Hu, X.: Huang Y, Duan Q, et al, Abnormal event detection in using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell.
crowded scenes using histogram of oriented contextual gradient 23(3), 257–267 (2001)
descriptor. EURASIP J. Adv. Signal Process. 2018(1), 54 (2018) 30. Acuna, D., Ling, H., Kar, A.: Efficient interactive annotation of seg-
9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification mentation datasets with polygon-RNN++. In: IEEE International
with deep convolutional neural networks. In: International Confer- Computer Vision and Pattern Recognition (2018)
ence on Neural Information Processing Systems (2012) 31. Castrejon, L., Kundu, K., Urtasun, R.: Annotating object instances
10. Simonyan, K., Zisserman, A.: Very deep convolutional networks with a polygon-RNN. In: IEEE Conference on Computer Vision
for large-scale image recognition. In: International Conference on and Pattern Recognition (2017)
Learning Representations (2015) 32. Siswantoro, J., Prabuwono, A.S., Abdullah, A.: A linear model
11. Szegedy, C., Liu, W., Jia, Y.: Going deeper with convolutions. In: based on Kalman filter for improving neural network classification
IEEE Conference on Computer Vision and Pattern Recognition, performance. Expert Syst. Appl. 49, 112–122 (2016)
pp. 1–9 (2009) 33. Duin, R.P.W.: The combining classifier: to train or not to train. In:
12. He, K., Zhang, X., Ren, S.: Deep residual learning for image International Conference on Pattern Recognition (2002)
recognition. In: IEEE Conference on Computer Vision and Pat- 34. The KTH Dataset: http://www.nada.kth.se/cvap/actions/.
tern Recognition (2015) Accessed on 18 Jan. (2005)
13. Ren, S., He, K., Girshick, R.: Faster R-CNN: towards real-time 35. The Weizmann Dataset: http://www.wisdom.weizmann.ac.il/.
object recognition with region proposal networks. In: International Accessed on 24 Dec. (2007)
Conference on Neural Information Processing Systems (2015) 36. The UT-Interaction Dataset: http://cvrc.ece.utexas.edu/
14. Redmon, J., Divvala, S., Girshick, R.: You only look once: uni- SDHA2010 (2007)
fied, real-time object recognition. In: IEEE International Computer 37. Qian, H., Zhou, J., Mao, Y.: Recognizing human actions from sil-
Vision and Pattern Recognition (2016) houettes described with weighted distance metric and kinematics.
15. Liu, W., Anguelov, D., Erhan, D.: SSD: single shot multibox detec- Multimed. Tools Appl. 76, 21889–21910 (2017)
tor. In: European Conference on Computer Vision (2016) 38. Xu, K., Jiang, X., Sun, T.: Two-stream dictionary learning archi-
16. Li, C., Wang, P., Wang, S.: Skeleton-based action recognition using tecture for action recognition. IEEE Trans. Circuits Syst. Video 27,
LSTM and CNN. In: IEEE International Conference on Multimedia 567–576 (2017)
and Expo Workshops (2017) 39. Chou, K.P., Prasad, M., Wu, D.: Robust feature-based automated
17. Donahue, J., Hendricks, L.A., Guadarrama, S.: Long-term recur- multi-view human action recognition system. IEEE Access 6, 1
rent convolutional networks for visual recognition and description. (2018)
In: AB Initto Calculation of the Structures and Properties of 40. Ko, K.E., Sim, K.B.: Deep convolutional framework for abnormal
Molecules (2015) activities recognition in a smart surveillance system. Eng. Appl.
18. Ji, S., Xu, W., Yang, M.: 3D convolutional neural networks for Artif. Intell. 67, 226–234 (2018)
human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41. Wang, J., Zhou, S.C., Xia, L.M.: Human interaction recognition
35(1), 221–231 (2013) based on sparse representation of feature covariance matrices. J.
19. Wang, X., Gao, L., Song, J.: Beyond frame-level CNN: saliency- Central South Univ. 25(2), 304–314 (2018)
aware 3D CNN with LSTM for video action recognition. IEEE 42. Vishwakarma, D.K., Dhiman, C.: A unified model for human activ-
Signal Process. Lett. 99, 1 (2016) ity recognition using spatial distribution of gradients and difference
20. Simonyan, K., Zisserma, A.: Two-stream convolutional networks of Gaussian kernel. Vis. Comput. 35, 1595–1613 (2019)
for action recognition in videos. In: Conference and Workshop on 43. Sahoo, P.S., Ari, S.: On an algorithm for human action recognition.
Neural Information Processing Systems (2014) Expert Syst. Appl. 115, 524–534 (2019)
21. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two- 44. Vishwakarma, D.K.: A twofold transformation model for human
stream network fusion for video action recognition. In: IEEE action recognition using decisive pose. Cognit. Syst. Res. 61, 1–13
International Computer Vision and Pattern Recognition (2016) (2020)
22. Wang, L., Xiong, Y., Wang, Z.: Temporal segment networks:
towards good practices for deep action recognition. In: European
Conference on Computer Vision (2016)
Publisher’s Note Springer Nature remains neutral with regard to juris-
23. Chen, J., Wu, J., Konrad, J.: Semi-coupled two-stream fusion con-
dictional claims in published maps and institutional affiliations.
vnets for action recognition at extremely low resolutions. In: IEEE
Winter Conference on Applications of Computer Vision (2017)
24. Wang, X., Gao, L., Wang, P.: Two-stream 3-D convnet fusion for
action recognition in videos with arbitrary size and length. IEEE
Trans. Multimed. 20, 634–644 (2018)
123
Congcong Liu (1994–) is a master Xing Hu received the Ph.D. degree

student at School of Optical Elec- in control science and engineer-
trical and Computer Engineering, ing from Shanghai Jiao Tong Uni-
the University of Shanghai for Sci- versity, in 2016. He is currently
ence and Technology. Her research a Lecturer with the School of
interest covers image processing Optical-Electrical and Computer
and pattern recognition. Engineering, University of Shang-
hai for Science and Technology,
China. His current research inter-
ests include image processing,
computer vision, and machine
learning.
Jie Ying received the Ph.D. degree

in optical engineering from the Jin Liu (1978–) is currently an
University of Shanghai for Sci- Associate Professor at School of
ence and Technology. She is an Electronic and Electrical Engineer-
Associate Professor with the Uni- ing, Shanghai University of Engi-
versity of Shanghai for Science neering Science, China, She is an
and Technology, China. Her re- Associate Deanat School of Elec-
search interests include image pro- tronic and Electrical Engineering,
cessing, pattern recognition, Intel- Shanghai University of Engineer-
ligent detection. ing Science, Shanghai. Her re-
search interests include, intelligent
detection, and control technology,
distributed sensor network, and
test information acquisition and
processing.
Haima Yang (1979–) received

the Ph.D. degree in Signal and
Information Processing from Xi’an
Institute of Optics and Precision
Mechanics, Chinese Academy of
Science, Xi’an, China, in 2015.
He is an Associate Professor with
the Instrument Science and Tech-
nology, School of Optical-Elec-
trical and Computer Engineering,
University of Shanghai for Sci-
ence and Technology, China. His
research interests include, digital
signal analysis and processing,
SPR sensor mechanism and simu-
lation, pattern recognition system development, symbolic slider vari-
able structure control.
123

10.1007@s00371 020 01868 8

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10.1007@s00371 020 01868 8

Uploaded by

Copyright:

Available Formats

The Visual Computer

Improved human action recognition approach based on two-stream

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

1 Introduction action recognition algorithms are mainly divided into tradi-

Fig. 1 Human action recognition approach

(d) pushing (e) kicking (f) punching

(a) walk (b) run (c) wave1

(a) run (b) walk (c) skip

(d) wave2 (e) crouch (f) bench

(d) jack (e) jump (f) pjump

(g) hand (h) hug (i) push

(g) side (h) wave2 (i) wave1 (j) kick

Fig. 5 The MHIs obtained from the TenthLab dataset

2.1.2 VGG-16 model training

(j) bend A well-known VGG-16 model is trained with the MHI

2.2 FRGB algorithm architecture

2.2.1 Extraction of moving target annotation

Annotating data manually are time-consuming. Acuna et al.

xk = Ak,k−1 xk−1 + ξk−1 (4)

where xk is the state at time k, z k is the measured value at time

xkl = Ak,k−1 xk−1

The error covariance prediction equation is given by:

The Kalman filter gain is defined by:

vector Pk,k−1 are obtained by (7). The state correction equa-

Fig. 8 Frame examples of KTH dataset

The Softmax classifier is generally used to solve the multi-

The output of the Softmax classifier P(i) is a normalized

F S = max [Mscore , Rscore ] . (18)

3 Experiments and results

We conducted experiments on four different datasets to eval-

Fig. 12 Recognition results of KTH dataset (each output box is asso-

ρ = TPi / (TPi + FNi ) (20)

The F-score (F) and accuracy (A) are calculated as follows:

3.2 Description of datasets

3.2.1 KTH dataset

The KTH video dataset [34] consisted of videos that rep-

3.2.2 Weizmann dataset

The Weizmann dataset [35] contained videos that represented

3.2.3 UT-interaction dataset

The UT-interaction video dataset [36] contained videos that

3.2.4 TenthLab dataset

Fig. 18 Comparison of two-stream-based human action algorithm in 0 0 3 0 97 0

Fig. 23 Confusion matrices of TenthLab dataset

100 0 0 0 0 0 0 0 0 0 layer of VGG-16 to obtain the final category. At the same

rectangle. The results of the four datasets are displayed in

3.3.2 Performance analysis

0 0 0 0 0 100 0 0 0 0 Figures 16, 17, 18, and 19 depicts the comparison of

0 0 0 0 0 0 100 0 0 0 Weizmann dataset, UT-interaction dataset, and TenthLab

0 0 0 0 0 0 0 100 0 0 makes the results more accurate.

Table 4 Experimental evaluation of TenthLab dataset

Walk 97.5 0.951 1 0.975

Fig. 24 Computational time for per approach part

Table 5 Recognition accuracy

The chosen approaches are the newest approaches related 4 Conclusions

Congcong Liu (1994–) is a master Xing Hu received the Ph.D. degree

Jie Ying received the Ph.D. degree

Haima Yang (1979–) received

You might also like