Professional Documents
Culture Documents
10.1007@s00371 020 01868 8
10.1007@s00371 020 01868 8
https://doi.org/10.1007/s00371-020-01868-8
ORIGINAL ARTICLE
Abstract
In order to improve the accuracy of human abnormal behavior recognition, a two-stream convolution neural network model
was proposed. This model includes two main parts, VMHI and FRGB. Firstly, the motion history images are extracted and
input into VGG-16 convolutional neural network for training. Then, the RGB image is input into Faster R-CNN algorithm
for training using Kalman filter-assisted data annotation. Finally, the two stream VMHI and FRGB results are fused. The
algorithm can recognize not only single person behavior, but also two person interaction behavior and improve the recognition
accuracy of similar actions. Experimental results on KTH, Weizmann, UT-interaction, and TenthLab dataset showed that the
proposed algorithm has higher accuracy than the other literature.
Keywords Human action recognition · Kalman filter · Motion history image · Faster R-CNN · Video surveillance
123
C. Liu et al.
in three-dimensional space and time, the space-time interest which could fuse multiple segments and obtain more con-
points were obtained, and the pixel histogram statistics of text information. Chen et al. [23] incorporated semi-coupled
the space-time interest points was performed, and finally, the two-stream fusion network and applied it to the video with
feature vector describing actions was formed. Rapantzikos extremely low resolution for behavior recognition. The meth-
et al. [7] proposed the application of discrete Wavelet trans- ods of additive fusion, splicing fusion, and convolution fusion
form in three dimensions, respectively, and selected points of were proposed. Wang et al. [24] used a 3D CNN instead of
interest in space and time based on low-pass and high-pass a 2D-CNN and adopted the STPP (spatiotemporal pyramid
filtering responses in each dimension. Hu et al. [8] proposed pooling) in the last convolution layer to achieve consistent
a novel histogram of oriented contextual gradient (HOCG) feature dimensions of output. Zhao et al. [25] used the 3D
descriptor for AED (abnormal event detection) based on the CNN, RNN, and bidirectional GRU, where human skeleton
contextual gradients. sequence was used as an input. Afrasiabi [26] extracted opti-
The development of deep learning has contributed to sig- cal flow fields from video frames using convolutional neural
nificant progress in the target detection field. Many deep networks as features. Imran et al. [27] proposed a three-
learning models have been proposed for target detection, stream architecture for fusion of RGB, inertial and skeleton
such as AlexNet [9], VGGNet [10], GoogleNet [11], ResNet data. Yi et al. [28] proposed a new trajectory descriptor based
[12], Faster R-CNN [13], YOLO [14], and SSD [15]. Also, on HOG, HOF, MBH and proposed a novel approach to cal-
many researchers have applied different deep learning meth- culate saliency map based on optical flow. However, there
ods into human action recognition. Li et al. [16] proposed are still many problems in human action recognition such
a method based on LSTM (long short-term memory) and as differentiation of similar actions, recognition of interac-
CNN (convolutional neural network); first different features tion between people, and the need for manual annotation of
were extracted, and then they were, respectively, input into a large amount of data.
three LSTM networks and seven CNN networks. Next, all ten In this paper, we present a robust pedestrian action recog-
networks were fused, and three fusion methods, maximum nition approach based on the motion history image (MHI),
fusion, average fusion, and element-by-element multipli- RGB frames, and convolutional neural network. We study the
cation fusion, were adopted. Lastly, the final results were relationship between single frame and continuous frames of a
output. Donahue et al. [17] proposed the LRCN (long-term video. The proposed approach is capable of locating individu-
recurrent convolutional network), where the spatial informa- als in a fixed video and recognizing single-human motion and
tion was extracted by a CNN, and then time information was human–human interactions. Moreover, we use Kalman filter
extracted from a video by the LSTM network and finally clas- for data annotation to assist manual annotation. The entire
sified. Ji et al. [18] proposed a method based on a 3D-CNN, algorithm includes the VMHI (VGG-16 and MHI) structure,
where the time was added on the basis of 2D-CNN. This the FRGB (Faster R-CNN and RGB frames) structure, and
method could extract both spatial and temporal information the resulting fusion structure. During training, in the VMHI
from a video. Wang et al. [19] proposed an improved recog- branch, one MHI image is generated for every ten consec-
nition network combining 3D CNN and LSTM, effectively utive frames of a video and then fed to the VGG-16 neural
reducing the number of network parameters and facilitating network input. In the FRGB branch, the annotation informa-
the training process. Simonyan et al. [20] put forward the two- tion obtained by the Kalman filter and the RGB frames is
stream convolutional network; they adopted two identical fed to the Faster R-CNN network input. During testing, the
CNNs, of which one obtained spatial information from input MHI and the last frame of continuous ten frames are input to
video frame, and the other network obtained time information the trained VGG-16 model and Faster R-CNN model, respec-
from input optical flow information of video; then, the two tively, and their output signals are fed to the Softmax classifier
networks were fused by means of average fusion or fusion whose output is then combined with the border information
classification using the SVM (support vector machine); the of the Faster R-CNN, providing a final output result. We
best performance was achieved by using the SVM for fusion performed the experiments on the KTH, Weizmann, Tenth-
classification. Feichtenhofer et al. [21] improved the fusion Lab, and UT-interaction datasets to evaluate the performance
strategy and conducted the fusion from the middle layer of of the proposed approach and then compared the obtained
a two-stream network. The experimental results showed that results with the results of the state-of-the-art approaches.
the proposed strategy was better than the original two-stream The rest of this paper is organized as follows. Section 2
network, and the number of parameters was significantly gives a detailed description of our approach. Section 3
reduced. On the basis of a two-stream network, Wang et al. presents the experimental setup and experimental results
[22] introduced the idea of segmentation and sparse sam- analysis, and Sect. 4 concludes the paper.
pling and proposed the TSN (temporal segment network),
123
Improved human action recognition approach based on two-stream convolutional neural network model
2 Proposed approach for human action idea of MHI was first proposed in [29]. Assume τ denotes
detection the moving time of human, and δ is the decay parameter.
When τ is too small, a part of the motion information is lost.
The algorithm proposed in the paper benefits from the MHI, On the other hand, when τ is too large, the pixel intensity
RGB frames, and CNN. The overall framework of the algo- cannot be accurately determined, so it is difficult to judge
rithm is shown in Fig. 1, where it can be seen that the proposed the movement direction. As δ becomes larger, the part that
algorithm consists of three main parts. occurs earlier is eliminated first. The MHI intensity value
The first part is the VMHI structure. As shown in Fig. 1a, Hτ (x, y, t) is defined by Eq. (1)
the VMHI consists of the MHI block and VGG-16 network.
The MHI block expresses the target motion in the form of
τ, if ψ( x, y, t) = 1
image brightness. An MHI image is generated from 10 suc- Hτ (x, y, t) =
max(0, Hτ (x, y, t) − δ), SL ≤ 0 < SM
cessive frames of a video, and then it is fed to the VGG-16
(1)
network input. The second part of the proposed algorithm is
the FRGB structure. As shown in Fig. 1b, the FRGB consists
The update function ψ(x,y,t) is defined by the inter-frame
of the RGB frames with annotations, Kalman filter algorithm,
difference method, and ξ is the artificially set difference
and Faster R-CNN algorithm. Kalman filter algorithm is used
threshold. Assume I (x, y, t) represents the intensity value
to extract the information on a moving-target position and
of a pixel with the coordinates (x, y) in frame t, and Δ rep-
generate annotations (path, image size, human action name,
resents the inter-frame distance. As ξ becomes bigger, the
and ground truth information of human object). The Faster
background noise disappears, but ξ over the assembly leads
R-CNN deep architecture is used to detect human activities
to a hollow in the center area.
from the RGB frames. The Faster R-CNN mainly includes
four parts, the deep convolutional layers (VGG-16), Region
1, if D(x, y, t) ≥ ξ
Proposal Network (RPN), Region of Interest Pooling (ROI ψ(x,y,t) = (2)
0, otherwise
Pooling) layer, and the classification layer. The third part
D(x, y, t) = |I (x, y, t) − I (x, y, t ± Δ)| (3)
of the proposed algorithm is the fusion process which takes
the output of the Softmax algorithm of the VMHI structure
and the output of the FRGB structure to determine the tar- In this paper, we first extract MHI images from the video
get behavior category. Lastly, the information on the target frames and then organize images into a dataset, which is
motion type is combined with the target border information used for the VGG-16 neural network training. In the VGG-
of the FRGB output to obtain the final output result. 16 neural network testing, MHI images are also generated
from the frames and then fed to the trained VGG-16 network
input for judgment. The frames used in training and testing
2.1 VMHI algorithm architecture are different. The obtained judgment result is merged with
the FRGB structure judgment result, and then the action type
2.1.1 Motion history image is determined.
The MHIs obtained from the benchmark datasets are pre-
The MHI expresses the target motion through image bright- sented in Figs. 2, 3, 4, and 5, where it can be seen that
ness by calculating the change in each pixel value in time. The the brightness of the pixels in images is high where there
123
C. Liu et al.
(a) walking (b) jogging (c) running (a) handshaking (b) pointing (c) hugging
123
Improved human action recognition approach based on two-stream convolutional neural network model
123
C. Liu et al.
The Faster R-CNN model is trained with the images and kernelsliding = (y1 − y0 )/height pool × (x1 − x0 )/widthpool
annotations obtained by Kalman filter algorithm. The Faster (12)
R-CNN mainly includes the VGG-16 convolutional layers,
RPN, ROI Pooling, and the classification layer. The VGG-16 After the ROI Pooling layer, the proposal feature maps are
feature extraction convolutional layers are shown in Fig. 6a. extracted as the input of classification. The fully connected
The output of the fifth part denotes the RPN input. Since layer and Softmax layer are used to judge which human action
the classical methods for region proposal generation such as the target belongs to, and then the probability is determined.
Sliding Window and Selective Search in R-CNN are time- At the same time, the bounding box regression is used to
consuming, in the Faster R-CNN, the RPN is employed to make the detected target box more accurate. The bounding
generate region proposals, which dramatically reduces the box regression output is given by:
running time [13]. As shown in Fig. 7, first a 3×3 sliding win-
dow slides on the feature map, mapping the center point of the tx = (x − xa ) /wa , t y = (y − ya ) /h a (13)
current 3×3 region back to the original graph. Anchors areas
tw = ln (w/wa ) , th = ln (h/h a ) (14)
of (1282 , 2562 , 5122 ) and the length-width ratio of (1:1, 1:2,
2:1) are placed on the original graph. Therefore, each pixel tx∗ = x ∗ − xa /wa , t y∗ = y ∗ − ya /h a (15)
corresponds to nine anchors. The formula for the anchors tw∗ = ln w ∗ /wa , th∗ = ln h ∗ /h a (16)
mapping back to the original image can be expressed as fol-
lows: where x, y, w, and h denote the box’s center coordinates,
width, and height, respectively; x, xa , and x ∗ are x coor-
(x, y) = Sx , Sy (11) dinates of the predicted box, anchor box, and ground-truth
box, respectively; y, ya , and y ∗ are y coordinates of the pre-
where S represents the final product of the convolution neural dicted box, anchor box, and ground-truth box, respectively;
network, x and y represent the coordinates on the original w, wa , and w ∗ are widths of the predicted box, anchor box,
graph, and x and y represent the coordinates on the feature and ground-truth box, respectively; lastly, h, h a , and h ∗ are
map. heights of the predicted box, anchor box, and ground-truth
Then, the anchors are fed to two parallel fully con- box, respectively.
nected layers, box-regression layer, and box-classification
layer. The box-regression layer is used to adjust the posi- 2.3 Decision level fusion
tion of a candidate box, and the box-classification layer is
used to distinguish whether the object in anchor is a tar- Combination of the multiple classifiers’ results has always
get or not. Finally, the proposals are saved for the following been a discussion topic [33]. In this paper, we use the proba-
ROI Pooling. The size of the input proposals is not con- bility scores generated by Softmax classifier to combine the
sistent. ROI Pooling is introduced to solve this problem. two streams of RGB frames and MHI, as shown in Fig. 1.
123
Improved human action recognition approach based on two-stream convolutional neural network model
123
C. Liu et al.
Fig. 11 Frame examples of TenthLab dataset Fig. 13 Recognition results of Weizmann dataset (each output box is
associated with a category label and a softmax score)
123
Improved human action recognition approach based on two-stream convolutional neural network model
F = 2πρ/ (ρ + π ) (21)
A = (TPi + TNi ) / (TPi + FPi + FNi + TNi ) . (22)
123
C. Liu et al.
Fig. 17 Comparison of
two-stream-based human action
algorithm in Weizmann dataset
Predicted action
walking handwaving running boxing jogging handclapping
walking
100 0 0 0 0 0
handwaving
0 100 0 0 0 0
Actual action
running
0 0 96 0 4 0
boxing
0 0 0 100 0 0
jogging
0 0 0 0 0 100
Fig. 20 KTH dataset
more suitable for the actual scene: Firstly, it used the over-
Fig. 19 Comparison of
two-stream-based human action
algorithm in TenthLab dataset
123
Improved human action recognition approach based on two-stream convolutional neural network model
Predicted action
handshaking pushing hugging pointing kicking punching
handshaking
100 0 0 0 0 0
pushing
0 98 0 0 0 2
Actual action
hugging
0 0 100 0 0 0
pointing
0 0 0 100 0 0
kicking
0 0 0 0 100 0
punching
0 4 0 0 0 96
Fig. 21 UT-interaction dataset
0 0 100 0 0 0 0 0 0 0
denote the highest scores.
jack
0 0 0 100 0 0 0 0 0 0
Actual action
0 5 0 0 95 0 0 0 0 0
pjump
0 0 0 0 0 0 0 0 0 100 datasets, there were 100 RGB images and 100 MHIs for test-
ing, for each class in the TenthLab dataset, there were 40 RGB
Fig. 22 Confusion matrices of Weizmann dataset images and 40 MHIs for testing. Each MHI corresponded to
the last frame of 10 consecutive RGB images in a video.
In these matrices, rows represented the actual classes and
head view angle to collect the video; secondly, the direction columns represented the predicted classes. In Figs. 20, 21, 22,
of the movement of the characters in the video is changed. and 23, diagonal values denote the rate of correctly recog-
nized actions, and the off-diagonal values denote the rate of
3.3 Results analysis misrecognized actions.
The proposed algorithm showed strong applicability
3.3.1 Experimental results because not only it could recognize a single-human action,
but also achieved good recognition of human–human inter-
The last frame of each set consisted of consecutive ten frames action. The proposed approach performed well in recogni-
was fed to the trained Faster R-CNN model for testing, and tion of walking, hand waving, boxing, and hand clapping
the score obtained by the softmax layer of Faster R-CNN in KTH dataset, walking, jumping-jack, jump-forward-
model was combined with the score obtained by the softmax on-two-legs, jump-in-place-on-two-legs, gallop-side-ways,
123
C. Liu et al.
Table 1 Experimental evaluation of KTH dataset wave-two-hands, wave-one-hand, and bending in Weizmann
Action Accuracy (%) Precision Recall F-score dataset, handshaking, hugging, pointing, and kicking in UT-
interaction dataset, and wave1, wave2, hug, and kick in
Walking 100 1 1 1
TenthLab dataset. However, some actions were misrecog-
Hand waving 100 1 1 1
nized because the features of these actions were similar in
Running 96 0.970 1 0.985
motion and shape. In KTH dataset, the recognition accuracy
Boxing 100 1 1 1 for running was 96%, and 4% of frames were misclassified
Jogging 97 0.960 1 0.980 as jogging, while the recognition accuracy for jogging was
Hand clapping 100 1 1 1 97%, and 3% of frames were misclassified as running. In UT-
interaction dataset, the recognition accuracy for pushing was
Table 2 Experimental evaluation of Weizmann dataset 98%, and 2% of frames were misclassified as punching, while
the recognition accuracy for punching was 96%, and 4% of
Action Accuracy (%) Precision Recall F-score
frames were misclassified as pushing. Further, in Weizmann
Walk 100 1 1 1 dataset, the recognition accuracy for running was 96%, and
Run 96 0.950 1 0.974 4% of frames were misclassified as skipping, while the recog-
Jump 100 1 1 1 nition accuracy for skipping was 95%, and 5% of frames
Jack 100 1 1 1 were misclassified as running. Lastly, in TenthLab dataset,
Skip 95 0.960 1 0.980 the recognition accuracy for walk was 97.5%, and 2.5% of
Pjump 100 1 1 1 frames were misclassified as run, while the recognition accu-
Side 100 1 1 1
racy for run was 95%, and 5% of frames were misclassified as
Wave2 100 1 1 1
walk, the recognition accuracy for crouch was 95%, and 5%
of frames were misclassified as bench, while the recognition
Wave1 100 1 1 1
accuracy for bench was 97.5%, and 2.5% of frames were
Bend 100 1 1 1
misclassified as crouch, the recognition accuracy for hand
was 97.5%, and 2.5% of frames were misclassified as push,
Table 3 Experimental evaluation of UT-interaction dataset while the recognition accuracy for push was 95%, and 5% of
Action Accuracy (%) Precision Recall F-score frames were misclassified as hand, Tables 1, 2, 3, and 4 pro-
vide the performance measures such as accuracy, precision,
Handshaking 100 1 1 1 recall, and F-score for various actions in the KTH dataset,
Pushing 98 0.961 1 0.980 the Weizmann dataset, UT-interaction dataset, and TenthLab
Hugging 100 1 1 1 dataset. The results prove that the proposed approach is robust
Pointing 100 1 1 1 in terms of human action recognition.
Kicking 100 1 1 1 To conduct a quantitative evaluation, we calculated an
Punching 96 0.980 1 0.990 average recognition rate and compared the obtained results
with the results of the state-of-the-art recognition approaches
proposed by Qian [37], Xu [38], Chou [39], Ko [40], Wang
[41], Vishwakarma [42], Sahoo [43], and Vishwakarma [44].
123
Improved human action recognition approach based on two-stream convolutional neural network model
123
C. Liu et al.
4. Chaudhry, R., Ravichandran, A., Hager, G.: Histograms of oriented 25. Zhao, R., Ali, H., Smagt, P.V.D.: Two-stream RNN/CNN for action
optical flow and Binet–Cauchy kernels on nonlinear dynamical recognition in 3D videos. In: IEEE International Conference on
systems for the recognition of human actions. In: IEEE Conference Intelligent Robots and Systems (2017)
on Computer Vision and Pattern Recognition, pp. 20–25 (2009) 26. Afrasiabi, M., Khotanlou, H., Mansoorizadeh, M.: DTW-CNN:
5. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recog- time series-based human interaction prediction in videos using
nition using motion history volumes. Comput. Vis. Image Underst. CNN-extracted features. Vis. Comput. (2019). https://doi.org/10.
104(2), 249–257 (2006) 1007/s00371-019-01722-6
6. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a 27. Imran, J., Raman, B.: Evaluating fusion of RGB-D and inertial sen-
local SVM approach. In: IEEE International Conference on Pattern sors for multimodal human action recognition. J. Ambient Intell.
Recognition, pp. 23–26 (2004) Hum. Comput. 11, 189–208 (2020)
7. Rapantzikos, K., Avrithis, Y., Kollias, S.: Dense saliency-based 28. Yi, Y., Li, A., Zhou, X.F.: Human action recognition based on action
spatiotemporal feature points for action recognition. In: IEEE Con- relevance weighted encoding. Signal Process. Image Commun. 80,
ference on Computer Vision and Pattern Recognition, pp. 43–48 115640 (2020)
(2009) 29. Bobick, A.F., Davis, J.W.: The recognition of human movement
8. Hu, X.: Huang Y, Duan Q, et al, Abnormal event detection in using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell.
crowded scenes using histogram of oriented contextual gradient 23(3), 257–267 (2001)
descriptor. EURASIP J. Adv. Signal Process. 2018(1), 54 (2018) 30. Acuna, D., Ling, H., Kar, A.: Efficient interactive annotation of seg-
9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification mentation datasets with polygon-RNN++. In: IEEE International
with deep convolutional neural networks. In: International Confer- Computer Vision and Pattern Recognition (2018)
ence on Neural Information Processing Systems (2012) 31. Castrejon, L., Kundu, K., Urtasun, R.: Annotating object instances
10. Simonyan, K., Zisserman, A.: Very deep convolutional networks with a polygon-RNN. In: IEEE Conference on Computer Vision
for large-scale image recognition. In: International Conference on and Pattern Recognition (2017)
Learning Representations (2015) 32. Siswantoro, J., Prabuwono, A.S., Abdullah, A.: A linear model
11. Szegedy, C., Liu, W., Jia, Y.: Going deeper with convolutions. In: based on Kalman filter for improving neural network classification
IEEE Conference on Computer Vision and Pattern Recognition, performance. Expert Syst. Appl. 49, 112–122 (2016)
pp. 1–9 (2009) 33. Duin, R.P.W.: The combining classifier: to train or not to train. In:
12. He, K., Zhang, X., Ren, S.: Deep residual learning for image International Conference on Pattern Recognition (2002)
recognition. In: IEEE Conference on Computer Vision and Pat- 34. The KTH Dataset: http://www.nada.kth.se/cvap/actions/.
tern Recognition (2015) Accessed on 18 Jan. (2005)
13. Ren, S., He, K., Girshick, R.: Faster R-CNN: towards real-time 35. The Weizmann Dataset: http://www.wisdom.weizmann.ac.il/.
object recognition with region proposal networks. In: International Accessed on 24 Dec. (2007)
Conference on Neural Information Processing Systems (2015) 36. The UT-Interaction Dataset: http://cvrc.ece.utexas.edu/
14. Redmon, J., Divvala, S., Girshick, R.: You only look once: uni- SDHA2010 (2007)
fied, real-time object recognition. In: IEEE International Computer 37. Qian, H., Zhou, J., Mao, Y.: Recognizing human actions from sil-
Vision and Pattern Recognition (2016) houettes described with weighted distance metric and kinematics.
15. Liu, W., Anguelov, D., Erhan, D.: SSD: single shot multibox detec- Multimed. Tools Appl. 76, 21889–21910 (2017)
tor. In: European Conference on Computer Vision (2016) 38. Xu, K., Jiang, X., Sun, T.: Two-stream dictionary learning archi-
16. Li, C., Wang, P., Wang, S.: Skeleton-based action recognition using tecture for action recognition. IEEE Trans. Circuits Syst. Video 27,
LSTM and CNN. In: IEEE International Conference on Multimedia 567–576 (2017)
and Expo Workshops (2017) 39. Chou, K.P., Prasad, M., Wu, D.: Robust feature-based automated
17. Donahue, J., Hendricks, L.A., Guadarrama, S.: Long-term recur- multi-view human action recognition system. IEEE Access 6, 1
rent convolutional networks for visual recognition and description. (2018)
In: AB Initto Calculation of the Structures and Properties of 40. Ko, K.E., Sim, K.B.: Deep convolutional framework for abnormal
Molecules (2015) activities recognition in a smart surveillance system. Eng. Appl.
18. Ji, S., Xu, W., Yang, M.: 3D convolutional neural networks for Artif. Intell. 67, 226–234 (2018)
human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41. Wang, J., Zhou, S.C., Xia, L.M.: Human interaction recognition
35(1), 221–231 (2013) based on sparse representation of feature covariance matrices. J.
19. Wang, X., Gao, L., Song, J.: Beyond frame-level CNN: saliency- Central South Univ. 25(2), 304–314 (2018)
aware 3D CNN with LSTM for video action recognition. IEEE 42. Vishwakarma, D.K., Dhiman, C.: A unified model for human activ-
Signal Process. Lett. 99, 1 (2016) ity recognition using spatial distribution of gradients and difference
20. Simonyan, K., Zisserma, A.: Two-stream convolutional networks of Gaussian kernel. Vis. Comput. 35, 1595–1613 (2019)
for action recognition in videos. In: Conference and Workshop on 43. Sahoo, P.S., Ari, S.: On an algorithm for human action recognition.
Neural Information Processing Systems (2014) Expert Syst. Appl. 115, 524–534 (2019)
21. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two- 44. Vishwakarma, D.K.: A twofold transformation model for human
stream network fusion for video action recognition. In: IEEE action recognition using decisive pose. Cognit. Syst. Res. 61, 1–13
International Computer Vision and Pattern Recognition (2016) (2020)
22. Wang, L., Xiong, Y., Wang, Z.: Temporal segment networks:
towards good practices for deep action recognition. In: European
Conference on Computer Vision (2016)
Publisher’s Note Springer Nature remains neutral with regard to juris-
23. Chen, J., Wu, J., Konrad, J.: Semi-coupled two-stream fusion con-
dictional claims in published maps and institutional affiliations.
vnets for action recognition at extremely low resolutions. In: IEEE
Winter Conference on Applications of Computer Vision (2017)
24. Wang, X., Gao, L., Wang, P.: Two-stream 3-D convnet fusion for
action recognition in videos with arbitrary size and length. IEEE
Trans. Multimed. 20, 634–644 (2018)
123
Improved human action recognition approach based on two-stream convolutional neural network model
123