Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

sensors

Article
Human Interaction Classification in Sliding Video Windows
Using Skeleton Data Tracking and Feature Extraction †
Sebastian Puchała, Włodzimierz Kasprzak * and Paweł Piwowarski

Institute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19,
00-665 Warszawa, Poland; sebastian.puchala.stud@pw.edu.pl (S.P.); pawel@piwowarski.com.pl (P.P.)
* Correspondence: wlodzimierz.kasprzak@pw.edu.pl
† This paper is an extended version of our paper published in Puchała, S.; Kasprzak, W.; Piwowarski, P. Feature
engineering techniques for skeleton-based two-person interaction classification in video. In Proceedings of
the 2022 17th International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore,
2022; pp. 66–71.

Abstract: A “long short-term memory” (LSTM)-based human activity classifier is presented for
skeleton data estimated in video frames. A strong feature engineering step precedes the deep neural
network processing. The video was analyzed in short-time chunks created by a sliding window. A
fixed number of video frames was selected for every chunk and human skeletons were estimated
using dedicated software, such as OpenPose or HRNet. The skeleton data for a given window
were collected, analyzed, and eventually corrected. A knowledge-aware feature extraction from the
corrected skeletons was performed. A deep network model was trained and applied for two-person
interaction classification. Three network architectures were developed—single-, double- and triple-
channel LSTM networks—and were experimentally evaluated on the interaction subset of the ”NTU
RGB+D” data set. The most efficient model achieved an interaction classification accuracy of 96%.
This performance was compared with the best reported solutions for this set, based on “adaptive
graph convolutional networks” (AGCN) and “3D convolutional networks” (e.g., OpenConv3D). The
sliding-window strategy was cross-validated on the ”UT-Interaction” data set, containing long video
clips with many changing interactions. We concluded that a two-step approach to skeleton-based
human activity classification (a skeleton feature engineering step followed by a deep neural network
model) represents a practical tradeoff between accuracy and computational complexity, due to an
Citation: Puchała, S.; Kasprzak, W.;
Piwowarski, P. Human Interaction
early correction of imperfect skeleton data and a knowledge-aware extraction of relational features
Classification in Sliding Video from the skeletons.
Windows Using Skeleton Data
Tracking and Feature Extraction. Keywords: human interaction videos; LSTM; preliminary skeleton features; skeleton tracking; sliding
Sensors 2023, 23, 6279. https:// window; many-interaction videos
doi.org/10.3390/s23146279

Academic Editor: Eui Chul Lee

Received: 20 May 2023 1. Introduction


Revised: 30 June 2023
Human activity recognition in image sequences and video has lately been a hot
Accepted: 7 July 2023
research topic in the computer vision, multimedia, and machine learning communities.
Published: 10 July 2023
Two-person interactions constitute a specific category of human activities. Currently,
the best performing solutions are based on deep learning techniques, in particular on
deep neural networks (DNN) such as CNN (convolutional neural networks), GCN (graph
Copyright: © 2023 by the authors.
convolutional networks, or LSTM (long short-term memory networks) [1–4]. Practical
Licensee MDPI, Basel, Switzerland. applications of related technology are expected in video surveillance, robotics, or content-
This article is an open access article based video filtering.
distributed under the terms and Human activity recognition in video can be divided into two main categories: apply-
conditions of the Creative Commons ing the activity recognition method directly to video data [5] or first performing a human
Attribution (CC BY) license (https:// pose estimation (i.e., skeleton detection) in every frame of the sequence [6]. Nowadays,
creativecommons.org/licenses/by/ 2-dimensional (2D) or 3-dimensional (3D) human skeleton representations of human-
4.0/). populated image regions are generated sufficiently reliably, even with the support of

Sensors 2023, 23, 6279. https://doi.org/10.3390/s23146279 https://www.mdpi.com/journal/sensors


Sensors 2023, 23, 6279 2 of 20

specialized devices, such as the Microsoft Kinect. Some popular solutions to human skele-
ton estimation (i.e., detection and localization) in images can be mentioned: OpenPose [7],
DeepPose [8], and DeeperCut [9]. There are three fundamental architectures, which have
been employed as backbone architecture for human pose estimation research: AlexNet
(e.g., in the DeepPose model), Visual Geometry Group network (VGG) (e.g., in OpenPose),
and Residual Neural Network (ResNet) (e.g., in DeeperCut). In early solutions, hand-
designed features, such as edges, contours, Scale-Invariant Feature Transform (SIFT), and
Histogram of Oriented Gradients (HOG), have usually been used for the detection and
localization of human body parts or key points in the image [10]. More recently, deep
neural network-based solutions were successfully proposed [4], as they have the capability
to automatically learn rich semantic and discriminative features. Initially, Multi-layer
Perceptrons (MLP) and LSTM models were explored but, currently, Convolutional Neural
Networks (CNN) and Graph CNNs [11] dominate the research. CNNs can learn both
spatial and temporal information from signals and can effectively model scale-invariant
features as well.
In a recent work [12], we proposed knowledge-aware feature extraction from skeleton
data. As relational features are mostly created from skeletons, this allowed us to focus
subsequently on the temporal aspect and to use a single-channel LSTM network instead of
the often-proposed CNNs and GCNs. In this work, two novel issues were studied. First,
various multi-stream networks (single-, double- and triple-channel networks) with LSTM
layers were proposed, performing feature processing and classification. This led to new
findings and increased the classification accuracy. The second issue was the implementation
of a sliding window technique to process longer time video clips, containing many different
activities. This will allow the development of different strategies for the overall classification
of a video clip.
There are four remaining sections of this work. Section 2 describes recent approaches
to human-activity classification. Our solution is presented in Section 3. In Section 4,
experiments are described that verify different network architectures when processing
three different feature sets. All models were learned and evaluated on the interaction subset
of the NTU RGB+D data set [1]. The models learned on the main data set and the sliding
window strategy were also cross-validated on the UT-Interaction data set [13]. Finally,
in Section 5, we summarize our results.

2. Related Work
The recognition of human activities in video is a hot research topic in the last fifteen
years. Typically, human activity recognition in images and video requires first a detection
of human body parts or key-points of a human skeleton. The skeleton-based methods
compensate for some of the drawbacks of vision-based methods, such as assuring the
privacy of persons and reducing the scene lightness sensitivity.
Most of the research is based on the use of artificial neural networks. However, more
classical approaches have also been tried, such as the SVM (e.g., [14,15]). Yan et al. [16]
used multiple features, such as a “bag of interest points” and a “histogram of interest
point locations”, to represent human actions. They proposed a combination of classifiers in
which AdaBoost and “sparse representation” were used as basic algorithms. In the work of
Vemulapalli et al. [17], human actions were modeled as curves in a Lie group of Euclidean
distances. The classification process uses a combination of dynamic time warping, Fourier
temporal pyramid representation, and linear “support vector machine” (SVM).
Thanks to higher quality results, artificial neural networks are replacing other methods.
Thus, the most recently conducted research in human activity classification differs only in
terms of the proposed network architecture. Networks based on the LSTM architecture or
a modification of this architecture (a ST-LSTM network with trust gates) were proposed
by Liu et al. [18] and Shahroudy et al. [1]. They introduced so-called “Trust Gates” for
controlling the content of an LSTM cell and designed an LSTM network capable of capturing
spatial and temporal dependencies at the same time (denoted as ST-LSTM). The task
Sensors 2023, 23, 6279 3 of 20

performed by the gates is to assess the reliability of the obtained joint positions based on
the temporal and spatial context. This context was based on the position of the examined
junction in the previous moment (temporal context) and the position of the previously
studied junction in the present moment (spatial context). This behavior is intended to help
network memory cells assess which locations should not be remembered and which ones
should be kept in memory. The authors also drew attention to the importance of capturing
default spatial dependencies already in the skeleton data. They experimented with different
joint’s set-to-sequence mappings. For example, they mapped the skeleton data into a tree
representation, duplicating joints when necessary to keep spatial neighborhood relation
and performed a tree traversal to obtain a sequence of joints. Such an enhancement of the
input data allowed an increase of the classification accuracy by several percent.
The work [19] introduced the idea of applying convolutional filters to pseudo-images
in the context of action classification. A pseudo-image is a map (a 2D matrix) of feature
vectors from successive time points, aligned along the time axis. Thanks to these two
dimensions, the convolutional filters find local relationships of a combined time–space
nature. Liang et al. [20] extended this idea to a multi-stream network with three stages.
They used three types of features, extracted from the skeleton data: positions of joints,
motions of joints, and orientations of line segments between joints. Every feature type was
processed independently in its own stream but after every stage the results were exchanged
between streams.
Graph convolutional networks are currently considered a natural approach to the
action (and interaction) recognition problem. They are able to achieve high quality results
with only modest requirements of computational resources [21,22].
One of the best performances on the NTU RGB+D interaction data set is reported by
the work [3]. Its main contribution is a powerful two-stream network with three-stages,
called “Interaction Relational Network” (IRN), the input to which is the basic relations
between the joints of two interacting persons tracked over the length of image sequence
and then makes further encoding, decoding, and an LSTM-based final classification. In our
view, the most important contribution is to the initial extraction of the well-structured
preparation of pair-wise input relation that contains both distance and motion information
between joints, where the first stream processes within-a-person relations, while the second
one —processes between-person relations. The use of final LSTM represents a high-quality
model, called the IRN-LSTM network. It leads to the processing of a dense frame sequence,
so all frames of the video clip can be processed. Instead of an LSTM, in ordinary versions
of the IRN network, a simple densely connected classifier is used and a sparse sequence of
frames is processed.
Another recent development is the pre-processing of the skeleton data to extract
different types of information (e.g., information on joints and bones and their relations
in space and time). Such data streams are first separately processed by so called multi-
stream neural networks and later fused to a result. Examples of such solutions are the
“Two-Stream Adaptive Graph Convolutional Network” (2S-AGCN) and the “Multistream
Adaptive Graph Convolutional Network” (AAGCN), proposed by Shi et al. [23,24].
The current best results for small-size networks were reported by Zhu et al. [25],
where two new modules were proposed for a baseline 2S-AGCN network. The first
module extends the idea of modelling relational links between two skeletons by a spatial-
temporal graph to a “Relational Adjacency Matrix (RAM)”. The second novelty is a
processing module, called “Dyadic Relational Graph Convolution Block”, which com-
bines the RAM with spatial graph convolution and temporal convolution to generate new
spatial-temporal features.
Very recently, exceptionally high performance was reported when using networks
with 3D convolutional layers, applied to data sensors that constitute skeleton “heatmaps”
(i.e., preprocessed image data) [26]. The approach, called PoseConv3D, can even be topped
when fused with the processing of ordinary RGB-data streams [27]. Obviously, this requires
the creation of a heavy network and produces high computational load.
Sensors 2023, 23, 6279 4 of 20

From the analysis of the recent successful solutions, we drew three main conclusions
and motivation for our research work:
1. Using many streams of skeleton data (i.e., joints, branches, spatial and temporal
interrelations) was proved to provide essential and meaningful information for activ-
ity classification (e.g., interaction relational networks, two- and multi-stream DNN
architectures);
2. The use of light-weight solutions is preferred in practice, achieved by using graph
CNNs combined with ordinary CNNs and using CNNs with 2-D kernels instead
of 3-D CNNs, although heavy-weight solutions, such as 3D CNNs, are topping the
performance rankings;
3. In practice, a video clip (or a particular time-window), apparently containing a human
action or interaction, is reduced to a sparse frame sequence, although using all the
available frames improves the performance.

3. The Approach
3.1. Structure
A video clip may contain many activities of the same or different persons. Thus,
the video is analyzed in short-time chunks created by a sliding window. A fixed number of
video frames is selected from every data chunk for further analysis. As shown in Figure 1,
the proposed solution consists of the following main processing stages:
1. Sliding window and key-frame selection: a fixed number of frames, selected from a
time-window of frames, is assumed to be analyzed further;
2. Skeleton detection and estimation: a pose estimator (e.g., the OpenPose net [7]) is
applied to detect and localize human skeletons and their 2D joints in every RGB image
(selected video frame) of an image sequence;
3. Skeleton tracking and correcting: two “main” skeletons are tracked in the image
sequence; low certain joints or missing joints are replaced by interpolated data;
4. Feature extraction: features are created from the two streams of joints; we studied
three types of relational features, besides the raw skeleton data;
5. Neural network models: alternative LSTM-based models are trained and applied for
Version June 30, 2023 submitted to Sensors 4 of 21
action- and interaction classification (please note, that the topic of this paper is limited
to the interaction classification case).

Figure 1. General
Figure structure
1. General of of
structure our approach.
our approach

3.2. Sliding Window and Key-Frame Selection


ity classification (e.g., interaction relational networks, two- and multi-stream DNN 143
A basic design question is the generation of image (frame) sequences from a video
architectures); 144
clip. 2.Videos
the can be light-weight
use of of different solutions
lengths; the durationinofpractice,
is preferred actionsachieved
and frame by rates
using can differ.
graph 145
CNNs
Theoretically, combinedNeural
Recursive with ordinary
NetworksCNNs, and using
(RNN) can beCNNs with to
adopted 2-D kernelson
operate instead of 146
a variable-
3-D CNNs;
length input. However, although
this isheavy-weight
not recommendedsolutions, like
(such 3D CNNsare
networks aremore
topping the perfor-
difficult to learn).
147
mance rankings;
Thus, we decided to use image sequences of fixed length, extracted by a sliding window 148
3. in practice, a video clip (or a particular time-window), apparently containing
(Figure 2a). With this approach, many sub-sequences may be created for an input video a human 149
action or interaction, is reduced to a sparse frame sequence, although using all the 150
in the testing and active work mode. It must be noted that, for training a neural network
available frames improves the performance. 151
model, short-time video clips are used, converted to single windows, as a single reference
label3.isThe
assigned to every sample clip.
approach 152

3.1. Structure 153

A video clip may contain many activities of the same or different persons. Thus, the 154
video is analyzed in short-time chunks created by a sliding window. A fixed number of 155
video frames is selected from every data chunk for further analysis. As shown in Figure 1, 156
the proposed solution consists of the following main processing stages: 157

1. Sliding window and key-frame selection: a fixed number of frames, selected from a 158
Sensors 2023,
30, 2023 submitted 23, 6279
to Sensors 5 of 21 5 of 20

(a)

(b)
Figure 2. Sparse frame sequence
Figure created
2. Sparse (a)sequence
frame by current location
created (a)of
byacurrent
sliding location
window,ofwith adjustable
a sliding window, with adjustable
size and window interlace ratio;
size and and (b)
window selection
interlace of aand
ratio; fixed
(b)number
selectionofofframes
a fixed number of frames.

of 2.133 s, which corresponds The key toissue is to choose


a number M of 64the right length
frames. As theof the sliding
labeled window.
training samplesIf a short-time
184
video
with single interactions clip isare
processed,
typicallywhichof lengthcontains
2.5 − 3one activity
s, the type
selected only, the
window window
length should
should 185
cover nearly
the entire
satisfy both above requirements. clip. When a longer-time video may contain many activity instances,
186
the window
should be able to cover a single activity only. We
The number of key frames N in a window must be consistent with the input size 187 decided to operate with a window length
of 2.133 s, which corresponds to a number M of 64 frames.
of the trained or applied neural network model. In the literature dedicated to this topic, 188 As the labeled training samples
typically N is chosen withinsingle
the rangeinteractions
from 8 to are32typically of length
or all frames 2.5–3 s,clip
of a video theare
selected window189length should
considered
satisfy both above requirements.
(limited only by assumed window size). By choosing N = 32 keyframes, we achieved a fair 190
comparison with recentThe resultsnumber
of otherof key frames N
researchers, in athe
using window must beofconsistent
same amount information,with191the input size
of the trained or applied neural network model. In the literature dedicated to this topic,
and also had a chance to process the video in real time. In experiments we confirmed, that 192
typically N is chosen in the range from 8 to 32 or all frames of a video clip are considered
with growing a number of keyframes the classification accuracy is steadily improving. 193
(limited only by assumed window size). By choosing N = 32 keyframes, we achieved a fair
After fixing the window length and the number of key frames in the sliding window, 194
comparison with recent results of other researchers, using the same amount of information,
another two parameters must be selected: the interlace ratio of (or delay ∆M between) two 195
and also had a chance to process the video in real time. With experiments, we confirmed
consecutive windows and the frame rate (or delay ∆N between consecutive key frames) in 196
that, with a growing number of keyframes, the classification accuracy is steadily improving.
a window (Figure 2(b)). 197
After fixing the window length and the number of key frames in the sliding window,
3.3. Skeleton detection another two parameters must be selected: the interlace ratio of (or delay ∆M
and estimation 198between) two
consecutive windows and the frame rate (or delay ∆N between consecutive key frames) in
In the paper [8], a multi-person 2D pose estimation architecture was proposed based on 199
a window (Figure 2b).
”Part Affinity Fields” (PAFs). The work introduced an explicit nonparametric representation 200
of the key point association
3.3. Skeletonwhich encodes
Detection both position and orientation of the human 201
and Estimation
limbs. The designed architecture can learn both human key point detection and association 202
In the paper [7], a multi-person 2D pose estimation architecture was proposed based on
using heatmaps of human key-points and part affinity fields, respectively. It iteratively 203
“Part Affinity Fields” (PAFs). The work introduced an explicit nonparametric representation
predicts part affinity fields and part detection confidence maps. The part affinity fields 204
of the key point association, which encodes both position and orientation of the human
encode part-to-part association including part locations and orientations. In the iterative 205
limbs. The designed architecture can learn both human key point detection and association
architecture, both PAFs and confidence maps are iteratively refined over successive stages 206
using heatmaps of human key-points and part affinity fields, respectively. It iteratively
with intermediate supervision at each stage. Subsequently, a greedy parsing algorithm is 207
predicts part affinity fields and part detection confidence maps. The part affinity fields
employed to effectively parse human poses. The work ended up releasing the OpenPose 208
encode part-to-part association including part locations and orientations. In the iterative
library, the first real-time system for multi-person 2D pose estimation. 209
architecture, both PAFs and confidence maps are iteratively refined over successive stages
In our research, we alternatively use OpenPose or HRNet. The core block of OpenPose, 210
with intermediate supervision at each stage. Subsequently, a greedy parsing algorithm was
the ”body_25 model” containing 25 characteristic points (joints) located in the image. 211
employed to effectively parse human poses. The work ended up releasing the OpenPose
Every joint oi , estimated by the OpenPose system, is described in the following format: 212
library, the first real-time system for multi-person 2D pose estimation.
oi = ( xi , yi , ci ), where: ( xi , yi ) are absolute pixel coordinates of junction i in the image, ci is 213
In our research, we alternatively used OpenPose or HRNet. The core block of Open-
the certainty of joint detection - a value from the range [0, 1]. 214
Pose, the “body_25 model” containing 25 characteristic points (joints) located in the image.
Sensors 2023, 23, 6279 6 of 20

Every joint oi , estimated by the OpenPose system, is described in the following format:
oi = ( xi , yi , ci ), where ( xi , yi ) are absolute pixel coordinates of junction i in the image and ci
is the certainty of joint detection—a value from the range [0, 1].
Thus, for each frame t and skeleton p, we get a vector of raw characteristics vtp , which
has 75 elements.

3.4. Skeleton Tracking and Correcting


In cases where more than two skeletons in an image are returned by OpenPose/HRNet,
the two largest skeletons are selected first and next they are tracked in the remaining
ensors 6 of
frames. We focused on the first 15 joints of every skeleton—a 21
conclusion from an statistical
evaluation of the detected skeletons (Figure 3).

Figure 3. The 15 reliable joints (marked from


Figure 3. 0The
to 15
14)reliable
out ofjoints
25 of(marked
the OpenPose’s ”body_25”
from 0 to 14) skeleton
out of 25 of the OpenPose’s “body_25” skeleton
model with the size normalization distance o1 ⌢
model with theosize
8 normalization distance o1 _ o8 .

We also canceled some joints data that are uncertain. Those values, whose certainty
Thus, for each frame t and skeleton
value is cp, we get a vector of raw characteristics vtp mark
i < 0.3, were removed and replaced by a special
which 215
representing “not a value”.
has 75 elements. Finally, the absolute image coordinates were transformed into relative
216 coordinates by
dividing them by the corresponding image size.
3.4. Skeleton tracking and correcting The location data for joints received from OpenPose are not always 217 perfect. It happens

In case, more than two skeletons in an image are returned by OpenPose/HRNet, the 218low certainty, and we
that some joints are not detected, while some others are detected with
removed them. Fortunately, due to the sequential nature of the available data, a variety
two largest skeletons are selected first and next they are tracked in remaining frames. We 219
of techniques can be used to fill these gaps. Let vi be a series of N positions oit of joint i in
focus on the first 15 joints of everytime:
skeleton
vi = [o-i1a, oconclusion
2 N from statistical evaluation of the 220
i , ..., oi ]. The following techniques were applied to improve the quality of
detected skeletons (Figure 3(a)). skeleton data: 221
We also cancel some joints 1. dataProblem:
that areoneuncertain.
position oThose
t values, whose certainty 222
i is missed; solution: taking the average of neighbors across
value is ci < 0.3 are removed and replaced by
t a special
t−1 mark
time oi = 0.5(oi + oi )’; t+1 representing ”not a value”. 223
Finally, the absolute image coordinates2. Problem:areoi istransformed into relative
missed k consecutive times, coordinates
i.e., from t to tby
+ k −224 1; solution: taking an
dividing them by the corresponding image size. t −1
interpolation of values oi and oi ; t+k 225
The location data for joints received 3. from oOpenPose
Problem: is not always perfect. It happens 226t
i is missed first k times; solution: set first k values of oi to oi
k +1
;
that some joints are not detected,4.while some others are detected with low certainty, and t
Problem: oi is missed last k times; solution: set last k values oi to oi ; 227 N −k

we remove them. Fortunately, due 5. to Problem:


the sequential nature ofmissed;
oi is completely the available
solution: data, a variety
set it by default, relative
228 to known joints.
of techniques can be used to fill these gaps. Let vi be a series of N positions oi of joint i in 229 t

time: vi = [oi1 , oi2 , ..., oiN ]. The following techniques are applied to improve the quality of 230
skeleton data: 231

1. problem: one position oit is missed; solution: taking the average of neighbors across 232

time oit = 0.5(oit−1 + oit+1 )’; 233


2. problem: oi is missed k consecutive times, i.e. from t to t + k − 1; solution: taking an 234
Sensors 2023, 23, 6279 7 of 20

The result of tracking (up to) two sets of skeleton joints in N frames can be represented
as a 2D map of N × 15 × 2 entries:
 
v1p1 ^ v1p2
 v2p1 ^ v2p2 
 
VN =   (1)
 ... ^ ... 
vNp1 ^ vNp2

where every vip = [o1i , o2i , ..., o1 5i ] is a vector of 15 joints, represented by location coordinates,
of skeleton p in frame i.

3.5. Feature Extraction


Unfortunately, such a strict representation of junction data, as in Equation (1) has
obvious disadvantages—the data are not invariant with respect to the position in the image
and do not explicitly represent relationships between both skeletons. First, the coordinates
of the joints may randomly change but still represent the same semantic meaning (i.e., an ac-
tion stage). The second problem is that the distance of points during interaction depends
on the scale of the presentation of the scene in the image and the size of people. Thirdly,
the point representation does not explicitly model other important relationships between
silhouettes such as relative orientation and movement. Of course, a deep network would
also be able to learn such dependencies but then we unnecessarily lose computing resources
and deteriorate the quality of predictions for learning data transformations, which can
be easily performed analytically. Therefore, three types of mutual representation of both
skeletons were developed, which reduce the disadvantages of the “raw” representation
of joints:
1. Limb-angle features—in the further part of the work, also called “LA features”;
2. Polar dense features (PD);
3. Polar sparse features (PS).

3.5.1. Size Normalization


The invariance of features with respect to the size of the skeleton in the image was
obtained by normalizing the coordinates of the junction points with the section between
the neck o1 and the center of the hips o8 (Figure 3). This distance is most often correctly
detected by OpenPose. Secondly, it does not depend on the angle of human position in
relation to the camera. The only exception is when the person’s spine is positioned along
the depth axis of the camera system (this case does not occur in the data sets used). After
calculating the length of the segment o1 _ o8 , it becomes a normalization value for all
other measured distances in the feature sets. This distance is measured only for the first
person and both persons are normalized by it.

3.5.2. LA Features
For every skeleton a and b the following are obtained (Figure 4): the lengths of 14 line
segments (called “limbs”) f a , fb (distances between two neighbor joints) and 13 orientation
changes (angles) r a , rb between two neighbor segments (Figure 4). Additionally, distances
d( j) between pairs of corresponding joints (the same index j) of two skeletons a and b are
also considered (15 distances).
Thus, for every frame, 69 features are defined, = (14 + 13) · 2 + 15. The N · 69 features
are split into two maps, one for each skeleton, FaN and FbN , with common part (15 distances
d( j)t for every frame t) provided in both maps:
 
f1a ^ r1a ^ d1
 f2a ^ r2a ^ d2 
FaN =
 ...
 (2)
^ ... ^ ... 
f aN ^ r aN ^ dN
Sensors 2023, 23, 6279 8 of 20

 
f1b ^ r1b ^ d1
 f2 ^ r2b ^ d2 
FbN = b
 ...
 (3)
^ ... ^ ... 
submitted to Sensors 8 of 21
fbN ^ rbN ^ dN

Figure
Figure 4. Illustration of the LA 4. Illustration
features: of the
14 line LA features:
segments 14 line
(called segments
”limbs”) of (called “limbs”)
a skeleton and of
13a skeleton and 13 orien-
tation changes between
orientation changes between neighbor segments neighbor segments.

3.5.3. PD Features
3.5.3. PD features We define a vector u between the center points of the o1 ^ o8 280 segments of both
We define a vectorskeletons
u between the center
(Figure 5). Thispoints
vectorof thebeo1used
will ⌣ to segments the
o8 normalize of both
distances281 between joints
of different
skeletons (Figure 5(a)). This skeletons
vector will be usedand to make relative
to normalize orientation
the distances between of lines
jointsconnecting
of 282 the joints of
different skeletons. The PD feature set includes vectors
different skeletons and to make relative orientation of lines connecting the joints of different 283 connecting every joint of first
skeleton (a) with every joint of second skeleton (b) and vice
skeletons. The PD feature set includes: vectors connecting every joint of first skeleton (a) 284 versa—skeleton 2 with skeleton
1 (Figure
with every joint of second skeleton5).(b)
Every
andvector is represented
vice versa - skeletonin2polar
with form by its1 magnitude
skeleton 285, q b,j (normalized
(Figure 5). q a,j
by the distance of u and by its relative orientation r
Every vector is represented in polar form by its magnitude q a,j , qb,j (normalized by the 286a,j , r b,j (relative to the orientation of
distance of u and by its relative orientation r a,j , rb,j (relative to the orientation of vector u N287 magnitudes)
vector u). Thus, for every frame, there are 900 features defined (=225 (vector
N
225 (orientations)
). Thus, for every frame+there are 900 features ·2. The N · 900
defined, (=features
225 (vectorare split in two maps,
magnitudes) +225Q a 288and Qb , one for
(orientations) ·2. The Neach skeleton:
· 900 features are split in two maps,  Q aN1 and Q N1 , one
q a ^ rba
 for each 289
skeleton:  1   q2a ^ r2a  290
q a ⌣ r1a Q aN =   ... ^ ... 
 (4)
 2
qa ⌣ ra  2 
Q aN =  ... ⌣ ...  q aN ^ r aN (4)
 1 
q aN ⌣ r aN qb ^ r1b
 1 N  q2b ^ r2b  291
qb ⌣ r1b Qb =   ... ^ ... 
 (5)
 2
qb ⌣ rb  2 
QbN =  qbN ^ rbN (5)
 ... ⌣ ... 
qbN ⌣ rbN

3.5.4. PS features 292

Let us define the center point S of vector u (Figure 5(a)). Now 15 vectors are defined for 293
every skeleton. Every vector connects the point S with a joint of skeleton 1 or 2 (Figure 6). 294
Again like for PD features, every vector is represented in polar form by two features 295
ubmitted Sensors
to Sensors
2023, 23, 6279 9 of 21 9 of 20

Figure 5. Illustration ofFigure


the PD features: the
5. Illustration normalization
of the vector
polar dense (PD) u with
features: center
vectors pointevery
between S; vectors
pair of joints from
between every pair of joints from
different different
skeletons areskeleton
computedjoints (some
and their of them
lengths are shown are
and orientations only)
normalized with respect to
the vector u, drawn between centers of spinal segments (between joints 1 and 8) of both skeletons; S
is the center point of vector u.
defined only (= (15 + 15) · 2). The N · 60 features are split into two maps, HaN and HbN , one 298
for each skeleton: 3.5.4. PS Features 299
 1 1 
h a point
Let us define the center ⌣ Srof a vector u (Figure 5). Now, 15 vectors are defined for
every skeleton.NEvery h 2 connects
 vector r 2 the

H =  a ⌣ a  point S with a joint of skeleton 1 or 2 (Figure 6).
(6)
a 
Again, as for PD features, 
... every
⌣ vector... is represented in polar form by two features—
normalized magnitudehhNa,j , h⌣ b,j and
ra N relative orientation r a,j , rb,j (both magnitude and
a
orientation are normalized with respect to u). Thus, for every frame there are 60 300 features
 1 1  N and H N , one
defined only (=(15 + 15)h·2). The
⌣ rbN · 60 features are split into two maps, H a b
b
for each skeleton:  h2 ⌣ r2   
HbN =  b
 ... ⌣ ... 
b  h1 ^ r1
 2a a (7)
h ^ r 2 
N  a a 
hbN ⌣ HarbN=  ... ^ ...  (6)
h aN ^ r aN
3.6. LSTM models  1  301
hb ^ r1b
3.6.1. Single channel LSTM  h2 ^ r2  302
HbN =  b b  (7)
The „single channel” LSTM (SC-LSTM) has three versions  ... ^ corresponding
...  to the three 303
types of features (LA, PD or PS). Thus, we call them SC-LSTM-LA, hbN ^ rbN SC-LSTM-PD and 304
SC-LSTM-PS, appropriately. We also consider a baseline feature version SC-LSTM-RAW, 305
3.6. LSTM Models
which processes the raw skeleton joints obtained by OpenPose. These versions differ by 306
3.6.1. Single Channel LSTM
the input layer only, as there are different numbers of features considered. The network 307
The “single channel” LSTM (SC-LSTM) has three versions corresponding to the three
configuration consiststypes
of two LSTM layers, interleaved by two Dropout layers and final two 308
of features (LA, PD or PS). Thus, we call them SC-LSTM-LA, SC-LSTM-PD, and
dense layers (Figure 7).SC-LSTM-PS,
In the SC-LSTM-PS version
appropriately. Wethere are 3.359.931
also considered trainable
a baseline parameters.
feature 309
version SC-LSTM-
RAW, which processes the raw skeleton joints obtained by OpenPose. These versions
3.6.2. Double channeldiffer
LSTMby the input layer only, as there are different numbers of features considered.
310

The network
The „double channel” LSTMconfiguration
(DC-LSTM)consists of two
has three LSTM layers,
versions interleavedtobythe
corresponding two dropout
311
three types of features (LA, PD or PS). Thus, we call them DC-LSTM-LA, DC-LSTM-PD 312
and DC-LSTM-PS, appropriately. These versions differ by the input layer only, as there 313
are different numbers of features considered. The network configuration consists of two 314
Sensors 2023, 23, 6279 10 of 20

ubmitted to Sensors 10 of 21
layers and the final two dense layers (Figure 7). In the SC-LSTM-PS version, there are
3,359,931 trainable parameters.

Figure 6. Illustration of the PS 6.


Figure features: vectors
Illustration of the between point
polar sparse (PS) S and every
features: skeleton
vectors betweenjoint are
center defined
point S of vector u and
(some of them are shownevery only)
skeleton joint are computed and their lengths and orientations are normalized with respect to
vector u, which connects the centers of spinal segments (between joints 1 and 8) of every skeleton.

common data is added3.6.2.


to the input
Double of every
Channel LSTMstream. The DC-LSTM-PS network consists of 319
6.612.155 trainable parameters.
The “double channel” LSTM (DC-LSTM) has three versions corresponding to the
320 three
types of features (LA, PD, or PS). Thus, we call them DC-LSTM-LA, DC-LSTM-PD, and
3.6.3. Triple channel LSTM
DC-LSTM-PS, appropriately. These versions differ in terms of the input layer only, 321
as there
are different numbers of features considered. The network configuration consists of two
The „triple channel” LSTM (DC-LSTM-LA) comes in one version only - for the LA 322
independent LSTM streams, a concatenation layer and two dense layers. Every LSTM
features - as the otherstream
two features (PD and PS) have strictly two data streams only. The 323
has two LSTM layers interleaved by two dropout layers (Figure 8). The skeleton
network configurationfeatures
consistsareof three independent
separated LSTM
into two subsets, eachstreams, a concatenation
corresponding layer
to one skeleton. In the324
case of
and two dense layers. LA
Every LSTM stream has two LSTM layers interleaved by two Dropout
features, there is also a common part of both skeletons (15 distances between joints).
325
layers (Figure 9). TwoThis
of the LSTMdata
common streams process
are added separately
to the the feature
input of every stream.subsets of every network
The DC-LSTM-PS 326
consists of 6,612,155 trainable parameters.
skeleton, while the third one processes the common feature subset (15 distances between 327
joints). The TC-LSTM-LA3.6.3. network has 9.761.979
Triple Channel LSTM parameters. 328

The “triple channel” LSTM (DC-LSTM-LA) comes in one version only—for the LA
4. Results 329
features—as the other two features (PD and PS) have strictly two data streams only. The net-
For evaluation of ourconfiguration
work approach consists
and for of performance
three independentcomparison
LSTM streams,with other ap- layer
a concatenation 330 and
proaches to action andtwo
interaction classification,
dense layers. Every LSTM the ”accuracy”
stream metric
has two LSTM andinterleaved
layers the class ”con-
by two dropout
331
layers (Figure 9). Two of the LSTM streams process the feature
fusion matrix” will be applied. ”Accuracy” is the typical performance measure given in 332subsets of every skeleton
separately, while the third one processes the common feature subset (15 distances between
DNN-related publications [3] and defined as a ratio of the correctly classified data to the 333
joints). The TC-LSTM-LA network has 9,761,979 parameters.
total amount of classifications made by the model: 334

Number o f correct predictions


Accuracy = (8)
Total number o f predictions

Because of specific evaluation scenarios defined for the NTU RGB+D dataset, called CS 335
(cross-subject) and CV (cross-view), the test set is balanced with respect to classes and 336
Sensors
23 submitted 2023, 23, 6279
to Sensors 11 of 21 11 of 20

Version June 30, 2023 submitted to Sensors 12 of 21


Figure
Figure 7. Illustration of the 7. Illustration
SC-LSTM-PS of the
network - theSC-LSTM-PS network—the
three versions of SC-LSTMthree
differversions
only by of
theSC-LSTM differ only in
input layer size terms of the input layer size.

where K means the number of classes, TPi - the number of true positives of class i samples, 339
FNi - the number of false negatives of class i samples. 340

4.1. Datasets 341

To evaluate and test the trained classifiers, three data sets were used. The main dataset 342
on which our models will be trained and evaluated is the interaction subset of the NTU 343
RGB+D dataset. It includes 11 two-person interactions of 40 actors: A50: punch/slap, A51: 344
kicking, A52: pushing, A53: pat on back, A54: point finger, A55: hugging, A56: giving 345
object, A57: touch pocket, A58: shaking hands, A59: walking towards, A60: walking apart. 346
In our experiments, already the skeleton data of the NTU-RGB+D dataset is considered. 347
There are 10.347 video clips in total, in which 7.334 videos are in the training set and 348
remaining 3.013 videos are in the test set. No distinct validation subset is distinguished. 349
The NTU RGB-D dataset allows to perform a cross-subject (person) (short: CS) or a 350
cross-view (CV) evaluation. In the cross-subject setting, samples used for training show 351
actions performed by half of the actors, while test samples show actions of remaining actors, 352
i.e., videos of 20 persons are used for training and videos of remaining 20 persons - for 353
testing. In the cross-view setting, samples recorded by two cameras are used for training, 354
while samples recorded by the remaining camera - for testing. 355
Each skeleton instance consists of 25 joints of 3D skeletons that apparently represent 356
a single person. As our research objective is to analyze video data and to focus on only 357
reliably detected joints, we use only the 2D information of only first 15 joints. 358

4.2. Verification on Figure


the NTU 8. Illustration
RGB+D of the DC-LSTM-PS network—the two other versions of DC-LSTM differ only
Figure 8. Illustration ofdataset
the DC-LSTM-PS network - the two other versions of DC-LSTM359
differ only
by the input layer size.
We train and evaluate
by the our size
input layer 8 models on the NTU RGB+D set, using only 2D skeleton 360
information, in both verification modes, CS (cross-subject) and CV (cross-view), proposed 361
by the Authors of this dataset. The training set is split into learning and test subsets - two 362
third for learning and one third for validation/testing. CS means that actors in the training 363
set are different than in the test set, but data from all the camera views are included in 364
Sensors
3 submitted 2023, 23, 6279
to Sensors 14 of 21 12 of 20

Figure
Figure 9. Architecture of 9. Architecture network
the TC-LSTM-LA of the TC-LSTM-LA network.

4. Results
For evaluation of our approach and for performance comparison with other ap-
proaches to action and interaction classification, the “accuracy” metric and the class “con-
fusion matrix” will be applied. “Accuracy” is the typical performance measure given in
DNN-related publications [2] and is defined as a ratio of the correctly classified data to the
total amount of classifications made by the model:

Number o f correct predictions


Accuracy = . (8)
Total number o f predictions

Because of specific evaluation scenarios defined for the NTU RGB+D data set, called
CS (cross-subject) and CV (cross-view), the test set is balanced with respect to classes and
the class set is closed (i.e., all test samples belong to the known class set). Under these
conditions, the “accuracy” value is equivalent to non-weighted (mean) average “recall”:
Figure 10. Illustration of properly classified interactions of hugging (top row) and punch (bottom
row) 1 K TPi
Recall = ∑ , (9)
K i=1 TPi + FNi

our models performwherebetterKwith


means relational
the number features than
of classes, TPwhen using RAW skeleton data 389
i —the number of true positives of class i samples,
(SC-LSTM-RAW). FNi —the number of false negatives of class i samples. 390
Consider now the effects of feature type and channel number. In the case of the 391
SC-LSTM architecture, 4.1.polar
Data Sets
features (PD, PS) perform much better than the LA features. 392
This was expected, because To evaluate
the aim of and test the
using trained
polar classifiers,
features is more three data sets to
accurately were used. The393
represent main data
interpersonal relationships. On the other hand, when DC-LSTM architectures are compared, 394the NTU
set on which our models were trained and evaluated was the interaction subset of
RGB+D data
we see something completely set. It includes
different. 11 two-person
The separation interactions
of channels of 40 actors:
for persons A50: punch/slap,
significantly 395
A51:
kicking, A52: pushing, A53: pat on back, A54: point finger, A55: hugging, A56: giving
improved the use of limb-angle features, while worsening the quality of polar features. In 396
object, A57: touch pocket, A58: shaking hands, A59: walking towards, A60: walking apart.
fact, the separation is very natural for LA features and the information related to every 397
In our experiments, the skeleton data of the NTU-RGB+D data set were already
single person is independent
considered.of the other
There person.
were 10.347In the case
video clips of
in polar
total, features,
in which even
7.334 when 398
videos were in the
separated into two channels, they contain mutual information. This split of features gives 399
no benefit and even causes a deterioration in quality. An interesting conclusion is also the 400
similar level of performance of dense and sparse ”polar” features, although their feature 401
numbers are much different. The triple-channel configuration TC-LSTM-LA provides 402
mixed results. It improves the accuracy of CS testing by 1.1% but deteriorates the CV 403
testing by 1.2%. 404
Sensors 2023, 23, 6279 13 of 20

Version June 30, 2023 submitted to Sensors 14 of 21


training set and remaining 3.013 videos were in the test set. No distinct validation subset
was distinguished.
The NTU RGB-D data set allowed us to perform a cross-subject (person) (short: CS) or
a cross-view (CV) evaluation. In the cross-subject setting, samples used for training show
actions performed by half of the actors, while test samples show actions of remaining actors,
i.e., videos of 20 persons were used for training and videos of the remaining 20 persons
were used for testing. In the cross-view setting, samples recorded by two cameras were
used for training, while samples recorded by the remaining camera were used for testing.
Each skeleton instance consists of 25 joints of 3D skeletons that apparently represent a
single person. As our research objective was to analyze video data and to focus on only
reliably detected joints, we used only the 2D information of the first 15 joints.

4.2. Verification on the NTU RGB+D Data Set


We trained and evaluated our eight models on the NTU RGB+D set, using only the
2D skeleton information, in both verification modes—CS (cross-subject) and CV (cross-
view)—proposed by the authors of this data set. The training set was split into learning and
test subsets—two thirds for learning and one third for validation/testing. CS means that
actors in the training set are different than in the test set but data from all the camera views
were included in both sets. CV means that two samples from camera views are included in
the training set, while samples from the remaining camera view are in the test set. Some
examples of proper interaction classification are shown in Figure 10.
Figure 9. Architecture of the TC-LSTM-LA network

Figure Illustration
10.Illustration
Figure 10. of properly
of properly classified
classified interactions
interactions of (top
of hugging hugging (top
row) and and punching
row)(bottom
punch
(bottom
row) row).

Confusion matrices allow for accurate analysis of incorrect predictions of individual


our models perform better with relational features than when using RAW skeleton data 389
classes. In total, we prepared and analyzed 16 confusion matrices arrays (=8 models
(SC-LSTM-RAW). 390
×2 modes).
Consider Figure 11 shows
now the effects fragments of a confusion
of feature type matrix
and channel obtained
number. forcase
In the the of
SC-LSTM-LA
the 391
model
SC-LSTM in the CS mode.
architecture, Wefeatures
polar deliberately show
(PD, PS) results
perform of an
much average
better performing
than the model,
LA features. 392 so
that
This was expected, because the aim of using polar features is more accurately to represent 393vast
any mistakes are more visible than in cases of better-performing models. The
majority of class
interpersonal predictions
relationships. areother
On the correct. The
hand, confused
when DC-LSTMresults are as follows:
architectures are compared, 394
we see
• something
The punch completely different.with
class is confused The the
separation
finger of channelsclass—in
pointing for persons significantly
both 395
cases, a similar
improved the use of limb-angle features, while worsening the quality of polar features. In 396
hand movement is made towards the other person;
fact, the separation is very natural for LA features and the information related to every 397
• The class of pat on the back is confused with the class of touching a pocket—touching
single person is independent of the other person. In the case of polar features, even when 398
a pocket
separated involves
into two touching
channels, another
they contain person’s
mutual pocket in
information. an split
This interaction (a simulation
of features gives 399 of
stealing a wallet), so the movement is close to pat someone on the back;
no benefit and even causes a deterioration in quality. An interesting conclusion is also the 400
• The
similar giving
level object classofand
of performance theand
dense shaking
sparse hands class
”polar” are very
features, similartheir
although interactions—both
feature 401
involve
numbers the contact
are much of the
different. Thehand;
triple-channel configuration TC-LSTM-LA provides 402
mixed
• results.
The waking It improves
towards theandaccuracy
waking of CS testing
apart classesby
are1.1% but deteriorates
detected the CV 403
virtually flawlessly.
testing by 1.2%. 404
In addition, for three models, the per-class classification accuracy was computed
(Table 1). We see exactly which classes cause the biggest problems. The worst-detected
classes are: “punch”, “touch pockets” and “point finger”. However, all these errors almost
Sensors 2023, 23, 6279 14 of 20

Version June 30, 2023 submitted to Sensors disappear with the TC-LSTM-LA model, which detects all interaction classes
15 of 21 at a similarly
proper level.

FigureFigure
11. The11. The
most most confusing
confusing cases of classification
cases of classification by the SC-LSTM-LA
by the SC-LSTM-LA model. model.

Table
We have The per-class
1. chosen testbest
our three accuracy of three
performing models
models, trained on the
SC-LSTM-PS, NTU-RGB+D
DC-LSTM-LA andinteraction
405
set,
verified in the CS (cross subject) mode.
TC-LSTM-LA, for a comparison with recent other works. 406

4.3. Comparison study A050 A051 A052 A053


407
Model Punch Kicking Pushing Pat on Back
A complexity-to-quality tradeoff of our approach is demonstrated, when comparing 408
it with SC-LSTM-RAW
other works referred in 61.8%
recent literature. A 80.7%
lot of works on two-person
81.8% interaction 63.5%
409
classification have been evaluated on the NTU RGB+D interaction dataset. In Table 3, 410
SC-LSTM-LA 79.2% 91.2% 91.3% 84.7%
we list some of the leading works with their accuracies given in referred works. A com- 411
TC-LSTM-LA
petitiveness 94.6% regarding the
of our three best models, 94.4%
criteria of quality97.4%
and complexity, 95.2%
is 412
observed. It must be noted thatA054the top solutions use multi-data streamA056
A055 architectures. The
A057
413
PoseConv3D(J+L)
Model solution is processing
Point Finger in parallel two types of image
Hugging sequences:
Giving Object skeleton
Touch Pocket
414
heatmaps and RGB images. The 2S DR-AGCN solution is employing graph structures 415
SC-LSTM-RAW 64.0% 93.5% 64.2% 53.7%
besides the skeleton joints and branches. Top approaches are analyzing all the frames of a 416
SC-LSTM-LA
video clip, 82.4% which process
contrary to other methods, 97.1% 89.2%
a sparse frame sequence 80.4%
only. Our 417
results were obtained
TC-LSTM-LA for 32 frames
95.0%selected for windows
99.4% of 64 frames.95.6% 418
94.4%
4.4. Sliding window validation onA058
a UT subset A059 A060 419
Model
4.4.1. The UT-Interaction Shaking Hands Walking Towards
dataset Walking Apart 420

The models were also tested


SC-LSTM-RAW on the UT-Interaction
73.9% 98.5%dataset [13], which
99.5%contains longer 421
videos with multiple interactions occurring one after the other. In total, five videos with 422
SC-LSTM-LA 88.4% 100% 99.6%
eight interactions each were tested (the interactions were consistent with NTU classes). 423
TC-LSTM-LA
The accuracy of classification 97.4%
by our 8 models is 100% 99.8%
given in Table 4. The results confirm 424
our findings based on the NTU RGB+D dataset - the RAW features induce the worst 425
classification accuracy, while the comparison of remaining models is leading to the same 426
The summary of results obtained by all the considered network architectures is given
ranking as before. The three best performing models are TC-LSTM-LA, DC-LSTM-LA and 427
in Table
SC-LSTM-PS.
2. First, we clearly see the advantage of our feature engineering step, 428
as all
our models perform better with relational features than when using RAW skeleton data
(SC-LSTM-RAW).
Sensors 2023, 23, 6279 15 of 20

Table 2. The test accuracy of four models trained on the NTU-RGB+D interaction set, verified in the
CS (cross subject) and CV (cross view) modes. The best CS- and CV-performances are highlighted by
a bold font

No. Model CS CV Parameters


1 SC-LSTM-RAW 75.7% 80.1% 3.33 M
2 SC-LSTM-LA 89.3% 90.5% 3.36 M
3 SC-LSTM-PD 91.7% 93.5% 4.12 M
4 SC-LSTM-PS 91.0% 94.5% 3.33 M
5 DC-LSTM-LA 95.5% 94.4% 6.61 M
6 DC-LSTM-PD 90.0% 91.8% 8.09 M
7 DC-LSTM-PS 90.1% 91.7% 6.53 M
8 TC-LSTM-LA 96.6% 93.2% 9.76 M

Consider now the effects of feature type and channel number. In the case of the
SC-LSTM architecture, polar features (PD, PS) perform much better than the LA features.
This was expected because the aim of using polar features is to more accurately represent
interpersonal relationships. On the other hand, when the DC-LSTM architectures were
compared, we see something completely different. The separation of channels for persons
significantly improved the use of limb-angle features, while worsening the quality of polar
features. In fact, the separation is very natural for LA features and the information related
to every single person is independent of the other person. In the case of polar features, even
when separated into two channels, they contain mutual information. This split of features
gives no benefit and even causes a deterioration in quality. An interesting conclusion is also
the similar level of performance of dense and sparse “polar” features, although their feature
numbers are much different. The triple-channel configuration TC-LSTM-LA provides
mixed results. It improves the accuracy of CS testing by 1.1% but deteriorates the CV
testing by 1.2%.
We have chosen our three best performing models, SC-LSTM-PS, DC-LSTM-LA, and
TC-LSTM-LA, for a comparison with other recent works.

4.3. Comparison Study


A complexity-to-quality tradeoff of our approach is demonstrated, when comparing
it with other works referred in recent literature. A lot of works on two-person interaction
classification have been evaluated on the NTU RGB+D interaction data set. In Table 3,
we list some of the leading works with their accuracies given in referred works. A com-
petitiveness of our three best models, regarding the criteria of quality and complexity,
is observed. It must be noted that the top solutions use multi-data stream architectures.
The PoseConv3D(J+L) solution is processing two types of image sequences in parallel—
skeleton heatmaps and RGB images. The 2S DR-AGCN solution employs graph structures
besides the skeleton joints and branches. The top approaches analyze all the frames of a
video clip, contrary to other methods, which process a sparse frame sequence only. Our
results were obtained for 32 frames selected for windows of 64 frames.

4.4. Sliding Window Validation on a UT Subset


4.4.1. The UT-Interaction Data Set
The models were also tested on the UT-Interaction data set [13], which contains longer
videos with multiple interactions occurring one after the other. In total, five videos with
eight interactions each were tested (the interactions were consistent with NTU classes).
The accuracy of classification by our eight models is given in Table 4. The results confirm
our findings based on the NTU RGB+D data set—the RAW features induce the worst
classification accuracy, while the comparison of remaining models leads to the same ranking
as before. The three best-performing models are TC-LSTM-LA, DC-LSTM-LA, and SC-
LSTM-PS.
Sensors 2023, 23, 6279 16 of 20

Table 3. Test accuracy of leading works evaluated on the NTU-RGB+D interaction set in the CS (cross
subject) and CV (cross view) mode. Our three models are highlighted by a bold font

Approach Year CS CV Parameters


FSNET [28] 2019 74.0% 80.5% -
ST-LSTM [18] 2016 83.0% 87.3% -
ST-GCN [21] 2018 83.3% 87.1% 3.1 M
IRNinter+intra [3] 2019 85.4% - 9.0 M
GCA-LSTM [29] 2017 85.9% 89% -
2-stream GCA-LSTM [30] 2018 87.2% - -
AS-GCN [22] 2019 89.3% 93% 9.5 M
LSTM-IRN [3] 2019 90.5% 93.5% 9.08 M
2S-AGCN [23] 2019 93.4% - 3.0 M
DR-GCN [25] 2021 93.6% 94.0% 3.18 M
2S DR-AGCN [25] 2021 94.68% 97.19% 3.57 M
PoseConv3D(J+L) [27] 2022 97.0% 99.6 6.9 M
SC-LSTM-RAW 2022 75.7% 80.1% 3.33 M
SC-LSTM-PS 2022 91.0% 94.5% 3.33 M
DC-LSTM-LA 2022 95.5% 94.4% 6.51 M
TC-LSTM-LA 2022 96.6% 93.2% 9.76 M

Table 4. Cross-domain test accuracy of our eight models obtained on the UT-Interaction data set.

No Model Accuracy
1 SC-LSTM-RAW 72.5%
2 SC-LSTM-LA 82.5%
3 SC-LSTM-PD 90.0%
4 SC-LSTM-PS 92.5%
5 DC-LSTM-LA 95.0%
6 DC-LSTM-PD 87.5%
7 DC-LSTM-PS 90.0%
8 TC-LSTM-LA 97.5%

4.4.2. Example of Multi-Interaction Video


Let us illustrate the strategy of sliding window classification on one example from
the UT data set. The drawing in Figure 12 presents the development of interaction class
likelihoods in the sequence of windows. For every window, the class with highest likelihood
is chosen. The obtained results are collected in Table 5 and illustrated in Figure 13. The
window size was 2 s, with interlace 0.5 (i.e., window rate was 1 window per second).

Table 5. Interactions detected in consecutive window periods of a 20 s video clip.

No. First Frame Last Frame Detected Interaction True/False


1 1 15 point finger True
2 16 105 hugging True
3 106 195 pushing False (“walking apart”)
4 196 255 giving an object True
5 256 285 pushing True
6 286 315 punch True
7 316 375 walking apart True
8 376 435 walking towards True
9 436 526 kicking True
10 527 594 point finger True
Version June 30, 2023 submitted to Sensors 17 of 21

Sensors 2023, 23, 6279 17 of 20

Figure 12. Illustration


Figure of interaction
12. Illustration class likelihoods
of interaction classinlikelihoods
true sliding window
in trueclassification
sliding window classification.

Table 5. Interactions detected in consecutive window periods of a 20 seconds video clip

No. First frame Last frame Detected interaction True/False


1 1 15 point finger True
2 16 105 hugging True
3 106 195 pushing False (”walking apart”)
4 196 255 giving an object True
5 256 285 pushing True
6 286 315 punch True
7 316 375 walking apart True
8 376 435 walking towards True
9 436 526 kicking True
10 527 594 point finger True

Figure 13. Illustration of detected interactions by sliding window classification.


Sensors 2023, 23, 6279 18 of 20

5. Discussion
As we can see from Table 3, many works on skeleton-based human activity recognition
in video have been published in the last several years. They have been trained and
evaluated on short video clips containing single activities. Our aim was to design an
approach that solves a more realistic problem of processing a longer-time video with
varying interactions between two actors. A second goal was to reach real-time processing
with a satisfying classification performance. Our solution can be briefly characterized by
three concepts: knowledge-aware skeleton feature extraction by the feature engineering
step; use of multi-stream neural network models based on LSTM layers; and the sliding
window-controlled processing of long-time videos.
We have trained several models on the interaction subset of the NTU RGB+D data set.
The models have been evaluated in a short-video mode on the test part of the above training
set and in a cross-model mode on long-videos from the UT-Interaction data set. The first
evaluation resulted in the selection of the three best-performing single-, double- and triple-
channel models: SC-LSTM-PS, DC-LSTM-LA, and TC-LSTM-LA. These models represent
a tradeoff between accuracy and complexity, as the highest accuracy (of 94.9% when
averaging the CV and CS scores) has been achieved by the most complex model TC-LSTM-
LA (with 9.76 M weights), while the low complex model (with 3.33 M weights) showed the
worst accuracy (of 92.75%). The usefulness of our feature engineering step can be verified
by the presented results. When the raw skeleton data was used, the corresponding model
has reached an average accuracy of only 77.9%.
A comparison with the top performing complex DNN models validated a good stand-
ing of our solutions. Our moderate complexity models with standard LSTM layers perform
3.4–5.55% lower than the currently best PoseConv3D(J+L) (with an average performance of
98.3%). Please note that this top version of the PoseConv3D family was trained not only
on skeleton heatmaps but also on the original RGB data. The performance of our mod-
els is only slightly lower than the second-best performing adaptive graph convolutional
networks (the 2S DR-AGCN model) with 95.93%.
Our models and the sliding window step have also been validated on a second data
set—the UT-Interaction set of longer-time videos with many interactions. Again, the TC-
LSTM-LA model has shown a highest accuracy of 97.5%. By monitoring the results obtained
for consecutive window locations, one could also verify the almost perfect classification
of multiple interactions (in the presented example—a proper classification of nine out of
ten interactions).
The main scientific contribution is related to the proposed feature engineering algo-
rithm that performs skeleton tracking and knowledge-aware (“hand-crafted”) relational
feature extraction. This contribution can be formulated as follows:
1. We demonstrated the superiority of our approach—using hand-crafted relational fea-
tures combined with an LSTM-based classification model over simple neural network
models that learn relational features from pairs of joints—such as the IRNinter+intra
and LSTM-IRN models.
2. Our hand-crafted features can equalize the advantages of modern graph neural
networks and graph convolutional networks over LSTMs, when both are applied in
the feature transformation stage (as an encoder). Even complex configurations, such
as the AS-GCN and 2S-GCN models, can be challenged by our approach.

6. Conclusions
An approach to two-person interaction classification has been designed and experi-
mentally evaluated. The input data come from the OpenPose tool, which is an efficient
deep network solution for generating human skeleton sets from an image or video frame.
The quality of skeleton data is improved by the proposed skeleton tracking and joints
correction procedure. An important quality contribution comes from the knowledge-aware
feature engineering step, which generates relational data from the raw skeletons.
Sensors 2023, 23, 6279 19 of 20

Various network configurations, based on LSTM layers, were trained and evaluated.
High quality test results prove our concept. Applying our relational features, accuracy
gains of 12–14% have been achieved compared to the use of RAW skeleton data. A practical
advantage is the assumed sparsity of video frames. By adjusting the key frame number,
real-time processing is possible even with moderate computational resources. The approach
can easily be adopted to process true image sequences, such as image galleries.
The limitations of this study are as follows: a strong dependence on the proper
estimation of human skeleton data by OpenPose or HRnet and a focus on main body parts,
i.e., human actions performed by feet, hands, and fingers cannot be properly distinguished
from each other.

Author Contributions: Conceptualization, S.P. and W.K.; methodology, W.K.; software, S.P. and P.P.;
validation, S.P. and P.P.; formal analysis, W.K.; writing—original draft preparation, S.P. and W.K.;
writing—review and editing, P.P.; project administration, W.K. All authors have read and agreed to
the published version of the manuscript.
Funding: This research was funded by “Narodowe Centrum Badań i Rozwoju”, Warszawa, Poland,
grant No. CYBERSECIDENT/455132/III/NCBR/2020. The APC was funded by Warsaw University
of Technology.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Links to the data sets are included in the Reference section.
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or
in the decision to publish the results.

References
1. Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. NTU RGB+D: A Large Scale data set for 3D Human Activity Analysis. In Proceedings
of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, NV, USA, 26 June–1 July 2016.
[CrossRef]
2. [Online]. NTU RGB+D 120 Data Set. Papers with Code. Available online: https://paperswithcode.com/dataset/ntu-rgb-d-120
(accessed on 25 May 2023).
3. Perez, M.; Liu, J.; Kot, A.C. Interaction Relational Network for Mutual Action Recognition. arXiv 2019, arXiv:1910.04963. Available
online: https://arxiv.org/abs/1910.04963 (accessed on 20 May 2023).
4. Stergiou, A.; Poppe, R. Analyzing human-human interactions: A survey. In Computer Vision and Image Understanding; Elsevier:
Amsterdam, The Netherlands, 2019; Volume 188, p. 102799. [CrossRef]
5. Liu, M.; Yuan, J. Recognizing Human Actions as the Evolution of Pose Estimation Maps. In Proceedings of the 2018 IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1159–1168.
[CrossRef]
6. Cippitelli, E.; Gambi, E.; Spinsante, S.; Florez-Revuelta, F. Evaluation of a skeleton-based method for human activity recognition
on a large-scale RGB-D data set. In Proceedings of the 2nd IET International Conference on Technologies for Active and Assisted
Living (TechAAL 2016), London, UK, 24–25 October 2016. [CrossRef]
7. Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity
Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [CrossRef] [PubMed]
8. Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the 2014 IEEE
Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [CrossRef]
9. Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. Deepercut: A deeper, stronger, and faster multi-person pose
estimation model. In Computer Vision—ECCV 2016; LNCS Volume 9907; Springer: Cham, Switzerland, 2016; pp. 34–50. [CrossRef]
10. Zhang, S.; Wei, Z.; Nie, J.; Huang, L.; Wang, S.; Li, Z. A review on human activity recognition using vision-based method.
J. Healthc. Eng. 2017, 2017, 3090343. [CrossRef] [PubMed]
11. Bevilacqua, A.; MacDonald, K.; Rangarej, A.; Widjaya, V.; Caulfield, B.; Kechadi, T. Human Activity Recognition with Convo-
lutional Neural Networks. In Machine Learning and Knowledge Discovery in Databases; LNAI Volume 11053; Springer: Cham,
Switzerland, 2019; pp. 541–552. [CrossRef]
12. Puchała, S.; Kasprzak, W.; Piwowarski, P. Feature engineering techniques for skeleton-based two-person interaction classification
in video. In Proceedings of the 2022 17th International Conference on Control, Automation, Robotics and Vision (ICARCV),
Singapore, 11–13 December 2022; pp. 66–71. [CrossRef]
Sensors 2023, 23, 6279 20 of 20

13. UT-Interaction. SDHA 2010 High-Level Human Interaction Recognition Challenge. Available online: https://cvrc.ece.utexas.
edu/SDHA2010/Human_Interaction.html (accessed on 10 May 2022).
14. Meng, H.; Freeman, M.; Pears, N.; Bailey, C. Real-time human action recognition on an embedded, reconfigurable video processing
architecture. J. Real-Time Image Process. 2008, 3, 163–176. [CrossRef]
15. Chathuramali, M.; Rodrigo, R. Faster human activity recognition with SVM. In Proceedings of the International Conference on
Advances in ICT for Emerging Regions (ICTer2012), IEEE, Colombo, Sri Lanka, 13–14 December 2012. [CrossRef]
16. Yan, X.; Luo, Y. Recognizing human actions using a new descriptor based on spatial–temporal interest points and weighted-output
classifier. Neurocomputing 2012, 87, 51–61. [CrossRef]
17. Vemulapalli, R.; Arrate, F.; Chellappa, R. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, OH, USA, 23–28
June 2014. [CrossRef]
18. Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In Computer
Vision—ECCV 2016; LNCS Volume 9907; Springer: Cham, Switzerland, 2016; pp. 816–833. [CrossRef]
19. Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-Based Action Recognition with Convolutional Neural Networks. 2017. [Online].
Available online: https://arxiv.org/abs/1704.07595 (accessed on 10 May 2022).
20. Liang, D.; Fan, G.; Lin, G.; Chen, W.; Pan, X.; Zhu, H. Three-Stream Convolutional Neural Network with Multi-Task and
Ensemble Learning for 3D Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), IEEE, Long Beach, CA, USA, 16–17 June 2019. [CrossRef]
21. Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. arXiv 2018,
arXiv:1801.07455. Available online: https://arxiv.org/abs/1801.07455v2 (accessed on 20 May 2022).
22. Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-Structural Graph Convolutional Networks for Skeleton-Based
Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Long Beach, CA, USA, 15–20 June 2019; pp. 3590–3598. [CrossRef]
23. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H.-Q. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action
Recognition. arXiv 2019, arXiv:1805.07694v3. Available online: https://arxiv.org/abs/1805.07694v3 (accessed on 15 July 2022).
24. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H.-Q. Skeleton-based action recognition with multi-stream adaptive graph convolutional
networks. IEEE Trans. Image Process. 2020, 29, 9532–9545. [CrossRef] [PubMed]
25. Zhu, L.; Wan, B.; Li, C.-Y.; Tian, G.; Hou, Y.; Yuan, K. Dyadic relational graph convolutional networks for skeleton-based human
interaction recognition. In Pattern Recognition; Elsevier: Amsterdam, The Netherlands, 2021; Volume 115, p. 107920. [CrossRef]
26. Duan, H.; Zhao, Y.; Chen, K.; Shao, D.; Lin, D.; Dai, B. Revisiting Skeleton-based Action Recognition. arXiv 2021,
arXiv:2104.13586v1. Available online: https://arxiv.org/abs/2104.13586v1 (accessed on 20 August 2022).
27. Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting Skeleton-based Action Recognition. arXiv 2022, arXiv:2104.13586v2.
Available online: https://arxiv.org/abs/2104.13586v2 (accessed on 20 Aprili 2023).
28. Liu, J.; Shahroudy, A.; Wang, G.; Duan, L.-Y.; Kot, A.C. Skeleton- Based Online Action Prediction Using Scale Selection Network.
IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2019, 42, 1453–1467. [CrossRef] [PubMed]
29. Liu, J.; Wang, G.; Hu, P.; Duan, L.-Y.; Kot, A.C. Global Context-Aware Attention LSTM Networks for 3D Action Recognition. In
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July
2017; pp. 3671–3680. [CrossRef]
30. Liu, J.; Wang, G.; Duan, L.-Y.; Abdiyeva, K.; Kot, A.C. Skeleton- Based Human Action Recognition with Global Context-Aware
Attention LSTM Networks. IEEE Trans. Image Process. (TIP) 2018, 27, 1586–1599. [CrossRef] [PubMed]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like