Hard Net Hardness Aware Discrimination Network For 3d Early Activity Prediction

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

000 HARD-Net: Hardness-AwaRe Discrimination 000

001 001
002
Network for 3D Early Activity Prediction 002
003 003
004 004
Anonymous ECCV submission
005 005
006 Paper ID 1360 006
007 007
008 008
009 009
Abstract. Predicting the class label from the partially observed activity
010 010
sequence is a very hard task, as the observed early segments of different
011 activities can be very similar. In this paper, we propose a novel Hardness- 011
012 AwaRe Discrimination Network (HARD-Net) to specifically investigate 012
013 the relationships between the similar activity pairs that are hard to be 013
014 discriminated. Specifically, a Hard Instance-Interference Class (HI-IC) 014
015 bank is designed, which dynamically records the hard similar pairs. Based 015
016 on the HI-IC bank, a novel adversarial learning scheme is proposed to 016
017
train our HARD-Net, which thus grants our network with the strong ca- 017
pability in mining subtle discrimination information for 3D early activity
018 018
prediction. We evaluate our proposed HARD-Net on two public activity
019 019
datasets and achieve state-of-the-art performance.
020 020
021 Keywords: Early Activity Prediction, Action/Gesture Understanding, 021
022 3D Skeleton Data, Hardness-Aware Learning 022
023 023
024 024
025 025
026 026
1 Introduction
027 027
028 028
Early human activity prediction (predicting the class label of an action or gesture
029 029
before it is completely performed) is an important and hot research problem in
030 030
the human behavior analysis domain, thanks to its relevance to many real-world
031 031
applications, such as online human-robot interactions, self-driving vehicles, and
032 032
security surveillance [6, 35, 37]. Existing works [26, 34] show that the 3D skeleton
033 033
data, which can be conveniently acquired with 3D pose estimation frameworks
034 034
[19, 23, 29], is a concise yet informative and powerful representation for human
035 035
behavior analysis. Therefore, in this paper, we focus on the task of early hu-
036 036
man activity prediction from the 3D skeleton data, namely, 3D early activity
037 037
prediction.
038 038
Unlike 3D activity recognition where the full-length skeleton sequences can
039 be used, which often contain sufficient discrimination information, in 3D early 039
040 activity prediction, only the beginning segments of the sequences are observed. 040
041 This makes early activity prediction much more challenging than recognition. 041
042 More specifically, when performing early activity prediction, the observed be- 042
043 ginning segments of many activities can be very similar, i.e., there may be only 043
044 subtle discrepancies among them for discrimination. Thus due to the lack of 044
2 ECCV-20 submission ID 1360

045 045
046 Shaking 046
047 hand 047
048 048
049 049
050 050
Pointing to
051 051
someone
052 052
053 053
054 Timeline 054
055 20% 80% 100% 055
056
Fig. 1. Illustration of two example activities from NTU RGB+D dataset [26]. Though 056
sufficient discrimination information can be used to distinguish these two activities
057 057
when their full sequences are observed, at the early stages (e.g., when only 20% is
058 058
observed), these two activities are quite similar (with only subtle discrimination infor-
059 059
mation contained, as labelled by the red boxes). This makes early prediction hard.
060 060
061 061
062 062
063 significant discrimination information, these partially observed segments can be 063
064 easily “mis-predicted” into other categories. For example, in Fig. 1, at the early 064
065 stage (20% observation ratio), the “pointing to someone” sample can be easily 065
066
mis-predicted into the “shaking hand” class, since there are only very subtle 066
067
differences between them. Here we call the easily mis-predicted segments as the 067
068
hard instances, and the classes that they are easily mis-predicted into as their 068
069
interference classes. We also call the pair containing a hard instance and its 069
070
interference class as a hard pair. 070
071 To deal with the challenging task of 3D early activity prediction, some of 071
072 the existing works [6] adopt soft regression schemes [7] or exclude the easily 072
073 mismatched categories [37] for better handling the ambiguities in the partially 073
074 observed skeleton sequences. Some other works [9, 35] focus on inferring or distill- 074
075 ing information from the full activity sequences that contain more sufficient dis- 075
076 crimination information, to assist activity prediction from the partial sequences. 076
077 Though remarkable progress has been achieved by these existing methods, most 077
078 of them do not explicitly consider the hard pair discrimination issue, i.e., specif- 078
079 ically investigating the relationships within each hard pair in order to exploit 079
080 their minor discrepancies for better early activity prediction. 080
081 As mentioned above, the high similarities of the partially-observed activity 081
082 sequences between the hard instance and its corresponding interference classes 082
083 make 3D early activity prediction challenging. Thus to achieve reliable prediction 083
084 performance, a desired prediction model should be discriminative and powerful 084
085 enough in comprehending the relationships between the confusing hard pair sam- 085
086 ples and meanwhile prudentially investigating their inherent subtle discrepancies 086
087 that can be exploited for discrimination. 087
088 Inspired by this, in this paper, we propose a novel Hardness-AwaRe Dis- 088
089 crimination Network (HARD-Net), that is able to explicitly mine, perceive, and 089
ECCV-20 submission ID 1360 3

090 exploit the relationships and also the minor discrepancies within each hard pair, 090
091 in order to achieve a discriminative model for early activity prediction. 091
092 Concretely, in our HARD-Net, a Hard Instance-Interference Class (HI-IC) 092
093 bank is specifically designed, that is able to dynamically record the hard pairs 093
094 during the model learning procedure. Based on our HI-IC bank, an effective 094
095 adversarial learning scheme for discriminating the features of the hard pair sam- 095
096 ples is proposed. To investigate the relationship between a hard instance and 096
097 its interference class, a feature generator is designed, which produces confus- 097
098 ing yet plausible hard instance features by conditioning on the similarities of 098
099 this instance to the corresponding interference class. Meanwhile, to obtain the 099
100 ability of mining subtle discrimination information between the features of hard 100
101 pair samples, a class discriminator is further designed that pushes the prediction 101
102 model to distinguish the confusing features of the hard instance from its inter- 102
103 ference classes. Therefore, with the adversarial learning going on, the generated 103
104 features of the hard instance becomes more confusing with regard to its inter- 104
105 ference class, which in turn promotes the capability of the class discriminator 105
106 in mining the subtle differences that exist between the features of the hard pair 106
107 samples for class discrimination. As a result, the proposed HARD-Net with the 107
108 class discriminator as the classifier becomes very powerful in handling hard pairs 108
109 that are often very hard to be discriminated by the early activity prediction 109
110
models. 110
111
The contributions of this paper are summarized as follows: 1) We propose a 111
112
novel end-to-end Hard-Net with an effective adversarial learning scheme, where 112
113
a powerful class discriminator that is able to mine subtle discrimination informa- 113
114
tion within the features of the hard pairs can be achieved, which thus improves 114
115
the 3D early activity prediction performance. 2) We design a HI-IC Bank to 115
116
store the hard instance - interference class pairs. Specifically, the HI-IC Bank 116
117
is dynamically updated during the network learning, which thus contiguously 117
boosts the discrimination ability of our early activity prediction model. 3) The
118 118
proposed method achieves state-of-the-art performances on the NTU RGB+D
119 119
dataset for 3D early action prediction and the FPHA dataset for 3D early gesture
120 120
prediction.
121 121
122 122
123 2 Related Work 123
124 124
125 125
3D human activity analysis has been widely studied in the computer vision
126 126
community. In this section, we first briefly review existing works on 3D human
127 127
activity recognition, and then introduce current achievements on early human
128 activity prediction. Finally, we review the hard example learning methods that 128
129 are relevant to our proposed HARD-Net. 129
130 3D Human Activity Recognition. Some of the existing methods [2, 34] 130
131 used hand-crafted features for 3D human activity recognition. Evangelidis et al. 131
132 [2] proposed a local skeleton descriptor that encodes the relative positions of each 132
133 joint quadruple, and then utilized Fisher kernel representations to describe the 133
134 skeletal quads. Vemulapalli et al. [33] proposed to represent the human activities 134
4 ECCV-20 submission ID 1360

135 as curves in a Lie group instead of 3D joint locations or joint angles, and then 135
136 transform the Lie group to its Lie algebra for recognizing the 3D activities. 136
137 With the development of deep neural networks, RNN/LSTM-based methods 137
138 have also attracted lots of research attentions [26, 20, 18, 8, 1]. Shahroudy et al 138
139 [26] proposed a part-aware LSTM to model the human activities as interactions 139
140 of human joints, and split the LSTM memory cells into part-based sub-cells to 140
141 learn long-term context information for each body part. Zhu et al. [39] proposed 141
142 an end-to-end fully-connected LSTM network for human activity recognition 142
143 with a regularization strategy to discern the co-occurrences of body joints. 143
144 Besides the RNN/LSTM models, 2D convolutional neural networks (CNNs) 144
145 have also been investigated for 3D activity recognition. Ke et al. [10] proposed 145
146 to represent a skeleton sequence as a three-color-channel RGB image, and each 146
147 channel is interpreted with the one cylindrical coordinate channel of the 3D 147
148 human joints. Li et al. [15] further proposed an attention module above the 2D 148
149 CNN to learn the global co-occurrences from the skeleton data. 149
150 More recently, graph convolutional networks (GCNs) become prevalent for 150
151 handling 3D activity recognition [28, 38, 32, 16, 31, 27]. Yan et al. [38] proposed 151
152 to use spatial-temporal GCN for 3D activity recognition. Shi et al. [28] proposed 152
153 an adaptive graph convolutional network to adaptively learn the topology of the 153
154 graph for various layers, and employed the second-order information of the raw 154
155 skeleton data as an extra input stream to boost the performance. 155
156 Early Human Activity Prediction Unlike activity recognition that is 156
157 able to observe the full-length activity sequences which often contain sufficient 157
158 discrimination information, in early activity prediction, only the segments from 158
159 the beginning parts of the activity sequence can be used. Due to the drop of 159
160 discrimination information, early activity prediction becomes much more chal- 160
161 lenging than activity recognition. 161
162 162
Different approaches [6, 13, 9, 35, 37, 12, 14] have been proposed for early ac-
163 163
tivity prediction. Ke et al. [9] proposed to learn latent global information from
164 164
full-length sequences and local information from partial-length sequences. Wang
165 et al. [35] introduced a teacher-student learning architecture to transfer knowl- 165
166 edge from long-term sequences to the shorter-term sequences. Kong et al. [12] 166
167 proposed a memory cell in original LSTM module to memorize hard-to-predict 167
168 training samples to improve prediction performance. Hu et al. [6] proposed a 168
169 soft regression scheme for handling the similarities among the early segments of 169
170 different activities. 170
171 171
Overall, the aforementioned works on 3D early activity prediction do not
172 172
focus on improving the discrimination ability of the prediction model by specif-
173 173
ically handling the very similar hard pair samples, though the discrimination
174 ability for the hard pairs is often one of the bottlenecks in early activity pre- 174
175 diction. Different from these works, we construct a HI-IC bank to explicitly and 175
176 dynamically record the hard pair data, and propose a novel HARD-Net with 176
177 adversarial learning to push the prediction model to be able to specifically dis- 177
178 criminating hard pair samples by exploiting their relationships and mining their 178
179 subtle discrimination information. 179
ECCV-20 submission ID 1360 5

180 Hard Example Learning Explicitly learning from hard-to-predict exam- 180
181 ples has been shown to be very helpful for a wide range of computer vision and 181
182 machine learning tasks [21, 30, 17, 5, 22, 36, 3]. For example, Shrivastava et al. [30] 182
183 proposed a hard example mining scheme to automatically select hard data to 183
184 improve the object classification performance. Felzenszwalb et al. [3] proposed 184
185 a margin-sensitive method for handling hard negative examples with a latent 185
186 SVM to iteratively fix the latent values for positive examples and optimize the 186
187 latent SVM objective function. 187
188 Unlike these methods on hard example learning, we focus on improving the 188
189 ability of mining subtle discrimination information within the hard pairs, by 189
190 explicitly pairing the easily mis-predicted early activity segments with their cor- 190
191 responding interference classes via an adversarial learning scheme. An HI-IC 191
192 bank is also introduced to specifically store the hard instances and their inter- 192
193 ference classes, in order to facility the comprehending of the relationships and 193
194 minor differences within the pairs. This thus boosts the discrimination capability 194
195 of the early activity prediction model. 195
196 196
197 197
3 Method
198 198
199 199
3.1 Problem Formulation
200 200
201 Given a full-length activity sequence S = {st }Tt=1 , where st denotes the tth frame, 201
202 and T represents the sequence length, following existing works [6, 9], the full- 202
T
203 length sequence S is first divided into N segments, i.e., each segment contains N 203
τ T
204 frames. Thus a partial sequence can be denoted as P = {st }t=1 , where τ = i · N 204
205 and i ≤ N . The task of early activity prediction is to identify the activity 205
206 category c ∈ C = {1, 2, ..., C} that the partial activity sequence P belongs to, 206
207 based on various observation ratios. 207
208 208
209 209
3.2 Hardness-AwaRe Discrimination Network
210 210
211 3.2.1 Overview. The overall architecture of our end-to-end Hardness-AwaRe 211
212 Discrimination Network (HARD-Net) is shown in Fig. 2. As mentioned above, 212
213 certain activities can be quite similar at their early stages. Thus 3D early ac- 213
214 tivity prediction often suffers from lack of sufficient discrimination information 214
215 when at low observation ratios. Here we introduce a new method that is able to 215
216 explicitly record the hard pairs, that lack sufficient discrimination information, 216
217 using an HI-IC bank, and investigate the relationship between the hard instances 217
218 and their corresponding interference classes via an adversarial learning scheme, 218
219 which thus enhances the capability of our prediction model in mining the minor 219
220 discrimination information within the hard samples in feature space, for better 220
221 activity prediction. 221
222 222
223 3.2.2 Hard Instance-Interference Class (HI-IC) Bank. We design an 223
224 HI-IC bank to record hard pairs, where each pair contains a hard instance as 224
6 ECCV-20 submission ID 1360

225 225
226 Hard Instance Class 226
Instance
227 Instance0 1
227
Interference Instance1 7
228 …
228
Class
Instance9 3
229 229
HI-IC Bank
230 230
231 hard RealOrFake 231
f Discriminator
Real/Fake
232 H 232
P Hard feature
233 Generator f latent 233
f inter Latent
234 feature 234
Encoder Interference Class
235 PI feature Discriminator 235
...
236 236
237 f ori 237
Original
238 P feature 238
239 Fig. 2. Illustration of our end-to-end HARD-Net. Our network is constructed on a 239
240 replaceable feature encoder (e.g., CNN skeleton encoder [9] or GCN skeleton encoder 240
241 [28]) that encodes features for partial sequences. Following the red arrows, partial 241
242 sequences (P ) are fed to the encoder followed by the classifier to obtain classification 242
243 scores that serve as the criterion for storing hard pairs into HI-IC bank. Then following 243
244
the green arrows, we randomly select a hard pair including a hard instance (P H ) and an 244
interference class sample (P I ) from the HI-IC bank for feature encoding. The obtained
245 245
feature pairs, f hard and f inter , are further taken into account for adversarial learning, in
246 246
order to improve the capability of our prediction model on mining subtle discrepancies
247 within each hard pair in feature space. 247
248 248
249 249
250 well as its corresponding interference class, as shown in the top part of Fig. 2. 250
251 This thus enables our model to get aware of the specific categories that a hard 251
252 partial activity sequence can be easily mis-predicted into. 252
253 As shown in Fig. 2, the base structure of our network includes a feature 253
254 encoder E that learns features for the partial activity sequence, and a classifier 254
255 (denoted as class discriminator in Fig. 2) for class prediction. At each training 255
256 iteration, each partial activity sequence (P ) is fed to the encoder to obtain the 256
257 features f ori , which are then further fed to the class discriminator to produce the 257
258 prediction confidence scores ŷ. If the partial sequence instance (P ) is wrongly 258
259 predicted by the class discriminator, given the prediction confidence scores ŷ, the 259
260 activity class cr1 that has the rank-1 score in ŷ is considered as the interference 260
261 class (cI ) of P (i.e., cI = cr1 ), as cr1 has the most ambiguous information 261
262 regarding to P . 262
263 We regard the wrongly predicted partial sequence (P ) that does not have 263
264 sufficient discrimination information, as the hard instance (P H ). We can then 264
265 pack the pair of P H and cI as a hard pair that is then stored into the HI-IC 265
266 bank, as illustrated in Fig. 2. 266
267 267
268 3.2.3 Adversarial Hardness-Aware Discrimination Learning Scheme. 268
269 To investigate the relationship within the hard pair, we design a feature generator 269
ECCV-20 submission ID 1360 7

270 (G) conditioning on the hard instance and its corresponding interference class 270
271 from the pair, in order to derive latent features that are confusing yet plausible 271
272 for representing the original hard instance. Meanwhile, to enhance the capability 272
273 of our network in mining subtle discrimination information, we design a class 273
274 discriminator (Dcls ) by granting the prediction model with the power of distin- 274
275 guishing the generated latent features of the hard instance from its interference 275
276 class. With such an adversarial learning scheme, the generated latent features of 276
277 the hard instance become more and more confusing with regard to its interfer- 277
278 ence class, which in turn boosts the power of the class discriminator in mining 278
279 the subtle discrimination information to distinguish the confusing hard instance 279
280 from its interference class. As a result, the overall discrimination capability of 280
281 the prediction model is strengthened during the adversarial learning procedure. 281
282 Feature Generator. We design a generator (G) that exploits the relation- 282
283 ship between the hard instance and the corresponding interference class, in order 283
284 to produce latent features that are confusing and hard to predict, yet are still 284
285 plausible and retain inherent information representing the original hard instance. 285
286 Concretely, to exploit the relationship between the hard instances and the 286
287 corresponding interference classes, for a hard instance (P H ), we refer to the HI- 287
288 IC bank and identify its interference class. We then randomly sample an instance 288
289 (P I ) from the interference class. Thus we get the paired hard samples (P H and 289
290 P I ). These two samples are fed to the feature encoder to obtain the feature pair 290
291 (f hard and f inter ) representing these two samples. 291
292 In our design, we aim to generate latent features (f latent ) of the hard instance 292
293 that are very confusing and ambiguous with regard to its interference class. 293
294 Thus beside feeding the original features (f hard ) of the hard instance to the 294
295 generator (G), the interference class features (f inter ) are used as the interference 295
296 information and are also fed to G, , as shown in Fig. 2. Therefore, conditioning 296
297 on the aggregated features of P H and P I , the produced latent features (f latent ) 297
298 from G become very confusing and hard for discriminating between these two 298
299
classes. 299
300 Moreover, to specifically ensure that f latent are ambiguous and hard enough 300
301 for activity prediction, we further introduce an “ambiguous label” for the hard 301
302
instance to assist the learning of G. This “ambiguous label” can be explained 302
303
as follows. Usually, in classification, the ground-truth label of the category j is 303
304
a one-hot vector (y), in which the jth element is set to 1 and other places are 0. 304
305
Unlike this one-hot label, our “ambiguous label” here is represented as a vector 305
306
y amb , where two positions, that correspond to the ground-truth category of the 306
307
hard instance and the category of its interference class, are both set to 0.5, and 307
308
all other elements are set to 0. 308
309
Such an “ambiguous label” (y amb ) can then be used as the constraint to drive 309
310
the generated latent features f latent to be ambiguous between these two classes. 310
311
This constraint can be formulated as follows: 311
312 K 312
X
313 LG
amb =− ykamb · log ŷklatent (1) 313
314 k=1 314
8 ECCV-20 submission ID 1360

315 where K denotes the total number of the activity classes, and ŷ latent is the 315
316 output vector of the class discriminator that performs classification based on 316
317 the generated latent features (f latent ). 317
318 The constraint in Eq. (1) ensures that the generated latent features (f latent ) 318
319 are ambiguous enough. However, as mentioned before, f latent still needs to be 319
320 plausible and retain inherent information for representing the original hard in- 320
321 stance. To achieve this, we apply a real-or-fake constraint on f latent to make it 321
322 plausible, as well as a mean absolute error constraint to drive f latent to be closer 322
323 to the hard instance’s original features (f hard ). 323
324 The mean absolute error constraint (LG con ), for narrowing the distance be- 324
325 tween f latent and f hard , is formulated in Eq. (2). The real-or-fake constraint 325
rof
326 (LGrof ), brought by the RealOrFake Discriminator (D ) for ensuring that the 326
latent
327 generated features (f ) and the original features (f hard ) still stay in the 327
328 same feature domain, is formulated in Eq. (3). 328
329 329
latent
330
LG
con = ||f − f hard ||1 (2) 330
331 331
332 rof hard rof latent 332
LG
rof = E[log D (f )] + E[log[1 − D (f )]] (3)
333 333
334 The overall objective function for the generator (G) can thus be formulated 334
335 as: 335
336 336
337
LG = LG G G
con + λ1 Lrof + λ2 Lamb (4) 337
338 338
With the above objective function, the feature generator (G) is thus able
339 339
to generate latent features (f latent ) that are very confusing with regard to the
340 340
interference class, yet still retain inherent information for representing the input
341 341
hard instance.
342 342
Class Discriminator. To obtain strong discrimination power, we design a
343 class discriminator (Dcls ) that is able to distinguish the generated latent features 343
344 (f latent ) of each hard instance from its corresponding interference class. To learn 344
345 cls 345
our class discriminator, a softmax classification constraint (LD ) is applied on
346 346
Dcls that pushes it to predict the accurate label (y) of the original hard instance
347 347
based on the confusing latent features (f latent ):
348 348
349 K 349
cls X
350 LD =− yk · log ŷklatent (5) 350
351 k=1 351
352 352
Therefore, with the adversarial learning going on, the generated latent fea-
353
tures (f latent ) for representing the hard instance become more and more confus- 353
354 ing with regard to its interference class (i.e., contain less and less discrimination 354
355 information for Dcls to do class distinguishing). This, however, in turn boosts the 355
356 power of Dcls in comprehending the remaining subtle discriminative information 356
357 in f latent for distinguishing it from its interference class, i.e., Dcls thus becomes 357
358 more and more powerful in mining the very minor discrimination information 358
359 for class distinguishing. 359
ECCV-20 submission ID 1360 9

360 Algorithm 1: Learning procedure of our HARD-Net. 360


361 Input: Partial skeleton sequences (P ) and ground-truth labels (c ) τ 361
362 362
while not converge do
363 Backbone learning and HI-IC Bank Filling 363
364 Calculate f ori by E; 364
Calculate ŷ by Dcls ;
365 Dcls 365
Calculate Lori with Eq. (6);
366 Update E and Dcls ; 366
367 if rank-1(ŷ) ! = cτ then 367
PH ← P;
368 368
cI ← rank-1 (ŷ);
369 HI-IC Bank ← {P H ; cI } ; 369
370 end 370
371 end 371
Adversarial HARD-Net Learning
372 Freeze E; 372
373 Select and sample P H and P I from HI-IC Bank; 373
Calculate f hard and f inter by E;
374 374
Calculate f latent by G;
375 Dcls Drof 375
Calculate L and L ;
376 Freeze G; Update Drof and Dcls ; 376
377 Calculate LG ; 377
378 Freeze Drof and Dcls ; Update G; 378
end
379 379
end
380 380
381 381
382 382
Note that during adversarial learning, besides feeding in the generated latent
383 383
features (f latent ) to train Dcls , the original features (f ori ) encoded from the
384 384
original samples are also fed to Dcls during training, as shown in Fig. 2. Therefore
385 385
the below objective function is also applied when learning Dcls :
386 386
387 K 387
cls X
388 LD
ori = − yk · log ŷkori (6) 388
389 k=1 389
390 As mentioned before, in our adversarial learning scheme, the original features 390
391 and the generated features are kept in the same domain. Thus such a train- 391
392 ing scheme (combining Eq. (5) and (6)) is able to stabilize the overall network 392
393 learning, which further yields a powerful Dcls for mining subtle discrimination 393
394 information contained in both the latent features and the original features for 394
395 class distinguishing. Therefore the obtained class discriminator Dcls , that has 395
396 very strong power in mining subtle discrimination information for distinguishing 396
397 the hard instances from the interference classes, can act as the final classifier for 397
398 activity prediction. 398
399 399
400 3.2.4 Training and Testing. Each training iteration of our HARD-Net is 400
401 comprised of two phases, namely backbone training with HI-IC bank populating, 401
402 and adversarial learning. 402
403 Backbone training & HI-IC bank filling. The backbone of our network 403
404 mainly consists of the encoder E and the class discriminator, as shown in Fig. 2. 404
10 ECCV-20 submission ID 1360

405 This backbone can be trained based on Eq. (6). To fill the HI-IC bank, a mini- 405
406 batch of original partial sequences P with batch size B are first fed to the encoder 406
407 E to extract features f ori . Then based on f ori , the class discriminator produces 407
408 predicted scores, which serve as a criterion for storing hard pairs into our HI-IC 408
409 bank, i.e., if a sample is mis-predicted, this sample and its mis-predicted class 409
410 are packed as a hard pair, which will be stored into the HI-IC bank. 410
411 Adversarial learning scheme. During adversarial learning, the parameters 411
412 of the encoder E are first frozen. We then sample rB hard pairs from HI-IC bank, 412
413 where 0 < r ≤ 1. If there are not enough pairs in the bank (i.e., at the early 413
414 stage of training process), all pairs in the bank are selected and repeated to 414
415 reach rB. Otherwise, we follow the first-in-first-out strategy to select the rB 415
416 hard pairs. Based on the interference class from each sampled hard pair, we 416
417 sample an instance of it from the dataset as P I . After that, P H and P I are fed 417
418 into the encoder E for feature encoding, and then the encoded features f hard 418
419 and f inter are fed into the generator G to attain f latent . The generated latent 419
420 features f latent are fed into the RealOrFake discriminator Drof and the class 420
421 discriminator Dcls with ambiguous label to update G and Drof . Finally, the 421
422 f latent are used to update Dcls with the ground-truth category. This training 422
423 procedure is detailed in Alg. 1. 423
424 Testing. As shown by the red arrows in Fig. 2. at the inference phase, 424
425 we input a partial skeleton sequence to the encoder to obtain features, which 425
426 are then fed to Dcls for activity prediction. Since Dcls has strong capability in 426
427 mining subtle discrimination information for distinguishing hard samples from 427
428 their similar interference classes, our network becomes powerful in early activity 428
429 prediction. 429
430 430
431 431
432 4 Experiments 432
433 433
434 434
We test the proposed method for 3D early action prediction on the NTU RGB+D
435 435
dataset [26], and 3D early gesture prediction on the First Person Hand Action
436 436
(FPHA) dataset [4]. We conduct extensive experiments on these two datasets as
437 437
below. More results can also be found in the supplementary file.
438 438
NTU RGB+D dataset [26] is a large dataset that has been widely used for
439 439
3D action recognition and 3D early action prediction. This dataset was captured
440 440
with Microsoft Kinect (V2) from three different views. It contains more than 56
441 441
thousands videos and over 4 million frames from 60 activity categories. Each
442 442
human skeleton in the dataset possesses 25 human body joints represented by
443 443
3D coordinates. This dataset is very challenging for 3D early action prediction, as
444 it contains a large number of samples that are confusing at the beginning of the 444
445 activity sequences. There are two standard evaluation protocols provided by the 445
446 dataset. The first protocol is the Cross Subject (CS) setting, where 20 subjects 446
447 are employed for training and the remaining 20 subjects are left for testing. 447
448 The second protocol is the Cross View (CV) protocol, where two viewpoints are 448
449 employed for training, and the third one is for testing. 449
ECCV-20 submission ID 1360 11

CS for NTU RGB+D dataset CV for NTU RGB+D dataset 0.85


FPHA dataset
450 0.8 0.9 450
0.80
0.8
451 0.6 0.75 451
0.7
Accuracy

Accuracy

Accuracy
Ours 0.70
452 Ke et al. [10] 452
0.4 Ke et al. [9] 0.6
Jain et al. [8] 0.65
Aliakbarian et al. [1] 0.5
453 0.2 Wang et al. [35] Weng LSTM et al. [37] 0.60 Weng LSTM et al. [37] 453
Weng et al. [37] 0.4 Weng et al. [37] Weng et al. [37]
Pang et al. [24] Ours 0.55 Ours
454 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
454
Observation Ratio Observation Ratio Observation Ratio
455 455
456 Fig. 3. Comparison of 3D early activity prediction performance on NTU RGB+D and 456
457 FPHA datasets. Our method outperforms state-of-the-arts by a large margin. 457
458 458
459 459
First Person Hand Action (FPHA) dataset [4] is a challenging 3D hand
460 460
gesture dataset. The samples in this dataset are the first-person hand activi-
461 461
ties interacting with 3D objects recorded by six subjects. It contains over 100K
462 462
frames of 45 different hand activity categories. Each hand skeleton attains 21
463 463
hand joints interpreted by 3D coordinates. We test our method on FPHA by fol-
464 464
lowing the standard evaluation protocol as [4], where 600 activity sequences are
465 465
employed for training and the remaining 575 activity sequences are for testing.
466 466
Evaluated Models. To test the efficacy of our method, we test two different
467 467
models, namely “w/o HARD-Net” and “w/ HARD-Net”. (1) “w/o HARD-Net”:
468 468
This is actually the backbone model of our network, that contains the feature
469 469
encoder and the classifier. (2) “w/ HARD-Net”: This is our proposed activity
470 470
prediction model (HARD-Net) that has strong capability of discriminating hard
471 471
pair samples via Hardness-Aware Discrimination adversary leaning.
472 472
473 473
474 4.1 Implementation Details 474
475 475
To comprehensively evaluate the efficacy of our HARD-Net model, we specifically
476 476
construct our method above two state-of-the-art baseline encoders, namely the
477 477
CNN encoder [15] and the GCN encoder [28], as shown in Tab. 4. The details of
478 478
these two baseline encoders can be found in the corresponding papers [15, 28].
479 479
We also design our generator and real-or-fake discriminator by following Radford
480 480
et al. [25], and implement the class discriminator based on multi-layer preceptor.
481 481
The weights λ1 and λ2 in Eq. (4) are both set to 1.
482 All experiments are performed based on the Pytorch framework. Adam [11] 482
483 algorithm is used to train our end-to-end network. The learning rate, betas, 483
484 weight decay are set to 2 × 10−4 , (0.9, 0.999), and 1 × 10−5 , respectively. We 484
485 set the size of HI-IC bank to be 5000 for the very large NTU RGB+D dataset 485
486 and 100 for the much smaller FPHA dataset. In each training iteration, the 486
487 proportion between the original instances and the hard pair instances used for 487
488 network learning is set to 4 : 1. 488
489 489
490 490
491
4.2 Experiments of 3D Early Action Prediction on NTU RGB+D 491
492 We compare the proposed HARH-Net with the state-of-the-art approaches on 492
493 NTU RGB+D. The comparison results with different observation ratios are 493
494 shown in Tab. 1 (cross subject protocol), and Tab. 2 (cross view protocol). 494
12 ECCV-20 submission ID 1360

495 Table 1. Performance comparison (%) on NTU RGB+D (cross-subject protocol). Our 495
496
method outperforms the backbone model (“w/o HARD-Net”) significantly. It also out- 496
performs the state-of-the-art 3D early activity prediction methods by a large margin.
497 497
498 Observation Ratios 498
499 Methods 20% 40% 60% 80% 100% AUC 499
500 Ke et al. [10] 8.34 26.97 56.78 75.13 80.43 45.63 500
501 Jain et al. [8] 7.07 18.98 44.55 63.84 71.09 37.38 501
Aliakbarian et al. [1] 27.41 59.26 72.43 78.10 79.09 59.98
502 Wang et al. [35] 35.85 58.45 73.86 80.06 82.01 60.97 502
503 Pang et al. [24] 33.30 56.94 74.50 80.51 81.54 61.07 503
Weng et al. [37] 35.56 54.63 67.08 72.91 75.53 57.51
504 504
Ke et al. [9] 32.12 63.82 77.02 82.45 83.19 64.22
505 505
w/o HARD-Net 37.82 67.87 79.22 83.39 84.52 66.91
506 w/ HARD-Net 42.39 72.24 82.99 86.75 87.54 70.56 506
507 507
508 Table 2. Performance comparison (%) on NTU RGB+D (cross-view protocol). We 508
509 observe only [37] has reported early action prediction results on the cross-view protocol. 509
510 510
Observation Ratios
511 511
Methods 20% 40% 60% 80% 100% AUC
512 512
513 LSTM [37] 33.86 54.70 68.85 74.86 77.84 57.93 513
Weng et al. [37] 37.22 57.18 69.92 75.41 77.99 59.71
514 514
w/o HARD-Net 47.71 78.95 88.49 91.51 91.79 75.50
515 515
w/ HARD-Net 53.15 82.87 91.34 93.71 94.03 78.84
516 516
517 517
518 518
Results on Cross Subject Protocol. Comparison results on cross subject
519 519
protocol are shown in Tab. 1 and Fig. 3 (left). As shown in Tab. 1, our proposed
520 520
HARD-Net achieves the best performance over all observation ratios, which indi-
521 521
cates the efficacy of our proposed HARD-Net. Compared to the state-of-the-art
522 522
works and the backbone model, our method outperforms them significant, es-
523 523
pecially when the observation ratio is very low. The significant improvements
524 524
demonstrate that our proposed approach can mine minor yet significant discrep-
525 525
ancies for discrimination.
526 526
Moreover, following [24, 1, 37], we also use the area under curve metric, de-
527 527
noted as AUC, which is used to illustrate the average precision over all obser-
528 528
vation ratios to investigate the overall efficacy of our proposed HARD-Net. As
529 529
shown in Tab. 1, our approach achieves the highest AUC of 70.56%, compared
530 530
to the existing methods and also the backbone model (“w/o HARD-Net”). Note
531 531
that our HARD-Net outperforms the backbone model by 3.65%, which further
532 demonstrates that the proposed adversarial learning scheme can well-perceive 532
533 and comprehend the subtle differences within hard classes and facilitate the 533
534 discriminate capacities of the class discriminator. 534
535 Results on Cross View Protocol. We also evaluate our HARD-Net on 535
536 cross view protocol as in [37] and the comparison results are shown in Tab. 2 536
537 and Fig. 3 (middle). As shown in Tab. 2, our proposed HARD-Net model out- 537
538 performs the existing works by a large margin over all observations ratios, which 538
539 demonstrates the efficacy of our approach. 539
ECCV-20 submission ID 1360 13

540 Table 3. Quantitative results (%) comparison on FPHA with state-of-the-arts. 540
541 Observation Ratios 541
542 Methods 20% 40% 60% 80% 100% AUC 542
543 543
LSTM [37] 54.26 63.30 69.22 72.17 74.43 64.11
544 Weng et al. [37] 59.65 65.91 70.43 73.57 74.96 66.66 544
545 w/o HARD-Net 62.26 74.61 79.65 82.09 83.48 72.17 545
546 w/ HARD-Net 71.83 82.78 86.09 87.13 87.30 78.56 546
547 547
548 548
549 It is worth noting that the average accuracy score AUC of the HARD-Net 549
550 exceeds the previous work [37] by 19.13% and exceeds baseline encoder by 3.34% 550
551 which indicates our class discriminator can benefit from adversarial learning 551
552 scheme and obtain more discrimination ability for 3D early activity prediction. 552
553 553
554 4.3 Experiments on 3D Early Gesture Recognition 554
555 555
556
To demonstrate the efficacy of our proposed HARD-Net on 3D gesture dataset, 556
557
extensive experiments are conducted on a publicly available 3D hand gesture 557
558
dataset, namely FPHA. As illustrated in Tab. 3 and Fig. 3 (right), our proposed 558
559
HARD-Net achieves better performance consistently over all observation ratios 559
560
compared to Weng et al. [37]. 560
561
Compared to the baseline model, at the early stages when the observation 561
562
ratio is very low that lack sufficient discrimination information, since our HARD- 562
563
Net is powerful in mining minor discrepancies, it achieves the most significant 563
performance gain by 9.57% at the 20% observation ratio.
564 564
565 565
566 4.4 Ablation Study 566
567 567
In this section, extensive ablation experiments are conducted based on the NTU-
568 568
RGB+D dataset (cross-subject protocol), that is widely used by existing works
569 569
[9, 24, 35, 37] in early activity prediction community.
570 570
571 571
572 AUC from different original - latent sample proportions. AUC from different HI-IC Bank sizes 572
0.7050
573 0.70 573
0.7025
574 574
AUC

AUC

0.7000 0.69
575 0.6975 575
576 0.6950 0.68 576
1 2 3 4 5 6 0 5000 10000 15000 20000
577 Proportion Bank Size 577
578 Fig. 4. Left: Evaluation of the impact of using different proportions of original samples 578
579
and hard pair samples for network training. When proportion between original sample 579
number and hard pair sample number is 4:1, our model achieves the highest prediction
580 580
accuracy. Right: Evaluation of the impact of different HI-IC bank sizes.
581 581
582 582
583 Impact of Bank Size. We evaluate the performance of HI-IC bank in differ- 583
584 ent bank sizes. The result is shown in Fig. 4 (right). The AUC reflecting average 584
14 ECCV-20 submission ID 1360

585 Table 4. Performance gain (%) brought by our HARD-Net with different backbones. 585
586 Observation Ratios 586
587 Backbone Methods 20% 40% 60% 80% 100% 587
588 w/o HARD-Net 34.01 63.16 75.87 81.39 82.24 588
589 CNN backbone [9] w/ HARD-Net 35.86 64.97 77.12 82.22 82.98 589
590 ∆ +1.85 +1.81 +1.25 +0.83 +0.74 590
591 w/o HARD-Net 37.82 67.87 79.22 83.39 84.52 591
GCN backbone [28] w/ HARD-Net 42.39 72.24 82.99 86.75 87.54
592 592
∆ +4.57 +4.37 +3.77 +3.36 +3.02
593 593
594 594
595 595
596
precision over all observation ratios increases rapidly from a smaller bank size to 596
597
a larger one, and then remains stable when the bank size is large enough (e.g., 597
598
size 5000). This can be explained by the number of hard pairs in a dataset, and 598
599
when the intrinsic threshold is reached, the performance gain is limited. 599
600
Impact of proportions between original features and latent features 600
601
for training. Our experimental results in Fig. 4 (left) demonstrate that the op- 601
timal proportion of original features and latent features used for network training
602 602
is 4 : 1. This can be explained as: if we use too much original features for net-
603 603
work training, then less useful discrimination information will be mined via our
604 604
adversarial learning scheme. However, if too much latent features are employed
605 605
for training, it may lead to performance drops over the original samples.
606 606
Impact of Backbone Encoder. We extensively test our algorithm on a
607 607
CNN backbone and a GCN backbone, and show the efficacy of the proposed
608 608
method. As shown in Tab. 4, our HARD-Net boosts early prediction performance
609 609
on both backbone models obviously, especially at the very low observation ra-
610 610
tios. This indicates that our network is powerful in mining subtle discrimination
611 611
information for early activity prediction.
612 612
613 613
614 5 Conclusion 614
615 615
In this paper, we propose a novel Hardness-AwaRe Discrimination Network
616 616
(HARD-Net) for 3D early activity prediction. The proposed HARD-Net is able to
617 explicitly investigate the relationship between an easily mis-predicted instance, 617
618 named hard instance, and the particular category that it is mis-predicted into, 618
619 named interference class. An adversarial learning scheme is proposed to mine 619
620 subtle discrepancies between this hard instance - interference class pair by gen- 620
621 erating ambiguous and less discriminative latent features conditioned on that 621
622 particular pair to represent original hard instances. We further design a class 622
623 discriminator to distinguish the derived latent features from the corresponding 623
624 interference classes. With this adversarial learning going on, the generated latent 624
625 features become more and more confusing with regard to the interference classes, 625
626 which in turn empowers the class discriminator more capabilities to mine subtle 626
627 discrimination information. Thus the overall discrimination capability for early 627
628 activity prediction is strengthened. With such a network design, our proposed 628
629 HARD-Net achieves state-of-the-art performance on two challenging datasets. 629
ECCV-20 submission ID 1360 15

630 References 630


631 631
632
1. Aliakbarian, M., Saleh, F., Salzmann, M., Fernando, B., Petersson, L., Andersson, 632
L.: Encouraging lstms to anticipate actions very early (2017)
633 633
2. Evangelidis, G., Singh, G., Horaud, R.: Skeletal quads: Human action recognition
634 634
using joint quadruples. Proceedings - International Conference on Pattern Recog-
635 nition (2014). https://doi.org/10.1109/ICPR.2014.772 635
636 3. Felzenszwalb, P., Girshick, R., Mcallester, D., Ramanan, D.: Object de- 636
637 tection with discriminatively trained part-based models. IEEE transac- 637
638 tions on pattern analysis and machine intelligence 32, 1627–45 (09 2010). 638
639 https://doi.org/10.1109/TPAMI.2009.167 639
640
4. Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action 640
benchmark with rgb-d videos and 3d hand pose annotations. In: Proceedings of
641 641
Computer Vision and Pattern Recognition (CVPR) (2018)
642 642
5. Girshick, R.: Fast r-cnn. In: 2015 IEEE International Conference on Computer
643 Vision (ICCV). pp. 1440–1448 (Dec 2015) 643
644 6. Hu, J.F., Zheng, W.S., Ma, L., Wang, G., Lai, J.: Real-time rgb-d activity predic- 644
645 tion by soft regression. In: European Conference on Computer Vision. pp. 280–296 645
646 (2016) 646
647 7. Hu, J.F., Zheng, W.S., Ma, L., Wang, G., Lai, J., Zhang, J.: Early action prediction 647
648
by soft regression. IEEE Transactions on Pattern Analysis and Machine Intelligence 648
41(11), 2568–2583 (2018)
649 649
8. Jain, A., Singh, A., Koppula, H., Soh, S., Saxena, A.: Recurrent neural networks
650 650
for driver activity anticipation via sensory-fusion architecture. Arxiv (2015)
651 9. Ke, Q., Bennamoun, M., Rahmani, H., An, S., Sohel, F., Boussaid, F.: Learning 651
652 latent global network for skeleton-based action prediction. IEEE Transactions on 652
653 Image Processing 29, 959–970 (2020) 653
654 10. Ke, Q., Bennamoun, M., An, S., Sohel, F.A., Boussaı̈d, F.: A new representation 654
655 of skeleton sequences for 3d action recognition. In: IEEE Conference on Computer 655
656
Vision and Pattern Recognition. pp. 4570–4579 (2017) 656
11. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR
657 657
abs/1412.6980 (2014)
658 12. Kong, Y., Gao, S., Sun, B., Fu, Y.: Action prediction from videos via memorizing 658
659 hard-to-predict samples. In: AAAI (2018) 659
660 13. Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for 660
661 action prediction. In: European Conference on Computer Vision. vol. 8693 (2014) 661
662 14. Kong, Y., Tao, Z., Fu, Y.: Deep sequential context networks for action prediction. 662
663
IEEE Conference on Computer Vision and Pattern Recognition pp. 3662–3670 663
(2017)
664 664
15. Li, C., Zhong, Q., Xie, D., Pu, S.: Co-occurrence feature learning from skeleton
665 665
data for action recognition and detection with hierarchical aggregation. In: IJCAI.
666 pp. 786–792 (2018) 666
667 16. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural 667
668 graph convolutional networks for skeleton-based action recognition. In: The IEEE 668
669 Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 669
670 17. Lin, T., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object 670
671
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(2), 671
318–327 (Feb 2020)
672 672
18. Liu, J., Wang, G., Duan, L., Abdiyeva, K., Kot, A.C.: Skeleton-based human action
673 673
recognition with global context-aware attention lstm networks. IEEE Transactions
674 on Image Processing 27(4), 1586–1599 (April 2018) 674
16 ECCV-20 submission ID 1360

675 19. Liu, J., Ding, H., Shahroudy, A., Duan, L.Y., Jiang, X., Wang, G., Kot, A.C.: 675
676 Feature boosting network for 3d pose estimation. IEEE Transactions on Pattern 676
677 Analysis and Machine Intelligence 42(2), 494–501 (2020) 677
678 20. Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal lstm with trust gates for 678
679
3d human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) 679
Computer Vision – ECCV 2016. pp. 816–833. Springer International Publishing,
680 680
Cham (2016)
681 681
21. Loshchilov, I., Hutter, F.: Online batch selection for faster training of neural net-
682 682
works (2015)
683 683
22. Lou, Y., Bai, Y., Liu, J., Wang, S., Duan, L.: Veri-wild: A large dataset and a
684 new method for vehicle re-identification in the wild. In: Proceedings of the IEEE 684
685 Conference on Computer Vision and Pattern Recognition. pp. 3235–3243 (2019) 685
686 23. Moreno-Noguer, F.: 3d human pose estimation from a single image via distance 686
687 matrix regression. In: Proceedings of the IEEE Conference on Computer Vision 687
688 and Pattern Recognition. pp. 2823–2832 (2017) 688
689 24. Pang, G., Wang, X., Hu, J.F., Zhang, Q., Zheng, W.S.: Dbdnet: Learning bi- 689
690 directional dynamics for early action prediction. In: IJCAI. pp. 897–903 (2019) 690
691 25. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with 691
692
deep convolutional generative adversarial networks. In: Bengio, Y., LeCun, Y. 692
(eds.) 4th International Conference on Learning Representations, ICLR 2016,
693 693
San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings (2016),
694 694
http://arxiv.org/abs/1511.06434
695 695
26. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+d: A large scale dataset for
696 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer 696
697 Vision and Pattern Recognition. pp. 1010–1019 (2016) 697
698 27. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with di- 698
699 rected graph neural networks. In: The IEEE Conference on Computer Vision and 699
700 Pattern Recognition (CVPR) (2019) 700
701 28. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional 701
702 networks for skeleton-based action recognition. In: CVPR (2019) 702
703
29. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kip- 703
man, A., Blake, A.: Real-time human pose recognition in parts from single depth
704 704
images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
705 705
Recognition. pp. 1297–1304 (2011)
706 706
30. Shrivastava, A., Mulam, H., Girshick, R.: Training region-based object detectors
707 with online hard example mining (2016) 707
708 31. Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph 708
709 convolutional lstm network for skeleton-based action recognition. In: The IEEE 709
710 Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 710
711 32. Tang, Y., Tian, Y., Lu, J., Li, P., Zhou, J.: Deep progressive reinforcement learn- 711
712 ing for skeleton-based action recognition. In: The IEEE Conference on Computer 712
713 Vision and Pattern Recognition (CVPR) (June 2018) 713
714 33. Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by rep- 714
715
resenting 3d skeletons as points in a lie group. In: 2014 IEEE Confer- 715
ence on Computer Vision and Pattern Recognition. pp. 588–595 (June 2014).
716 716
https://doi.org/10.1109/CVPR.2014.82
717 717
34. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Learning actionlet ensemble for 3d human ac-
718 718
tion recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence
719 36(5), 914–927 (2013) 719
ECCV-20 submission ID 1360 17

720 35. Wang, X., Hu, J., Lai, J., Zhang, J., Zheng, W.: Progressive teacher-student learn- 720
721 ing for early action prediction. In: IEEE Conference on Computer Vision and 721
722 Pattern Recognition. pp. 3551–3560 (2019) 722
723
36. Wang, X., Shrivastava, A., Gupta, A.: A-fast-rcnn: Hard positive generation via 723
adversary for object detection. In: The IEEE Conference on Computer Vision and
724 724
Pattern Recognition (CVPR) (2017)
725 725
37. Weng, J., Jiang, X., Zheng, W., Yuan, J.: Early action recognition with
726 726
category exclusion using policy-based reinforcement learning. IEEE Trans-
727 actions on Circuits and Systems for Video Technology pp. 1–1 (2020). 727
728 https://doi.org/10.1109/TCSVT.2020.2976789 728
729 38. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for 729
730 skeleton-based action recognition. In: AAAI (2018) 730
731 39. Zhu, W., Lan, C., Xing, J., Li, Y., Shen, L., Zeng, W., Xie, X.: Co-occurrence 731
732 feature learning for skeleton based action recognition using regularized deep lstm 732
networks. In: AAAI (2016)
733 733
734 734
735 735
736 736
737 737
738 738
739 739
740 740
741 741
742 742
743 743
744 744
745 745
746 746
747 747
748 748
749 749
750 750
751 751
752 752
753 753
754 754
755 755
756 756
757 757
758 758
759 759
760 760
761 761
762 762
763 763
764 764

You might also like