Professional Documents
Culture Documents
A Critical Review of Action Recognition Benchmarks: Tal Hassner The Open University of Israel
A Critical Review of Action Recognition Benchmarks: Tal Hassner The Open University of Israel
Tal Hassner
The Open University of Israel
hassner@openu.ac.il
4321
ble 1. We note that our survey focuses exclusively on ac- this is a strong testament on the capabilities of modern ac-
tion recognition from visual data, excluding benchmarks tion recognition systems: controlled setting such as those
which use additional channels of information (e.g., depth, offered by these early sets no longer seem to pose signifi-
as in [24]). We further limit the discussion to benchmarks cant challenges to modern computer vision systems.
designed to measure performance on general action recog-
nition tasks, rather than specific applications (e.g., violence 2.2. Interim years: TV, sports, and motion pictures
detection in [8] or gait recognition [31]). In an effort to increase the diversity of appearances and
Whenever possible, we report existing state-of-the-art re- viewing conditions, benchmark designers have turned to
sults. We note that different benchmarks define different TV, sports broadcasts and motion pictures as sources for
testing protocols and different performance measures. Due challenging videos of human actions. These benchmarks
to space considerations, we report only the best scores for no longer represent controlled conditions; viewpoints, illu-
each set without elaborating on the different measures. Care minations, occlusions are all arbitrary, thereby significantly
must therefore be taken when interpreting these values, as raising the bar for action recognition systems.
different benchmarks use different evaluation protocols and One early example is the UIUC2 benchmark [36], which
have different levels of chance recognition. provided unconstrained sports videos of badminton matches
downloaded from YouTube. To our knowledge, this is the
2.1. The early years: Action recognition “in the lab” first benchmark to provide such unconstrained data. An-
Two early benchmarks are the KTH [32] and Weiz- other early, popular example is the UCF-Sports bench-
mann [1] sets, both used extensively over the years. These mark [28] with its 200 videos of nine different sports events,
sets provide low resolution videos of a few, “atomic” action collected from TV broadcasts. Sports videos were also con-
categories, such as walking, jogging, and running. These sidered in the Olympic-Games benchmark of [23].
videos were produced “in the lab”, and present actors that Feature films were used by some as a source for chal-
perform scripted behavior. The videos they provide were lenging action recognition videos. These included the UCF
acquired under controlled conditions, with static cameras Feature Films Dataset [28], sometimes referred to as the
and static, un-clutered backgrounds. In addition, actors ap- “Kissing-Slapping” benchmark, which provides 90 videos
pear without occlusions, thus allowing action recognition to of kissing and 110 of slapping scenes obtained from a
be performed by considering silhouettes alone [1]. range of classic movies, and the challenging Hollywood-2
Performances on these databases has saturated over the (HOHA2) [21] from 2009, an extension to the Hollywood-
years, with perfect accuracy reported on the Weizmann 1 (HOHA) benchmark [16] released a year earlier. Finally,
set [40] and near-perfect accuracy on KTH (e.g., ∼ 95% the most recent benchmark to use TV and motion picture
in [7]), which has been reported saturated as far back data, included in this survey, is the High-Five collection
as [23]. Curiously, despite being saturated and despite their of [26], which, like the UCF Feature Films Dataset, focuses
simplistic settings, these benchmarks are still frequently on interactions, rather than single person activities.
used today, with recent examples including [13] on the Of these benchmarks, the two Hollywood sets seems to
Weizmann set and [34] on KTH. remain challenging even today. This, despite the relatively
Over the years other benchmarks have been proposed, small number of action categories they include. One pos-
providing videos obtained under laboratory conditions, with sible reason for this is that the action samples in these sets
different emphasis placed on the data set design and the set- are not well localized in time; action recognition systems
ting these benchmarks attempted to reflect. These sets in- tested on these sets must include temporal action detection,
clude the IXMAS benchmark, proposed in [41], designed localizing the action in time, not just recognizing it, thereby
to study action recognition under varying viewpoints. Al- making it more challenging than other sets.
though IXMAS provides synchronized, multi-view footage
2.3. Recent years: Action recognition “in the wild”
of each action, and this has been used to include depth in-
formation for recognition, tests on this benchmark often TV and motion picture videos introduced unconstrained
include videos only. Additional sets are The UMD Com- viewing conditions, yet they were in some sense still lim-
mon Activities Dataset [37], which records activities from ited in the range of challenges they represented. Specifi-
a synchronized pair of viewpoints, the University of Illi- cally, although unconstrained, the videos in these sets were
nois at Urbana-Champaign UIUC1 benchmark [36], and fi- all typically produced under favorable conditions, were of
nally, the University of Rochester Activities of Daily Liv- high quality, and were shot from carefully selected view-
ing (ADL) [22] benchmark, released in 2009 and the latest points. Not surprising, high performances were already re-
such benchmark proposed, to our knowledge. ported for most of these sets. All this cannot be said of
As we report in Table 1, perfect, or near perfect results the many videos accumulating in online repositories such
have been reported on all these benchmarks. We believe as YouTube. In recent years, several have offered bench-
4322
Table 1. Action Recognition Databases. ]acn. is the number of action classes. ]clips are the numbers of clips in the collection. SotA
reflects the best performances obtained on each benchmark in recent years. It is important to note that these measures are not comparable,
as the different benchmarks use different performance measures. Moreover, these scores do not reflect the improvement over chance, which
varies from one benchmark to the other (e.g., 2% for the UCF50 benchmark [27] and 50% for ASLAN [12]).
marks assembled from such “in-the-wild” sources. standing, running, jumping, landing and standing up again.
One of the first to offer such videos was the UCF- 2.4. Emerging trends
YouTube set [20], also referred to as the YouTube Actions
benchmark. Taking advantage of the availability of huge Two conclusions are clearly evident from the survey
numbers of videos, it provides 1,168 videos in eleven cat- above and further illustrated in Figure 1. The first is that
egories, obtained from both YouTube and personal video many of these benchmarks have been saturated. This tes-
collections. More recently, and returning to Olympic sports tifies to the high performance and capabilities of modern
events, the Olympic Sports Dataset [25] was released, con- action recognition techniques, as well as suggests that new
taining 16 complex actions represented by 50 videos each, benchmarks must be designed to reflect more challenging
all downloaded from YouTube. Here, an attempt was made problem settings. The second conclusion is evident when
to go beyond atomic actions, by considering events involv- comparing these benchmarks to those used in photo clas-
ing several different stages or different actions performed sification applications: action recognition benchmarks cur-
at once. An example being the long-jump, which combines rently provide a few dozen categories to the thousands of-
4323
Table 2. Results on the ASLAN benchmark. See text for details.
100%
90% MethodP Acc. ± SE AUC
1 HOG,p (x1 . ∗ x2 ) [12] 56.58 ± .74 61.6
80%
SotA
P
2 HOF, p (x1 . ∗ x2 ) [12] 56.82 ± .57 58.5
70% 3 HNF,
P
(x1 . ∗ x2 ) [12] 58.87 ± .89 62.1
60% 4 STIP, 12 sim. [12] 60.88 ± .77 65.3
5 OSSML [11] 64.25 ± .70 69.1
50% 6 MIP [10] 64.62 ± .80 70.4
40% 7 MIP+STIP [10] 65.45 ± .80 71.9
30% 8 Traj. [17] 62.02 ± 1.1 66.9
432 9 MBH [17] 64.25 ± .90 69.9
60 10 Multi-rep. [17] 66.13 ± 1.0 73.2
Controlled sets 50 51 11 Human [12] 94.19 97.8
50 TV / MP
40 “In The Wild”
# classes
4324
1-3). Twelve different similarities were computed for all sults demonstrate the remaining gap between man and ma-
three STIP representations and combined into 26D vectors chine on the task represented by the ASLAN benchmark.
representing each video pair. These were again classified as
same/not-same using linear-SVM in Table 2 row 4. In [11] 4. Conclusions
a metric learning approach was designed for the One Shot
Similarity of [42]. Their OSSML approach improved per- The survey presented here demonstrate that contempo-
formance but at a significant computational cost (row 5). rary action recognition systems can now perform very well
Motion Interchange Patterns (MIP), presented in [10], in controlled settings, with few atomic actions and clear,
are low-level features, encoding at each space-time pixel a known viewing positions, and even on uncontrolled videos
measure of the likelihood for a change in motion occurring of high quality, when actions are well localized in time.
at that pixel. A video is represented by building BoF us- Unconstrained action recognition from “real-world” videos,
ing these codes. Mechanisms are described for both motion however, remains a challenging problem. State-of-the-art
stabilization and screening of irrelevant codes. Their sys- methods are currently very far from the perfect scores ob-
tem is more efficient than OSSML, yet achieves better per- tained by systems developed for image classification. This
formance rates (row 6). By combining MIP with the three fact is underscored by the properties of the benchmarks
STIP (HOG, HOF, and HNF) results are slightly improved, used to develop, train, and test systems for action recog-
suggesting that the MIP descriptors already capture much nition in videos, compared to systems for image classifica-
of the information available from the STIP descriptors. tion: where the latter include hundreds of action categories,
The Dense Motion Trajectories representation (Traj.), peaking at thousands of video samples, the former provide
originally described in [38], encode point trajectories over thousands of categories and millions of photo samples. This
time, on a dense grid in space and scale, using optical flow is particularly troubling when considering that actions can
computed between subsequent frames. Each trajectory is vary greatly in how they appear in videos, possibly far more
encoded as a vector of (normalized) 2D displacements of than static items vary in appearance in photos, implying that
points. ASLAN results obtained with this representation data sets for action recognition must be designed to offer
were reported in [17] using the code made available by the more examples, not less.
authors of Traj. [38] and appear here in row 7 of Table 2. In an effort to bridge this gap, the ASLAN benchmark
In [17] the Motion Boundary Histogram (MBH) repre- has recently been proposed in [12]. With 432 classes, it
sentation of [4] were also evaluated. MBH represents mo- offers a far greater variability of categories than other exist-
tion by considering the horizontal and vertical derivatives ing benchmarks. In this paper we review the performance
of motion in each pixel, separately, in order to compare the reported on this set and show the wide gap that remains be-
relative motion of pixels to their neighbors. In [38] this tween the best performing methods and human capabilities.
representation was used to encode actions by quantizing We do so in an effort to motivate subsequent research into
these derivatives into 8-bin, weighted histograms. Results action recognition on larger and more challenging video
from [17] using MBH are reported in row 8. collections, reflecting realistic, “in the wild” conditions.
Finally, [17] propose a number of extensions to the MIP
descriptor [10]. For the HistMIP descriptor, MIP codes References
are computed separately over different channels of each
[1] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri.
frame. Each such channel captures spatial gradient orienta-
Actions as space-time shapes. Proc. IEEE Conf. Comput.
tions, quantized into local spatial histograms and weighted Vision Pattern Recognition, 2005.
by their magnitude. Thus, HistMIP is designed to reflect
[2] M. Bregonzio, J. Li, S. Gong, and T. Xiang. Discriminative
the changes in motion of local textures, rather than intensi- topics modelling for action feature selection and recognition.
ties of the original MIP. The DoGMIP, on the other hand, In Proc. British Mach. Vision Conf., volume 6, 2010.
first processes each frame to produce multiple Difference [3] J. M. Chaquet, E. J. Carmona, and A. Fernández-Caballero.
of Gaussians layers. These are then processed, produc- A survey of video datasets for human action and activity
ing MIP codes for each layer. The results reported in [17] recognition. Comput. Vision Image Understanding, 2013.
show that none of these representations, on their own, pro- [4] N. Dalal, B. Triggs, and C. Schmid. Human detection using
vides a significant performance boost over existing meth- oriented histograms of flow and appearance. In European
ods. Their combination, along with the other descriptors Conf. Comput. Vision, pages 428–441. Springer, 2006.
described above, however, does improve performance, as [5] C. Fellbaum. Wordnet. Theory and Applications of Ontol-
evident in row 10 of Table 2, which is the best performance ogy: Computer Applications, pages 231–243, 2010.
reported on the ASLAN benchmark to date. [6] A. Gaidon, Z. Harchaoui, C. Schmid, et al. Recognizing ac-
Table 2 also provides human performances on the tivities with cluster-trees of tracklets. In Proc. British Mach.
ASLAN benchmark, originally published in [12]. These re- Vision Conf., 2012.
4325
[7] A. Gilbert, J. Illingworth, and R. Bowden. Fast realistic [25] J. C. Nieblesand, C.-W. Chen, and L. Fei-Fei. Modeling tem-
multi-action recognition using mined dense spatio-temporal poral structure of decomposable motion segments for activity
features. In Proc. IEEE Int. Conf. Comput. Vision, 2009. classification. In European Conf. Comput. Vision, 2010.
[8] T. Hassner, Y. Itcher, and O. Kliper-Gross. Violent flows: [26] A. Patron-Perez, M. Marszalek, A. Zisserman, and I. Reid.
Real-time detection of violent crowd behavior. In Int. Work- High five: Recognising human interactions in tv shows. In
shop on Socially Intelligent Surveillance and Monitoring at Proc. British Mach. Vision Conf., 2010.
the IEEE Conf. Comput. Vision Pattern Recognition, 2012. [27] K. K. Reddy and M. Shah. Recognizing 50 human action
[9] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. categories of web videos. Mach. Vision and Applications,
Labeled faces in the wild: A database for studying face 2012.
recognition in unconstrained environments. UMASS, TR 07- [28] M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach: A
49, 2007. spatio-temporal maximum average correlation height filter
[10] O. Kliper-Gross, Y. Gurovich, T. Hassner, and L. Wolf. Mo- for action recognition. In Proc. IEEE Conf. Comput. Vision
tion interchange patterns for action recognition in uncon- Pattern Recognition, pages 1–8, 2008.
strained videos. In European Conf. Comput. Vision, 2012. [29] S. Sadanand and J. J. Corso. Action bank: A high-level rep-
[11] O. Kliper-Gross, T. Hassner, and L. Wolf. One shot similarity resentation of activity in video. In Proc. IEEE Conf. Comput.
metric learning for action recognition. Proc. of the Workshop Vision Pattern Recognition, pages 1234–1241. IEEE, 2012.
on Similarity-Based Pattern Recognition (SISM), 2011. [30] M. Sapienza, F. Cuzzolin, and P. Torr. Learning discrimina-
[12] O. Kliper-Gross, T. Hassner, and L. Wolf. The action simi- tive space-time actions from weakly labelled videos. In Proc.
larity labeling challenge. IEEE Trans. Pattern Anal. Mach. British Mach. Vision Conf., 2012.
Intell., 34(3):615–621, 2012. [31] S. Sarkar, P. J. Phillips, Z. Liu, I. R. Vega, P. Grother, and
[13] T. Kobayashi and N. Otsu. Motion recognition using local K. W. Bowyer. The humanid gait challenge problem: Data
auto-correlation of space–time gradients. Pattern Recogni- sets, performance, and analysis. IEEE Trans. Pattern Anal.
tion Letters, 33(9):1188–1195, 2012. Mach. Intell., 27(2):162–177, 2005.
[14] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. [32] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human ac-
HMDB: a large video database for human motion recogni- tions: A local svm approach. In Proc. 17th Int. Conf. Pattern
tion. In Proc. IEEE Int. Conf. Comput. Vision, 2011. Recognition, volume 3, pages 32–36, 2004.
[15] I. Laptev. On space-time interest points. Int. J. Comput. [33] A. H. Shabani, D. A. Clausi, and J. S. Zelek. Evaluation
Vision, 64(2):107–123, 2005. of local spatio-temporal salient feature detectors for human
action recognition. In Comput. and Robot Vision, 2012.
[16] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.
[34] F. Shi, E. Petriu, and R. Laganiére. Sampling strategies for
Learning realistic human actions from movies. In Proc. IEEE
real-time action recognition. In Proc. IEEE Conf. Comput.
Conf. Comput. Vision Pattern Recognition, 2008.
Vision Pattern Recognition. IEEE, 2013.
[17] N. Levy, Y. Hanani, and L. Wolf. Evaluating new variants of
[35] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny
motion interchange patterns. In CVPR Workshop on Action
images: A large data set for nonparametric object and
Similarity in Unconstrained Videos, 2013.
scene recognition. IEEE Trans. Pattern Anal. Mach. Intell.,
[18] B. Li, M. Ayazoglu, T. Mao, O. I. Camps, and M. Sznaier. 30(11):1958–1970, 2008.
Activity recognition using dynamic subspace angles. In
[36] D. Tran and A. Sorokin. Human activity recognition with
Proc. IEEE Conf. Comput. Vision Pattern Recognition, pages
metric learning. In European Conf. Comput. Vision, 2008.
3193–3200. IEEE, 2011.
[37] A. Veeraraghavan, R. Chellappa, and A. K. Roy-Chowdhury.
[19] H. Liu, R. Feris, and M.-T. Sun. Benchmarking datasets for The function space of an activity. In Proc. IEEE Conf. Com-
human activity recognition. In Visual Analysis of Humans, put. Vision Pattern Recognition, pages 959–968, 2006.
pages 411–427. Springer, 2011.
[38] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense tra-
[20] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions jectories and motion boundary descriptors for action recog-
from videos “in the wild”. In Proc. IEEE Conf. Comput. nition. Int. J. Comput. Vision, pages 1–20, 2012. Available:
Vision Pattern Recognition, pages 1996–2003, 2009. lear.inrialpes.fr/people/wang/dense trajectories.
[21] M. Marszalek, I. Laptev, and C. Schmid. Actions in con- [39] J. Wang, Z. Chen, and Y. Wu. Action recognition with multi-
text. In Proc. IEEE Conf. Comput. Vision Pattern Recogni- scale spatio-temporal contexts. In Proc. IEEE Conf. Comput.
tion, pages 2929–2936, 2009. Vision Pattern Recognition, pages 3185–3192. IEEE, 2011.
[22] R. Messing, C. Pal, and H. Kautz. Activity recognition using [40] Y. Wang and G. Mori. Human action recognition by semila-
the velocity histories of tracked keypoints. In Proc. IEEE Int. tent topic models. IEEE Trans. Pattern Anal. Mach. Intell.,
Conf. Comput. Vision, pages 104–111, 2009. 31(10):1762–1774, 2009.
[23] K. Mikolajczyk and H. Uemura. Action recognition with [41] D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint ac-
motion-appearance vocabulary forest. In Proc. IEEE Conf. tion recognition using motion history volumes. Comput. Vi-
Comput. Vision Pattern Recognition, 2008. sion Image Understanding, 104(2-3):249–257, 2006.
[24] B. Ni, G. Wang, and P. Moulin. Rgbd-hudaact: A color- [42] L. Wolf, T. Hassner, and Y. Taigman. The one-shot similarity
depth video database for human daily activity recognition. kernel. In ICCV, 2009.
In Consumer Depth Cameras for Comput. Vision. 2013.
4326