Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

A Bag of Words approach for recognition of Greek folk

dances

Eftychia Fotiadou Ioannis Kapsouras


eftifot@aiia.csd.auth.gr jkapsouras@aiia.csd.auth.gr
Nikos Nikolaidis Anastasios Tefas
nikolaid@aiia.csd.auth.gr tefas@aiia.csd.auth.gr
Department of Informatics
Aristotle University of Thessaloniki

ABSTRACT tive research field of computer vision and various approaches


In this paper we explore the problem of distinguishing Greek have been proposed during the last years [11],[6].
folk dances from other kinds of activities, as well as from In this paper, we deal with a more complex activity class,
other dance genres, using video recordings. For this purpose, namely dance. Dance can be thought of as a composition of
we adopt dense trajectories descriptors along with a Bag of elementary motions (i.e. dance steps), the combination of
Words (BoW) model to represent the motion depicted in which, in a certain temporal order, gives rise to the chore-
the videos. Subsequently, a Support Vector Machine (SVM) ography. Therefore, the activity of dancing, unlike other
classifier is used for classification. In order to evaluate the types of everyday activities commonly treated in activity
performance and the adequacy of the aforementioned frame- recognition research (e.g. running, jumping or waving), is
work for the dance recognition task, we performed experi- characterized by high complexity and variation, rendering
ments for three different classification setups: 1) Greek folk the recognition problem more challenging.
dances vs other activities (including different dance genres), While there exists a large body of research related to
2) six different dance genres and 3) Greek vs Balkan folk recognition of simple, everyday activities, there is a rela-
dances. The aforementioned experiments exhibited satisfac- tively limited number of publications regarding dance recog-
tory results, with successful recognition rates higher than nition. In [8], a method for distinction among different In-
89.74%. dian classical dances is proposed. A pose descriptor, based
on the histogram of oriented optical flow, is introduced to
represent each frame of a video sequence in a hierarchical
CCS Concepts manner. An alternative SVD-based scheme, called Segmen-
•Computing methodologies → Activity recognition tal SVD (SegSVD), suitable for recognition of dance motions
and understanding; acquired through motion capture, is proposed in [3]. The ad-
vantage of this method is, that it takes into consideration
the temporal order of the samples in the motion capture
Keywords sequence. In [7], a real-time framework for classification of
dance recognition, classification, Bag of Words model dance gestures from skeleton animation data is proposed.
It includes an angular representation of the skeleton and
a classifier based on cascaded correlation. Furthermore, a
1. INTRODUCTION distance metric based on dynamic time warping is adopted,
Activity recognition refers to the process through which which estimates the difference between an input gesture and
human motions depicted in videos or other types of media an oracle.
(e.g. motion capture data) are automatically classified to A method for classifying ballet moves from motion data
a number of predefined classes. Surveillance systems, se- acquired with the Microsoft Kinect sensor is presented in
mantic annotation of content in multimedia databases, and [2]. Another method for classification of ballet moves is in-
human-computer interaction are, among others, some ex- troduced in [1], where a deformable model is proposed to
amples of the diverse applications that benefit from the de- represent the trajectories of the joints of a moving body,
velopment of efficient activity recognition algorithms. As provided by a motion capture system. Finally, a method
a consequence, activity recognition has become a very ac- for recognition of different dance gestures of Bali traditional
dance, can be found in [4]. The authors adopt a skeletal rep-
Permission to make digital or hard copies of all or part of this work for personal or resentation similar to [7], in order to represent data captured
classroom use is granted without fee provided that copies are not made or distributed with a static mounted depth sensor. Subsequently, classifi-
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than cation is performed using a linguistic motivated method.
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- In this paper, we explore various classification scenar-
publish, to post on servers or to redistribute to lists, requires prior specific permission ios, which involve dances from the Greek folk dances genre,
and/or a fee. Request permissions from permissions@acm.org.
recorded into video. Folk dances consist an important part
SETN ’16, May 18-20, 2016, Thessaloniki, Greece of a country’s cultural heritage, conveying information as-
c 2016 ACM. ISBN 978-1-4503-3734-2/16/05. . . $15.00 sociated with old customs, aspects of everyday life, as well
DOI: http://dx.doi.org/10.1145/2903220.2903221
Figure 1: Flowchart for the recognition framework.

as historic events. For the above reason, the systematic or- data. The aforementioned steps are shown in the flowchart
ganization, processing and analysis of multimedia material of Fig. 1. In the following sub-sections, the details of the
related to folk dances is of great significance for educational utilized method are provided.
and cultural purposes. The classification scenarios consid-
ered in this paper include distinguishing Greek folk dance 2.1 Dense Trajectories
videos from videos depicting other kinds of activities (in- Dense trajectories ([9],[10]) provide an efficient represen-
cluding other dance genres), discrimination among different tation of motion depicted in video sequences, which is suit-
dance genres, as well as discrimination between Greek and able for activity recognition tasks. Dense trajectories are
Balkan folk dances. The latter have been selected as a spe- calculated by first densely sampling points of the video at
cial and challenging case, since they usually exhibit signif- different spatial scales. The sampled points located in areas
icant resemblance to Greek folk dances, which makes dis- of the video containing motion are tracked on each spatial
tinction more difficult. In contrast to most of the aforemen- scale separately, using the information from the dense opti-
tioned approaches, where recognition is mostly performed at cal flow field. A trajectory is formed by points in successive
the level of individual dance, dance gesture or dance move, frames: (Pt , Pt+1 , Pt+2 , ..., Pt+L ), where L is the length of
we consider a more generic problem. Furthermore, as far as the trajectory. Consequently, each trajectory is represented
the definition of dance is concerned, we choose to work with by a descriptor defined by the following formula:
dance video sequences consisting of an arbitrary number of
consecutive dance moves, i.e. real world dance videos col- (∆Pt , ∆Pt+1 , ..., ∆Pt+L−1 )
T= Pt+L−1 , (1)
lected from Youtube, rather than isolated moves, which is a j=t k∆Pj k
more realistic scenario.
In order to distinguish Greek folk dances from other activ- where ∆Pt = (Pt+1 − Pt ) = (xt+1 − xt , yt+1 − yt ) is the
ities or dance genres, we combined the method described in displacement vector between two successive frames.
[9], which extracts feature descriptors from video data based Apart from the trajectories themselves, additional descrip-
on dense trajectories, with a Bag of Words approach, com- tors are calculated in the spatio-temporal volumes surround-
monly used in pattern recognition tasks. Classification of ing each trajectory. In more detail, an N × N × L sized vol-
the histograms is then performed using an (SVM) classifier ume around a trajectory is subdivided in a grid of cells and
with a χ2 kernel. for each of them a bundle of histograms is computed: His-
togram of Oriented Gradients (HOG), which captures static
local appearance information, Histogram of Optical Flow
2. METHOD DESCRIPTION (HOF), which carries information related to local motion,
The utilized method for folk dances recognition comprises and Motion Boundary Histograms in horizontal (x) and ver-
of a feature extraction stage, where dense trajectories are tical (y) directions (MBHx and MBHy respectively), which
calculated from the videos and a classification stage, which capture the relative motion between pixels, eliminating in
involves the construction of a Bag of Words model from the this way the unwanted camera motion.
(a) (b) (c) (d)

(e) (f) (g)

Figure 2: Sample frames from videos of the activity classes included in our dataset: (a) Greek folk dance,
(b) Balkan folk dance, (c) ballet, (d) ballroom, (e) hip hop, (f ) tap, (g) kiss from Hollywood2 dataset.

2.2 Bag of Words Model and classification In order to calculate the kernel function, we fuse the five
The process described in the previous subsection results in different histograms derived from the descriptor types dis-
the extraction of five different types of descriptors (feature cussed in 2.1 in two ways. According to the first, we perform
vectors) for each trajectory. A Bag of Words (BoW) model is kernel addition, i.e. we compute separate kernels, one for
built for each type of features, using the procedure described each descriptor type, and subsequently add them. Alterna-
below. tively, we apply histogram concatenation, i.e. we concate-
Let us represent a video sequence with X = {v1 , v2 , ..., nate the five different histograms calculated for each video
vF }, where vi denotes each of the F feature vectors. Ini- in a vector and calculate a single kernel.
tially, a codebook is created by applying the k-means algo-
rithm in order to cluster the feature vectors from a train- 3. EXPERIMENTAL RESULTS
ing videos set into C clusters. Each codeword represents
the centroid of a cluster. Subsequently, for each training The objective of our experiments was to test whether the
video, a histogram can be calculated as follows: a video X method described in the previous section is suitable for dis-
is transformed to a sequence XD = {d1 , d2 , ..., dF }, where tinguishing Greek folk dances from other activities or dance
each feature vector di is derived by replacing the feature genres. For this reason, we built a dataset consisting of di-
vector vi with the codeword (cluster center) which is clos- verse types of activities. In more detail, we included videos
est to it. By calculating the frequency of occurrence for of:
each of the C codewords in the sequence XD , a histogram
• Greek folk dances: 39 videos from 13 different dances
s = {s1 , s2 , ..., sC } of the codeword appearances can be con-
(3 videos per dance), with an approximate duration of
structed.
1-2 min per video. The folk dances we selected were
The same procedure is followed in order to calculate the
Aloniotikos, Kori Eleni, Mpaintouskino, Mpoufiko, Poust-
histogram of a test sequence. For each feature vector of a
seno, Ramna, Sarakina, Stankena, Syrtos, Tikfeskino,
test sequence, the distances to all codewords are estimated,
Tsourapia, Zavlitsena and Zaramo.
and the feature vector is replaced by the codeword which
is closest to it. The test histogram is constructed by calcu-
• Balkan folk dances: 39 videos from folk dances from
lating the frequencies of occurrence of each codeword in the
countries of the Balkans region, except for Greece, ∼
sequence.
1-2 min long.
After the histograms for all the training videos have been
calculated, they are used along with their labels to train an
• other dance genres: 39 videos from ballet, ballroom,
SVM classifier. In order to classify test sequences, the corre-
hip hop and tap dance, ∼ 0.5-6 min long.
sponding histograms are extracted and subsequently fed to
the SVM classifier. The SVM is used along with a χ2 kernel,
• Hollywood2 dataset [5]: 40 videos from all 12 avail-
calculated by the following formula:
able motion classes, namely AnswerPhone, DriveCar,
1 Eat, FightPerson, GetOutCar, HandShake, HugPer-
K(si , sj ) = exp(− D(si , sj )) (2)
A son, Kiss, Run, SitDown, SitUp, StandUp, were in-
cluded. The duration of the videos varies between 2.5
where D(si , sj ) denotes the χ2 distance between histograms sec and 2 min.
si = {si1 , si2 , ..., siC } and sj = {sj1 , sj2 , ..., sjC }:
C Sample frames from the dataset used in our experiments
1 X (sik − sjk )2 are shown in Fig. 2. Most videos, except for those from
D(si , sj ) = , (3)
2 sik + sjk the Hollywood2 dataset, were collected from Youtube. The
k=1
dances depicted in the respective videos are performed mainly
while A is a scaling factor, calculated as the mean of χ2 by professional dancers. Also, they exhibit high variabil-
distances between all training samples. ity with respect to the recording conditions (i.e. lighting
Table 1: Correct recognition rates
Recognition problem Histogram Concatenation Kernel Addition
Rate Rate
Greek dances vs all other activities 96.08% 96.08%
Dance genres 79.49% 89.74%
Greek vs Balkan folk dances 88.46% 92.31%

conditions, professional or non-professional camera opera- mance of the method was also satisfactory in the case of a six
tor, static or moving camera), a fact that renders the recog- class classification scenario involving different dance genres,
nition task more realistic but more challenging at the same where 89.74% successful recognition rate was achieved. As
time. a matter of fact, the method exhibits promising results for
For the extraction of dense trajectories features from the recognition among different individual Greek folk dances, a
video data, we used the default parameter values used in [9]. scenario that we intend to pursue in our future research.
Furthermore, codebooks for each feature type, consisting of
4000 codewords were calculated using the training data. 5. ACKNOWLEDGMENTS
The first test we conducted involved a two-class classifi-
This research has been co-financed by the European Union
cation scenario, where Greek folk dances are distinguished
(European Social Fund - ESF) and Greek national funds
from all other activity types of our dataset. Therefore, the
through the Operation Program ”Education and Lifelong
first class consists of the Greek folk dance video sequences
Learning” of the National Strategic Reference Framework
(26 training and 13 testing), while the second class contains
(NSRF) - Research Funding Program: THALIS-UOA-ERA-
videos from Balkan folk dances, other dance genres, as well
SITECHNIS MIS 375435.
as from the Hollywood2 dataset (71 training and 38 testing
videos in total). As it can be observed from Table 1, for this
experimental setting, the method achieved high recognition 6. REFERENCES
[1] S. Boukir and F. Cheneviére. Compression and recognition
rates, for both types of descriptors fusion. of dance gestures using a deformable model. Pattern
Furthermore, we tested the method in a six-class clas- Analysis and Applications, 7(3):308–316, 2004.
sification experiment for dance genre recognition. For the [2] J. Dancs, R. Sivalingam, G. Somasundaram, V. Morellas,
purposes of this experiment we used video sequences from and N. Papanikolopoulos. Recognition of ballet
the following activity types: Greek folk dances, Balkan folk micro-movements for use in choreography. In Proc.
dances, ballet, ballroom, hip hop and tap. The training set IEEE/RSJ Int. Conf. Intelligent Robots and Systems
(IROS), pages 1162–1167, Nov 2013.
for this experiment consisted of 68 videos, while the testing
[3] L. Deng, H. Leung, N. Gu, and Y. Yang. Recognizing Dance
set of 39 videos. In this case, the recognition rates were sig- Motions with Segmental SVD. In Proc. 20th Int. Conf.
nificantly lower than the ones of the previous test, which is Pattern Recognition, ICPR ’10, pages 1537–1540, 2010.
expected, as the recognition task involves six classes instead [4] Y. Heryadi, M. Fanany, and A. Arymurthy. Grammar of
of two. It is observed that fusing the descriptors by adding dance gesture from bali traditional dance. Int. Journal of
the individual kernels, achieves a better performance than Computer Science Issues, 9:144–149, 2012.
concatenating them and using a single kernel. [5] M. Marszalek, I. Laptev, and C. Schmid. Actions in
Finally, we performed experiments in order to test whether context. In Proc. IEEE Conf. Computer Vision & Pattern
Recognition, 2009.
the method is capable of distinguishing Greek from Balkan
[6] R. Poppe. A survey on vision-based human action
folk dances, a challenging setup due to the similarities of the recognition. Image Vision Comput., 28(6):976–990, June
two genres. In this case, we consider the two-class classifica- 2010.
tion problem, where each class consists of the videos corre- [7] M. Raptis, D. Kirovski, and H. Hoppe. Real-time
sponding to Greek and Balkan dances respectively (52 train- classification of dance gestures from skeleton animation. In
ing and 26 testing videos in total). The correct recognition Proceedings of the 2011 ACM SIGGRAPH/Eurographics
rate for the above problem was 92.31% for separately cal- Symposium on Computer Animation, SCA ’11, pages
culated kernels and 88.46% for single kernel. As in the case 147–156, 2011.
[8] S. Samanta, P. Purkait, and B. Chanda. Indian classical
of dance genres classification, the kernel addition yielded dance classification by learning dance pose bases. In 2012
higher rates. IEEE Workshop on Applications of Computer Vision
(WACV), pages 265–270, Jan 2012.
[9] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense
4. CONCLUSIONS AND FUTURE WORK trajectories and motion boundary descriptors for action
In this paper we adopted an existing framework for recog- recognition. Int. Journal of Computer Vision,
nition of generic activities (i.e. the Bag of Words model in 103(1):60–79, 2013.
combination with dense trajectories descriptors), in order [10] H. Wang and C. Schmid. Action Recognition with
Improved Trajectories. In Proc. IEEE Int. Conf. Computer
to distinguish Greek folk dances from other kind of activ- Vision (ICCV 2013), pages 3551–3558. IEEE, 2013.
ities, as well as from other dance genres. The aforemen- [11] D. Weinland, R. Ronfard, and E. Boyer. A Survey of
tioned framework exhibited very good results in the case Vision-based Methods for Action Representation,
of two-class classification problems. Specifically, a rate as Segmentation and Recognition. Computer Vision and
high as 96.08% was observed in the recognition task for Image Understanding, 115(2):224–241, Feb 2011.
Greek dances vs all other activities and 92.31% in that of
distinguishing Greek from Balkan folk dances. The perfor-

You might also like