Professional Documents
Culture Documents
A Bag of Words Approach For Recognition of Greek Folk Dances
A Bag of Words Approach For Recognition of Greek Folk Dances
dances
as historic events. For the above reason, the systematic or- data. The aforementioned steps are shown in the flowchart
ganization, processing and analysis of multimedia material of Fig. 1. In the following sub-sections, the details of the
related to folk dances is of great significance for educational utilized method are provided.
and cultural purposes. The classification scenarios consid-
ered in this paper include distinguishing Greek folk dance 2.1 Dense Trajectories
videos from videos depicting other kinds of activities (in- Dense trajectories ([9],[10]) provide an efficient represen-
cluding other dance genres), discrimination among different tation of motion depicted in video sequences, which is suit-
dance genres, as well as discrimination between Greek and able for activity recognition tasks. Dense trajectories are
Balkan folk dances. The latter have been selected as a spe- calculated by first densely sampling points of the video at
cial and challenging case, since they usually exhibit signif- different spatial scales. The sampled points located in areas
icant resemblance to Greek folk dances, which makes dis- of the video containing motion are tracked on each spatial
tinction more difficult. In contrast to most of the aforemen- scale separately, using the information from the dense opti-
tioned approaches, where recognition is mostly performed at cal flow field. A trajectory is formed by points in successive
the level of individual dance, dance gesture or dance move, frames: (Pt , Pt+1 , Pt+2 , ..., Pt+L ), where L is the length of
we consider a more generic problem. Furthermore, as far as the trajectory. Consequently, each trajectory is represented
the definition of dance is concerned, we choose to work with by a descriptor defined by the following formula:
dance video sequences consisting of an arbitrary number of
consecutive dance moves, i.e. real world dance videos col- (∆Pt , ∆Pt+1 , ..., ∆Pt+L−1 )
T= Pt+L−1 , (1)
lected from Youtube, rather than isolated moves, which is a j=t k∆Pj k
more realistic scenario.
In order to distinguish Greek folk dances from other activ- where ∆Pt = (Pt+1 − Pt ) = (xt+1 − xt , yt+1 − yt ) is the
ities or dance genres, we combined the method described in displacement vector between two successive frames.
[9], which extracts feature descriptors from video data based Apart from the trajectories themselves, additional descrip-
on dense trajectories, with a Bag of Words approach, com- tors are calculated in the spatio-temporal volumes surround-
monly used in pattern recognition tasks. Classification of ing each trajectory. In more detail, an N × N × L sized vol-
the histograms is then performed using an (SVM) classifier ume around a trajectory is subdivided in a grid of cells and
with a χ2 kernel. for each of them a bundle of histograms is computed: His-
togram of Oriented Gradients (HOG), which captures static
local appearance information, Histogram of Optical Flow
2. METHOD DESCRIPTION (HOF), which carries information related to local motion,
The utilized method for folk dances recognition comprises and Motion Boundary Histograms in horizontal (x) and ver-
of a feature extraction stage, where dense trajectories are tical (y) directions (MBHx and MBHy respectively), which
calculated from the videos and a classification stage, which capture the relative motion between pixels, eliminating in
involves the construction of a Bag of Words model from the this way the unwanted camera motion.
(a) (b) (c) (d)
Figure 2: Sample frames from videos of the activity classes included in our dataset: (a) Greek folk dance,
(b) Balkan folk dance, (c) ballet, (d) ballroom, (e) hip hop, (f ) tap, (g) kiss from Hollywood2 dataset.
2.2 Bag of Words Model and classification In order to calculate the kernel function, we fuse the five
The process described in the previous subsection results in different histograms derived from the descriptor types dis-
the extraction of five different types of descriptors (feature cussed in 2.1 in two ways. According to the first, we perform
vectors) for each trajectory. A Bag of Words (BoW) model is kernel addition, i.e. we compute separate kernels, one for
built for each type of features, using the procedure described each descriptor type, and subsequently add them. Alterna-
below. tively, we apply histogram concatenation, i.e. we concate-
Let us represent a video sequence with X = {v1 , v2 , ..., nate the five different histograms calculated for each video
vF }, where vi denotes each of the F feature vectors. Ini- in a vector and calculate a single kernel.
tially, a codebook is created by applying the k-means algo-
rithm in order to cluster the feature vectors from a train- 3. EXPERIMENTAL RESULTS
ing videos set into C clusters. Each codeword represents
the centroid of a cluster. Subsequently, for each training The objective of our experiments was to test whether the
video, a histogram can be calculated as follows: a video X method described in the previous section is suitable for dis-
is transformed to a sequence XD = {d1 , d2 , ..., dF }, where tinguishing Greek folk dances from other activities or dance
each feature vector di is derived by replacing the feature genres. For this reason, we built a dataset consisting of di-
vector vi with the codeword (cluster center) which is clos- verse types of activities. In more detail, we included videos
est to it. By calculating the frequency of occurrence for of:
each of the C codewords in the sequence XD , a histogram
• Greek folk dances: 39 videos from 13 different dances
s = {s1 , s2 , ..., sC } of the codeword appearances can be con-
(3 videos per dance), with an approximate duration of
structed.
1-2 min per video. The folk dances we selected were
The same procedure is followed in order to calculate the
Aloniotikos, Kori Eleni, Mpaintouskino, Mpoufiko, Poust-
histogram of a test sequence. For each feature vector of a
seno, Ramna, Sarakina, Stankena, Syrtos, Tikfeskino,
test sequence, the distances to all codewords are estimated,
Tsourapia, Zavlitsena and Zaramo.
and the feature vector is replaced by the codeword which
is closest to it. The test histogram is constructed by calcu-
• Balkan folk dances: 39 videos from folk dances from
lating the frequencies of occurrence of each codeword in the
countries of the Balkans region, except for Greece, ∼
sequence.
1-2 min long.
After the histograms for all the training videos have been
calculated, they are used along with their labels to train an
• other dance genres: 39 videos from ballet, ballroom,
SVM classifier. In order to classify test sequences, the corre-
hip hop and tap dance, ∼ 0.5-6 min long.
sponding histograms are extracted and subsequently fed to
the SVM classifier. The SVM is used along with a χ2 kernel,
• Hollywood2 dataset [5]: 40 videos from all 12 avail-
calculated by the following formula:
able motion classes, namely AnswerPhone, DriveCar,
1 Eat, FightPerson, GetOutCar, HandShake, HugPer-
K(si , sj ) = exp(− D(si , sj )) (2)
A son, Kiss, Run, SitDown, SitUp, StandUp, were in-
cluded. The duration of the videos varies between 2.5
where D(si , sj ) denotes the χ2 distance between histograms sec and 2 min.
si = {si1 , si2 , ..., siC } and sj = {sj1 , sj2 , ..., sjC }:
C Sample frames from the dataset used in our experiments
1 X (sik − sjk )2 are shown in Fig. 2. Most videos, except for those from
D(si , sj ) = , (3)
2 sik + sjk the Hollywood2 dataset, were collected from Youtube. The
k=1
dances depicted in the respective videos are performed mainly
while A is a scaling factor, calculated as the mean of χ2 by professional dancers. Also, they exhibit high variabil-
distances between all training samples. ity with respect to the recording conditions (i.e. lighting
Table 1: Correct recognition rates
Recognition problem Histogram Concatenation Kernel Addition
Rate Rate
Greek dances vs all other activities 96.08% 96.08%
Dance genres 79.49% 89.74%
Greek vs Balkan folk dances 88.46% 92.31%
conditions, professional or non-professional camera opera- mance of the method was also satisfactory in the case of a six
tor, static or moving camera), a fact that renders the recog- class classification scenario involving different dance genres,
nition task more realistic but more challenging at the same where 89.74% successful recognition rate was achieved. As
time. a matter of fact, the method exhibits promising results for
For the extraction of dense trajectories features from the recognition among different individual Greek folk dances, a
video data, we used the default parameter values used in [9]. scenario that we intend to pursue in our future research.
Furthermore, codebooks for each feature type, consisting of
4000 codewords were calculated using the training data. 5. ACKNOWLEDGMENTS
The first test we conducted involved a two-class classifi-
This research has been co-financed by the European Union
cation scenario, where Greek folk dances are distinguished
(European Social Fund - ESF) and Greek national funds
from all other activity types of our dataset. Therefore, the
through the Operation Program ”Education and Lifelong
first class consists of the Greek folk dance video sequences
Learning” of the National Strategic Reference Framework
(26 training and 13 testing), while the second class contains
(NSRF) - Research Funding Program: THALIS-UOA-ERA-
videos from Balkan folk dances, other dance genres, as well
SITECHNIS MIS 375435.
as from the Hollywood2 dataset (71 training and 38 testing
videos in total). As it can be observed from Table 1, for this
experimental setting, the method achieved high recognition 6. REFERENCES
[1] S. Boukir and F. Cheneviére. Compression and recognition
rates, for both types of descriptors fusion. of dance gestures using a deformable model. Pattern
Furthermore, we tested the method in a six-class clas- Analysis and Applications, 7(3):308–316, 2004.
sification experiment for dance genre recognition. For the [2] J. Dancs, R. Sivalingam, G. Somasundaram, V. Morellas,
purposes of this experiment we used video sequences from and N. Papanikolopoulos. Recognition of ballet
the following activity types: Greek folk dances, Balkan folk micro-movements for use in choreography. In Proc.
dances, ballet, ballroom, hip hop and tap. The training set IEEE/RSJ Int. Conf. Intelligent Robots and Systems
(IROS), pages 1162–1167, Nov 2013.
for this experiment consisted of 68 videos, while the testing
[3] L. Deng, H. Leung, N. Gu, and Y. Yang. Recognizing Dance
set of 39 videos. In this case, the recognition rates were sig- Motions with Segmental SVD. In Proc. 20th Int. Conf.
nificantly lower than the ones of the previous test, which is Pattern Recognition, ICPR ’10, pages 1537–1540, 2010.
expected, as the recognition task involves six classes instead [4] Y. Heryadi, M. Fanany, and A. Arymurthy. Grammar of
of two. It is observed that fusing the descriptors by adding dance gesture from bali traditional dance. Int. Journal of
the individual kernels, achieves a better performance than Computer Science Issues, 9:144–149, 2012.
concatenating them and using a single kernel. [5] M. Marszalek, I. Laptev, and C. Schmid. Actions in
Finally, we performed experiments in order to test whether context. In Proc. IEEE Conf. Computer Vision & Pattern
Recognition, 2009.
the method is capable of distinguishing Greek from Balkan
[6] R. Poppe. A survey on vision-based human action
folk dances, a challenging setup due to the similarities of the recognition. Image Vision Comput., 28(6):976–990, June
two genres. In this case, we consider the two-class classifica- 2010.
tion problem, where each class consists of the videos corre- [7] M. Raptis, D. Kirovski, and H. Hoppe. Real-time
sponding to Greek and Balkan dances respectively (52 train- classification of dance gestures from skeleton animation. In
ing and 26 testing videos in total). The correct recognition Proceedings of the 2011 ACM SIGGRAPH/Eurographics
rate for the above problem was 92.31% for separately cal- Symposium on Computer Animation, SCA ’11, pages
culated kernels and 88.46% for single kernel. As in the case 147–156, 2011.
[8] S. Samanta, P. Purkait, and B. Chanda. Indian classical
of dance genres classification, the kernel addition yielded dance classification by learning dance pose bases. In 2012
higher rates. IEEE Workshop on Applications of Computer Vision
(WACV), pages 265–270, Jan 2012.
[9] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense
4. CONCLUSIONS AND FUTURE WORK trajectories and motion boundary descriptors for action
In this paper we adopted an existing framework for recog- recognition. Int. Journal of Computer Vision,
nition of generic activities (i.e. the Bag of Words model in 103(1):60–79, 2013.
combination with dense trajectories descriptors), in order [10] H. Wang and C. Schmid. Action Recognition with
Improved Trajectories. In Proc. IEEE Int. Conf. Computer
to distinguish Greek folk dances from other kind of activ- Vision (ICCV 2013), pages 3551–3558. IEEE, 2013.
ities, as well as from other dance genres. The aforemen- [11] D. Weinland, R. Ronfard, and E. Boyer. A Survey of
tioned framework exhibited very good results in the case Vision-based Methods for Action Representation,
of two-class classification problems. Specifically, a rate as Segmentation and Recognition. Computer Vision and
high as 96.08% was observed in the recognition task for Image Understanding, 115(2):224–241, Feb 2011.
Greek dances vs all other activities and 92.31% in that of
distinguishing Greek from Balkan folk dances. The perfor-