Professional Documents
Culture Documents
5 PDF
5 PDF
5 PDF
265
In general ICD classification is a human activity clas- 2. Proposed Methodology
sification problem. There are several attempts to recog-
nize the human activities from video [24, 2]. Aggarwal In our method, we first represent each frame of a ICD
et al. [2] classify the human activity recognition in two video using pose descriptor. An over-complete dictionary
classes namely, single-layered approach and hierarchical is then learned from the visual words consisting of some
sequence of pose descriptors using a well known on-line
approach. In single layer approach, activities are recog-
nized directly from videos, while in hierarchical approach, dictionary learning technique [17]. Based on the learned
an action is divided into sub-actions [8]. The action is dictionary, we sparsely represent each video words and fi-
represented by classifying it into sub-actions. Wang et nally build a sparse descriptor for dance. In the following
al. [26] have used topic model to model the human activ- subsection, we discuss the pose descriptor.
ity. They represent a video by Bag-of-Wards representation. 2.1. Pose descriptor:
Later, they have used a model which is popular in object
recognition community, called Hidden Conditional Random Motion information [7, 11, 14, 6] is one of the dominant
Field (HCRF) [27]. They model human action by flexi- feature in human activity analysis from a video data. We
ble constellation of parts conditioned on image observations use motion and oriented gradient information to build a pose
and learn the model parameters in max-margin framework descriptor of each frame [21]. To get a pose descriptor, we
−−→ →
−
and named it max-margin hidden conditional random field calculate optical flow [OF ] and the gradient [ G ], as shown
(MMHCRF). in Figure 1(a) and 1(b) respectively.
We combine these two vector fields to get the final mo-
→
−
Some researchers use space time features to classify the tion matrix [ R ] of the corresponding frame as follows:
human action. Blank et al. represent the human action as →
− →
− −−→
three dimensional shapes included by the silhouettes in the [ R ] = [ G ] ∗ [OF ] (1)
space-time volume [3]. They use space-time features such where binary operation (∗) represents the element wise mul-
as local space-time saliency, action dynamics, shape struc- →
− −−→
tiplication of G and OF . i.e.,
ture and orientation to classify the action. In [24], they
recognise human action based on space-time locally adap- Rx = Gx × OFx (2)
tive regression kernels and the matrix cosine similarity mea-
sure. Klaser et al. localize the action in each frame by ob- and
taining generic spatio temporal human tracks [9]. They have Ry = Gy × OFy (3)
used sliding window classifier to detect specific human ac- Hence,
tions. Ry
θ = tan−1 ( ) (4)
Rx
Fengjun et al. model the human activity as a collection The above equation clearly explains that, the weighted
of 2D synthetic human poses rendered from a wide range of optical flow captures the information mostly at the outline
viewpoints [15]. For an input video sequence, they match of the actor present in the video frame and not from back-
each frame with synthetic poses and track the same by ground shown in Figure 1(c). We calculate the feature in a
Viterbi algorithm. On the other hand, in a dance video, there hierarchical fashion. For that, the image (corresponding to
are lots of variations in poses. So, to model each dance pose a video frame) is divided into 4l parts in lth (l = 0, 1, .....)
is largely an unexplored research area. In [23], they classi- layer. Angle histogram (calculated by Eq. 4) of each parts
fied human activities into three categories: atomic action, is calculated in L bins. For lth layer, we calculate the his-
composite action, and interaction. They use context-free togram of each parts and concatenate them to form a vector
grammar (CFG) to represent the complex human action. of length (L × 4l ). In each layer, we normalize the vec-
tor by L1 norm. For the details, please go through [21].
Our contribution in this paper are in two folds. First, we For our experiment, we take the value of l as 0, 1 and 2,
present ICD classification problem as a new application of and L = 8. Finally, concatenate all the vectors of all lay-
computer vision. Secondly, to solve this problem, we pro- ers and get a 168 = (8 + 4 × 8 + 16 × 8) dimensional
pose a new dance descriptor (or action descriptor). We build feature vector, which is the pose descriptor shown in Fig-
a new dataset of ICD which may be helpful in computer vi- ure 2. For ith training dance video Vi (i = 1 : M ) having
sion research community for further research in this domain the number of frames Ni , we get (Ni − 1) pose descrip-
and will be available soon. Rest of the paper is organized as tors of each 168 dimension as fi,1 , fi,2 , ....., fi,Ni −1 . We
follows: Proposed methodology is described in section 2. learn an over-complete dictionary from the set of all pose
In section 3, we show the experimental results and finally, descriptors F = {fi,j : i = 1 : M, j = 1 : Ni } using online
section 4 concludes the paper. dictionary learning method, discussed in the next section.
266
frames and a representing vector that summarizes a video by
building a temporal pyramid and max pooling on each cell.
Details are given below:
267
Van Gemert et al. [25] have obtained improvements in
object classification by replacing hard quantization by soft
quantization:
exp(−βkvfi − dj k22
αi,j = PK (10)
2
k=1 exp(−βkvfi − dk k2 )
where β is a parameter that controls the softness of the soft
assignment (hard assignment is the limit when β → ∞).
This amounts to coding as in the E-step of the expectation-
maximization algorithm to learn a Gaussian mixture model,
using codewords of the dictionary as centers.
Sparse coding [5, 16] uses a linear combination of a
small number of codewords to approximate the vfi . We
use an online dictionary learning algorithm [17] to generate
our code-book so that each visual word can be approximate
by a sparse combination of code-book words. We use the
idea of soft-assignment of target sample visual word with
Figure 3. Illustrates the architecture of our algorithm based on the constrained that it can be approximated by linear com-
Temporal Pyramid matching. The whole video sequence is di- bination of a few visual words. Then we use max pooling
vided into overlapping visual words vfi Sparse coding measures on each of the cells to get near-local responses.
the responses of each local descriptor to the dictionary’s ”visual
words”. These responses are pooled across different temporal lo-
cations over different temporal scales. αi = arg min L(α, D) ' kvfi − Dαk22 + λkαk1 (11)
α
hm,j = max αi,j , f orj = 1, . . . , K (12)
i∈Nm
there be M cells/regions of interests on the video (e.g., the
15 = 8 + 4 + 2 + 1 cells of a four-level temporal pyramid), where kαk1 denotes the l1 norm of α, λ is a parameter that
with Nm denoting the number of frames/indices within re- controls the sparsity, and D is a dictionary trained by min-
gion m. Let f and g denote some coding and pooling op- imizing the average of L(αi , D) over all samples of visual
erators, respectively. The vector z representing the whole words in the training videos, alternatively over D and αi .
video is obtained by sequentially coding, pooling over all We have used SPAMS software [1] to train the dictionary
regions, and concatenating: and sparse representation of each visual words.
2.4. video descriptor
αi = f (vfi ), i = 1, . . . , N (5) After pooling on each of the cells a pyramid matching
hm = g({αi }i∈Nm ), m = 1, . . . , M (6) kernel is used. As described in detail in [13], it can be rep-
z T = [hT1 , hT2 , . . . , hTM ] (7) resented by the intersection kernel concatenating the indi-
vidual responses of each cell. So our case the summariza-
In the usual bag-of-features framework [13], f mini- tion of a video or the video descriptor can be written as
mizes the distance to a code-book, usually learned by an V T = [hT1 , hT2 , . . . , hTM ]. We use a one to one SVM with
unsupervised algorithm (e.g., K-means), and g computes intersection kernel to learn those video descriptors for clas-
the average over the pooling region: sification task.
3. Experimental results
αi ∈ {0, 1}K , αi,j = 1 iff j = arg min kvfi − dk k22 (8)
k≤K Since this is the first attempt to classify ICD in com-
1 X puter vision domain, there is no standard dataset available
hm = αi (9)
|Nm | for evaluation. To test our algorithm, we have created a
i∈Nm
dataset from the videos downloaded from YouTube video
where dk denotes the k-th codeword in the code-book. library. Primarily, the dataset consists of the three oldest
Note that averaging and using uniform weighting is equiv- and most popular ICDs, Bharatnatyam, Kathak and Odissi.
alent (up to a constant multiplier) to using histograms with The dataset is manually labeled by dance experts. In our
weights inversely proportional to the area of the pooling re- dataset, each class contains 30 video clips with different
gions. resolutions (max. 400 × 350). The maximum duration of a
268
bharatnatyam 83.33 3.33 13.33
kathak 6.67 90.00 3.33
odissi 10.00 3.33 86.67
bha kat odi
rat hak ssi
natya
m
269
pose a new action descriptor. The proposed descriptor not [16] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary
only classify ICD efficiently but also classify a common hu- learning for sparse coding. In ICML, Montreal, Canada, June
man actions. The results on KTH datbase proves the same. 2009.
Since this is the first attempt of its kind, the dataset on ICD [17] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learn-
is relatively small. Though the dataset is small, the variation ing for matrix factorization and sparse coding. Journal of
in terms of number of person involved, clothing, acquisition Machine Learning Research, 11(9):19–68, 2010.
condition etc. makes the problem challenging. We are go- [18] S. Maji, A. C. Berg, and J. Malik. Classification using inter-
section kernel support vector machines is efficient. In CVPR,
ing to make our dataset available soon. We have a plan to
pages 1–8, June 2008.
increase the number of classes and number of videos in each
[19] S. Maji, L. Bourdev, and J. Malik. Action recognition from a
class.
distributed representation of pose and appearance. In CVPR,
pages 3177–3184, June 2011.
References [20] R. Massey. India’s dances. In India’s Dances, 2004.
[21] S. Mukherjee, S. K. Biswas, and D. P. Mukherjee. Recogniz-
[1] http://www.di.ens.fr/willow/spams/.
ing human action at a distance in video by key poses. IEEE
[2] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A Transactions on Circuits and Systems for Video Technology,
review. ACM Computing Surveys(To appear), 2011. 21(9):1228–1241, 2011.
[3] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. [22] N. Nayak, R. Sethi, B. Song, and A. Roy-Chowdhury. Mo-
Actions as space-time shapes. In ICCV, volume 2, pages tion pattern analysis for modeling and recognition of com-
1395–1402, October 2005. plex human activities. Visual Analysis of Humans: Looking
[4] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning at People, Springer, 2011.
mid-level features for recognition. In CVPR, pages 2559– [23] M. S. Ryoo and J. K. Aggarwal. Recognition of composite
2566, June 2010. human activities through context-free grammar based repre-
[5] A. M. Bruckstein, D. L. Donoho, and M. Elad. From sparse sentation. In CVPR, pages 1709 – 1718, October 2006.
solutions of systems of equations to sparse modeling of sig- [24] H. Seo and P. Milanfar. Action recognition from one exam-
nals and images. SIAM Review, 51(1):34–81, January 2009. ple. IEEE Trans. on Pattern Analysis and Machine Intelli-
[6] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. His- gence, 33(5):867–882, 2011.
tograms of oriented optical flow and binet-cauchy kernels on [25] J. Van Gemert, C. Veenman, A. Smeulders, and J.-M. Geuse-
nonlinear dynamical systems for the recognition of human broek. Visual word ambiguity. Pattern Analysis and Machine
actions. In CVPR, pages 1932–1939, June 2009. Intelligence, IEEE Transactions on, 32(7):1271–1283, July
[7] A. A. Efros, A. C. Berg, G. P. Mori, and J. Malik. Recogniz- 2009.
ing action at a distance. In ICCV, pages 726–733, October [26] Y. Wang and G. Mori. Human action recognition by semi-
2003. latent topic models. IEEE Trans. on Pattern Analysis and
[8] A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis. Understand- Machine Intelligence Special Issue on Probabilistic Graphi-
ing videos, constructing plots: Learning a visually grounded cal Models in Computer Vision, 31(10):1762–1774, 2009.
storyline model from annotated videos. In CVPR, 2009. [27] Y. Wang and G. Mori. Hidden part models for human ac-
[9] A. Kläser, M. Marszałek, C. Schmid, and A. Zisserman. Hu- tion recognition: Probabilistic vs. max-margin. IEEE Trans.
man focused action localization in video. In International on Pattern Analysis and Machine Intelligence, 33(7):1310–
Workshop on Sign, Gesture, Activity, 2010. 1323, 2011.
[28] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyra-
[10] T. Lan, Y. Wang, W. Yang, and G. Mori. Beyond actions:
mid matching using sparse coding for image classification.
Discriminative models for contextual group activities. In
In CVPR, pages 1794–1801, June 2009.
NIPS, 2010.
[29] W. Yang, Y. Wang, and G. Mori. Recognizing human actions
[11] I. Laptev. On space-time interest points. International Jour-
from still images with latent poses. In CVPR, pages 2030–
nal of Computer Vision, 64(2):107–123, 2005.
2037, June 2010.
[12] I. Laptev and T. Lindeberg. Velocity adaptation of space- [30] B. Yao, A. Khosla, and L. Fei-Fei. Classifying actions and
time interest points. In ICPR, pages 52–56, September 2004. measuring action similarity by modeling the mutual context
[13] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of of objects and human poses. In ICML, Bellevue, USA, June
features: Spatial pyramid matching for recognizing natural 2011.
scene categories. In CVPR, volume 2, pages 2169–2178,
October 2006.
[14] B. D. Lucas and T. Kanade. An iterative image registration
technique with an application to stereo vision. In 7th IJCAI,
pages 674–679, 1981.
[15] F. Lv and R. Nevatia. Single view human action recogni-
tion using key pose matching and viterbi path searching. In
CVPR, pages 1–8, June 2007.
270