Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Int J Multimed Info Retr (2014) 3:15–28

DOI 10.1007/s13735-013-0042-8

REGULAR PAPER

Multivariate time series modeling of geometric features


of spatio-temporal volumes for content based video retrieval
Chiranjoy Chattopadhyay · Amit Kumar Maurya

Received: 30 July 2013 / Revised: 15 August 2013 / Accepted: 16 August 2013 / Published online: 3 September 2013
© Springer-Verlag London 2013

Abstract In this paper, we address the problem of Content 1 Introduction


Based Video Retrieval using a multivariate time series mod-
eling of features. We particularly focus on representing the The process of Content Based Video Retrieval (CBVR),
dynamics of geometric features on the Spatio-Temporal Vol- involves retrieving similar video in response to users’ search
ume (STV) created from a real world video shot. The STV queries, using the inherent properties extracted from the
intrinsically holds the video content by capturing the dynam- video itself. However, the quality of the retrieval result of
ics of the appearance of the foreground object over time, and any CBVR system depends on the quality of content rep-
hence can be considered as a dynamical system. We have cap- resentation in the feature database. On the other hand, the
tured the geometric property of the parameterized STV using quality of features also depends on the video segments (i.e.,
the Gaussian curvature computed at each point on its surface. video objects, video shots, scenes, etc.). Hence, represent-
The change of Gaussian curvature over time is then modeled ing video based on its content has emerged as an important
as a Linear Dynamical System (LDS). Due to its capability research area in the field of multimedia retrieval. It has a large
to efficiently model the dynamics of a multivariate signal, number of applications in fields such as surveillance, secu-
Auto Regressive Moving Average (ARMA) model is used rity, entertainment, education, etc. One major need of these
to represent the time series data. Parameters of the ARMA applications is to store large collections of video data. The
model are then used for video content representation. To dis- challenge is to represent these large video databases in terms
criminate between a pair of video shots (time series), we of their contents.
have used the subspace angle between a pair of feature vec- In many of these applications, with respect to a video data-
tors formed using ARMA model parameters. Experiments base when a query video is given, the following two questions
are done on four publicly available benchmark datasets, shot need to be addressed:
using a static camera. We present both qualitative and quan-
titative analysis of our proposed framework. Comparative – In which category does the video shot belong to?
results with three recent works on video retrieval also show – In the given category, what are the “most similar” video
the efficiency of our proposed framework. shots in the database, to the given query?

Keywords Content Based Video Retrieval · The former is known as video categorization or classifica-
Spatio temporal volume · Gaussian curvature · tion, and the later is known as the retrieval task. Modeling the
Time series · ARMA intra- and inter-class variations among different categories of
video shots in a supervised (classification) or unsupervised
(clustering) way, will lead to answering the first question.
C. Chattopadhyay (B) · A. K. Maurya
On the other hand, identification of content in a video shot,
Indian Institute of Technology Madras, Chennai 600036, India use of proper feature and discriminating criteria and a proper
e-mail: cchatto@cse.iitm.ac.in ranking function are required to answer the second ques-
A. K. Maurya tion. In our current work, we are particularly interested in
e-mail: peaceamit@gmail.com answering the second question by modeling the dynamics of

123
16 Int J Multimed Info Retr (2014) 3:15–28

geometric features of the Space Time Volume (STV), created similarity search in multi-dimensional data sequence has
from static camera video shots, as a LDS. Parameters of the been reported in [24]. An algorithm to extract time series
Auto Regressive Moving Average (ARMA) model are used from video to characterize type of motion is discussed in
to represent the time series data, and video shots are retrieved [20]. A framework for finding similar time series with natural
from the database based on the match-cost between the relations is proposed in [14]. In [2], a similarity measure for
parameters. multivariate time series has been proposed using the Euclid-
ean distance based on Vector Autoregressive (VAR) models,
1.1 Related work for human action classification. Full-featured wavelet trans-
forms for similarity search of time series data was proposed
Spatio-temporal features and time series modeling has been in [31]. Bag-of-pattern approach [26], statistical modeling
the area of interest to many researchers in recent past. In [11], and kernel based methods [40] were also explored by
literature, works related to our proposed framework can researchers in the past for modeling time series for different
be broadly grouped into two categories: (i) use of spatio- applications.
temporal features as content descriptor, (ii) application of Literature review suggests that most existing CBVR sys-
time series modeling technique for various computer vision tems focus on video analysis, visual feature extraction, and
tasks on video data. In the rest of this section of related work supporting query by example. Few CBVR systems focus
we will discuss in details some of those key contributions. upon exploiting the spatio-temporal features available in
Spatio-temporal features have been exploited extensively video shots for unique representation and their efficient
by the researchers to represent and retrieve videos, both from retrieval. Dearth of a compact and unified representation
the pixel and compressed domains [3,19,25]. During recent of spatio-temporal information of the content present in a
years, use of motion trajectory and object shapes as fea- video shot in most CBVR systems makes them less efficient
tures have gained importance for representing video shots. in terms of computing and effective in measuring similarity
Motion trajectory based features have been presented as con- between a pair of video shots.
tent descriptors and used for retrieval tasks in [5,9,13,23]. Contribution In this work we particularly focus on repre-
Statistical features have been reported as a spatio-temporal senting video content as a multi-variate time series. We first
content descriptor in [27]. Existing video retrieval systems generate a smooth, parameterized STV from a given video
[15,16,23] extract low-level features from video shots and shot by combining the shape and trajectory information of
has reported promising results. A global motion trajectory the moving foreground blob. Then we model the dynam-
representation based video retrieval technique has been pro- ics of the surface curvature at each point on the STV as a
posed in [18]. The video object based retrieval technique Linear Dynamical System (LDS) using ARMA model. Para-
proposed in [16] segments video objects and represent the meters of the ARMA model have been used for representing
color, shape, texture and motion features of video objects. In the video content. We have also proposed a framework for
one of our earlier work [10], a joint spatio-temporal repre- CBVR using this content representation technique.
sentation of a video object has been analyzed, where shape Organization of the paper In Sect. 2, we briefly discuss
and trajectory features are combined to generate an STV. A the overall framework of our proposed method. Section 3
Multi-spectral approach was followed to represent the video describes the method of smooth, parameterized STV forma-
content as peaks and ridge lines of the EMST-CSS surface. A tion from a given video shot. Section 4 discusses our proposed
space time salient object approach has been reported in [21] method of modeling STV as time series and matching a pair
for video content representation and retrieval. It proposed a of video shot. Experimental setup with details of the datasets
fusion method for motion and spatial saliency integration to used for experimental analysis is given in Sect. 5. In Sect. 6,
detect spatio-temporal salient objects, based on both atten- we demonstrate the application of the proposed content rep-
tion analysis and interest point trajectory tracking. MPEG-7 resentation technique for CBVR task, using four benchmark
visual descriptors are used as video content descriptors in datasets. Finally, concluding remarks are presented in Sect. 7.
[25] for CBVR task. Various approaches exist in literature
that use processing on the STV for human action recognition
[7,22,32]. Differential geometry and geometric optimization 2 Brief description of the proposed method
techniques have been used to design robust vision based algo-
rithm [35,36,39]. Spatio-temporal features constitute important cues to the
Features extracted from video shot which are modeled human perception system so as to extract salient information
as time series have been utilized in certain computer vision from a video shot. The main focus of our proposed CBVR
problems. Examples include, dynamic textures [17], face framework is to combine the spatial and temporal features
recognition [1], gait recognition [6], image-video based present in a video shot and use it for video retrieval based on
recognition [37] and activity recognition [8]. Technique for their content.

123
Int J Multimed Info Retr (2014) 3:15–28 17

Fig. 1 Overall framework of the proposed method of Content Based Video Retrieval

Assumption Our proposed algorithm works under the ters corresponding to each STV (of a video shot) are stored
following assumptions: as a feature set in the feature database. Figure 2 shows a
schematic illustration of the method proposed in this paper.
– all the videos are shot using a static camera During matching (stage (ii)), for a given query video shot
– videos are pre-segmented into shots of short duration we extract ARMA parameters from the query STV. Similarity
(approx. 5–10 s.) between two given videos is calculated using the technique
– each video shot contains a single, predominantly large proposed in [28]. Retrieved results are rank ordered based
moving object. on this match cost. Precision, Recall and F-measure are used
as metrics to quantitatively measure the performance of the
Figure 1 depicts the overall framework of our proposed proposed algorithm. Next section describes the process of
CBVR technique. Our entire framework is divided into two STV formation in details.
stages, (i) Database population and (ii) Retrieval.
In stage (i), given a input video shot we apply change
detection algorithm reported in [4]. We apply post-processing 3 Parameterized spatio-temporal volume generation
on the resultant output to detect the prominent foreground
block that exists for the entire duration of the video shot. Formation of STV from a video shot, using the shape con-
After that, we extract the silhouette of the blob for a given tours, involves extraction of the foreground moving object
frame by tracing the boundary of the blob. This process is from a given video shot and then generation of a 3-D sur-
repeated for all the blobs to get the silhouettes from all the face. Then the STV surface is parameterized and refined,
video frames. Centroid of the foreground blob is tracked depending upon the requirement of the underlying applica-
across all the frames to get the motion trajectory of the mov- tion to obtain a smooth and regular surface. The STV for-
ing object. The rationale behind extracting the silhouette and mation process consists of two stages: (i) extraction of shape
the trajectory is that the former captures the change in object contour and trajectory from the segmented foreground blobs,
appearance while the later captures the motion kinematics. and (ii) generation of STV surface from these shape contours.
Together, these two features represent the overall low-level We will discuss these processes in details in the following
content in a video shot. Next, we generate a smooth, para- sub-sections.
meterized STV (u along the spatial and v along the temporal
direction) using all the contours and the trajectory. Gaussian 3.1 Contour and trajectory extraction from foreground
curvature is then extracted at each (u, v) location on the STV blobs
surface and a feature matrix, which resembles the time series,
is created. This time series data is then given as an input to the Extracting the foreground object from an input video is the
ARMA model to compute the parameters. ARMA parame- initial step for STV generation. In the present work, we

123
18 Int J Multimed Info Retr (2014) 3:15–28

Fig. 2 Different stages of processing in our proposed framework, n number of points on the contour, t total number of frames in the video shot.
u parameter represents arc length, v parameter represents time

consider static camera videos for experimental purpose. We Algorithm 1: Prominent foreground blob detection and
use “ViBe” [4], as the algorithm for background subtraction, contour extraction.
which works best with static camera video shots, for extract- Input: A set of n segmented frames of a video shot (output of
ing foreground moving objects. The result of foreground seg- ViBe [4]).
Output: A Set (S) of silhouettes of the prominent foreground
mentation process is represented as a set of foreground pixels
blob extracted from each frame.
within a 2D-frame ( f ) and often contains a lot of noise. The 1 Suppress foreground pixels that do not have any other foreground
pixels marked as foreground need to be combined together pixel in a N 8 neighborhood;
to generate a foreground blob. Each frame f may have either 2 Apply morphological operation (erosion and dilation) on
remaining foreground pixels to get foreground blobs;
produce, single blob or multiple blobs or null (in the absence
3 Obtain forward and backward correspondences between blobs in
of any significant motion). Our aim is to select one signifi- successive frames based on euclidean distance between centroids
cant blob from each frame that represents the predominant of the blobs;
foreground moving object in the video. Since ViBe produces 4 Take intersection of forward and backward correspondences
between blobs in successive frames to obtain final
noisy blobs for most frames, the choice of considering the
correspondences;
largest blob from each frame does not give the desired result. 5 Identify set of path P = { pi : 1 ≤ i ≤ N p } across all frames by
We assume that consecutive positions of the video object in connecting correspondences obtained in step 4, where N p =
successive frames has sufficient overlap. Given the output of total number of paths detected;
N i i
ViBe, Algorithm 1 outlines the steps involved in determining 6 Compute, ∀ pi ∈ P; V (i) = k=1 bk , where N i = number of
th i
blobs on i path, and bk is the k blob on the i th path;
th
the prominent foreground blob and thereafter extracting the
contours (silhouette) of each blob. 7 Select B = {bil : 1 ≤ i ≤ N l } on the l th path such that
l = arg maxi V (i), as the set of prominent foreground blobs;
A systematic illustration of the prominent foreground blob 8 Construct S = {sil : 1 ≤ i ≤ N l } by the tracing the exterior
extraction process is explained in Fig. 3. For multiple non- contours ∀bil ∈ B;
occluding foreground objects present in the video, separate 9 return S;
temporal paths ( pi s) are first obtained for each foreground
object. We individually compute the sum of the size of the
blobs belonging to a particular pi (step 6 of Algorithm 1).
Then we select the path with maximum net sum of blob sizes
and discard all the other paths (step 7 of Algorithm 1). After
the blobs of the prominent foreground objects are extracted,
the smooth outer contour of the foreground object is obtained
as follows.

– Extract the outer contour of the blobs by tracing the


Fig. 3 Extracting the prominent foreground blobs (a, b). Forward and
boundary of the blobs. The contours are re-sampled to backward correspondence between blobs (c). Intersection of forward
generate n = 200 (empirically determined) points per and backward correspondences (d). Final set of selected foreground
contour in each frame. blobs

123
Int J Multimed Info Retr (2014) 3:15–28 19

Fig. 4 Different stages of smooth parameterized STV formation

– The contours are smoothed using a Gaussian function to


remove minor artifacts present in the contour.

To get the trajectory of the moving foreground object, we


track the centroid of the extracted prominent foreground
blob (as given in step 7 of Algorithm 1) across the frames
of the video. These extracted contours represent the change
of the object shape in the given video shot over time and Fig. 5 Results of point correspondence between successive contours
the trajectory captures the change in the motion kinemat- of (a). Human running and (b) moving car using the technique proposed
ics. The process of generating smooth, parameterized STV in [34]. Only a few correspondences are shown for visual clarity
using the contour and trajectory points is discussed in the
next section.
3.2.1 Approximate STV formation

3.2 Parameterized STV formation First we find the correspondence between a pair of contours
Ci and Ci+1 using the technique proposed in [34]. Output of
Process of STV generation finds resemblance to the prob- the point correspondence is shown in Fig. 5. Using these ini-
lem of 3-D surface reconstruction, in the field of computer tial correspondence we stack the contours along the trajectory
graphics, where a 3-D surface needs to be reconstructed points to generate the approximate STV. It can be observed
from a set of data points. Surface reconstruction from shape from Fig. 4b that we may get three types of mappings: (i)
contours takes advantage of the approximated shape of the one-to-one, (ii) one-to-many, and (iii) many-to-one. To rep-
object, but has several challenging issues (for details please resent the geometric properties of the STV surface as time
refer to [29]). The prime issue is to detect correspondence series we need a special case of one-to-one mapping where
between points on a pair of shape contours in successive the order of the contour points are also preserved. The tech-
video frames and fit surface patches while ensuring smooth- nique proposed in [34] ensures that the order of mapping is
ness constraint. Moreover, while finding point correspon- preserved but it does not always ensure one-to-one mapping.
dence between the contours, order of points must also be To achieve that we propose a parameterization technique.
preserved. We define two variables for parameterization: u,
for the contour boundary and v along the trajectory (time 3.2.2 STV parameterization
axis). Given a set of n contours, the process of STV forma-
tion involves three intermediate steps: (i) approximate STV This step ensures that a unique one-to-one correspondence
formation, (ii) parameterization, and (iii) smoothing. Figure is established between the points on the successive contours,
4 illustrates a schematic example of the proposed parameter- and all others are removed. Point correspondences on the
ization process. Stage 1 shows three contour Ci s (mapped as contours generated in the previous step (Sect. 3.2.1) gives us
straight lines where the arc length between a pair of succes- the axial paths, all of which originate at the first contour (C1 )
sive points is preserved) with all the points on the contours and terminates at the last contour (Cn ). These axial paths and
indexed. the radial contours are used together for the parameterization

123
20 Int J Multimed Info Retr (2014) 3:15–28

of the STV surface. At first, for each successive contour path re-positioning and smoothing process. For example, in
pair Ci and Ci+1 , one to many point correspondences are Fig. 6 (I ), it can be seen that axial paths are non-uniformly
removed by keeping only the connections with minimum dis- spread over the STV surface. But after smoothing they are
tance, leaving one-to-one and many-to-one mapping between re-positioned in an uniformed manner (in part c). This shows
points (Fig. 4c). Next, all possible axial paths pi s are traced that the simple process of detecting point correspondences
from the first contour C1 to the last contour Cn and lengths of and stacking of shape contours of the foreground blob is not
the paths are computed. Many-to-one correspondences will sufficient to generate a proper STV. This smooth, parame-
cause some paths to merge mid-way before they reach Cn . terized STV surface is processed further for video content
For those paths, we retain only the paths with smallest length representation.
and discard others. If, for some paths, the starting point does
not lie on C1 , then we remove those paths also. Thus some 3.3 Properties of parameterized STV
of the axial paths will be filtered and the remaining paths
will run along the STV from C1 to Cn . This will create gaps In our case, the parameterized STV has been generated using
between contour points (Fig. 4d). We interpolate to create the set of contours extracted from the segmented foreground
new paths in the gap between the existing paths. blobs. We consider the STV as a manifold M(x, y, t), which
Let pi and pi+1 be two such adjacent paths with m points can be approximated by computing the surface equations for
between their end points on contour Cn . We generate m paths each small surface patch. We can therefore represent the STV
between pi and pi+1 , by retracing from contour Cn to C1 . For using a 2-D parametric representation. If we consider the arc
example, in Fig. 4d, we can observe that between p5 and p7 , length of the contour as u and the time as v, then we can write
we need to trace a path from contour C3 to C1 . We achieve M using this (u, v) parameterization as:
that by introducing a point equidistant from p5 and p6 on C2
M = S(u, v) = [x(u, v), y(u, v), v] (1)
and C1 . Then we traced the path from C3 to C1 by joining
the p6 ’s on each contour (shown in dotted lines in Fig. 4e). In this representation, arc-length encodes the object shape
By this process we will get a unique one-to-one mapping (contour) and time encodes the motion trajectory. With this
between all the points on the successive contours. However, parametric representation, fixing the u parameter generates
the axial paths may not be properly distributed (congested at 2D motion trajectories of any point on the object boundary.
some places and far apart in other) on the STV surface. To Similarly, fixing the v parameter generates the object con-
uniformly distribute the axial lines on the STV surface, we tours at time v. Parameterized surface (STV) of this kind
propose a smoothing technique in the next section. motivates us to model the time-varying changes of geomet-
ric properties of surface points, as a time series. In the next
3.2.3 STV smoothing section, we discuss our proposed feature extraction and time
series modeling process.
In the context of 3D surface reconstruction, smoothing has
received a considerable amount of interest and numerous
approaches have been proposed for generating smooth sur- 4 Modeling STV as a time-series
faces which are suitable for computing the differential sur-
face properties. In our case, to evenly distribute the axial We have modeled video content as a LDS, which represents a
paths, each point with a parameter value u p is shifted along class of parametric models for the time series. This assumes
the contour (u) such that the transformed point satisfies the that the video frames are output of a dynamical system, which
constraint: u p = wu −1 + (1 − w)u +1 , where the parameters is driven by an IID process. We compute the parameters of
u −1 and u +1 represent the adjacent points on either the side an ARMA model and use them for matching a pair of video
of u p (and w = 0.4). Figure 4f illustrates the effect of our shots.
smoothing algorithm. It can be observed that as compared to
stage 5 (Fig. 4e), axial paths are uniformly distributed, and 4.1 Feature extraction from STV
thus generates a smooth, parameterized STV.
Figure 6 (I − I I I ) shows results of our experiment of In our approach, the differential properties of STV captures
the STV generation for three different video shots taken the video content in terms of “what is moving?” and “how
from different benchmark datasets. Figure 6a shows the is it moving?”. These properties are implicitly encoded on
segmented foreground blobs (for a few uniformly spaced the STV surface and can be analyzed by the surface geom-
frames), stacked along the trajectory (indicated by a dashed etry. These geometric features encode the shape and motion
line). The initial and the final STVs are shown in Fig. 6b changes simultaneously.
and c, respectively. Enlarged views of small corresponding According to the famous Gauss’s Theorema Egregium
patches on the STV surfaces highlight the importance of the [30], Gaussian curvature is an intrinsic invariant of a surface.

123
Int J Multimed Info Retr (2014) 3:15–28 21

Fig. 6 Steps of STV creation for three different video shots, having lin- line). The enlarged views of small corresponding patches, given in
ear and non-linear motion trajectories. a Figure shows the segmented between (b) and (c), highlight the result of smoothing and parame-
foreground blob and the motion trajectory (superimposed as dashed terization process for STV generation

The sign of the curvature defines the surface type and the Since, Su · N = 0, partial differentiation with respect to u
value defines the surface sharpness. Let S(u, v) be our para- will yield,
metric STV, and let Su and Sv denote the partial derivatives
∂(Su · N )
of S with respect to u and v. The normal vector to the STV ⇒ Suu · N + Su ·
∂u
is perpendicular to the tangent vectors Su and Sv . Therefore,
Nu = 0 ⇒ Suu · N = −Su · Nu
the unit normal is given by,
The other terms in Eq. (4) can be established in a similar
Su × Sv manner. Therefore, Eq. (4) can be re-written as:
N = N (u, v) = (2)
|Su × Sv |  
−Su · Nu Su · Nv
II = (5)
Gaussian curvature for a parametric surface is defined in Sv · Nu −Sv · Nv
terms of the first fundamental form [30]:
The Gaussian curvature at a point (u, v) on the smooth para-
  meterized STV is obtained by,
Su · Su Su · Sv
I= (3)
Sv · Su Sv · Sv det(II)
KG = (6)
det(I)
and the second fundamental form (also known as the shape
tensor): Figure 7 depicts the Gaussian curvature computed for differ-
ent STVs created using our proposed algorithm. In Fig. 7, the
 
Suu · N Suv · N higher value of curvature is coded with ’red’ and relatively
II = (4)
Svu · N Svv · N smoother (flat) regions are coded in ‘green’. It can be seen

123
22 Int J Multimed Info Retr (2014) 3:15–28

Fig. 7 Gaussian curvature plotted on the STV generated for four different actions (bend, running, walking and waving two hands) performed by
four different subjects (a–d). Different color represents the value of the curvature (red positive, blue negative). Videos were taken from [38] and
[22] (color figure online)

from the figure that similar classes of content depicts similar on a radial line, i.e., the shape contour (a sequence of u’s),
pattern in terms of the structure of the STV. Moreover, results represents the state of the system as a particular time instance
also reveals that different action sequence has a distinct pat- (v). Therefore, at each point on a given axial line on the STV
tern in the dynamics of Gaussian curvature over the STV surface we measure the change in curvature over time. Thus,
surface. Intra-class similarity and inter-class dissimilarity in an STV surface can be considered as a dynamical system
the change of Gaussian curvature over time is present for and the Gaussian curvature extracted at each point (u, v) on
different categories of videos. For example, in case of STV the STV surface can be written as a function of the current
generated from videos where a person is waving both hands, state as:
we can observe that the undulation on the STV surface has a
specific pattern of Gaussian curvature across time, which is statev+1 = φ(statev ), out putv = ψ(statev ) (7)
unique to this particular class. At any time instance it repre- Here, φ and ψ are linear functions. In this study, observation
sents the action state of the video object. This motivates us of the geometric features on the STV surface is modeled
to see the change of Gaussian curvature over time as a time using the ARMA representation, which is given by:
series and represent that using parameters of ARMA model.

p 
q

4.2 Time-series representation of features xv = − ai xv−i + bi ev−i (8)


i=1 i=o

As discussed earlier the parametric representation of the STV The response of the system at sample index v, as denoted
surface in an (u, v) space helps in encoding the shape and by xv , is expressed as a function of p previous observations
motion of the moving object. With such a representation we and q residual error terms. p and q are often referred to as
can visualize the time varying change of surface curvature the model orders of the AR and MA processes, respectively.
as a time series data. The Gaussian curvature of the points The weights of the previous observations and residual error

123
Int J Multimed Info Retr (2014) 3:15–28 23

terms, ai and bi , are called the AR and MA coefficients, of state-of-the-art algorithms of foreground segmentation on
respectively. It is assumed that the residual error of the moving (and unconstrained) camera video shots is far from
ARMA model, en , is influenced by the unknown excitation satisfactory, while algorithms based on static camera shots
to the structure and is modeled by a white-noise process with are stable and accurate. This motivates us to experiment using
variance 2. the video shots only with static cameras. Use of moving
In this work, we have extracted Gaussian curvature at each camera video shots must be resolved with more research,
point on the parameterized STV surface (see Sect. 4.1). Fea- and is beyond the scope of the current work. In this section,
tures computed for each radial axis is combined row-wise to we first describe the datasets on which we performed our
form the feature matrix. Let, F = {f1 , f2 , ..., ft } be the fea- experiments.
ture matrix obtained using the sequences of features extracted
from each radial axis. The bold face of the elements of the 5.1 Weizman dataset
matrix F signifies that they are row vectors. Dimension of
each such feature vector is 200 (equal to the number of points We tested our proposed framework on Weizman action
on the shape contour). To model the extracted features, the dataset [22]. The dataset contains 93 low-resolution video
time series model parameters need to be identified. We have shots using a static camera at a frame rate of 50 fps. It has
adapted the closed form sub-optimal solution discussed by 10 natural actions namely: (i) running (run), (ii) walking
[17] to extract these model parameter values and hence use (walk), (iii) skip, (iv) jumping-jack (jack), (v) jump forward
the variant of ARMA model. For each video shot in the data- on two legs (jump), (vi) jump in place on two legs (pjump),
base, these model parameters are estimated and stored in the (vii) gallop side ways (side), (viii) wave two hands (or
feature database. The query video undergoes the same fea- wave2), (ix) wave one hand (wave1), and (x) bend forward
ture extraction process and then those features are matched (bend). Each action is performed by nine different subjects.
with features of all the videos in the database. This matching To increase the number of data samples, we extend the dataset
is performed using a time series based matching technique, by augmenting it with the horizontally flipped version of each
which is discussed in the next section. sequence.

4.3 Time series based STV matching 5.2 KTH dataset

In [37], it was shown that the parameters of linear dynamic The KTH dataset [33] has six types of human actions,
models are finite-dimensional linear subspaces of appropriate namely: (i) walking, (ii) jogging, (iii) running, (iv) boxing,
dimensions. Subspace angle has been defined by [28] as a (v) hand waving and (vi) hand clapping. In a given video
metric between a pair of ARMA models. Given two matrices shot one action is repeatedly performed by a subject. The
H1 and H2 , the principle angles between subspaces spanned videos were shot in four different scenarios having variation
by H1 and H2 are denoted by an n-tuple, where n is the order in scene (indoor, outdoor), scale, and appearance (with dif-
of the model: ferent clothes). In total the dataset contains 2,391 sequences,
shot with a static camera with 25 fps frame rate. The average
H 1 ∧ H 2 = (θ 1 , θ 2 , ..., θ n ), θ i ≥ θi+1 ≥ 0 (9)
duration of each video shot is approximately 4 s.
Here, θi is the subspace angles between H1 and H2 . The sub-
space angle based distance (d) between two ARMA models 5.3 UT-tower dataset
H1 and H2 of order n, is defined [28] as:
 The UT-tower dataset [12] consists of 108 low-resolution
2
dH 1 ,H2
= −ln cos2 (θi ) (10)
(360 × 240) video sequences with a frame rate of 10 fps.
The distance is based on the largest principal angle between It has nine types of actions, (i) pointing to a particular
the two models. The value of d is used to match a pair of video direction (pointing), (ii) standing, (iii) digging, (iv) walking,
shots. Based upon this match-cost the retrieved videos are (v) carrying an object (carrying), (vi) running, (vii) wave one
rank ordered and results are displayed. Details of experiments hand (wave1), (viii) wave two hands (wave2), and (ix) jump
are discussed next. in place on two legs (jumping), performed 12 times by six
subjects. All the videos were shot in two different environ-
mental settings: (i) concrete square and (ii) lawn. The pri-
5 Experimental setup mary challenge is to recognize the content in low-resolution
videos with blurry visual cues. For our experimental purpose
We have verified the performance of our proposed CBVR we have used the foreground masks provided along with this
framework by extensive experimental study, using four video dataset for each video. We performed morphological opera-
datasets of real world, static camera video shots. Performance tion (erosion and dilation) on the binary masks to get the

123
24 Int J Multimed Info Retr (2014) 3:15–28

Fig. 8 Performance of our proposed algorithm on Weizman dataset (median frames) for two different classes of query, b running, c waving
[22]. a Confusion matrix showing pairwise subspace distance between one hand (wave 1)
different classes of videos, retrieval results showing top eight videos

Fig. 9 Performance of our proposed algorithm on KTH dataset [33]. a Confusion matrix showing pairwise subspace distance between different
classes of videos, retrieval results showing top eight videos (median frames) for two different classes of query, b jogging, c hand clapping

Fig. 10 Performance of our proposed algorithm on UT-tower dataset [12]. a Confusion matrix showing pairwise subspace distance between
different classes of videos, retrieval results showing top eight videos (median frames) for two different queries, b running, c pointing

Fig. 11 Performance of our proposed algorithm on VP Lab dataset [38]. a Confusion matrix showing pairwise subspace distance between different
classes of videos, retrieval results showing top eight videos (median frames) for two different queries, b car, c aerial

foreground blob. We have extracted the contours and the The dataset in [38] consists of five different categories of
trajectory using the given foreground masks and then fol- video shots, they are: (i) videos shot from the top floor of
lowed the parameterization process. the department building (aerial), (ii) bicycle and motorbike
(bike), (iii) moving car (car), (iv) human walking (walk), and
5.4 VPLab dataset (v) cartoon. All the videos were hand labeled by the authors.
As previously discussed, the assignment was done based on
We have also recorded real-world video shots consisting the dominant content present in the video. For simplicity, in
of different outdoor locations, using a still hand-held Sony all the videos used for experimental purposes, there is only
camcorder and have also downloaded videos from internet. one content in a shot (or clip).

123
Int J Multimed Info Retr (2014) 3:15–28 25

Fig. 12 Comparative result showing median frames of top six retrieved three query video shots belong to: Weizman [22], VP Lab [38], KTH
video shots for three different queries using. a Liang et. al [25], b Gao [33], respectively
and Yang [21], c EMST-CSS [10], and d our proposed method. The

6 Experimental results ters computed from the STVs. Parts (b) and (c) depict the
top eight retrieved videos (only median frame is shown)
In this section we evaluate, both qualitatively and quantita- for two different queries. Video shots which are wrongly
tively, the performance of our proposed framework of CBVR retrieved are highlighted using a “red” bounding box. It is evi-
using four datasets (as described above) and compare it with dent from the confusion matrices that our proposed method
three state-of-the-art methods. is able to discriminate (diagonally dominant similarity and
larger off-diagonal terms) among different classes of video
6.1 Retrieval performance shots.

We have performed CBVR on all the above mentioned 6.2 Comparative study
datasets individually. Figures 8–11 depicts the retrieval per-
formance of our proposed CBVR framework. In all the We have compared our approach with three baseline meth-
four figures, part (a) shows the confusion matrix depicting ods of CBVR [10,21,25], using spatio-temporal approach.
the results of pairwise distance between videos of differ- Retrieval results are shown for visual comparison in Fig. 12,
ent class. Remember here that the distance refers to sub- for three query video shots from the datasets [12,22,33].
space angle based distance between the ARMA parame- Median frames of the top six retrieved results are shown row-

123
26 Int J Multimed Info Retr (2014) 3:15–28

Fig. 13 Comparison of retrieval performance using Precision-Recall metric, for four different datasets: a Weizman [22], b KTH [33], c UT-tower
[12], d VP Lab [38]

Fig. 14 Comparison of retrieval performance using F-score metric, for four different datasets

wise in all cases. Wrongly retrieved results are indicated by the-art methods [10,25,21] and compared with our proposed
a red bounding box. Since ARMA models allows for general method. Figure 13 depicts precision-recall graphs for exper-
dynamic relationships among variables in a system, it is more iments on four different benchmark datasets, comparing our
adequate to represent the dynamics of the STV and thus pro- results with [21,25] and [10]. For example, in case of the
vides more accurate retrieval results than [10,21,25]. How- videos from [22] (Fig. 13a), our approach (black line with ×
ever, in cases where the action classes are non-orthogonal, marker) clearly outperforms the other three approaches by
i.e., inter-class similarity is high, ranking of our method is showing higher precision and recall values. Similarly, the
also prone to error. For example, consider query 3 in Fig. 12 precision curves in Fig. 13b–d, clearly exhibit the superior
(d), where clapping sequence is misclassified as waving two performance of our proposed multivariate time series based
hands. This fact has been further emphasized by the quanti- approach for CBVR. Since our criteria for better performance
tative analysis done using the precision-recall and F-measure is the quality of ranking, it should be noted that the curves for
metrics. the time series based approach show more relevant videos in
the top ranked results.
6.3 Quantitative analysis Figure 14 shows that our approach provides higher
F-measure and outperforms the other three approaches. This
For a retrieval system the most useful performance measures supports the claim that the proposed approach is able to effi-
include precision, recall, and F-measure. We computed the ciently represent the content and produces superior results of
precision-recall, and F-measure values for all three state-of- video matching.

123
Int J Multimed Info Retr (2014) 3:15–28 27

7 Conclusion and discussion 12. Chen CC, Ryoo MS, Aggarwal JK (2010) UT-tower dataset: aerial
view activity classification challenge. http://cvrc.ece.utexas.edu/
SDHA2010/Aerial_View_Activity.html
We draw upon insights from differential geometry and time 13. Chen PY, Chen ALP (2003) Video retrieval based on video motion
series modeling to propose a framework for CBVR task. tracks of moving objects. In: Proceedings of SPIE, vol 5307,
We construct a smooth, parameterized STV from a given pp 550–558
video, shot using static camera, and compute geometric fea- 14. Cui B, Zhao Z, Tok WH (2012) A framework for similarity search
of time series cliques with natural relations. IEEE Trans Knowl
ture (Gaussian curvature). Then these geometric features are Data Eng 24(3):385–398
represented as parameters of an LDS, using an ARMA model. 15. Das S, Chattopadhyay C, Dyana A (2011) Vidlookup: a web-based
Similarity between two videos is computed by measuring the online CBVR system for query video shots. Demo at ICCV
subspace angle based distance between the ARMA parame- 16. Deng Y, Manjunath BS (1998) NeTra-V: toward an object-based
video representation. IEEE Trans Circuits Syst Video Technol
ters. Utilizing the proposed video matching framework, we 8:616–627
have achieved very promising and superior performance. 17. Doretto G, Chiuso A, Wu YN, Soatto S (2003) Dynamic textures.
In future, we plan to explore the possibilities of extending Int J Compu Vis 51:91–109
this framework for moving camera video shots. Foreground 18. Dyana A, Das S (2009) Trajectory representation using Gabor fea-
tures for motion-based video retrieval. Pattern Recogn Lett 30:
segmentation is an intrinsic part of our proposed framework 877–892
and performance of the subsequent steps rely on efficient 19. Erol B, Kossentini F (2005) Shape-based retrieval of video objects.
foreground segmentation from videos shot using a moving IEEE Trans Multiméd 7:179–182
camera. Further, we plan to explore use of interest point 20. Florez OU, Lim S (2009) Discovery of time series in video data
through distribution of spatiotemporal gradients. ACM symposium
descriptors based features for video content representation. on applied computing
21. Gao HP, Yang ZQ (2010) Content based video retrieval using
Acknowledgments We are thankful to Prof. Sukhendu Das and the spatiotemporal salient objects. In: Intelligence Information
members of Visualization and Perception Lab for their support and Processing and Trusted Computing (IPTC)
comments. Additional thanks to Debarun Kar and Prahllad Deb for 22. Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007)
their insightful suggestions. Actions as space-time shapes. IEEE Trans Pattern Anal Machine
Intell 29(12):2247–2253
23. Hsieh JW, Yu SL, Chen YS (2006) Motion-based video retrieval
by trajectory matching. IEEE Trans Circuits Syst Video Technol
References 16:396–409
24. Lee SL, Chun SJ, Kim DH, Lee JH, Chung CW (2000) Similar-
1. Aggarwal G, Chowdhury A, Chellappa R (2004) A system iden- ity search for multidimensional data sequences. In: Proceeding of
tification approach for video-based face recognition. In: ICPR, ICDE
pp 175–178 25. Liang B, Xiao W, Liu X (2012) Design of video retrieval system
2. Auguste R, El Ghini A, Bilasco M, Ihaddadene N, Djeraba C (2010) using MPEG-7 descriptors. Procedia Eng 29:2578–2582
Motion similarity measure between video sequences using multi- 26. Lin J, Li Y (2009) Finding structural similarity in time series data
variate time series modeling. In: ICMWI, pp 292–296 using bag-of-patterns representation. In: SSDBM, 2009, pp 461–
3. Babu RV, Ramakrishnan KR (2007) Compressed domain video 477
retrieval using object and global motion descriptors. Multiméd 27. Ma Y, Zhang H (2002) Motion texture: a new motion based video
Tools Appl 32(1):93–113 representation. In: International conference of pattern recognition
4. Barnich O, Van Droogenbroeck M (2011) ViBe: a universal back- 28. Martin R (2000) A metric for ARMA processes. IEEE Trans Signal
ground subtraction algorithm for video sequences. IEEE Trans Process 48(4):1164–1170
Image Process 20(6):1709–1724 29. Meyers D, Skinner S, Sloan K (1992) Surfaces from Contours.
5. Bashir FI, Member S, Khokhar AA, Member S, Schonfeld D, ACM Trans Graphics 11(3):228–258
Member S (2007) Real-time motion trajectory-based indexing and 30. O’Neill B (1997) Elementary differential geometry, 2nd edn.
retrieval of video sequences. IEEE Trans Multiméd 9:58–65 Academic Press, New York. http://www.apnet.com
6. Bissacco A, Chiuso A, Ma Y, Soatto S (2001) Recognition of human 31. Popivanov I, Miller RJ (2002) Similarity search over time series
gaits. In: CVPR, vol 2, pp 52–57 data using wavelets. Proceedings of the 18th ICDE, pp 212–221
7. Bobick AF, Davis JW (2001) The recognition of human move- 32. Rodriguez MD, Ahmed J, Shah M (2008) Action MACH: a spatio-
ment using temporal templates. Trans Pattern Anal Mach Intell temporal maximum average correlation height filter for action
23(3):257–267 recognition. In: Proceedings of CVPR
8. Brendel W, Todorovic S (2010) Activities as time series of human 33. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions:
postures. In: ECCV, pp 721–734 a local SVM approach. ICPR
9. Chattopadhyay C, Das S (2012) A novel hyperstring based descrip- 34. Scott C, Nowak R (2006) Robust contour matching via the
tor for an improved representation of motion trajectory and retrieval order-preserving assignment problem. IEEE Trans Image Process
of similar video shots with static camera. In: Emerging Area in 15(7):1831–1838
Information Technology (EAIT) 35. Srivastava A, Turaga P, Kurtek S (2012) On advances in
10. Chattopadhyay C, Das S (2012) Enhancing the MST-CSS represen- differential-geometric approaches for 2D and 3D shape analyses
tation using robust geometric features, for efficient content based and activity recognition. IVC 30:398–416
video retrieval (CBVR). In: ISM 36. Turaga P, Veeraraghavan A, Srivastava A, Chellappa R (2011)
11. Chellappa R, Sankaranarayanan AC, Veeraraghavan A, Turaga P Statistical computations on Grassmann and Stiefel Manifolds for
(2010) Statistical methods and models for video-based tracking, image and video-based recognition. IEEE Trans Pattern Anal Mach
modeling, and recognition. Found Trends Signal Process 3:1–151 Intell 33(11):2273–2286

123
28 Int J Multimed Info Retr (2014) 3:15–28

37. Turaga PK, Veeraraghavan A, Srivastava A, Chellappa R (2011) 40. Zhang D, Zuo W, Zhang D, Zhang H (2010) Time series classifi-
Statistical computations on Grassmann and Stiefel manifolds for cation using support vector machine with gaussian elastic metric
image and video-based recognition. IEEE Trans Pattern Anal Mach kernel. In: Proceeding of ICPR, pp 29–32
Intell 33(11):2273–2286
38. VPLab-VID: http://www.cse.iitm.ac.in/vplab/videos.html
39. Yilmaz A, Shah M (2008) A differential geometric approach
to representing the human actions. Comput Vis Image Underst
109(3):335–351

123

You might also like