Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Proceedings of 2010 IEEE 17th International Conference on Image Processing September 26-29, 2010, Hong Kong

RECOGNIZING OFFENSIVE STRATEGIES FROM FOOTBALL VIDEOS

Ruonan Li and Rama Chellappa

Center for Automation Research, University of Maryland, College Park, MD 20742, USA

ABSTRACT manner. Experiments on a newly established football play dataset


demonstrate the performance of the proposed approach.
We address the problem of recognizing offensive play strategies
from American football play videos. Specifically, we propose a
2. PROBABILISTIC GENERATIVE MODEL FOR AN
probabilistic model which describes the generative process of an
FOOTBALL PLAY
observed football play and takes into account practical issues in real
football videos, such as difficulty in identifying offensive players, Given an observed football play 𝑂, we would like to classify it into
view changes, and tracking errors. In particular, we exploit the one of the offensive strategies, denoted as 𝐴. We aim to achieve this
geometric properties of nonlinear spaces of involved variables and by maximizing the posterior probability 𝑃 (𝐴∣𝑂), which is usually
design statistical models on these manifolds. Then recognition is evaluated as 𝑃 (𝑂∣𝐴)𝑃 (𝐴). The observed data 𝑂 is essentially a
performed via ’analysis-by-synthesis’ technique. Experiments on a collection of 𝑛 motion trajectories in the image plane, among which
newly established dataset of American football videos demonstrate 𝑚 come from the offensive team. In football plays, we have 𝑚 = 11
the effectiveness of the approach. and usually 𝑛 ≥ 22. To model the observation generating process
from 𝐴 to 𝑂, we decompose 𝑃 (𝑂∣𝐴) into several components cor-
Index Terms— Video Analysis, Activity Recognition
responding to sequential data generation steps as

1. INTRODUCTION 𝑃 (𝑂∣𝐴) = 𝑃 (𝑂∣𝑇, 𝑣)𝑃 (𝑣)𝑃 (𝑇 ∣𝐷)𝑃 (𝐷∣𝑓, 𝐴)𝑃 (𝑓 ∣𝐴)𝑑𝑓 𝑑𝐷𝑑𝑇 𝑑𝑣
Analysis of football/soccer videos has been an active research topic (1)
for years [1, 2, 3]. These efforts attempted to detect or recognize
specific types of semantics in the games using camera motion, color, where Markovian property is assumed as needed.
low-level motion, field markers, lines, texture, and so on. In this pa- Now we explain each of the above quantities and the correspond-
per we address a relatively new problem - the recognition of offen- ing data generating step. 1) 𝑓 is called the spatial co-occurrence
sive play strategies solely from the motion trajectories of individual function which is used to describe the spatial distribution of differ-
players. Very limited efforts have been made on this task including ent types of individual trajectories involved in the offensive play, and
[4] and recent papers from our group [5, 6] (Also see [7] on soccer 𝑃 (𝑓 ∣𝐴) characterizes the probabilistic variation of 𝑓 . This represen-
event analysis). Though promising results were reported, these pre- tation is employed from [5] and will be introduced in the following
liminary attempts were generally based on one or more of the follow- subsection. 2) 𝐷 is the collection of 𝑚 trajectories in the real ground
ing assumptions: 1) The offensive players and their roles in the play area, and 𝑃 (𝐷∣𝑓, 𝐴) tells us how to generate the true motion of the
are already identified; 2) A fixed camera is employed or the view offensive group in the field from a distribution 𝑓 of strategy 𝐴. 3)
angle is known; 3) The motion trajectories are complete and noise- 𝑇 denotes the entire set of 𝑛 trajectories, including those from de-
free. When dealing with real football videos, these assumptions are fensive players and referees, and 𝑃 (𝑇 ∣𝐷) provides the mechanism
unrealistic. The offensive players may not be easily identified due giving rise to the complete motion information o the ground plane.
to quantization, resolution, or other issues regarding video quality, 4) Finally, 𝑣 is the view transform which brings the trajectories from
where the colors of the sportwear are not discriminative. The games the ground plane onto the image plane, while 𝑃 (𝑣) characterizes the
are normally recorded using multiple cameras whose placements and view change. We now elaborate on these probabilistic mechanisms.
view angles may not always be accurately available. Moreover, the 1) From Offensive Type to Co-occurrence Function. The spa-
motion trajectories of the players will be noisy, fragmented, and even tial co-occurrence function 𝑓 (𝑤, Ω) is a two-argument non-negative
missing as they are generated by a tracking algorithm which may be function with 𝑤 ∈ W as the label of a trajectory type and Ω ∈ Π
non-robust. as the label of a spatial area/partition of the field where the play oc-
We propose a framework which eliminates all of the above as- curs. If 𝐹 (𝑤, Ω) gives the number of occurrences of single-object
sumptions, to recognize offensive strategies from realistic football motion 𝑤 within the spatial area Ω, then 𝑓 is defined as the scaled 𝐹
play videos. Specifically, we set up a probabilistic generative model with unit norm [5]. A simple toy example for the case of football is
to characterize the motion trajectories of players, taking into account that W = {acceleration, deceleration, left turn, right turn} and Π =
play types, motions of individual players, as well as view changes. {middle of the field, side of the field}. Then 𝐹 (acceleration,√middle
The recognition, as a result, is realized in an ’analysis-by-synthesis’ of the field) = 2, or 𝑓 (acceleration, middle of the field) = 2/11,

978-1-4244-7994-8/10/$26.00 ©2010 IEEE 4585 ICIP 2010


means that there are two players accelerating in the middle of the To generate the offensive motion trajectories, we employ a table
field during the play. In practice, we do not arbitrarily specify the look-up approach. For each 𝑤 in the index of the table, we main-
trajectory types W , but learn a ‘vocabulary’ of all possible types in tain all exemplar trajectories in the training set. Then for a specific
an unsupervised manner as in [5]. To effectively describe the spatial 𝑓 (𝑤, Ω) we randomly pick out 𝐹 (𝑤, Ω) trajectories from 𝑤-labeled
distribution of each type of trajectories, we only take into account exemplar ones. Then we randomly distribute them over the area
the minimum square area enclosing all trajectories, and partition it specified by Ω to obtain a 𝐷.
into front, middle, rear vertically and left, middle, right horizontally, We assume that the other motion trajectories are uniformly
totalling to nine spatial areas. Other partition schemes can also be at- spread across the entire field. For any 𝐷 generated, we may ran-
tempted to achieve different performance/complexity, while we cur- domly select a number of motion trajectories from the set and
rently use a 3 × 3 pattern for simplicity. uniformly distribute them all over in order to reach 𝑇 . However,
A particular instance of an offensive play will correspond to a eventually we do not need to generate 𝑇 in the algorithm, as will be
co-occurrence function, which in a sense serves as a ‘descriptor’ of explained soon.
that offense. Given an offense type 𝐴 and its instances, a collection 3) Modeling Statistical View Variability. The final step of our
of co-occurrence functions 𝑓 ’s from these instances is available to generative model is the transformation of the complete motion pat-
characterize this type of offensive play. To statistically capture the tern to the image plane. The issue of view-invariance comes up here
variability of 𝑓 for a particular 𝐴, i.e.,𝑃 (𝑓 ∣𝐴), we propose to use a where view variations usually happen among observed realizations.
parametric model as We propose to achieve view-invariance by explicitly and analytically
exploiting the statistical view change. Specifically, we build a statis-
1 𝑑2 (𝑓, 𝜇𝐴 )
𝑃 (𝑓 ∣𝐴) = 𝑝(𝑓 ; 𝜇𝐴 , 𝜎𝐴 , 𝑧𝐴 ) = exp(− 2
), (2) tical model for possible view variations of the static cameras.
𝑧𝐴 2𝜎𝐴
Under the pinhole camera model, different views correspond to
which behaves as a ‘Gaussian’ distribution. The reason why it is not different coordinate transforms between the ground plane and the
Gaussian is that the space of all 𝑓 ’s for a fixed number of objects is image plane, which is exactly characterized by a 2-D planar homog-
not Euclidean but a curved manifold, on which the intrinsic distances raphy. Therefore, modeling the view variation is equivalent to mod-

between 𝑓1 and 𝑓2 is 𝑑(𝑓1 , 𝑓2 ) ≜ cos−1 ( 𝑤,Ω 𝑓1 (𝑤, Ω)𝑓2 (𝑤, Ω)). eling the variation of homographies between image planes and the
However, the parameters 𝜇𝐴 , 𝜎𝐴 , 𝑧𝐴 have similar physical meaning ground plane, i.e., 𝑣’s. Analytically, 𝑣’s are 3 × 3 non-singular ma-
as the ‘mean’, the ‘variance’ and the normalizing factor respectively. trices which relate the homogenous coordinates of points in the two
Note that we use the compact parametric form above for its simplic- planes.
ity and effectiveness, but more complex and elegant ones can also be We are now building statistical model 𝑃 (𝑣) on the space of 3×3
considered if needed. non-singular matrices 𝔾𝕃(3), i.e., the generalized linear group. As
As for learning the parameters 𝜇𝐴 , 𝜎𝐴 , 𝑧𝐴 from a set of train- 𝔾𝕃(3) is again a nonlinear manifold, we exploit its intrinsic geome-
ing co-occurrence functions {𝑓𝑖 }𝑁 𝑖=1 of strategy 𝐴, the ‘mean’ try. To account for possible complex distribution of 𝑣, we propose a
′(𝑔+1) ∑𝑁
𝜇𝐴 is obtained iteratively by 𝜇𝐴 = 𝑁1 𝑖=1 L𝜇(𝑔) (𝑓𝑖 ) and ‘Curved Gaussian Mixture Model’ (CGMM) as a parametric distri-
𝐴
(𝑔+1)
𝜇𝐴
′(𝑔+1)
= E𝜇(𝑔) (𝜇𝐴 ) where L and E are logarithmic and expo- bution on 𝔾𝕃(3). Specifically, the probability density function for a
𝐴 K-component CGMM is
nential maps respectively (see Appendix). The ‘variance’ 𝜎𝐴 is then
∑𝑁 2 1/2
calculated as 𝜎𝐴 = ( 𝑁1 𝑖=1 𝑑 (𝑓𝑖 , 𝜇𝐴 )) . The normalizing ∑
𝐾

number 𝑧𝐴 can be estimated by Monte Carlo simulation, which is 𝑃 (𝑣) = 𝜋𝑘 𝑝𝐶 (𝑣; 𝜇𝑘 , Σ𝑘 ) (3)
𝑘=1
however not needed in this work.
To get a new instance of co-occurrence function 𝑓 of type 𝐴, where 𝜋𝑘 ’s are the mixing probabilities, 𝑝𝐶 is a single ‘curved’
we need to generate new samples from 𝑃 (𝑓 ∣𝐴). For this purpose, Gaussian distribution defined on 𝔾𝕃(3), 𝜇𝑘 is the mean of each
we generate a function 𝑓 ′ in T𝜇𝐴 , the tangent space at 𝜇𝐴 , such Gaussian, and Σ𝑘 the covariance defined in the tangent space at 𝜇𝑘 .
that < 𝑓 ′ , 𝑓 ′ >= 1. Then we generate a Gaussian random num- We compute the plane homographies from each of the train-
2
ber 𝑟 ∼ N (0, 𝜎𝐴 ) and obtain the new co-occurrence function as ing videos by locating field markers in the image. The CGMM,
𝑓 = cos(𝑟)𝜇𝐴 + sin(𝑟)𝑓 ′ . One issue associated with this approach from a collection of training views {𝑣𝑗 }𝑀 𝑗=1 , is then learned as fol-
is that negative values of 𝑓 may occur, and in this case we discard lows. We first cluster 𝑣𝑗 ’s into different components by computing a
the generated sample. Another issue comes from the integer require- pairwise intrinsic metric between each pair (𝑣1 , 𝑣2 ) as 𝑑(𝑣1 , 𝑣2 ) =
ment of 𝐹 which may not be satisfied by the generated function. To ∣∣ log(𝑣1−1 𝑣2 )∣∣. From the pairwise similarity metric one can employ
overcome it we change the generated co-occurrence function to the any suitable unsupervised clustering technique to cluster 𝑣𝑗 ’s. Here
closest one of integer value. we make use of the repeated quadratic programming algorithm used
2) Generating Ground Plane Motion Pattern. As mentioned, in [5]. Once we have obtained 𝐾 clusters (components), each of
the complete motion trajectories in the ground plane consist of those which contains 𝑀𝑘 samples, we may estimate the 𝑘th mixing proba-
involving both offensive players and others. We model the genera- bility as 𝜋𝑘 = 𝑀 𝑀
𝑘
. Then the center of each component is estimated
tion of these motions in two steps: 1) generating the offensive ones from the samples clustered into that component, using exactly the
from the co-occurrence function, i.e., 𝑃 (𝐷∣𝑓, 𝐴); and 2) generating iterations between the exponential map and the logarithmic map for
the entire set of motions by 𝑃 (𝑇 ∣𝐷). Lie groups (see Appendix). Finally, the covariance Σ𝑘 is calculated

4586
as normally done in the tangent space at 𝜇𝑘 , which contains the log-
arithmically mapped component samples from 𝔾𝕃(3). To simulate
a new view from the learned CGMM, we randomly select a compo-
nent according to the mixing probability, locate the center, generate
a Gaussian random matrix with the covariance in the tangent plane, D 6DPSOHVQDSVKRW E *URXQGWUXWKWUDMHFWRULHV F 7UDFNLQJ G 7UDMHFWRULHVIURPWUDFNLQJ

and finally exponentially map it to 𝔾𝕃(3).


The last component involved in the observation generating
Fig. 1. Sample of GaTech Football Play Dataset and tracking result.
model, 𝑃 (𝑂∣𝑇, 𝑣), is trivial once a view transform 𝑣 is given: The
observation 𝑂 will be determined by 𝑂 = 𝑣(𝑇 ) or 𝑇 = 𝑣 −1 (𝑂).
Therefore, 𝑃 (𝑂∣𝑇, 𝑣) = 𝛿(𝑂 = 𝑣(𝑇 )) or 𝑃 (𝑇 ∣𝑂, 𝑣) = 𝛿(𝑇 = First we augment the number of tracks when necessary. Specif-
𝑣 −1 (𝑂)). The form of Kronecker delta reduces (1) into ically, we do this when a significantly large foreground bounding
∫ box (area greater than twice of a single player box) is produced,
𝑃 (𝑂∣𝐴) = 𝑃 (𝑣 −1 (𝑂)∣𝐷)𝑃 (𝑣)𝑃 (𝐷∣𝑓, 𝐴)𝑃 (𝑓 ∣𝐴)𝑑𝑓 𝑑𝐷𝑑𝑣. enclosing two or more players inside. In this case, we duplicate
(4) the track associated with the large bounding box twice or multi-
ple times, according to its area compared with the area of a reg-
3. RECOGNITION BY SYNTHESIS ular one. Then we identify continuous temporal durations, during
each of which the tracker produces a constant number (greater than
With all components of the generating model defined, it is straight- 𝑚) of non-fragmented tracks. We denote one of these duration as
forward to employ the ‘analysis-by-synthesis’ method to evaluate the 𝑇𝑐 and assume 𝑁𝑐 tracks 𝑡𝑐,1 , 𝑡𝑐,2 , ⋅ ⋅ ⋅ , 𝑡𝑐,𝑁𝑐 are computed from
posterior probability 𝑃 (𝐴∣𝑂). Specifically, we pick up an offensive the tracker during this period. Then from the 𝐷𝑚 simulated tra-
type 𝐴𝑖 according to class prior 𝑃 (𝐴𝑖 ) (assumed known), simulate jectories we randomly select 𝑛𝑐 = ⌈ 𝐷𝑚 + 1⌉, ⋅ ⋅ ⋅ , 𝐷𝑚 trajecto-
2
a co-occurrence function 𝑓𝑖𝑗 from 𝑃 (𝑓 ∣𝐴𝑖 ), generate an offensive ries 𝑡𝐷1 , 𝑡𝐷2 , ⋅ ⋅ ⋅ , 𝑡𝐷𝑛𝑐 and run the Kuhn-Munkres assignment be-
motion pattern 𝐷𝑖𝑗 according to 𝑃 (𝐷∣𝑓𝑖𝑗 , 𝐴𝑖 ), and randomly select tween {𝑡𝑐,1 , 𝑡𝑐,2 , ⋅ ⋅ ⋅ , 𝑡𝑐,𝑁𝑐 } and {𝑡𝐷1 , 𝑡𝐷2 , ⋅ ⋅ ⋅ , 𝑡𝐷𝑛𝑐 }. The best
a view 𝑣𝑗 from 𝑃 (𝑣). Then the posterior probability can be approx- assignment yielding the minimum average cost per trajectory-pair is
imated as recorded as {𝑡∗𝑐,1 , 𝑡∗𝑐,2 , ⋅ ⋅ ⋅ , 𝑡∗𝑐,𝑛∗𝑐 } and {𝑡∗𝐷1 , 𝑡∗𝐷2 , ⋅ ⋅ ⋅ , 𝑡∗𝐷𝑛∗𝑐 }.

. 𝑃 (𝐴𝑖 ) 𝑗 𝑃 (𝑣𝑗−1 (𝑂)∣𝐷𝑖𝑗 ) For every duration 𝑇𝑐 we find the best assignment and finally
𝑃 (𝐴𝑖 ∣𝑂) = ∑ ∑ −1 (5) evaluate 𝑃 (𝑣 −1 (𝑂)∣𝐷) as
𝑖 𝑃 (𝐴𝑖 )( 𝑗 𝑃 (𝑣𝑗 (𝑂)∣𝐷𝑖𝑗 ))
∑ ∑𝑛∗𝑐 2 ∗ ∗
by Monte Carlo principle. It is worth noting that view invariance has 1 𝛽 {𝑇𝑐 } ( 𝑛𝑚∗ 𝑖=1 𝑑 (𝑡𝑐,𝑖 , 𝑡𝐷𝑖 ))
been realized implicitly by integrating all possible views. 𝑃 (𝑣 −1 (𝑂)∣𝐷) ≜ exp(− 𝑐
).
𝑁 2𝛼 2
Now it remains to evaluate 𝑃 (𝑣 −1 (𝑂)∣𝐷). Based on the mo- (7)
tion generating scheme discussed in 2.2, we achieve this as fol- Here the normalizing factor 𝛽 = ∑ 𝑙 𝑙(𝑇𝑐 ) where 𝑙 is the length
{𝑇𝑐 }
lows. We look for a one-to-one correspondence of the 𝑚 trajec- of the play and 𝑙(𝑇𝑐 ) is the length of the duration 𝑇𝑐 .
tories 𝑡𝐷1 , 𝑡𝐷2 , ⋅ ⋅ ⋅ , 𝑡𝐷𝑚 in 𝐷 with 𝑚 ones in 𝑣 −1 (𝑂). In other 4. EXPERIMENTS
words, we pick up 𝑚 trajectories 𝑡1 , 𝑡2 , ⋅ ⋅ ⋅ , 𝑡𝑚 from the 𝑛 trajecto-
ries in 𝑣 −1 (𝑂), such that the total distance between the two group We apply the proposed approach on GaTech Football Play Dataset.
∑𝑚 We have used it in our previous work [5, 6] and seen promising per-
𝑖=1 𝑑(𝑡𝐷𝑖 , 𝑡𝑖 ) is minimized. Intuitively, we are finding a subset
of 𝑣 −1 (𝑂) such that motions in it are the most likely ones corre- formance. The dataset consists of a collection of 155 NCAA football
sponding to the offensive play. The distance between two trajecto- game videos. Each video is a readily segmented one which records a
ries 𝑑(𝑡𝐷𝑖 , 𝑡𝑖 ) is simply calculated by summarizing the Euclidean play, i.e., the play starts at the beginning of the video and terminates
distances of point pairs. Then with all pair-wise distances between at the end of it. Accompanied with video is a ground-truth annota-
𝑣 −1 (𝑂) and 𝐷 available, the desirable correspondence can be de- tion including offense types and the object locations in each frame.
termined by running the classical Kuhn-Munkres assignment algo- Currently annotations for 56 videos are available for use. The anno-
rithm [8]. Once the best correspondence is found on 𝑡∗1 , 𝑡∗2 , ⋅ ⋅ ⋅ , 𝑡∗𝑚 , tation includes coordinates in the image plane of all the 22 players
𝑃 (𝑣 −1 (𝑂)∣𝐷) is evaluated as as well as field landmarks - the intersections of field lines. We show
∑𝑚 2 a sample snapshot together with corresponding ground truth trajec-
1 𝑑 (𝑡𝐷𝑖 , 𝑡∗𝑖 ) tories (in the ground plane coordinates) in Figure 1(a)(b), where red
𝑃 (𝑣 −1 (𝑂)∣𝐷) ≜ exp(− 𝑖=1 ). (6)
𝑁 2𝛼2 trajectories denote offensive players, yellow ones denote defensive
Practically, strong occlusions among players are common, and ones, and background is colored green for better visualization.
the motion of a single player may switch frequently between sta- We pre-process the samples with varying temporal duration by
tionary and non-stationary. As a result, trajectories from tracking normalizing their time scales to fixed value. The 56 play samples
are significantly fragmented, and the effective number of tracks may are organized into three offensive types, including Dropback, Mid-
be less than expected at times and may change from time to time. We dle&Right Run and Wideleft Run. The play type division takes into
modify the scheme by which we evaluate 𝑃 (𝑣 −1 (𝑂)∣𝐷) as follows account both the play type hierarchy and the balance of sample
to address the above issues. amount. The algorithm is run multiple times, each of which uses a

4587
is also of interest. It is also useful to model the interactions between
two groups (e.g., taking defensive side into account as well). We
may also consider incorporating articulated motion features, besides
simply point motion paths, to establish a ‘panoramic’ characteriza-
tion for a play, which may hopefully help achieving more accurate
recognition.

6. APPENDIX
For spatial co-occurrence functions, the exponential map E𝑓𝑚 :
Fig. 2. The recognition rates (%): D, M, and W stand for Dropback, 1
T𝑓𝑚 → F for 𝑓 ′ ∈ T𝑓𝑚 is defined as E𝑓𝑚 (𝑓 ′ ) = cos(< 𝑓 ′ , 𝑓 ′ > 2
Middle&right Run, and Wideleft Run respectively. 1
sin(<𝑓 ′ ,𝑓 ′ > 2 )
)𝑓𝑚 + 1 𝑓 ′ . The logarithmic map L𝐹 : F → T𝑓𝑚 is
<𝑓 ′ ,𝑓 ′ > 2
arccos(<𝑓,𝑓𝑚 >)
then given by L𝑓𝑚 (𝑓 ) = 1 𝑓 ∗ where 𝑓 ∗ = 𝑓𝑚 − <
random division of sample collection into training (approximately <𝑓 ∗ ,𝑓 ∗ > 2
80%) and testing (approximately 20%) sets. The homographies 𝑓, 𝑓𝑚 > 𝑓 .
corresponding to the view changes are determined by locating the For matrix Lie groups, the exponential map E𝑣𝑚 : T𝑣𝑚 →
landmark points on the football field. The free parameters 𝛼2 and 𝛾 𝔾𝕃(3) for 𝑣 ′ ∈ T𝑣𝑚 is given by E𝑣𝑚 (𝑣 ′ ) = 𝑣𝑚 exp(𝑣𝑚
−1 ′
𝑣 ). The
are simply taken as the variation of all pairwise distances between logarithmic map L𝑣𝑚 : 𝔾𝕃(3) → T𝑣𝑚 , meanwhile, is L𝑣𝑚 (𝑣) =
−1
trajectories and the normalizing factor 𝑁 cancels out. Since the 𝑣𝑚 log(𝑣𝑚 𝑣).
amount of the training samples is limited, we augment the size of Acknowledgment:This research was partially supported by
the training set during each training process as follows. We first learn DARPA VIRAT Phase I program.
a trajectory vocabulary for the trajectories in the original training set.
Then we perturb each original trajectory to get new ones and search 7. REFERENCES
for the new ‘word’ label for the new trajectories. The perturbation
[1] M. Lazarescu and S. Venkatesh, “Using camera motion to iden-
is realized by adding 2-D isotropic Gaussian on ground-plane co-
tify different types of American football plays,” in ICME, 2003,
ordinates at particular time instants (0%,20%,40%,60%,80%,100%
pp. 181 – 184.
of the video duration) and polynomially interpolating the other time
instants. Moreover, we shift the location of each trajectory (original [2] T. Liu, W. Ma, and H. Zhang, “Effective feature extraction for
and generated) entirely with another Gaussian. In this way, for each play detection in American football video,” in MMM, 2005.
of the original training play we get 20 synthetic plays, and thus the [3] C. Huang, H. Shih, and C. Chao, “Semantic analysis of soccer
eventual size of training set is 20 times more than the original one. video using dynamic Bayesian network,” IEEE Transactions on
For each testing play, we use the multi-object tracker reported Multimedia, vol. 8, no. 4, pp. 749 – 760, 2006.
in [9] to generate trajectories, due to its good performance on track- [4] S. Intille and A. Bobick, “Recognizing planned, multiperson
ing soccer players. The tracking results are shown in Figure 1(c)(d), action,” Computer Vision and Image Understanding, vol. 81,
with snapshots with bounding boxes and the tracks in the ground pp. 414 – 445, 2001.
plane. We generate 5 × 104 Mote Carlo samples for each testing
run as well as assume a uniform prior class probability. To evaluate [5] R. Li and R.Chellappa, “Recognizing coordinated multi-object
the effectiveness of models for statistical view change and the op- activity using a dynamic event ensemble model,” in ICASSP,
timal assignment, we design two baselines for a comparative study. 2009.
The first, namely ‘random view selection’ (RVS), does not simu- [6] R. Li, R.Chellappa, and S. Zhou, “Learning multi-modal den-
late a homography from the learned CGMM, but randomly picks out sities on discriminative temporal interaction manifold for group
one from all available training ones. The second baseline, denoted activity recognition,” in CVPR, 2009.
as ‘nearest player selection’ (NPS), picks out the spatially closest [7] G. Zhu et. al., “Trajectory based event tactics analysis in broad-
player in the testing play to match with each of the simulated rele- cast sports video,” in ACM MM, 2007.
vant players, instead of performing the Kuhn-Munkres assignment.
The play recognition results are shown in Figure 2. The proposed [8] H. W. Kuhn, “The Hungarian method for the assignment prob-
method outperforms all baselines and an average recognition rate of lem,” Naval Research Logistics Quarterly, , no. 2, pp. 83 – 97,
approximately 70% is obtained. 1955.
[9] S.W. Joo and R. Chellappa, “A multiple-hypothesis approach
5. DISCUSSION
for multiobject visual tracking,” IEEE Transactions on Image
We have proposed an algorithm to recognize football play strategies Processing, vol. 16, no. 11, pp. 2849 – 2854, 2007.
from realistic sports videos, and have shown preliminary empirical
performance of the approach. We believe that techniques for ex-
tracting high-level semantics in sports videos are worth continuous
investigation. For example, temporal detection of a particular play

4588

You might also like