Professional Documents
Culture Documents
Sura Report
Sura Report
(SURA), 2009
Submitted By
Saurabh Gupta
(Entry No. 2007CS10185)
1
Certificate
This is to certify that the project titled Intelligent Video Surveillance System- B is a bona
fide work done by Saurabh Gupta (Entry: 2007CS10185) as part of Summer Undergraduate
Research Award, 2009 in the Department of Computer Science and Engineering at Indian In-
stitute of Technology, Delhi. This project was carried out by them under my guidance and has
not been submitted elsewhere.
2
Acknowledgement
I am eternally grateful and highly indebted to Professor Subhashis Banerjee for giving me this
opportunity to work under him. His fraternal guidance, keen interest and educative discussions
have been a cornerstone for the success of this project. His involvement with the project was
a source of great motivation and ideas. I would also like to thank Industrial Research and
Development Unit for giving me this unique learning opportunity to work on this project. I
also express my sincere gratitude to the Department of Computer Science and Engineering
for providing me with the all the necessary facilities required for the completion of this project.
This project is the part of a larger project done collectively by Ankit Sagwal, Ankit Narang
and me. I thank them for keeping up my enthusiasm and for providing me with valuable
suggestions. I also thank Ayesha Choudhary for the valuable discussions I have had with her
and for the critical test data she provided. I also express my humble gratitude towards my
parents and family for their unwavering trust in me and my abilities.
Saurabh Gupta
3
Contents
1 Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Why video surveillance ? . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Why unsupervised ? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Exact Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Theoretical Background 6
2.1 What we mean by unsupervised and why is it possible ? . . . . . . . . . . . . . 6
2.2 Background Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Latent Semantic Indexing(LSI) . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Probabilistic Latent Semantic Analysis(pLSA) . . . . . . . . . . . . . . . . . . 9
3 Implementation 10
3.1 Basic Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 How we apply pLSA in context of videos ? . . . . . . . . . . . . . . . 10
3.1.2 Choice of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Approach followed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Clustering of a given set of videos . . . . . . . . . . . . . . . . . . . . 11
3.3 Other Activities undertaken . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Results 13
4.1 Single person activity analysis and classification . . . . . . . . . . . . . . . . . 13
4.2 Multi person activity classification . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Future Work 17
5.1 Unusual activity flagging for single person video . . . . . . . . . . . . . . . . 17
5.2 Multi person activity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4
1 Introduction
1.1 Motivation
1.1.1 Why video surveillance ?
Security and surveillance are important issues in today’s world. The recent acts of terrorism
have highlighted the urgent need for efficient surveillance. Contemporary surveillance systems
use digital video recording (DVR) cameras which play host to multiple channels. The major
drawbacks with this model is that it requires continuous manual monitoring which is infeasible
because of factors like human fatigue and cost of manual labour. Moreover, it is virtually im-
possible to search through recordings for important events in the past since that would require a
playback of the entire duration of video footage. Hence, there is indeed a need for an automated
system for video surveillance which can detect unusual activities on its own.
A system which needs to be programmed according to the location it is to be deployed in, would
require lots of initial overheads while installing it. This overhead includes enumeration of the
kind of activities which would happen in this area and then coming up with an Finite State
Machine (FSM) model which accurately captures routine activities and flags non routine ones.
Clearly, this overhead is large and makes a programmed approach unsuitable for large scale
deployment. Hence there is a need for unsupervised video surveillance system which is able to
learn routine activities on its own from non labelled learning data. A system with self learning
ability would be easy to deploy and would make it possible to have large scale monitoring.
1.2 Objectives
The objective for this project are as follows
2. To extend the approach for single person video classification to multi people activity
classification.
5
1.4 Related Work
1. Unsupervised image classification
One of the first successful attempts at unsupervised classification in the area on com-
puter vision was that of image classification. Sivic in the paper titled Discovering Object
Categories in Image Collections ([8]) used pLSA for unsupervised image classification.
The concept used was simple. It involved using features like vector quantized SIFT de-
scriptors computed on affine covariant regions as visual words describing images. This
choice of word on being subject to clustering using pLSA classified images on basis of
the objects contained in them.
3. Unsupervised Video analysis using pLSA for single object using Video Epitomes
pLSA has been successfully used for unsupervised video analysis ([1]). This used epit-
ome subtraction to obtain space time patches as words for the video document and catered
only to single person activity analysis. Moreover, the use of video epitomes made this ex-
tremely computationally expensive making it difficult for deployment in an online man-
ner. The current project is basically an extension of this work to multi person activity
analysis in a manner which is feasible for online deployment.
2 Theoretical Background
6
In the context of this project, this implies looking at a large set of unclassified videos,
learning patterns from this data set, and then classifying any new video that is provided as a
query. The learning process is such that videos with similar activities are grouped together into
the same class. The query video is put into a class with which it shares most of its features.
Any routine activity that comes up as a query will easily get mapped to as belonging to
one of the learnt classes. But on the contrary, an unusual activity would not get completely
classified with any one cluster. This distinction can then be used to classify activities as usual
and unusual.
Hence the only supervision that is involved here is feeding in learning data from which
usual activity patterns are learnt and whatever is not usual gets classified as unusual, making
such an unsupervised scheme for classification possible.
Figure 1: On left is the original frame. On Right is the foreground frame obtained after back-
ground subtraction.
Any background subtraction method needs to learn the background features in the current
setting. This training can be done in various ways. The simplest one is to let the system store a
large number of frames (say 1000) and take a bit wise average of all these frames. This average
will serve as the background for subsequent frames. There is a problem with such an approach
that once the training phase is over, the system will fix the background image and will not
incorporate any other stationary objects introduced in the scene into the background. Hence, a
7
better approach is to carry on background training as the system runs.
An improved background model which can better cater to lighting and intensity changes in
the scene along with dynamic learning was proposed by Chris Stauffer and W. E. L. Grimson.
This adaptive background mixture model models each pixel in the form of a Gaussians (in
terms of RGB intensity) and measure the new frame pixel Gaussians against the background
pixel Gaussians ([9]). This approach is more robust than the previous one. A more improved
background modeling technique was proposed by Brendan Klare and Sudeep Sarkar(in [5]).
They followed the similar approach of modeling pixel as Gaussians, but instead of Gaussians
based only on RGB intensities, they used 13 Gaussians to represent one pixel. This technique
very efficiently handles the illumination changes in the scene and also adapts rapidly if the
background features are altered.
We refer to foreground frames in the rest of the report. By this we mean, a binary image
which is white(or 1) in places where there is a foreground object and is black (or 0) at all other
places.
A = U ΣV t
UUt = I
VVt =I
8
A k rank approximation of this is obtained considering only the top k largest singular values
from the matrix Σ, and only the first k columns of the matrices U and V to obtain Ak as
Ak = Uk Σk Vk t (1)
This is a k rank approximation of A, having the least norm 2 deviation from A over all rank k
approximations. What this does is to map the original sparse space to a more meaningful space
of lesser dimension with the documents located at coordinates given by Uk Σk . The extent of
similarity of two documents is now obtained from the notion of closeness in this k dimensional
reduced space. Moreover, in this process of dimension reduction, noise is effectively removed
from the data.
LSA exploits the observation that similar words occur in similar documents and similar
documents contain similar words. Hence, it is easily able to learn synonymy (different words
having similar meaning, by their simultaneous occurrences in documents) and polysemy(words
with multiple meanings, by their occurrences in multiple different contexts) in a completely
unsupervised manner.
The EM algorithm described in [2] calculates P (w|z) and P (z|d) for a given k number of
classes. The term P (z|d) describes how well document d explains hidden topic z and the term
P (w|z) describes to what extent does word w contribute in explaining topic z.
Hence, we obtain clusters where in similar documents have high probability of explaining
the same hidden topic.
Besides, having all features that LSI offers, it has an additional benefit of having a stronger
9
statistical foundation.
3 Implementation
The basic technique we follow, is to treat videos as text documents. We extract out some
features from the video clip, which are treated as words describing this video document. Hence
we obtain a set of words; and a set of documents containing these words.
Next we apply pLSA on these documents in exactly the same way as is done for normal
documents and obtain clusters. Any new video is then classified into one of the learned clusters.
The most important aspect for this scheme to work well is the choice of words describing the
videos. It is important for the chosen words to capture sufficient information about the video.
Moreover, it must capture information which is able to give us clusters which we are looking
for.
The choice of words that we work with, is space time patches. Consider a space time
volume made by consecutive foreground frames (of say a 20 frame clip) spread along the z-
axis (the first image frames being parallel to the XY plane at z = 0, second one at z = 1 and
so on). Let (x, y, t) denote a point in this space, where (x, y) is the location in frame t of the
20 frame length clip.
A patch, Pdx,dy,dt in this space time volume, is the region of volume dx × dy × dt contained
in the region {(a, b, c) : a ∈ (x, x + dx), b ∈ (y, y + dy), c ∈ (t, t + dt)}. A patch Pdx,dy,dt , is
characterized completely by its starting coordinate (x, y, t).
A foreground patch is a patch having more than d fraction of its pixels as a part of the
foreground. There will be foreground patches in the space time volume of foreground frames.
The words describing a video document are all the foreground patches contained in the
space time volume of this video.
If SD is the set of words for document D, then
This choice of words was arrived at after trying out a lot of different possibilities. This
captures the information we require for clustering videos based on the activities contained in
them. The space information is contained in the x and y coordinate and the flow of action is
10
captured by the frame number.
Different activities would have a different pattern of the (x, y, t) patches. Similar activities
would have a similar pattern of (x, y, t) patches. Though they may not have exactly the same
pattern but over a large training data set, would share patches with documents containing similar
activity and hence get clustered with group of documents containing this activity.
Thus this choice of words is able to give us the kind of clustering that we are looking for.
11
apply the GMM algorithm not on the raw RBG image but on a normalized RBG as this
reduces the effects of shadows and intensity changes.
6. Running pLSA on this words document matrix to obtain video clusters(in Matlab)
The documentxword matrix thus obtained is processed through pLSA to obtain n clus-
ters. The algorithm gives us probability of each document and each word belonging to a
12
particular cluster. Similar documents and words are the ones which have high probability
of belonging to the same cluster.
4 Results
13
Figure 2: Class 1 - Activity in the near end of corridor.
Figure 7: Class 6 - Cluster of trajectories along the left edge of the corridor.
14
4.2 Multi person activity classification
The space time patch approach on multi person videos forms classes such that each class corre-
sponds to a single person activity happening in the video. Videos with a single activity in them
are assigned to the corresponding class. Videos which have more than one activity happening in
them are given likelihoods of belonging to the classes of the single person activities happening
in them.
pLSA is able to learn the single person activities happening in multi person videos, based on
the observation that the words describing a particular single person activity, occur as a group
in different multi person videos (in much the same way as topic discovery happens in sets
containing multi topic documents).
Consider for example the classes obtained on a learning video in the coffee shop area near
the library.
15
Figure 8: Class 1 - Trajectories along the path in mid right.
Figure 10: Class 3 - Trajectories from the bottom left to the center of frame.
Figure 11: Class 4 - People turning from main building side to coffee shop.
Figure 12: Class 5 - Trajectories from left top corner to right top corner and vice versa.
16
5 Future Work
References
[1] Ayesha Choudhary, Manish Pal, Subhashis Banerjee, and Santanu Chaudhury. Unusual
activity analysis using video epitomes and plsa. In ICVGIP, pages 390–397, 2008.
[3] S. Hongeng and R. Nevatia. Multi-agent event recognition. In Computer Vision, 2001.
ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, pages 84–
91 vol.2, 2001.
[4] Fan Jiang, Ying Wu, and Aggelos K. Katsaggelos. A dynamic hierarchical clustering
method for trajectory-based unusual video event detection. IEEE Trans. Image Process.,
18:907–913, 2009.
[5] B. Klare and S. Sarkar. Background subtraction in varying illuminations using an ensemble
based on an enlarged feature set. Computer Vision and Pattern Recognition Workshop,
0:66–73, 2009.
17
[6] Dhruv Mahajan, Nipun Kwatra, Sumit Jain, Prem Kalra, and Subhashis Banerjee. A frame-
work for activity recognition and detection of unusual activities. In ICVGIP, pages 15–21,
2004.
[7] C. Piciarelli, G.L. Foresti, and L. Snidara. Trajectory clustering and its applications for
video surveillance. Advanced Video and Signal Based Surveillance, IEEE Conference on,
0:40–45, 2005.
[9] Chris Stauffer and W.E.L Grimson. Adaptive background mixture models for real-time
tracking. In CVPR99, Fort Colins, CO, 1999.
18