Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

STUDENT UNDERGRADUATE RESEARCH AWARD

(SURA), 2009

Progress report for the project titled

Intelligent Video Surveillance System- B

Submitted By
Saurabh Gupta
(Entry No. 2007CS10185)

Under the guidance of:


Professor Subhashis Banerjee
Computer Science and Engineering

Department of Computer Science and Engineering


Indian Institute of Technology Delhi
October 2009

1
Certificate
This is to certify that the project titled Intelligent Video Surveillance System- B is a bona
fide work done by Saurabh Gupta (Entry: 2007CS10185) as part of Summer Undergraduate
Research Award, 2009 in the Department of Computer Science and Engineering at Indian In-
stitute of Technology, Delhi. This project was carried out by them under my guidance and has
not been submitted elsewhere.

Professor Subhashis Banerjee


Department of Computer Science and Engineering
Indian Institute of Technology, Delhi
New Delhi, India

2
Acknowledgement
I am eternally grateful and highly indebted to Professor Subhashis Banerjee for giving me this
opportunity to work under him. His fraternal guidance, keen interest and educative discussions
have been a cornerstone for the success of this project. His involvement with the project was
a source of great motivation and ideas. I would also like to thank Industrial Research and
Development Unit for giving me this unique learning opportunity to work on this project. I
also express my sincere gratitude to the Department of Computer Science and Engineering
for providing me with the all the necessary facilities required for the completion of this project.
This project is the part of a larger project done collectively by Ankit Sagwal, Ankit Narang
and me. I thank them for keeping up my enthusiasm and for providing me with valuable
suggestions. I also thank Ayesha Choudhary for the valuable discussions I have had with her
and for the critical test data she provided. I also express my humble gratitude towards my
parents and family for their unwavering trust in me and my abilities.

Saurabh Gupta

3
Contents
1 Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Why video surveillance ? . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Why unsupervised ? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Exact Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Theoretical Background 6
2.1 What we mean by unsupervised and why is it possible ? . . . . . . . . . . . . . 6
2.2 Background Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Latent Semantic Indexing(LSI) . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Probabilistic Latent Semantic Analysis(pLSA) . . . . . . . . . . . . . . . . . . 9

3 Implementation 10
3.1 Basic Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 How we apply pLSA in context of videos ? . . . . . . . . . . . . . . . 10
3.1.2 Choice of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Approach followed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Clustering of a given set of videos . . . . . . . . . . . . . . . . . . . . 11
3.3 Other Activities undertaken . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Results 13
4.1 Single person activity analysis and classification . . . . . . . . . . . . . . . . . 13
4.2 Multi person activity classification . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Future Work 17
5.1 Unusual activity flagging for single person video . . . . . . . . . . . . . . . . 17
5.2 Multi person activity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4
1 Introduction

1.1 Motivation
1.1.1 Why video surveillance ?

Security and surveillance are important issues in today’s world. The recent acts of terrorism
have highlighted the urgent need for efficient surveillance. Contemporary surveillance systems
use digital video recording (DVR) cameras which play host to multiple channels. The major
drawbacks with this model is that it requires continuous manual monitoring which is infeasible
because of factors like human fatigue and cost of manual labour. Moreover, it is virtually im-
possible to search through recordings for important events in the past since that would require a
playback of the entire duration of video footage. Hence, there is indeed a need for an automated
system for video surveillance which can detect unusual activities on its own.

1.1.2 Why unsupervised ?

A system which needs to be programmed according to the location it is to be deployed in, would
require lots of initial overheads while installing it. This overhead includes enumeration of the
kind of activities which would happen in this area and then coming up with an Finite State
Machine (FSM) model which accurately captures routine activities and flags non routine ones.
Clearly, this overhead is large and makes a programmed approach unsuitable for large scale
deployment. Hence there is a need for unsupervised video surveillance system which is able to
learn routine activities on its own from non labelled learning data. A system with self learning
ability would be easy to deploy and would make it possible to have large scale monitoring.

1.2 Objectives
The objective for this project are as follows

1. To design a model for unsupervised classification of videos of single person activities


and hence detect unusual activities.

2. To extend the approach for single person video classification to multi people activity
classification.

1.3 Exact Problem Statement


Treating videos as documents, coming up with appropriate choice of words describing the video
document which on analysis using pLSA(probabilistic Latent Semantic Analysis), logically
clusters the videos based on the activity happening in the video.

5
1.4 Related Work
1. Unsupervised image classification
One of the first successful attempts at unsupervised classification in the area on com-
puter vision was that of image classification. Sivic in the paper titled Discovering Object
Categories in Image Collections ([8]) used pLSA for unsupervised image classification.
The concept used was simple. It involved using features like vector quantized SIFT de-
scriptors computed on affine covariant regions as visual words describing images. This
choice of word on being subject to clustering using pLSA classified images on basis of
the objects contained in them.

2. Attempts at supervised video classification


There have been numerous attempts at supervised learning based on learning programmed
model for activities (as described in [6], [3]). These are fairly successful at identifying
the particular activities they are programmed for, but are not deployable in a general sce-
nario. They suffer from disadvantages of non scalability in terms of their deployment
and installation (as mentioned in the motivation for unsupervised surveillance).
There have been other attempts at unsupervised activity classification (as described in [7],
[4]). But these have primarily been on specific attributes of activities like only trajectory
or shape, etc. They are able to provide a very sound model for unsupervised learning of
the specific attribute they are designed for, but fail to give a generic model for overall
activity analysis.

3. Unsupervised Video analysis using pLSA for single object using Video Epitomes
pLSA has been successfully used for unsupervised video analysis ([1]). This used epit-
ome subtraction to obtain space time patches as words for the video document and catered
only to single person activity analysis. Moreover, the use of video epitomes made this ex-
tremely computationally expensive making it difficult for deployment in an online man-
ner. The current project is basically an extension of this work to multi person activity
analysis in a manner which is feasible for online deployment.

2 Theoretical Background

2.1 What we mean by unsupervised and why is it possible ?


The notion of unsupervised learning means learning and deducing information from unclassi-
fied raw learning videos and to then use this information to classify any new query video into
one of the classes that have been learnt.

6
In the context of this project, this implies looking at a large set of unclassified videos,
learning patterns from this data set, and then classifying any new video that is provided as a
query. The learning process is such that videos with similar activities are grouped together into
the same class. The query video is put into a class with which it shares most of its features.
Any routine activity that comes up as a query will easily get mapped to as belonging to
one of the learnt classes. But on the contrary, an unusual activity would not get completely
classified with any one cluster. This distinction can then be used to classify activities as usual
and unusual.
Hence the only supervision that is involved here is feeding in learning data from which
usual activity patterns are learnt and whatever is not usual gets classified as unusual, making
such an unsupervised scheme for classification possible.

2.2 Background Subtraction


For any kind of computer vision processing it is important (and is often the first step) to separate
out objects from the image that we are viewing. This process is called background subtraction.
More formally, it is the process of removal of static background features from an image. The
features of an image which are not part of the background are called the foreground objects. For
example, if we have a camera fixed at a traffic light, the vehicles will serve as foreground objects
and features like zebra-crossing, lamp post, etc will serve as background features. Consider for
example the following original image frame and the corresponding foreground frame obtained
after background subtraction.

Figure 1: On left is the original frame. On Right is the foreground frame obtained after back-
ground subtraction.

Any background subtraction method needs to learn the background features in the current
setting. This training can be done in various ways. The simplest one is to let the system store a
large number of frames (say 1000) and take a bit wise average of all these frames. This average
will serve as the background for subsequent frames. There is a problem with such an approach
that once the training phase is over, the system will fix the background image and will not
incorporate any other stationary objects introduced in the scene into the background. Hence, a

7
better approach is to carry on background training as the system runs.
An improved background model which can better cater to lighting and intensity changes in
the scene along with dynamic learning was proposed by Chris Stauffer and W. E. L. Grimson.
This adaptive background mixture model models each pixel in the form of a Gaussians (in
terms of RGB intensity) and measure the new frame pixel Gaussians against the background
pixel Gaussians ([9]). This approach is more robust than the previous one. A more improved
background modeling technique was proposed by Brendan Klare and Sudeep Sarkar(in [5]).
They followed the similar approach of modeling pixel as Gaussians, but instead of Gaussians
based only on RGB intensities, they used 13 Gaussians to represent one pixel. This technique
very efficiently handles the illumination changes in the scene and also adapts rapidly if the
background features are altered.
We refer to foreground frames in the rest of the report. By this we mean, a binary image
which is white(or 1) in places where there is a foreground object and is black (or 0) at all other
places.

2.3 Latent Semantic Indexing(LSI)


Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical
technique called Singular Value Decomposition (SVD) to identify patterns in the relationships
between the terms and concepts contained in an unstructured collection of text. LSI is based on
the principle that words that are used in the same contexts tend to have similar meanings. A key
feature of LSI is its ability to extract the conceptual content of a body of text by establishing
associations between terms that occur in similar contexts. It is called Latent Semantic Indexing
because of its ability to correlate semantically related terms that are latent in a collection of
text.
This technique considers a document as a bag of words. This essentially means it takes into
account only the occurrence of words in a document and not their relative orderings. hence
a set of d documents on w words can be described as a d × w matrix A with the ni,j entry
representing the number of occurrences of the word j in document i. LSI works by doing
a Singular Value Decomposition on this document × word matrix, to obtain a term-concept
vector matrix, U , a singular values matrix, Σ, and a concept-document vector matrix, V , which
satisfy the following relations

A = U ΣV t
UUt = I
VVt =I

8
A k rank approximation of this is obtained considering only the top k largest singular values
from the matrix Σ, and only the first k columns of the matrices U and V to obtain Ak as

Ak = Uk Σk Vk t (1)

This is a k rank approximation of A, having the least norm 2 deviation from A over all rank k
approximations. What this does is to map the original sparse space to a more meaningful space
of lesser dimension with the documents located at coordinates given by Uk Σk . The extent of
similarity of two documents is now obtained from the notion of closeness in this k dimensional
reduced space. Moreover, in this process of dimension reduction, noise is effectively removed
from the data.
LSA exploits the observation that similar words occur in similar documents and similar
documents contain similar words. Hence, it is easily able to learn synonymy (different words
having similar meaning, by their simultaneous occurrences in documents) and polysemy(words
with multiple meanings, by their occurrences in multiple different contexts) in a completely
unsupervised manner.

2.4 Probabilistic Latent Semantic Analysis(pLSA)


Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two
mode and co-occurrence data, which has applications in information retrieval and filtering,
natural language processing, machine learning from text, and in related areas. Compared to
standard Latent Semantic Analysis which stems from linear algebra and performs a Singular
Value Decomposition of co-occurrence tables, this method is based on a mixture decomposition
derived from a latent class model. This results in a more principled approach which has a solid
foundation in statistics.
Corresponding to coordinates in a reduced vector space in LSA, here we have likelihoods
of documents belonging to a particular latent topic; and how well a particular topic is explained
by the given set of words. From probability theory we have,
X
P (d, w) = P (d) P (w|z)P (z|d)
z∈Z

The EM algorithm described in [2] calculates P (w|z) and P (z|d) for a given k number of
classes. The term P (z|d) describes how well document d explains hidden topic z and the term
P (w|z) describes to what extent does word w contribute in explaining topic z.
Hence, we obtain clusters where in similar documents have high probability of explaining
the same hidden topic.
Besides, having all features that LSI offers, it has an additional benefit of having a stronger

9
statistical foundation.

3 Implementation

3.1 Basic Technique


3.1.1 How we apply pLSA in context of videos ?

The basic technique we follow, is to treat videos as text documents. We extract out some
features from the video clip, which are treated as words describing this video document. Hence
we obtain a set of words; and a set of documents containing these words.
Next we apply pLSA on these documents in exactly the same way as is done for normal
documents and obtain clusters. Any new video is then classified into one of the learned clusters.

3.1.2 Choice of Words

The most important aspect for this scheme to work well is the choice of words describing the
videos. It is important for the chosen words to capture sufficient information about the video.
Moreover, it must capture information which is able to give us clusters which we are looking
for.
The choice of words that we work with, is space time patches. Consider a space time
volume made by consecutive foreground frames (of say a 20 frame clip) spread along the z-
axis (the first image frames being parallel to the XY plane at z = 0, second one at z = 1 and
so on). Let (x, y, t) denote a point in this space, where (x, y) is the location in frame t of the
20 frame length clip.
A patch, Pdx,dy,dt in this space time volume, is the region of volume dx × dy × dt contained
in the region {(a, b, c) : a ∈ (x, x + dx), b ∈ (y, y + dy), c ∈ (t, t + dt)}. A patch Pdx,dy,dt , is
characterized completely by its starting coordinate (x, y, t).
A foreground patch is a patch having more than d fraction of its pixels as a part of the
foreground. There will be foreground patches in the space time volume of foreground frames.
The words describing a video document are all the foreground patches contained in the
space time volume of this video.
If SD is the set of words for document D, then

SD = {(x, y, t) : (x, y, t)is a foreground patch} (2)

This choice of words was arrived at after trying out a lot of different possibilities. This
captures the information we require for clustering videos based on the activities contained in
them. The space information is contained in the x and y coordinate and the flow of action is

10
captured by the frame number.
Different activities would have a different pattern of the (x, y, t) patches. Similar activities
would have a similar pattern of (x, y, t) patches. Though they may not have exactly the same
pattern but over a large training data set, would share patches with documents containing similar
activity and hence get clustered with group of documents containing this activity.
Thus this choice of words is able to give us the kind of clustering that we are looking for.

3.2 Approach followed


In principle, there are two phases in a generic machine learning solution. The first is to learn
from a given data set, and then to answer queries based on the learning from the data set.
Hence, even in context of pLSA there is a learning phase and a classification phase(where
queries are actually answered). The learning phase learns classes and their characteristics and
the classification phase gives the likelihood of a new document belonging to each of the classes
learnt. If the likelihood does not suggest that it belongs to any of the classes learnt then it is
flagged as an unusual activity. There is sufficient theory of doing this classification for a novel
video into the already learnt classes as described in [1].
So, for the purpose of this project we focus only on obtaining these clusters for a broad set
of settings. The kind of clustering which is obtained depends largely on the choice of words
that we use to capture the information contained in the document. Hence, an appropriate choice
of words is essential for obtaining good clusters. We focus our effort to obtaining a good set of
words, which is able to give us good logical clustering.
We describe below the process we followed to obtain clusters based on given video docu-
ments.

3.2.1 Clustering of a given set of videos

1. Dividing the video stream in short clips (our video documents)


The video stream is obtained as a regular sequence of frames. The entire video is divided
into short 20 frame clips which make up one document. So the entire video gives us a set
of documents which we need to cluster into classes based on their similarity.

2. Background subtraction to obtain foreground blobs(in openCV)


We then perform background subtraction onto each of this clip to obtain foreground
frames corresponding to each video frame. The background subtraction module that
was used is based on the Gaussian Mixture Model (as presented in [9]) which is noise
resistant and adapts well to changes in the backgrounds. We use the standard openCV
implementation of this algorithm with appropriate learning rate parameters. Also, we

11
apply the GMM algorithm not on the raw RBG image but on a normalized RBG as this
reduces the effects of shadows and intensity changes.

3. Obtaining words from foreground images(in Matlab)


As described above, the volume of size dx∗dy∗dt located at (x, y, t)(x, y ∈ {0, 5, 10, 15, ..., },
t ∈ {0, 2, 4, ...} )having more than d fraction of their pixels as foreground pixels is treated
as a patch and the word (x, y, t) is said to belonging to this video document. This search
is done over the entire space of (x, y, t) and valid words are added to the document.

4. Obtaining dictionary of words contained in this document set(in Matlab)


Once words describing each individual document are identified for all the training videos,
we do a clustering on this set of words to come up with a dictionary of top k words
common to all documents. The universe of words that we have defined is large and
captures only limited amount of spatial closeness in this (x, y, t) space. So this clustering
of words into k classes groups similar words (which are relatively nearby as indicated by
their (x, y, t) coordinates) and captures common features across different documents.
Consider for example the set of words (5,10,1), (5,15,1), (10,10,1), (40,10,2), (45,15,2)
and (50,10,3). A clustering on this set of words to obtain top 2 words would map words
(5,10,1), (5,15,1), (10,10,1) to word 1 and (40,10,2), (45,15,2), (50,10,3) to word 2. So,
2 different documents containing words (5,10,1) and (5,15,1) (which are very similar
but not exactly same) would be said to be containing the same word, word 1. This thus
captures common features across different documents.
pLSA treats all words as unrelated to one another (in fact it tries to learn the relation-
ship between them). Hence without this step two documents would be classified as same
only if they are exactly same. Even very similar documents would be classified as differ-
ent, as the words they contain are similar but not exactly same (and pLSA sees them as
completely different words).
This clustering was done using the kmeans function in Matlab.

5. Generation of words document matrix (in Matlab)


Once the dictionary of words is obtained, raw words in each document are mapped to one
of the k common words from the dictionary and a document × word matrix is obtained
((Ni,j ) entry in the matrix indicates number of occurrence of word j in document i). This
matrix is much the same as the one obtained for classification of literary documents (used
by any web search engine).

6. Running pLSA on this words document matrix to obtain video clusters(in Matlab)
The documentxword matrix thus obtained is processed through pLSA to obtain n clus-
ters. The algorithm gives us probability of each document and each word belonging to a

12
particular cluster. Similar documents and words are the ones which have high probability
of belonging to the same cluster.

3.3 Other Activities undertaken


Some other activities were under taken in the course of the project to obtain insight into the
techniques being used for learning. These are briefly described below.

1. Book title experiment - classification of books based on LSA and pLSA


In the process of understanding and getting a feel of how exactly pLSA and LSA work,
I carried out an experiment to classify books based on their titles. For the purpose of this
experiment, I collected the book titles of lots of books on 3 different topics in computer
science (from the department library). I then selected words from the titles to construct
a dictionary of words (this was basically eliminating useless words like of, for, and, the,
etc) and then a document × word matrix was obtained. This when subjected to LSA
and pLSA (independently) clustered books into 3 clusters putting books on the same
topic into exactly one cluster. I observed how these techniques were able to eliminate
noise and co relate words with similar meanings based on their occurrence in documents
together. Coding for this was done in perl and Matlab.

2. Evaluating pLSA performance on multi topic documents


When we come to multi people activity clustering, we need to use pLSA to learn from
videos containing more than one activity. To be able to evaluate how pLSA treats multi
topic documents, I conducted an experiment giving pLSA documents pertaining to 2
topics. Some documents discussed only of the topic while some discussed both. When
we do a 2 way clustering on this, pLSA returns clusters such that cluster 1 pertains to
topic 1, and cluster 2 pertains to topic 2, with the document discussing both these topics
split in likelihood of belonging to either of the cluster.

4 Results

4.1 Single person activity analysis and classification


The space time patch words work well for single person activity analysis. They are able to
cluster videos as per the activity contained in them and are able to bring out common trajectories
and common areas of activities in an unsupervised manner from videos. As an example, the
clusters obtained on a training video in Bharti Building corridor are shown in Figures 1-6.

13
Figure 2: Class 1 - Activity in the near end of corridor.

Figure 3: Class 2 - Cluster of activities in the far end of corridor.

Figure 4: Class 3 - Cluster of activities around the door.

Figure 5: Class 4 - Cluster of trajectories along the right edge of corridor.

Figure 6: Class 5 - Cluster of trajectories along the center of the corridor.

Figure 7: Class 6 - Cluster of trajectories along the left edge of the corridor.

14
4.2 Multi person activity classification
The space time patch approach on multi person videos forms classes such that each class corre-
sponds to a single person activity happening in the video. Videos with a single activity in them
are assigned to the corresponding class. Videos which have more than one activity happening in
them are given likelihoods of belonging to the classes of the single person activities happening
in them.
pLSA is able to learn the single person activities happening in multi person videos, based on
the observation that the words describing a particular single person activity, occur as a group
in different multi person videos (in much the same way as topic discovery happens in sets
containing multi topic documents).
Consider for example the classes obtained on a learning video in the coffee shop area near
the library.

15
Figure 8: Class 1 - Trajectories along the path in mid right.

Figure 9: Class 2 - Trajectories along the bottom edge.

Figure 10: Class 3 - Trajectories from the bottom left to the center of frame.

Figure 11: Class 4 - People turning from main building side to coffee shop.

Figure 12: Class 5 - Trajectories from left top corner to right top corner and vice versa.

Figure 13: A video with likelihoods for classes 1, 3 and 5.

16
5 Future Work

5.1 Unusual activity flagging for single person video


We have only learnt classes of videos from a given learning data set. Though novel video
classification has been studied thoroughly in [1], we have not explicitly implemented it here.
We intend to implement novel video classification based on this choice of words, and start
flagging unusual activities.

5.2 Multi person activity analysis


Though we have presented multi person activity classification using pLSA, we need to thor-
oughly investigate the suitability of this choice of words for multi person activity analysis.
Moreover, as mentioned, the likelihood of a video with multiple activities gets split among
classes describing the activities contained in it. This may pose a problem in flagging an un-
usual activity, because we rely on a confused match of the unusual word sequence with the
already learned classes. Hence, we need to investigate how to differentiate between videos
with multiple activities and videos with unusual activities.

References
[1] Ayesha Choudhary, Manish Pal, Subhashis Banerjee, and Santanu Chaudhury. Unusual
activity analysis using video epitomes and plsa. In ICVGIP, pages 390–397, 2008.

[2] Thomas Hofmann. Probabilistic latent semantic analysis. In In Proc. of Uncertainty in


Artificial Intelligence, UAI’99, pages 289–296, 1999.

[3] S. Hongeng and R. Nevatia. Multi-agent event recognition. In Computer Vision, 2001.
ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, pages 84–
91 vol.2, 2001.

[4] Fan Jiang, Ying Wu, and Aggelos K. Katsaggelos. A dynamic hierarchical clustering
method for trajectory-based unusual video event detection. IEEE Trans. Image Process.,
18:907–913, 2009.

[5] B. Klare and S. Sarkar. Background subtraction in varying illuminations using an ensemble
based on an enlarged feature set. Computer Vision and Pattern Recognition Workshop,
0:66–73, 2009.

17
[6] Dhruv Mahajan, Nipun Kwatra, Sumit Jain, Prem Kalra, and Subhashis Banerjee. A frame-
work for activity recognition and detection of unusual activities. In ICVGIP, pages 15–21,
2004.

[7] C. Piciarelli, G.L. Foresti, and L. Snidara. Trajectory clustering and its applications for
video surveillance. Advanced Video and Signal Based Surveillance, IEEE Conference on,
0:40–45, 2005.

[8] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering ob-


ject categories in image collections. In Proceedings of the International Conference on
Computer Vision, 2005.

[9] Chris Stauffer and W.E.L Grimson. Adaptive background mixture models for real-time
tracking. In CVPR99, Fort Colins, CO, 1999.

18

You might also like