Na

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Vis Comput

DOI 10.1007/s00371-012-0752-6

O R I G I N A L A RT I C L E

A survey on activity recognition and behavior understanding


in video surveillance
Sarvesh Vishwakarma · Anupam Agrawal

© Springer-Verlag Berlin Heidelberg 2012

Abstract This paper provides a comprehensive survey for continuously. The question arises whether it is impossible to
activity recognition in video surveillance. It starts with a de- make public places safe from any type of terrorist activities,
scription of simple and complex human activity, and var- suspicious events. Many scientists and researchers are doing
ious applications. The applications of activity recognition great research of finding the solution to this question. This
are manifold, ranging from visual surveillance through con- paper discusses the methods and techniques implemented by
tent based retrieval to human computer interaction. The or- researchers to perform the automatic surveillance. In auto-
ganization of this paper covers all aspects of the general matic surveillance, a network of computers process the input
framework of human activity recognition. Then it summa- video continuously and thoroughly analyze the image frame
rizes and categorizes recent-published research progresses to sense any unusual findings, which they can report to the
under a general framework. Finally, this paper also provides supervisor for their alertness. The procedure is performed in
an overview of benchmark databases for activity recogni- three levels: low-, mid-, and high-level image data process-
tion, the market analysis of video surveillance, and future ing. In modern civilization, the threats of theft, accidents,
directions to work on for this application. terrorist attacks, and riots are ever increasing. Due to high
amount of useful information that can be extracted from a
Keywords Image processing · Automated surveillance · video sequence, video surveillance has come up as an effec-
Video tracking · Human activity tive tool to forestall these security problems. An automated
surveillance system has been involved in experimenting with
1 Introduction video surveillance data to improve the image processing task
by generating more accurate and robust algorithms in object
In a traditional surveillance system, it is not possible to de- detection and tracking and human activity recognition.
tect and prevent a suspicious activity at public places and im-
portant building, etc. The human operator could not prevent 1.1 Previous survey
a dangerous situation by keeping close eyes on the mon-
itor screen due to physical fitness, lack of CCTV cameras, This paper identifies a possible research direction in vision
and security forces. Police guards may not keep surveillance based human action recognition and provides an overview of
framework of visual surveillance in dynamic scene address-
ing following issues, e.g., environment modeling, motion
S. Vishwakarma () · A. Agrawal detection, classification of objects, occlusion handling, fea-
Indian Institute of Information Technology, Allahabad, India
tures extraction, unusual activity, a combination of two- and
e-mail: rs51@iiita.ac.in
three-dimensional tracking, a combination of motion analy-
A. Agrawal
sis and biometrics, anomaly detection and behavior predic-
e-mail: anupam@iiita.ac.in
tion, content-based retrieval of surveillance videos, behav-
Present address:
ior understanding and natural language description, fusion
S. Vishwakarma
N-297 Third Phase, Shivalik Nagar, BHEL, Haridwar, India of information from multiple sensors, remote surveillance,
e-mail: sarvesh.vish@gmail.com etc. The role of automated surveillance system is to sense
S. Vishwakarma, A. Agrawal

Table 1 Related literature survey summary

First author Yr Topic Ref.

Aggarwal 11 Human activity analysis: A review [2]


Daniel 11 A survey of vision-based methods for [204]
action representation, segmentation and
recognition
Ronald 10 A survey on vision-based human action [142]
recognition
Zhan 08 Crowd analysis [224]
Kang 07 Intelligent visual surveillance [76]
Forsyth 05 Computational studies of human motion [46]
Yilmaz 06 Object tracking: A survey [216]
Valera 05 Intelligent distributed surveillance [187]
Weiming 04 Motion and tracking for surveillance [59]
Moeslund 01 Human motion capture [108]
Aggarwal 99 Motion analysis of human body [1] Fig. 1 A general framework of human motion analysis

and monitor unusual activity and alert the human operator 1.2 Simple and complex human activity
to take action against it.
The objective of this article is to focus on techniques High-level processing deals with reasoning and perception
and methodology used to recognize human activity. Previ- to recognize the human activities (suspicious/unusual ac-
ous survey in related areas are identified (see Table 1). tions), and it gets more and more difficult as the complex-
This paper provides a comprehensive survey of the recent ity level of human activity increases. Figure 2 illustrates an
developments in human activity analysis. It covers the latest overview of various types of human activities. Human activ-
research paper ranging mainly from 2008 to 2012. It thus ity (according to its complexity) is divided into four levels:
contains many new references not found in previous surveys.
(i) Gesture: Elementary movement of the human body part
The organization of this paper covers all aspects of the gen-
defines the gesture. This action is performed in a very
eral framework of the human activity recognition such as
short span of time and its complexity remains low.
low-level tasks, intermediate- and high-level tasks and pro-
Waving a hand, stretching an arm and bending are used
vides a comparison among few existing datasets. This pa-
to define the meaning of full motion of the person.
per focuses on three main issues of vision related tasks. In
(ii) Action: Actions are the single-person activity where
contrast to past surveys, we provide a detailed discussion
multiple gestures (atomic actions) are temporarily or-
on object tracking and their activity related taxonomy. We
ganized in the time domain. Running, walking, and
also provide more detailed discussions on open issues and
jumping are good examples of action.
research challenges involved with human activity analysis
(iii) Interaction: For two or more humans and/or objects do-
than any other earlier reviews.
To discuss the issues related to automated surveillance ing activity, it is called interaction. Human-human in-
more conveniently, we will focus on a more general analysis teraction, and human-object interaction also come in
system shown in Fig. 1. The remainder of this paper is or- this category. Carried/abandon bag, a person stealing a
ganized as follows. Section 2 reviews the work on low-level bag from another, and pointing a gun are examples of
task including segmentation and moving object classifica- it. These types of human activity are lengthly.
tion. Section 3 covers human tracking, which is divided into (iv) Group activity: Group activities are the activities per-
six categories of methods: model-based, region-based, ac- formed by two groups composed of multiple objects.
tive contour-based, feature-based, hybrid-based, and optical A group of soldiers marching, groups of two parties
flow-based. The paper then extends the discussion to the hu- misbehaving in parliament, groups of people protest-
man activity recognition in image sequences in Sect. 4. Sec- ing/fighting are examples of it.
tion 5 discuss machine learning methods to represent high-
level activity. In Sect. 6, we discuss various types of datasets 1.3 Applications
used for the testing of activity recognition methods. Some
surveys of the surveillance market are shown in Sect. 7. Fi- In this section, we present a few application areas that will
nally, conclusions are drawn in Sect. 9 followed by open highlight the potential impact of human activity recognition
issues and challenges in Sect. 8. systems.
A survey on activity recognition & behavior understanding in video surveillance

Fig. 2 Types of Human


Activities

1.3.1 Behavioral biometrics tor who needs to be aware of the activity in the camera’s field
of view. With recent growth in the number of cameras and
Biometrics involves study of approaches and algorithms for deployments, the efficiency and accuracy of human opera-
uniquely recognizing humans based on physical or behav- tors has been stretched. Hence, security agencies are seeking
ioral cues. Traditional approaches are based on fingerprints, vision-based solutions to these tasks, which can replace or
face, or iris, and can be classified as physiological biomet- assist a human operator. Automatic recognition of anoma-
rics, i.e., they rely on physical attributes for recognition. lies in a camera’s field of view is one such problem that has
These methods require cooperation from the subject for col- attracted attention from vision researchers [190, 229]. A re-
lection of the biometric. Recently, “Behavioral Biometrics” lated application involves searching for an activity of inter-
have been gaining popularity, where the premise is that
est in a large database by learning patterns of activity from
behavior is as useful a cue to recognize humans as their
long videos [60, 180].
physical attributes. The advantage of this approach is that
subject-cooperation is not necessary and it can proceed
1.3.4 Interactive applications and environments
without interrupting or interfering with the subject’s activ-
ity. Since observing behavior implies longer-term observa-
tion of the subject, approaches for action-recognition extend Understanding the interaction between a computer and a hu-
naturally to this task. Currently, the most promising example man remains one of the enduring challenges in designing
of behavioral biometric is the human gait [161]. human-computer interfaces. Visual cues are the most impor-
tant mode of nonverbal communication. Effective utilization
1.3.2 Content based video analysis of this mode such as gestures and activity holds the promise
of helping in creating computers that can better interact with
Video has become a part of our everyday life. With video humans. Similarly, interactive environments such as smart
sharing websites experiencing relentless growth, it has be- rooms [137] that can react to a user’s gestures can benefit
come necessary to develop efficient indexing and storage from vision based methods. However, such technologies are
schemes to improve the user experience. This requires the still not mature enough to stand the “Turing test,” and thus
learning of patterns from raw video and summarizing a continue to attract research interest.
video based on its content. Content-based video summa-
rization has been gaining renewed interest with correspond- 1.3.5 Animation and synthesis
ing advances in content-based image retrieval (CBIR) [151].
Summarization and retrieval of consumer content such as
The gaming and animation industry rely on synthesizing re-
sports videos is one of the most commercially viable appli-
alistic humans and human motion. Motion synthesis finds
cations of this technology [17].
wide use in the gaming industry where the requirement is to
1.3.3 Security and surveillance produce a large variety of motions with some compromise
on the quality. The movie industry on the other hand has tra-
Security and surveillance systems have traditionally relied ditionally relied more on human animators to provide high-
on a network of video cameras monitored by a human opera- quality animation. However, this trend is fast changing [46].
S. Vishwakarma, A. Agrawal

Fig. 3 Taxonomy for object Object Detection


detection approaches and the
lists of selected publications    
corresponding to each category
Background Statistical Temporal Optical flow
Subtraction Methods Differencing
Shibata[174]
Bayona[5]
Ince[65]
Haritaoglu[56] Liao[92]
Denman[38]
Kim[79] Collins[29]
Camplani[14] Ishii[67]
Allili[4] Lipton[96]
Sheikh[170] Chen[18]
Elgammal[42] Kameda[75]
Cheng[21] Liu[97]
Lin[94]
Reddy[148] Chen[20]
Wang[199]
Varcheie[188] Qi[144]
Tavakkoli[181]
Kim[81]
Paruchuri[136]
Monnet[111]
Oliver[127]
Tsai[185]
Yamazaki[214]
McHugh[104]
Bucak[11]
Migdal[106]
Li[91]
Paragios[131]
Pilet[139]
Porikli[143]
Vosters[196]
Parameswaran[133]
Zhao[228]

With improvements in algorithms and hardware, much more 2.1 Motion segmentation
realistic motion-synthesis is now possible. A related appli-
cation is learning in simulated environments. Examples of Motion segmentation in video images is known to be a diffi-
this include training of military soldiers, fire-fighters, and cult problem. The objective of this section is to discuss var-
other rescue personnel in hazardous situations with simu- ious types of approaches involvement to detect the movable
region corresponding to moving objects. The taxonomy of
lated subjects.
motion detection in Fig. 3 show the list of selected pub-
It attempts to detect, recognize, and track objects of in-
lications under each approach. Table 2 presents a detailed
terest from video obtained by camera which may be either
comparison of characteristics and brief description for four
fixed or moving [82]. In aerial surveillance, target sizes are approaches and state-of-the-art methods.
small and they must be acquired and tracked through chang-
ing terrain, evolving appearance, and frequent occlusion. In 2.1.1 Background subtraction
order to extract any useful information about the moving ob-
ject, we need to detect and track them over long durations. The background subtraction method detects the target in sta-
It is usually performed in the context of higher-level appli- tionary scenes, but is sensitive to dynamic scenes, which
cations that require the location and/or shape of the object keep on changing due to illumination and extra events. There
in every frame. The proliferation of high-powered comput- is need to apply morphological operation after performing
ers, the availability of high quality and inexpensive video raw subtraction and thresholding.
cameras, and the increasing need for automated video anal- 2.1.2 Statistical
ysis has generated a great deal of interest in object tracking
algorithms. The statistical method is a process of building a more ad-
vanced background subtraction by applying a statistical ap-
proach on individual pixels or a group of pixels. This is pop-
ular due to the robustness of noise, shadow, changes in light-
2 Motion detection ing conditions, etc.

Nearly every human activity recognition system starts with 2.1.3 Temporal differencing
human detection. The task of human detection is to segment To identify the moving target across a sequence of image
regions corresponding to the persons from the rest of an im- frames, the current image frame is subtracted either by the
age. It is a significant issue as subsequent processes such as previous frame or the next frame of the image sequences.
tracking and action recognition stages are dependent on it. This is termed as temporal differencing. Thresholds are ap-
This process usually involves segmentation and object clas- plied to the difference images to eliminate pixel changes
sification. There are six conventional approaches to moving due to camera noise, small illumination changes, etc. This
object detection: background subtraction, statistical meth- method is very adaptive to the dynamic scene, but not fine
ods, temporal differencing, and optical flow. for extracting all relevant feature pixels.
A survey on activity recognition & behavior understanding in video surveillance

Table 2 Performance details of some object detection techniques

Ref. Model Scene Spectral Camera Detect Remarks


outdoor(o) color(c) single(s) single(s)
indoor(i) grayscale(g) stereo(t) multiple(m)
multiple(m) group(g)

[56] blob-based BGS o g s m handle occlusion


[14] mixture of Gaussians i,o g,c m s,m tolerate varying background/gradual
illumination changes
[170] pixel-wise o c s s –
labelling/trajectory removal
[21] discrete wavelet transform i,o g s m not suited for clutter environment
[148] patch by patch texture i,o c s s,m deal with noise, illumination, dynamic
descriptor background
[188] feature histogram i c s s varying illumination
[81] salient motion detection o g s m accuracy matter
[111] autoregression i,o g s m handle position change in light source and
local/global illumination changes and fail to
detect moving object in same direction of
wind flow
[185] independent component i g s s computational fast, significant illumination
analysis changes
[104] foreground adaptive o g s s,m significant to illumination changes
[106] Markov random fields o c s s global illumination changes, complex motion
[131] topology free MDL-splitting i g s s computationally efficient, handle
HMM sudden/gradual global illumination
[143] dual foreground i,o c s s robust to illumination changes, tolerate
occlusion, noise, difficulty to adjust parameter
[79] single general Gaussian i,o c s m low/static background
[4] mixture of general Gaussian i c s g sudden/or shadow
[42] kernel density estimation i g – s little/background
[94] support vector machine i,o g s s –
[199] support vector regression o c s m tracking-by-detection and slow computation
[181] support vector data i g s s low contrast and quasistation background
description
[136] support vector regression i c s s change in position of illumination source and
sudden change in illumination
[127] principle component analysis o g s s varying illumination, weather
[214] independent component i c s s partial illumination changes
analysis
[11] independent non-negative o g t m sudden illumination changes
matrix factorization
[91] incremental rank o c,g s s,g illumination changes, shadow movements
(R1 , R2 , R3 ) tensor
[139] mixture of Gaussians i,o c,g s s,m local/global illumination changes
[196] eigenbackground based i c s s sudden illumination changes
statistical illumination
[133] illumination transfer o g s g local/global illumination changes and low
function computation
[228] candid covariance o g s g extreme illumination changes (in night)
incremental (PCA) capture tiny appearance
[5] sub-sample frame i g m m support partial/total occlusion, detect
differencing stationary foreground region
[92] mask sampling + hough i c s g extreme illumination changes (in night)
transform capture tiny appearance
S. Vishwakarma, A. Agrawal

Table 2 (Continued)

Ref. Model Scene Spectral Camera Detect Remarks


outdoor(o) color(c) single(s) single(s)
indoor(i) grayscale(g) stereo(t) multiple(m)
multiple(m) group(g)

[29] adaptive BGS + three frame o c t s accommodate slow lighting changes, noise in
differencing imagery
[96] temporal differencing + o g s s partial occlusion, cluttered background,
image template matching ambiguous pose
[75] double difference image o g s s computationally slow, less accurate
[174] optical flow distortion i c t s static/moving object detection
[65] extrapolation of two images i c s m detect occlusion
[38] optical flow on pixel o c s s improved speed/accuracy, detect occlusion
resolution between objects
[67] gradient-based + FPGA i c s s investigation of top’s spinning and a human’s
pitching motion
[18] frame straddling + FPGA i,o c s s less computation cost, high speed, less delay
involve
[97] coarse-to-fine + space o c s s detect texture among similar structures
feature
[20] optical flow context i,o c s m,g suited for fight scene
histogram
[144] Horn–Schunck method o g s s suited for infrared images

2.1.4 Optical flow and low resolution imagery occurs for the far field cam-
era view. Several classification methods have been proposed
The optical flow method is used to detect moving human in the literature. The main categories that we identify are:
being or objects independently based on their individual ve- shape-based classification and motion-based classification
locities from the camera installed on movable platform. This or a combination of both. Shape-based classification meth-
method is computationally data intensive and took more ods classify moving objects by identifying a shape informa-
time to segment the foreground objects from the scene and tion such as representation of point, box, blob, and silhou-
that is why there is a need to use specialized hardware (such ettes of motion regions in the image. Motion-based methods
as FPGA, GPU cards etc.) to beat the slow computation. The use periodic property of moving entities to categorize both
optical flow method is generally used as a feature for both rigid and nonrigid articulated motion. In this section, we
object detection and object tracking. This feature is known have provided an overview of some of the selected papers
as the optical flow vectors. Optical flow vectors are used to on object classification techniques. Table 3 shows the per-
represent apparent velocities of movement of brightness pat- formance details of some classification techniques for shape
terns in an image. Optical flow can give important informa- and motion-based methods.
tion about the spatial arrangement of the objects viewed and
rate of change of this arrangement. Discontinuities in the op-
tical flow can help in the segmentation of the image frame 3 Techniques used for object tracking
into the object region. As the object is detected by their fea- Object detection and tracking certainly are important in the
tures and tracked by matching the same features in subse- captured video frames/images. These captured videos con-
quent image frames. The optical flow method is suitable for tain a strong parallax. Actually, parallaxes contain rich in-
object tracking that we will discuss in more detail in Sect. 3. formation about the 3D nature of the scene as well as mov-
2.2 Object classification ing ground objects. But still it creates a challenge to extract
meaningful information for scene understanding and object
Object classification refers to the task of an automatically tracking. A geo-referenced depth estimation algorithm [210]
distinguished moving target of interest from other moving is used on input aerial video sequences to compute the
objects such as vehicles, flying birds, clouds, etc. across suc- depth from the video frame and the whole video/images are
cessive frames in an image sequence. To classify the object registered with respect to one global reference and gener-
is an essential process to further track and analyze their be- ate corresponding aligned images. Then based on the es-
haviors for high-level tasks. The small size of the object timated depth, nonground regions are segmented and then
A survey on activity recognition & behavior understanding in video surveillance

Table 3 Performance details of some object classification techniques

Study Method Classification Classifier tools Classification metrics Remarks

Collins et al. [29] shape single human, vehicles, neural network image blob area, dispersedness, classifier results kept in
human groups, clutter aspect ratio of blob bounding box histogram
Lipton et al. [96] shape human, vehicles, clutter neural network image blob area, dispersedness temporal consistency was
used for preciseness
Kuno et al. [83] shape human, vehicles, silhouette/shape shape parameter of human mean/standard deviation of
butterflies silhouettes pattern shape parameters
Cutler & Davis [32] motion human self-similarity, periodicity of moving objects –
time-frequency
analysis
Lipton [95] motion human, vehicles optical flow residual flow of rigidity, –
periodicity of moving objects
Stauffer [179] motion human, vehicles time color, velocity –
cooccurrence
matrix
Mohan et al. [110] motion component of human haar wavelets – –
body(head, legs, left arm,
right arm)
Zhange et al. [227] shape person, car, van, people multi-block local visual features work well for small size
groups, truck, bug binary pattern objects or low resolution
video
Javed & Shah [71] motion pedestrian, vehicles Recurrent size, compactness, aspect ratio, work well for restricted
Motion Images simple descriptor of shape or settings
motion
Bose & Grimson [9] motion human, vehicles scene-invariant image position & direction of fine for low resolution
classification motion of objects imagery & projective
image distortion
Brown [10] motion human, vehicles two-phase size, speed, color, texture just recognize human from
classification vehicles
Ma & Grimson [103] motion types of vehicles edge point sift repeatable/discriminative not recognize vehicle
descriptor features identity, not fine underview
changes, occlusion
Tsuchiya & Fujiyoshi shape single human, vehicles, adaboost shape, texture, speed computationally
[186] motion human groups, bike inexpensive, invariant to
lighting condition or
viewpoint

the Planar fitting plus depth extension approach is applied have described a technique based on GPU-accelerated com-
to extract the structure of buildings, trees’ shape, moving putation using the CUDA framework [61]. In the majority
car/vehicles, pedestrians, water tanks, etc. Due to varying of tracking applications, the measurements from the scene
image quality, the estimated depth map is highly variable. of interest are acquired at a discrete time referred to as
To obtain a consistent depth map over the video frame se- scans [15]. The measurements are then used to deduce the
quences, the bilateral depth fusion techniques are used to information about the objects being tracked, i.e., the targets.
refine the depth map by fusing low quality depth informa- The measurements may originate from targets, other ob-
tion [114]. There are great variety of object tracking ap- jects, or just false detections in noisy signals. All unwanted
proaches found in published literature. The best known and measurements are usually referred to as clutter. The task of
most successful algorithms include model-based tracking tracking objects as they move in substantial clutter, and to
using geometric models, sum of square differences of opti- do it at, or close to, the video frame-rate is challenging as
cal flow, Bayesian random-sampling techniques [115], par- in most of the severe cases; the background may consist of
ticle [226] and Kalman filtering [43], adaptive background objects similar to the foreground object(s), e.g., when a per-
subtraction method [101], multiple hypothesis tracking ap- son is moving past a person, a group of people, or a crowd.
proaches, affine image registration and local motion estima- Figure 4 depicts the taxonomy for object tracking. In [114],
tion, and mean shift tracking algorithms [8]. To achieve a tracking of the moving object method can be divided into
real-time performance of Motion tracking, Jing Huang et al. six major categories:
S. Vishwakarma, A. Agrawal

Object Tracking

     
Region-based Contour-based Feature-based Model-based Hybrid Optical flow-based
Tracking Tracking Tracking Tracking Tracking Tracking
Ladikos [84]
Wunsch [208] lalos [85] Schunck [166][165]
Meyer [105] Techmer [182] Coifman [28]
Davis [48] Girisha [51]
Fazli [43] Yokoyama [220] Chai [16]
Zhu [231] Denman [36][37]
Schmaltz [162] Yilmaz [217] Jang [70]
Ong [129] Ince [65]
Salembier [160] Chiverton [23][24] Zhou [230]
Vlasic [194] Sakaino [159]
Kim [80] Cai [12] Comaniciu [30]
Cheung [22] Senst [168]
Sclaroff [167] Qiang Chen [19] Ren [149]
Schmaltz [163] Paragios [130] Wen [205]
Niethammer [121] Lin [93]

Fig. 4 Taxonomy for Object Tracking approaches and the lists of selected publications corresponding to each category

a. Region-based tracking
b. Contour-based tracking of the object [182]. An interesting edge-based feature ap-
c. Feature-based tracking proach for detecting and tracking non-rigid objects is pro-
d. Model-based tracking posed by Yokoyama and Poggio [220]. It assumes that the
e. Hybrid tracking image intensity of time t and t + dt would remain same.
f. Optical flow-based tracking They extracted the contour of the moving object in four steps
(line restoration, line-based background subtraction, cluster-
3.1 Region-based tracking ing, and active contour). It works in real-time. A contour-
based nonrigid object tracking method via the contour en-
These algorithms track objects according to the variation of ergy function was proposed by Alper Yilmaz et al. [217]
image regions corresponding to the moving objects. Meyer where they tracked the complete region of the nonrigid ob-
and Bouthemy [105] had tried to track shape and teh position jects and recovered the occluded objects parts. Chiverton
of the object in successive frames using the recursive algo- et al. [23] used shape-based level set active contour frame-
rithm in complex outdoor scenes. This algorithm was based work to segment and track objects, which undergo frequent
on the first-order motion model. In [43], the motion regions shape deformation by defining the outline of the object in
are usually detected by subtracting the background from the first frame. In another paper, the same authors proposed the
current images. Schmaltz et al. [162] focused on the separa- concept of finite-sized shape memory to continuously re-
tion of the object from its background using a localized mix- member relevant shape information and automatically elim-
ture model by partitioning the foreground and background inate unnecessary shape information [24]. The object was
into several subregions. Salembier and Marques [160] have tracked through the sequences under low contrast scenes by
discussed various types of algorithms that were used to integrating both region and boundary features, using the al-
convert pixel-based representations into region-based repre- gorithm proposed by Ling Cai et al. [12]. Based on kernel
sentations for multimedia applications. Jong Kim and Joon and active contour, Qiang Chen et al. [19] located the object
Kim [80] pursued another route to segment a moving ob- in complex situations, and precisely tracked the object con-
ject in a traffic scene. They first detected the position of the tour. Paragios and Deriche [130] applied a geodesic active
moving object in a scene by applying adaptive threshold- contours to detect and track moving objects in a sequence
ing. Then the segmentation of the moving pixel was ob- of images. They determined detection and tracking bound-
tained using the k-mean clustering algorithm. In Sclaroff aries using probabilistic edge detectors. Level sets theory
was applied to reduce the computational cost. Niethammer
and Isidoro’s work [167], both the appearance and shape
et al. [121] used the geodesic active contour originated by
of the object were integrated and processed in a color tex-
Paragios and Deriche [130], in which the state information
ture map and triangular mesh, respectively, to track nonrigid
(normal velocity) with every particle on a contour is repre-
motion. The recent work by Schmaltz et al. [163] focused on
sented by means of a level set functions.
occlusions and the self-occlusion problem found in multiob-
ject tracking. They attempted to handle occlusions and self- 3.3 Feature-based tracking
occlusions using the probability density function by tracking
multiple objects and object parts simultaneously. In this section, we discussed feature-based object tracking
involving both static features and dynamic features. The
3.2 Contour-based tracking static features are extracted from the appearance of the ob-
ject, such as color, geometry, texture, and image template
In this method, instead of tracking the whole set of pixels while the dynamic features are derived from the mathemat-
comprising an object, the algorithm tracks only the contour ical tools such as particle filter, Kalman filter, and Bayesian
A survey on activity recognition & behavior understanding in video surveillance

more complex structures, the more expensive will be the


computation. Strictly speaking, model-based tracking is an
example of feature-based tracking. The reason why it is in-
dependently described is due to the requirement of group-
ing, reasoning, and rendering, which may defer it from the
feature based tracking. In addition, prior knowledge about
the investigated models is normally required. For example,
in the case of multiple objects tracking the binary repre-
sentation (models) of the targets must be obtained a priori.
Fig. 5 Feature-based tracking decision conceptual diagram This may be followed by applying a stage of model recog-
nition. The Hough transform was utilized to achieve a sim-
network, etc. The study of motion and behavior are the two ilar idea in [208]. Model-based tracking schemes share the
important elements of dynamic features. Figure 5 shows same challenges as the feature-based trackers do. For exam-
a conceptual diagram of a feature based tracking model. ple, occlusion is a significant cause of instabilities, result-
Coifmann et al. [28] worked on a video image processing ing in poor tracking performance. To track human motion
system to detect and track passengers’ cars, motor cycles, in activities such as walking, running, and jumping, Gavrila
buses, and trucks in real-time. They took corner points of and Davis [48] matched edges in the image with those of
vehicles as a relevant feature to track it using the Kalman an appearance model using distance transforms. A decom-
filter. This makes the system less sensitive to partial oc- position approach and a best-first technique were used to
clusion. YoungJoon Chai et al. [16] carried out research search through the high dimensional pose parameter space.
tracking objects using the particle filter based on an in- A robust variant of Chamfer matching was used as a fast
tegral histogram to the kinematics chain model. Jang and similarity measure between synthesized and real edge im-
Choi [70] applied the greedy algorithm to minimize the ages. Youding Zhu et al. [231] reconstructed the 3D pose
energy function to monitored shape, size, and color fea- for humans with the help of time-of-flight data to track them
ture. They tracked the human in different directions while reliably in a complex environment. Ong and Gong [129]
preventing it to get mixed-up with other persons. Shaohua adopted the condensation algorithm to track the 3D skele-
Kevin Zhou et al. [230] proposed a particle filter based adap- ton of the human body in their dynamical linear framework.
tive state transition paradigm for tracking with the following Daniel Vlasic [194] estimated the skeleton pose to obtain the
properties: handle appearance changes between frames and closest mesh point on contour and captured the details geo-
between frame and gallery images, occlusion analysis is em- metric information about the skeleton and shape. They have
bedded in the particle filter, and it stabilizes the tracker by achieved high quality tracking by providing full correspon-
embedding linear prediction into stochastic diffusion. Dorin dence and correct topology in an animated mesh structure.
Comaniciu et al. [30] used a Kalman filter framework based Kong-man Cheung et al. [22] used a marker and sensor on
on the color histogram in conjunction with a kernel-based the human body to capture their skeleton motion. They ex-
target localization method for tracking human beings in the perimented over skin tight cloth and manual adjustment so
subway. Ren and Chua [149] employed spatial and color that markers rigidly do movement corresponding to limbs.
information of the target object into the expectation max-
3.5 Hybrid tracking
imization (EM) framework to tackle illumination changes
and the reflection of shadow problems effectively. Zhiqiang In Alexander Ladikos et al. [84], the authors used both
Wen and Zixing Cai [205] used the “weighted color kernel the template and feature-based approaches to build a real-
histogram” approach to eliminate background pixels from time tracking model. It tracked small patches on the mov-
the object model. Then the mean shift algorithm was ap- ing object in order to cope with illumination changes and
plied to track the object robustly against the occlusion in an partial occlusions. Constantinos Lalos et al. [85] designed
outdoor environment. Feng Lin et al. [93] developed an ap- a hybrid approach by combining region and feature-based
proach for the target tracking system using a set of static and techniques. They tracked moving objects using the Rao-
dynamic features. The system starts to track the target with Blackwellized particle filter in order to provide an improved
its color and shape and after that it associates motion fea- description and extract initial information for undergoing
tures and controls their values using the discriminate func- events. HOG parameters were studied for a moving object
tion for robust target tracking in a complex environment. by a set of two thousand extracted patches.

3.4 Model-based tracking 3.6 Optical flow-based tracking

Models are usually constructed of-line with manual mea- Optical flow is the vector field which describes how the
surements or computer vision techniques. As models have image changes with time. The two-dimensional projec-
S. Vishwakarma, A. Agrawal

tion from the three-dimensional velocity field observed by gorithm. Serdar Ince et al. [65] worked on the problem of
the camera needs to be estimated. However, this has been spatial discontinuities by detecting occlusions and extrapo-
proved to be extremely difficult to achieve due to prob- lating optical flow in an occluded area. They used more than
lems such as the aperture effect. To find the optical flow two frames rather than two consecutive frames to exploit
in the image sequence, people attempt to use feature-based, the temporal correlation. Hidetomo Sakaino [159] focused
gradient-based, or correlation-based approaches. Most these to solve a large displacement of semitransparent objects be-
approaches are of intensive computation, and hence com- tween frames using a two- step optimization of the optical
putational optimization is demanding. Optical flow meth- flow estimation model. A Gaussian mixture motion model
ods are normally used for generating dense flow fields by based on optical flow method is presented in [168]. The au-
computing the flow vector of each pixel under the bright- thors implemented shot time and long time speed and di-
ness constancy constraint [166]. This computation is often rection independent motion descriptor to detect and track
carried out in the neighborhood of the pixel either alge- carried baggage in changing light conditions.
braically [100] or geometrically [165]. Extending optical
flow methods to compute the translation of a rectangular
region is somewhat achievable. One of the examples was 4 Techniques used for action recognition
reported in [172], where Shi and Tomasi proposed the well-
known ShiTomasiKanade (STK) tracker which iteratively Human activity recognition was classified by Agrawal and
computes the translation of a region centered on an interest Ryoo [2] into two categories: Nonhierarchical and hierar-
point. Once the new location of the interest point is obtained, chical approach based on the direct or indirect recogni-
the STK tracker evaluates the quality of the tracked patch by tion of human activities from the input video. Based on
computing the affine transformation. This scheme works ef- the interpretation, the nonhierarchical approach is divided
fectively and is fast in most circumstances. Further work is a into two classes: space-time approach and sequential ap-
need of reducing the incorrect point correspondence. Girisha proach and each of them further classified into three and two
et al. [51] calculated optical flow from the silhouettes region classes, respectively. Similarly, the hierarchical approach
of a moving human using the Two Way ANOVA method. Si- can be classified into three categories: statistical, syntactic,
mon Denman et al. [36] proposed a system in which optical and description-based approaches. Figure 6 shows a detailed
flow calculation is done in adaptive background segmen- taxonomy used for human activity recognition approaches
tation using the YCbCr image format unlike the gray scale covered in this section, together with a number of publica-
image. They analyzed only those pixels of images, which are tions corresponding to each category.
in motion rather than the whole images. This helps to avoid
unnecessary computation of CPU time and capable of do- 4.1 Nonhierarchical approaches
ing horizontal and vertical movement in real-time. In [37],
the same authors provided a feedback mechanism to add It covers the simple and short activities such as primitive ac-
additional information into the existing object tracking al- tion and periodic activities (running, jumping, waving, etc.).

Human Activity Recognition

Non-hierarchical Hierarchical

  
Space-time Sequential Statistical Syntactic Description-based

    
Ivanov [68]
Nguyen [119] Joo [74]
Pinhanez [140]
Space-time Trajectories Space-time Exemplar State-based Oliver [128] Moore [112]
Gupta [55]
Shi [173]
Volume Features Yu [221]
Minnen [107]
Intille [66]
Darrell [35] Vu [197]
Yamato [213] Damen [34]
Bobick [7] Campbell [13] Gavrilla [48] Ghanem [49]
Zelnik-Manor [223] Starner [178] Cupillard [31]
Shechtman [169] Rao [147] Yacoob [212] Ryoo [152]
Laptev [87] Vogler [195] Gong [52]
Rodriguez [150] Yilmaz [218] Efros [41] Ryoo [155]
Gilbert [50] Bobick [8] Zhang [225]
Ke [77] Sheikh [171] Lublinerman [99] Siskind [175]
Schuldt [164] Oliver [127] Dai [33]
Gorelick [53] Khan [78] Veeraraghavan [191] Nevatia [118]
Dollar [39] Park [135]
Oikonomopoulos [125] Jiang [73] Nevatia [117]
Laptev [89] Natarajan [116]
Vaswani [189] Ryoo [153]
Ryoo [156] Moore [113]
Peursum [138] Ryoo [154]
Chomat [25]
Niebles [120] Gupta [54]
Wong [207]
vishwakarma [193][192]

Fig. 6 Taxonomy for human activity recognition approaches and the lists of selected publications corresponding to each category
A survey on activity recognition & behavior understanding in video surveillance

movements, etc. It is not restricted to a small set of prede-


fined activities and can bear small template deformations.
Their system was able to detect different types of multi-
ple activities that occur simultaneously in a camera view
field in the presence of cluttered dynamic backgrounds. Ro-
driguez et al. [150] devised a method in which the MACH
filter is operated on spatiotemporal volume and vector val-
ued data. It lowers the computational cost by analyzing the
frequency response of this filter. Yan Ke et al. [77] fused
shape and flow based techniques for the matching interest of
Fig. 7 Space-time approaches video. It does not require a background model. Gorelick and
Blank et al. [53] represented actions as space-time shapes,
which contained both spatial and dynamic information. This
It recognizes the human activity from unknown image se-
method was fast and did not require prior video alignment.
quences by running the matching algorithm on it with ref-
A reliable performance can be achieved in action recogni-
erence to the training set containing the predefined activity
tion.
class.

4.1.1 Space-time Trajectories In 1995, Campbell and Bobick published a


paper [13] in which they recognize nine atomic movements
of a ballet dancer by tracking trajectories of joint positions
The space-time approach represents activity in volume, tra-
in a 3-D XY T plane. Rao and Shah [147] proposed a view-
jectories, and set of features. The algorithm is used to cor-
invariant trajectory matching method to learn human actions
rectly match input with their representative model to recog-
nize the activity class. As shown in Fig. 7, it is clear that without any training model and to avoid ambiguity of action
activity recognition can performed by matching similarity and their trajectories. In [218], Yilmaz and Shah constructed
in shape and appearance through any of the three types of spatiotemporal volume by stacking a contour of object re-
models. gions in consecutive frames. The check points of the contour
were tracked in 4-D XY ZT plane. Sheikh et al. [171] pro-
posed a matching algorithm that worked on a set of 13 joint
Space-time volume Bobick and Davis [7] focused to rec-
ognize activity in real-time by using two components of trajectories in a 4-D XY ZT space. This technique has rec-
vector images: MEI and MHI. For unknown video frames, ognized simple actions (i.e., sitting, standing, and dancing)
it first constructs a vector image and then it is matched despite changes in viewpoint, anthropometry, and execution
against a stored representation of known movements. It rate, while Khan et al. [78] targeted to recognize complex
works successfully in static background where the motion behavior in group activities (e.g., people parades) through a
of object movement can be separated easily. Instead of re- 3D polygon. Every individual of the group was represented
constructing a three-dimensional model of a person, it de- by corner points of the 3D polygon. Each individual entity
scribes the motion spatially by using motion-energy image was tracked in consecutive frames and recorded to get corner
and then investigate its motion using a motion-history im- point trajectories in a 4-D XY ZT space. Oikonomopoulous
age. Prior foreground/background segmentation is required et al. [125] introduced a representation of human action as a
in this approach. It represents and recognizes simple action collection of short trajectories that are extracted by a parti-
like sitting, arm waving, and crouching, etc. Shechtman and cle filtering tracking scheme in space and time. They used a
Irani [169] avoid explicit flow computation by employing longest common subsequence algorithm to compare differ-
a rank-based constraint directly on the intensity information ent sets of trajectories corresponding to different actions.
of spatio-temporal cuboids to enforce consistency between
a template and a target. It is used to detect “behaviors of in- Space-time features Space-time feature is defined as lo-
terest” in a video database and checked on the Weizmann cal features of 3D volume on a space-time scale. Local
dataset. It detects similar behaviors and activities in video features often serves as a good approximation to represent
sequences despite differences in appearance due to differ- and recognize human activities, as the methods based on
ent clothing, different backgrounds, and different illumina- space-time features assume that the 3-D space time vol-
tion, etc. No prior modeling or learning activities are needed. ume is a rigid 3-D object, and it describes each action 3D
It can handle video sequences of complex dynamic scenes volume by performing an object matching procedure. The
such as a dynamic event that contains an unstructured ob- space time feature provides s concise description of hu-
ject like running water, flickering fire, and complex ballet man action’s 3D volume by solving the object-matching
S. Vishwakarma, A. Agrawal

issues. Zelnik-Manor and Irani [223] proposed an event- poral relationship match. It can recognize complicated hu-
based distance measurement approach in which they uti- man activities such as interaction between human-human,
lized local features at multiple temporal scales. It does not human-object, and human-object-human. In [193], the au-
require any prior knowledge about the event model and thors have considered multiclass activities fused in a three-
background segmentation. Due to intensity gradient oper- dimensional (spatial and time) coordinate activity recogni-
ation on multiple temporal scaled video volumes, this sys- tion system to achieve maximum accuracy. In a recent work
tem shows inferior accuracy while increasing the complex- of Vishwakarma and Agrawal [192], they quantized feature
ity of the video. It is not useful for recognition of multi- vectors of interest points utilizing a histogram. Their ap-
activities in video. Laptev and Lindeberg [87] proposed a proach delivered the same performance with a less number
generalization of Harris and Forstner interest point detec- of features. It worked well in semantically varying events
tor to localize the compact representation of the event. Sim- and was robust to scale and view changes.
ilarly, Gilbert et al. [50] used a 2D Harris corner detec-
4.1.2 Sequential
tor and data mining approach to localize multiple action in
scale-invariant real-time. But this required a lot of training Sequential approaches recognize human activities by ana-
samples. Furthermore, Schuldt et al. [164] classified multi- lyzing the sequences of features extracted from input video
ple actions by applying SVMs to [87] and illustrated their frames. These sequence of features are called observation
impact on activity recognition. A new database called the sequences for a particular class of activity in videos. The
“KTH actions dataset” containing action videos (e.g., jog- idea here is to identify similarity measures associated with
ging and hand waving) was introduced, and has been widely extracted sequences observation in an input video.
adopted. Dollar et al. [39] proposed a new method “spatio-
temporal feature” as the cuboids prototype for the recog- Exemplar Recognition based on exemplar approaches aim
nition of human (and animal) actions. It does not support at directly extracting the sequences of feature vectors from
view-invariancy and multi-activities. Chomat et al. [25] pro- input video, and then compare it with template sequences
posed a method for recognizing human actions directly from using an algorithm (e.g., dynamic time warping).
image measurement. They modeled a segment of video as Darrell and Pentland [35] proposed a DTW algorithm
a (x, y, t) spatio-temporal volume and computed local ap- and used a view model to recognize gesture actions and ef-
pearance models at each pixel using Gabor filter at vari- fectively handle a variation in the execution of actions. In
ous orientation, spatial scale, and temporal scale. Niebles their work, “hello,” “good-bye,” and “come closer” gestures
et al. [120] also based on the interest point feature extrac- were recognized successfully. Gavrila and Davis [48] pro-
tion approach and recognized multiple activities in a single posed a 3D joint angle model and associated it with the
DTW algorithm to recognize human movements at a ges-
video sequence, robust to scale changes, and their localiza-
ture level. (e.g., waving hand for hello gesture, indication
tion. But it does not support view invariancy as it models an
for come to near, etc.) Yacoob and Black [212] have consid-
action class by clustering cuboids into a set of video code
ered a model in which recognition of different actions were
words. As the recognition of a simple action does not re-
treated by different type of features (set of eigenvectors).
quire a spatial-temporal configuration in the local features
They used PCA-based modeling and singular value decom-
extracted from 3D space-time volume. Wong et al. [207]
position technique to represent activities. Efros et al. [41]
captured the global information (spatiotemporal location)
presented a methodology (motion descriptor along with op-
and sparse set of interest points by constructing a pLSA tical flow) to track people’s activities at public places where
with an implicit shape model (pLSA-ISM). In contrast to each person’s height was around 30 pixels. Lublinerman
the pLSA used by [120], their pLSA-ISM captures the rel- et al. [99] detected a simple action using a linear time vari-
ative spatio-temporal location information of the features ant system. Their system was sensitive to noise and per-
from the activity center, successfully recognizing and lo- formance, which depend on the correctness of background
calizing activities in the KTH dataset. They reliably recog- subtraction. Veeraraghavan et al. [191] developed a model
nized actions and gestures, but it is not suited for multiple that was the extension of the DTW matching algorithm and
action recognition. In a slightly different approach, Laptev included a time function parameter to monitor execution
et al. [89] considered a 3D XY T space as a sequence of of action speed over a time. This model helped to repre-
grids, where each grid consists of the portion found in the sent pickup, throwing, pushing, and waving actions while
spatiotemporal histogram. This approach recognizes natural maintaining variation in intra and interperson speed. Jiang
human action, but has a less tolerance toward noise. It has et al. [73] used a geometrical model of local human part to
been checked on realistic videos (e.g., movie scenes) similar characterize an action as a sequence of postures and then ap-
to [150]. Ryoo and Aggarwal [156] presented an algorithm plied a sequence matching method for action recognition. In
to estimate structural correspondence between two contin- [189], Vaswani et al. uses non rigid shapes and a dynamic
uous videos. Their algorithm is modeled by the spatiotem- model to represent activity trajectories.
A survey on activity recognition & behavior understanding in video surveillance

State-based State based approaches are the sequential ap- 4.2.1 Statistical
proaches, which represent a human activity as a model. The
model is trained statistically corresponding to feature vec- Statistical approaches use statistical state-based models to
tors extracted from a particular class of activity. Yamato recognize activities. In the case of hierarchical statistical
et al. [213] adopted a hidden Markov model (HMM) to rep- approaches, multiple layers of state-based models (usually
resent and recognize the activities. Originally, the HMMs two layers) such as HMMs and DBNs are used to recognize
have been widely used for speech recognition. Starner and activities with sequential structures. At the bottom layer,
Pentland [178] also used standard HMMs, and further ex- atomic actions are recognized from sequences of feature
tended its application in gesture level action recognition. vectors, just as in single-layered sequential approaches. As
In order to recognize American sign language (ASL), they a result, a sequence of feature vectors are converted to a
have modeled each word of it as HMMs, and generated sequence of atomic actions. The second-level models treat
a sequence of features. This sequence of features actually
this sequence of atomic actions as observations generated
described location and the shape of a hand. However, in
by the second-level models. For each model, a probability
this HMM, it was not designed to handle a large num-
of the model generating a sequence of observations (i.e.,
ber of combinations to represent American sign language.
atomic-level actions) is calculated to measure the likelihood
Vogler and Metaxas [195] employed a concept of parallel
between the activity and the input image sequence. Oliver
HMMs (PHMMs) to lower the number of combinations used
et al. [128] developed layered hidden Markov models (LH-
in recognition of American sign language (ASL). Bobick
MMs), to model a real-time office activities. LHMMs can
and Wilson [8] also recognized gestures using state mod-
be regarded as a cascade of HMMs, and both upper layer’s
els. Here, gesture was represented as a 2-D XY trajectory
describing the location changes of a hand. Each curve is de- HMM and bottom layer’s HMM are connected via inferen-
composed into sequential vectors, which can be interpreted tial results. Zhang et al. [225] used layered HMMs to rec-
as a sequence of states computed from a training exam- ognize group actions from recognized individual actions.
ple. Oliver et al. [127] proposed a new concept of varia- Nguyen et al. [119] used hierarchical HMMs (HHMMs) to
tion in HMM and introduced a couple of hidden Markov recognize behavior correctly but not suited well for com-
models for modeling the complex individual activities. The plex behaviors. Similar to [128], Shi et al. [173] proposed a
traditional HMM could not able to recognize the interac- hierarchical approach using a propagation network (P-net).
tion between more than two people due to large size of Their work was focused on a common task for elderly peo-
the parameter set. Experimental results demonstrated that ple who develop late stage diabetes. Yu and Aggarwal [221]
CHMMs are better than HMMs from the learning point of used a block-based discrete hidden Markov model to rec-
view. Park and Aggarwal [135] estimated human body ges- ognize and analyze observation sequences containing a set
tures using Bayesian networks and modeled the evolution of of multiple actions. However, it failed to recognize multiple
two person interactions (such as a turning head in a left- concurrent actions. Cupillard et al. carry out event recogni-
right direction or a gesture used in greeting each other) tion by mean of graphs combination mechanism, in [31] an
by dynamic Bayesian networks (DBN). In Natarajan and approach for recognizing group of people behaviors using
Nevatia’s work [116], coupled hidden semi-Markov mod- multiple cameras is presented. Dai et al. [33] indicated that
els (CHSMMs) were proposed for modeling and recogniz- multilevel dynamic Bayesian network (DBN) model speci-
ing the duration of subevents and representing the interac- fies the detailed online analysis to detect multilevel events.
tion between multiple people. The major limitation with this Damen and Hogg [34] constructed Bayesian networks using
model is complexity. Gupta and Davis [54] proposed a prob- a AND-OR grammars to encode pairwise event constraints.
abilistic model that exploit contextual information for visual Gong and Xiang [52] proposed a dynamically multi-linked
action analysis to improve object recognition as well as ac- Hidden Markov model with the structure determined by
tivity recognition. Hidden Markov models and Bayesian re- Bayesian Information Criterion (BIC) to automatically sum-
lations are used by Moore et al. [113], to classify objects maries or categories activities within video. It avoids the dif-
and identify human motion. But their method is limited to ficulties associated with tracking under occlusion in noisy
an overhead view and only track hands. Peursum et al. [138] scenes.
suggest the use of human actions to infer object class based
on location and identity of objects.
4.2.2 Syntactic
4.2 Hierarchical approaches
Syntactic approaches model human activities as a string of
It defined recognition methodologies for complex human ac- symbols, where each symbol corresponds to an atomic-level
tivities such as human-object interactions and group activi- action. Ivanov and Bobick [68] suggested using stochastic
ties. context-free grammars (SCFGs) to model visual activities
S. Vishwakarma, A. Agrawal

and introduced it on an upper layer to compute the prob- team-level activities) to deal with uncertain and incomplete
ability of temporally consistent sequences of primitive ac- nature of real world application. Siskind [175] also proposed
tions. It worked well for temporally extended behaviors and a hierarchical description-based approach for human activ-
interactions between multiple objects. However, modeling ity recognition. Notably, it decomposed high-level activities
of temporally explicit behavior among interactive objects into primitive events and applies force dynamics on it to
(more than two) was a difficult task for them. Joo and Chel- recognize simple actions. The event logic is used to com-
lappa [74] designed an attribute grammar for recognition, bine primitive events of participating humans/objects to rec-
which is an extension of the SCFG. Their grammar attaches ognize complicated semantic-level complex events. How-
semantic tags and conditions to the production rules of the ever, there remains a challenging problem in computational
SCFG, enabling the recognition of more descriptive activi- models of language grounding. The representation language
ties. That is, their grammar is able to describe feature con- called Video Event Representation Language (VERL) [117]
straints as well as temporal constraints of atomic actions. for composite events was presented by Nevatia et al. [118].
Moore and Essa [112] further extended the work of [68] by They have not only constructed formal description by clas-
incorporating error detection and recovery mechanism. They sifying human activities into three level of hierarchy (e.g.,
have focused on multitask activities. One of the major draw- primitive events, single-thread composite events, and multi-
backs was the inability to compensate the failure of compli- thread composite events), but also illustrated a heuristic al-
cated human activities. However, they are not equally suited gorithm to detect ongoing human activities from input im-
to many other visual modeling tasks, which involve essen- ages. One of the main drawbacks is their inability to describe
tially non-sequential data, for example the spatial relation- complex composition of activities due to its single/multiple
ships in a single visual image. Minnen et al. [107] present a thread terminology. Significant progress has been made by
system that uses human specified grammars (i.e., SCFGs) to Vu et al. [197] to recognize long-term events. They have
recognize a person performing the wood workshop activity adopted temporal constraints propagation techniques. The
from a video sequence by analyzing object interaction be- spatial-temporal knowledge has been exploited to tackle
haviors separately within the structure of specified grammar. the activity recognition problem as a constraint satisfaction
For example, humans using a drill machine is a whole activ- problem.
ity, which is now divided into three subevents (e.g., “switch,” Ghanem et al. [49] applied the Petri nets to model com-
“drill,” and “switch off”). All three subevents are analyzed plicated activities and interaction in traffic domain. In [55],
separately in detail to yield corresponding subactivities, and work on activity classification has focused on recognition
then clubbing them sequentially to recognize a whole activ- of atomic level action, using 2D video. Gupta et al. used a
ity in a better way. context-free (AND-OR) grammar in their approach. How-
ever, the use of 2D video leads to relatively low accuracy
4.2.3 Description-based
even in the absence of clutter. Ryoo and Agrawal [152, 154]
A description-based approach is a recognition approach presented a framework for human action modeling by using
that explicitly maintains spatiotemporal structures for hu- Bayesian Networks and Context-free grammars to recognize
man activities. They represent a high-level human activity the composite activities (such as, poses of body parts). Ryoo
in terms of simpler activities composing the activity (i.e., and Agrawal [155] describe a method for recognizing com-
subevents), describing their temporal, spatial, and logical plex human activities using a context-free grammar (CFG)
relationships. Based on the IA-network concept, Pinhanez based representation scheme, derived from [152]. In [153]
and Bobick [140] have proposed the PNF-network. They object context is used for recognition of human-object inter-
have shown that their method is computationally tractable action.
even though one of the atomic actions was not provided.
But the scope of their methodology was limited to tempo-
ral information. One of the drawbacks is that a subnetwork 5 Human behavior understanding
has to be specified redundantly against occurrences of their
subevents. Intille and Bobick [66] have designed a more Behavior understanding involves analysis and recognition of
complex Bayesian network model to identify actions of a motion patterns, and the production of high-level description
football player in a crowded scene. They described the tem- of actions and interactions between or among objects. It is
poral structure in programming a language format instead necessary to analyze the behaviors of people and determine
of a network form. It can recognize highly structured or whether their behaviors are normal or abnormal [82]. An
uncertain visual perception (e.g., players are continuously automated visual surveillance system which can understand
interacting with each other and simultaneously changing and learn behavior from observing activities in a video se-
their behavior) by representing them in three level of hierar- quence requires a reliable combination of image processing
chy (atomic-level activities, individual-level activities, and techniques and artificial intelligence techniques [69, 209].
A survey on activity recognition & behavior understanding in video surveillance

Detection of suspicious human behavior involves modeling recognition. PCFG was constructed after background re-
and classification of human activities with certain rules [26]. moval, silhouette extraction, and keyframe extraction. Para-
The idea is to partition the observed human movement into meswaran [132] developed a high-level representation and
some discrete states and then classify them appropriately. an efficient recognition algorithm for human action that
Partitioning of the observed movements is very application was resistant to variation in viewpoints, and used pro-
specific and overall hard to predict what will constitute sus- jective invariants of landmark points on a human body.
picious or endangering behavior [26, 69]. But this method needed the 3D motion capture for cre-
The action recognition is governed by movement of gaits ating the action model database. Motion History Volumes
and these movements can be performed at several levels of (MHV) [203] [201] was introduced as a free-viewpoint rep-
abstraction. Different taxonomies were proposed here and resentation for human actions in the case of multiple cali-
we consider the hierarchy used by Moeslund et al. [109], brated, background-subtracted, and video cameras. This rep-
i.e., action primitive, action, and activity. An action primi- resentation was used to recognize basic human actions, in-
tive is an atomic movement. At limb level, action primitive dependently of gender, body size, and viewpoint. Weinland
can be described. An action consists of action primitives and et al. [202] proposed the 3D exemplar based hidden Markov
describes a whole body movement. At last, activities contain model (HMM). The exemplars were represented in 3D as
a number of subsequent actions, and yield an interpretation visual hulls that had been computed using a system of 5 cal-
of the movement that is being performed. For instance “rais- ibrated cameras. The method can recognize a single video
ing the left hand” is an action primitive, where as “waving or multiple videos. Lv and Nevatia [102] modeled actions as
the hand” is an action. “Punching someone” is an activity a graph model (Action Net) and used an enhanced Pyramid
that contains raising, waving, and punching actions. Match Kernel algorithm for fast matching between two sim-
Efros and Berg et al. [41] introduced a novel motion de- ilar feature sets. After successfully tracking the movement
scriptor based on optical flow measurements in a spatiotem- of the human subject from one frame to another in an image
poral volume for each stabilized human figure and demon- sequence, the problem of understanding human behaviors
strated the use of this descriptor in a nearest-neighbor query- from the perspective of image sequences becomes appar-
ing framework. ent. Behavior understanding involves action recognition and
Yilmaz and Shah [218] proposed the spatiotemporal vol- description and may be simply considered a classification
umes (STV) to solve the point correspondence problem be- problem of time varying feature data, i.e., matching an un-
tween consecutive frames, and then recognized the actions known test sequence with a group of labelled reference se-
based on the descriptors, which were computed by analyzing quences representing typical human actions. However, when
the differential geometric properties of STV. Gorelick and there is a need for a more complete description, other ap-
Blank et al. [53] represented actions as space-time shapes proaches are convenient: dynamic time warping [72], hidden
which contained both spatial and dynamic information. This Markov models [6] or neural networks [47], followed by an
method was fast and did not require prior video alignment. action recognition step and semantic description phase. Ex-
A reliable performance can be achieved in action recogni- isting techniques can be grouped into the following types
tion. based on the nature of the algorithms used: Naive Bayes
Laptev and Lindeberg [86] extended a 2D Harris cor- probabilistic models [40], hidden Markov models [6], fuzzy
ner detector to a 3D Harris detector, which detected regions logic: K-NN (K-Nearest Neighbors), NN (KNearest Neigh-
having significant local variations in both spatial and tem- bors) [3], dynamic time warping [72], Sequential Minimal
poral dimension. The representation has been successfully Optimization (SMO) [141], trees and decision rules CART
applied to human action recognition combined with a SVM (Classification and Regression Trees) [219] programs for
classifier [164]. Oikonomopoulos et al. [126] extended the machine learning [145] RIPPER [27], neural networks [47].
concept of saliency from spatial to the spatiotemporal do-
main by using a sparse set of spatiotemporal features. Dol- 5.1 Supervised model
lar et al. [39] improved the 3D Harris detector based on
The abnormal behavior is detected in the video based on
a convolution with a Gabor filter in time. Combined with
the predefined activity and their class. Only those abnormal
some global information, Wong and Cipolla [206] presented and normal behavior could be identified, which were already
a method to extract interest points using global informa- existing in the database. However, in reality, abnormal be-
tion, such as the organization of pixels in a whole video. havior is difficult to define for certain domains of activity
Nowozin et al. [122] used a sequential representation, which clearly.
retained the temporal order. A set of partial motion models
was combined into a global model of human motion for mo- 5.2 Unsupervised model
tion recognition by Filipovych et al. [45].
A probabilistic context-free grammar (PCFG) was More recently, a number of techniques have been pro-
adopted by Ogale et al. [123] for view-invariant action posed for unsupervised learning of behavior models. They
S. Vishwakarma, A. Agrawal

can be further categorized into two different types accord- 5.3 Semisupervised model
ing to whether an explicit model is built. Approaches that
do not model behavior explicitly either perform cluster- A recent work by Xu et al. [211] focused on semisuper-
ing on observed patterns and label those forming small vised learning techniques to recognize both simple and com-
plex activities. Compared to supervised techniques, the ma-
clusters as being abnormal. Learning from unlabeled data
jor contribution of [211] is to alleviate the requirement of
still remains one of the most challenging problems in the
a large training data set. Hue Thi et al. [183] described a
fields of computer vision and machine learning. The chal- framework called the weakly supervised approach, which
lenge mainly results from the fact that unlabeled images attempts to recognize human action by integrating the lo-
contain various objects, which change in pose, scale, and cal features both spatially and temporally. They tested their
degree of occlusion. Significant progress has been made technique on KHT, HoHA1, and the TRECVid dataset, and
in [44, 57, 200], which proposes several unsupervised tech- compared results with existing supervised techniques.
niques for learning object models as constellations of fea-
tures. Weber et al. [200] represent an object as a constel-
lation of parts. Fergus et al. [44] extend the model to ac- 6 Description of dataset
count for variability in appearance. Their results are encour-
In this section, we discuss some of the datasets, which are
aging but achieved on a prepared set of images with the ob-
currently used by many of the action recognition approaches
ject of interest being always in the foreground and cover- as a benchmark (see Table 4).
ing a large part of the images. The cooccurrences of fea-
tures are introduced to separate a different object class, as 6.1 Dataset defined for controlled environment
in [58, 90, 134, 146, 176]. In [134, 176]. The authors define
neighbors of features to represents objects. Hoiem et al. [58] Several public datasets have been introduced in the past
and Rabinovich et al. [146] model relationships for robust 10 years [158], encouraging researchers to explore vari-
object detection or image labeling, but require manually lo- ous action recognition directions. The KTH dataset [164]
calizing the objects in training images. Leordeanu et al. [90] and the Weizmann dataset [53] are the typical examples of
these dataset. These two single-camera datasets have been
link temporal dependent entities by observing their cooc-
designed for research purposes, providing a standard for
currence, therefore, it is more appropriate for object track-
researchers to compare their action classification perfor-
ing. In [215], authors propose a maximum likelihood al- mances. The datasets are composed of videos of relatively
gorithm for unsupervised shared structure learning, where simple periodic actions, such as walking, jogging, and run-
shared structures are represented as the strongly connected ning. The videos are segmented temporally so that each clip
clusters of consistent pairwise spatial relationships. contains no more than one action of a single person. They

Table 4 Comparison of characteristics of datasets

Specification KTH Weizmann HOHA1 TRECVID VIRAT PETS i-Lids UT-Int.a

# of Event types 6 10 8 10 23 3 4 6
Avg. # of samples per class 100 9 85 31670 101500 N/A N/A 8
Max. resolution 160 × 6 180 × 144 540 × 240 720 × 576 1920 × 1080 768 × 576 720 × 576 720 × 480
Human height in pixels 80 100 60 70 100 1200 20 200 20 180 20 N/A 200
Human to video height ratio 65–85 % 42–50 % 50–500 % 4–36 % 2–20 % N/A N/A N/A
# Scenes N/A N/A many 5 16 8 7 20
Viewpoint type side side varying 5/varying varying varying side side
Natural background clutter No No Yes Yes Yes Yes No Yes
Incidental objects/activities No No Yes, varying Yes Yes Yes Yes Yes
End-to-End activities No No Yes, varying Yes Yes Yes N/A N/A
Tight bounding boxes cropped cropped No No Yes No N/A N/A
Multiple annotations on movers No No No No Yes Yes N/A Yes
Camera motion No No varying No varying varying N/A No

a UT-Interaction
A survey on activity recognition & behavior understanding in video surveillance

Fig. 8 Example KTH dataset corresponding to different types of ac-


tions and scenarios (based on Schuldt’s work [164]) Fig. 9 Example of WEIZMANN dataset (derived from Gorelick’s
work [53])

were taken in a controlled environment; their backgrounds


and lighting conditions are mostly uniform. In general, they
have a good image resolution and little camera jitters.

6.1.1 KTH dataset

The KTH dataset contains six actions. They are performed


by 25 actors under four different scenarios of illumina- Fig. 10 Eight example scenes in HOHA1 Dataset [89]
tion, appearance, and scale changes. In total, it contains 598
video sequences. The KTH video database containing six
types of human actions (walking, jogging, running, box- datasets [53, 164]. These datasets encourage the develop-
ing, hand waving, and hand clapping) performed several ment of recognition systems that are reliable under noise
times by 25 subjects in four different scenarios: outdoors and view point changes. However, even though these videos
s1, outdoors with scale variation s2, outdoors with different were taken in more realistic environments, the complexity
clothes s3 and indoors s4. as illustrated in Fig. 8. There are of the actions themselves were similar to [53, 164]: The
25 × 6 × 4 = 600 video files for each combination of 25 sub-
datasets contain simple instantaneous actions such as kiss-
jects, 6 actions, and 4 scenarios. All videos were taken over
ing and hitting. They were not designed to test recognition
homogeneous backgrounds with a static camera with 25 fps
of high-level human activities from continuous sequences.
frame rate. The videos were down-sampled to the spatial res-
olution of 160 × 120 pixels and have a 4 seconds duration in
average. 6.2.1 HOHA1

6.1.2 Weizmann The Hollywood Human Action HoHA1 dataset contains 8


action classes, Lying, StandUp, HugPerson, Kiss, GetOut-
This a dataset contains 90 low-resolution 180 × 144 50 fps Car, HandShake, Sitdown, and SitUp. The action classes
video sequences showing nine different people, each per- are distributed in 430 training and 448 testing videos. The
forming 10 natural actions such as “run,” “walk,” “skip,” background of this dataset is highly complex in nature and
“jumping-jack,” (or shortly “jack”) “jump-forward-on-two- contain cluttered scenarios. Figure 10 represents the HoHA1
legs,” (or “jump”) “jump-in-place-on-two-legs,” (or dataset Laptev et al. [89].
“pjump”) “gallop sideways,” (or “side”) “wave-two-hands,”
(or “wave2”) “waveone-hand,” (or “wave1”) or “bend.” Fig-
6.2.2 TRECVid
ure 9 shows examples of the Weizmann dataset.

6.2 Dataset defined for realistic environment The TREC Video Retrieval Evaluation (TRECVid) [177]
is an international benchmarking activity to encourage re-
Recently, more challenging datasets were constructed by search in video information retrieval by providing a large
collecting realistic videos from movies [77, 88, 150]. These test collection, uniform scoring procedures, and a forum
movie scenes are taken from varying view points with for organizations interested in comparing their results. Fig-
complex backgrounds, in contrast of the previous public ure 11 shows snapshot of this dataset.
S. Vishwakarma, A. Agrawal

Fig. 11 Snapshots from the Event Detection track TRECVid


dataset [184]

6.2.3 VIRAT

A new large-scale video dataset [124] designed to assess the Fig. 12 Six example scenes in VIRAT Video Dataset (total 16 scenes)
performance of diverse visual event recognition algorithms is from [124]
with a focus on continuous visual event recognition (CVER)
in outdoor areas with wide coverage. Previous datasets for
action recognition are unrealistic for real-world surveillance
because they consist of short clips showing one action by
one individual [53, 164]. Datasets have been developed for
movies [88] and sports [98], but these actions and scene con-
ditions do not apply effectively to surveillance videos. This
dataset consists of many outdoor scenes with actions occur-
ring naturally by nonactors in continuously captured videos
of the real world. The dataset includes large numbers of in-
stances for 23 event types distributed throughout 29 hours
of video. This data is accompanied by detailed annotations,
which include both moving object tracks and event exam-
ples, which will provide a solid basis for large-scale evalu- Fig. 13 Example of PETS 2009 dataset containing crowd scenarios
ation. An example of this can be seen in Fig. 12 extracted with an increasing scene complexity (taken from PETS [62]) First
Row: left image concerns person count and density estimation and right
from the VIRAT dataset. image addresses people tracking. Second Row: left and right images in-
volve event recognition and flow analysis, respectively
6.2.4 PETS

The performance evaluation of tracking and surveillance


(PETS) [62] program organizes regular workshops and pro-
vides a benchmark dataset (see Fig. 13). This dataset ad-
dresses the problem of group activities such as crowd image
analysis (i.e., crowd count and density estimation, tracking
of individual(s) within a crowd, and detection of separate
flows and specific crowd events) within a public space. The
PETs 2009 provide researchers to evaluate new or existing
detection techniques on dataset captured in a real-world en-
vironment. The dataset scenarios were filmed from multiple
cameras and involve multiple actors. Fig. 14 Snapshot of i-Lids dataset (similar to CPNI’s work [63])

6.2.5 i-Lids
Analytics (VA) systems developed in partnership with the
The image library for intelligent detection systems (i-LIDS) Center for the Protection of National Infrastructure (CPNI).
(see Fig. 14, [63]) is the government’s benchmark for Video There are currently five scenarios within i-LIDS, that is, ster-
A survey on activity recognition & behavior understanding in video surveillance

ile zone monitoring, parked vehicle detection, abandoned 7 Economical impact on video surveillance market
baggage detection, doorway surveillance, and multiple cam-
era tracking scenario. Each of these scenarios are made up The projection of the compound annual growth rate of
of three datasets. The datasets for the event detection scenar- video surveillance market in the United States of America
ios each contain approximately 24 hours of footage. Each of is $67.07 million and in Europe is about $188.3 million (see
these datasets are filmed to represent all weather, time of Fig. 16).
day, and scene densities expected within the scenario. The In the face of global recession, the security and video
multiple camera tracking scenario datasets each contain ap- surveillance industries continue to remain unaffected and
proximately 50 hours of real world footage. even growing. According to IMS Research, the demand
Each dataset consists of two or three camera views re- for CCTV and video surveillance equipment should re-
ferred to as stages, and is further segmented into shorter main strong throughout 2012, with potential growth for the
video clips of 30 to 60 minutes. The training dataset is fur- surveillance equipment market predicted to exceed 25 %. In
ther split into individual events. Each dataset is supplied 2010, analog sales were notably low, but leaps and bounds
with a user guide detailing the library structure, user inter- were made in network video surveillance technology caused
face, and the procedure used to evaluate the systems against by the market that year to increase by more than 300 %. Two
the relevant scenario. probable causes for this are:

6.2.6 UT-Interaction – The demand for security and video surveillance continues
to be very relevant if not increasing in relevancy. The un-
deniable return-on-investment and government stimulus
The UT-Interaction dataset is designed to encourage de-
spending have spurred video surveillance industry sales,
tection of high-level human activities (e.g., hand shaking)
which continued to grow throughout 2010 and 2011.
which are more complex than the previous simple actions. In
Video surveillance has proven itself in decreasing crime
addition, it encourages localization of the multiple activities
as well as compiling valuable evidence in the aftermath.
involving multiple persons and objects [156, 198, 222] from
– The recent economic downturn did not affect all coun-
continuous video streams spatially and temporally. The UT-
tries, worldwide equally. Technological transitions from
Interaction dataset [157] contains videos of continuous exe-
lower value video surveillance equipment to more ad-
cutions of 6 classes of human-human interactions: hand-
vanced networked based surveillance continued to remain
shake, point, hug, push, kick, and punch. Figure 15 shows
strong and even bolstered the video surveillance market as
example snapshots of these multiperson activities. Ground
a whole.
truth labels for all interactions in the dataset videos are pro-
vided, including time intervals and bounding boxes. There As shown in Fig. 17, it is clear that activity recognition is
are a total of 20 video sequences whose lengths are around 1 an active research topic. In fact, there have been three times
minute. Each video contains at least one execution per inter- as many publications in the last 5 years than the number of
action, providing us about 8 executions of human activities all publications found before 2005. As we enter 2012 amidst
per video on average. Several actors with more than 15 dif- a sea of predictions, forecasts and top ten trends for the se-
ferent clothing conditions appear in the videos. The videos curity surveillance industry, one thing that remains constant
are taken with the resolution of 720 × 480, 30 fps, and the is the incredible technological innovations that we have seen
height of a person in the video is about 200 pixels. in the past year, and which will evolve in 2012.

Fig. 15 Example of UT-interaction dataset (based on Ryoo’s Fig. 16 Projection of the compound annual growth rate of the video
work [157]) surveillance market (Courtesy IMS Research [64])
S. Vishwakarma, A. Agrawal

Fig. 17 Increasing interest on


human activity recognition
research is shown through a
comparison of a number of
papers published up to 2011

8 Practical issues in real world scenario of intelligent, (semi) automated video analysis paradigms to
assist the operators in scene analysis and event classifica-
Despite much advancement in the field of automated surveil- tion. Event detection is a key component to provide timely
lance, there are still challenges to realize it in teh practical warnings to alert security personnel. It deals with mapping
real world conditions due to following issues. motion patterns to semantics (e.g., benign and suspicious
events). However, detecting semantic events from low-level
8.1 Robustness video features is major challenge in real world situations
due to unlimited possibilities of motion patterns and be-
Real world scenarios are characterized by sudden or gradual haviors leading to well-known semantic gap issues. Further-
changes in the input statistics. A major challenge for real more, suspicious motion events in surveillance videos hap-
world object detection and tracking is the dynamic nature of pen rather infrequently and the limited amount of training
real world conditions with respect to illumination, motion, data poses additional difficulties in detecting these so-called
visibility, weather changes, etc. Achieving robust algorithms rare events.
is a challenge especially (a) under illumination variation due
to weather conditions or lighting changes, for example, in an 8.3 Real timeliness
outdoor scene, due to movement of clouds in sky and in an
indoor scene, due to opening of doors or windows; (b) under A useful processing algorithm for surveillance systems
view changes; (c) in the case of multiple objects with partial should be real time, i.e., output information, such as events,
or complete occlusion or deformation; (d) in the presence of as they occur in the real scene. Requirement of accuracy and
articulated or nonrigid objects; (e) in the case of a shadow, robustness result in computational intensive and complex
reflections, and clutter; and (f) with video noise (e.g., Gaus- design of algorithms which makes real time implementation
sian white). Other scenarios with (a) low light and illumi- of a system a difficult task.
nation variation (b) boat among moving waves, and (c) car
object with moving background of vegetation. Significant 8.4 Cost effective
research and advancement in solving these difficulties have
been achieved, but still the problem is unsolved in generic For feasible deployment in a wide variety of real world
situations with dynamically varying environmental condi- surveillance applications, ranging from indoor intrusion de-
tions and there is lack of generic multimodal framework to tection to outdoor surveillance of important buildings, etc.,
achieve system robustness by data fusion. a cost effective framework is required.

8.2 Intelligent
9 Conclusion
With the advances in sensor technology, surveillance cam-
eras, and sound recording systems are already available in An automated detection of ongoing activities and behavior
banks, hotels, stores and highways, shopping centers, and analysis has become an active research area in the com-
the captured video data are monitored by security guards and puter vision field. It is strongly driven by many promising
stored in archives for forensic evaluation. In a typical sys- applications such as smart surveillance, intelligent robots,
tem, a security guard watches 16 video channels at the same virtual reality, action-based human-computer interfaces, un-
time and may miss many important events. There is a need usual activities and prevent terrorism, personal safety of
A survey on activity recognition & behavior understanding in video surveillance

passengers and public, etc. Technical development in this pipeline, and most of the reported work was restricted to
field demonstrated how it deals with challenges that involve fixed and known video points. The multiple view depen-
complex human movements. It is exciting to see many re- dent action required more training complexity. The space-
searchers gradually spreading their achievements into more time features approach are reliable under noise, illumination
intelligent practical applications. changes, and recognize multiple activities, but not suited to
We have presented a general processing framework of hu- model complex, nonperiodic, and viewing invariant repre-
man activity recognition systems, and discuss the recent de- sentations of activities. While several works (e.g., [50]) have
velopment explored for various stages of the system. The addressed the recognition of complex activities, but still it
state-of-the-art of existing methods in each key issue is de- remains a challenge.
scribed and the focus is on three major tasks: detection, The sequential approaches are reliable to detect the com-
tracking, and activity recognition or behavior understand- plex and nonperiodic activities by using sequential features.
ing. We have also discussed the publicly available datasets The exemplar-based methods are more flexible and require
built for the uniform testing of methodology proposed by au- less training data in comparison to the methods discussed
thors. We have provided a brief description of datasets and under the state-based. The problem with state-based ap-
characteristics comparison among them. proaches are that if the complexity level of activities get in-
In this survey, a brief overview of all preprocessing (i.e., creased, then they need a greater number of training data.
detection, classification, and tracking) steps has been in- Description based approaches are doing well to recog-
cluded. There are many limitations and open challenges, nize high level activities whose subevents are organized con-
which we have highlighted by providing the comparison. currently and occurring in a sequential manner in compari-
Motion detection in dynamic scenes is a difficult task in son to the statistical or syntactic approaches. The statistical
the presence of illumination and weather changes, detection and syntactic approaches can effectively handle the activity
of a shadow, self-occlusion, and complete occlusion. Fast video polluted with noise.
and accurate methods are still needed for segmentation tech-
niques to affect the performance of latter stages.
References
We have discussed the various object tracking techniques
for human, groups of people, moving vehicles, etc. The 1. Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. Com-
following six approaches are studied intensively in past put. Vis. Image Underst. 73(3), 428–440 (1999)
works: region-based, contour-based, feature-based, model- 2. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review.
ACM Comput. Surv. 43(3), 1–43 (2011)
based, hybrid, and optical flow. It represents a class of prob- 3. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning al-
lems that are both computational and data intensive. During gorithms. Mach. Learn. 06, 37–66 (1991)
the study, we found that the hybrid approach may be a bet- 4. Allili, M.S., Bouguila, N., Ziou, D.: A robust video foreground
ter choice to overcome occlusions of tracked entities, clut- segmentation by using generalized Gaussian mixture modeling.
In: 4th Canadian Conf. on Computer and Robot Vision, pp. 503–
ter in background, and losing the object due to rapid move- 509 (2007)
ments. More researchers are needed to track the human in 5. Bayona, A., SanMiguel, J.C., Martínez, J.M.: Stationary fore-
a crowded scene and robust improvement is required to im- ground detection using background subtraction and temporal dif-
plement a method to redirect the tracking human or group of ference in video surveillance. In: IEEE 17th Int. Conf. on Image
Processing, pp. 1–4 (2010)
humans in multiple perspectives. Tracking multiple persons 6. Blunsom, P.: Hidden Markov models. Tech. rep, Human Lan-
or a group of people are difficult due to a crowded environ- guage Technology University of Melbourne, Victoria, Aus-
ment, poor illumination, noisy images, and camera move- tralia (2004). http://www.cs.mu.oz.au/460/2004/materials/hmm-
tutorial.pdf
ments. We have explored various methods for the recogni-
7. Bobick, A.F., Davis, J.W.: The recognition of human movement
tion of activities of a single person, activities of crowd as using temporal templates. IEEE Trans. Pattern Anal. Mach. In-
a whole (or as subgroup), and found that space time ap- tell. 23(3), 257–267 (2001)
proaches have proven to yield good results for periodic ac- 8. Bobick, A.F., Wilson, A.D.: A state-based approach to the repre-
sentation and recognition of gesture. IEEE Trans. Pattern Anal.
tions and gestures. They usually do not require background Mach. Intell. 19(12), 1325–1337 (1997)
subtraction. However, their applicability is limited to slow 9. Bose, B., Grimson, E.: Improving object classification in far-field
speed motion variation. Moreover, they cannot deal with oc- video. In: Proc. of the Int. Conf. on Computer Vision and Pattern
clusions. The space-time volume representation is restricted Recognition, pp. 181–188. IEEE Computer Society, Washington
(2004)
to recognize actions when multiple people appeared in the 10. Brown, L.M.: View independent vehicle/person classification. In:
scene. For accurate localization of actions, they are com- Proc. of the ACM 2nd Int. Workshop on Video Surveillance &
putationally very expensive as they do not support view- Sensor Networks, pp. 114–123. ACM Press, New York (2004)
11. Bucak, S.S., Gunsel, B., Gursoy, O.: Incremental nonnegative
invariancy. Space-time trajectory does better work to repre-
matrix factorization for background modeling in surveillance
sent and recognize human movements from different view- video. In: IEEE 15th Signal Processing and Communications Ap-
ing angles, but still 3-D body modeling is in the research plications (SIU), pp. 1–4 (2007)
S. Vishwakarma, A. Agrawal

12. Cai, L., He, L., Yamashita, T., Xu, Y., Zhao, Y., Yang, X.: Robust 32. Cutler, R., Davis, L.S.: Robust real-time periodic motion detec-
contour tracking by combining region and boundary information. tion, analysis, and applications. IEEE Trans. Pattern Anal. Mach.
IEEE Trans. Circuits Syst. Video Technol. 21(12), 1784–1794 Intell. 22(8), 781–796 (2000)
(2011) 33. Dai, P., Di, H., Dong, L., Tao, L., Xu, G.: Group interaction anal-
13. Campbell, L., Bobick, A.: Recognition of human body motion ysis in dynamic context. IEEE Trans. Syst. Man Cybern. 38(1),
using phase space constraints. In: ICCV, pp. 624–630 (1995) 275–282 (2008)
14. Camplani, M., Salgado, L.: Adaptive background modeling in 34. Damen, D., Hogg, D.: Recognizing linked events: searching the
multicamera system for real-time object detection. Opt. Eng. space of feasible explanations. In: IEEE Conf. on Computer Vi-
50(12), 1–17 (2011) sion and Pattern Recognition, pp. 927–934 (2009)
15. Cavallaro, A., Steiger, O., Ebrahimi, T.: Tracking video objects in 35. Darrell, T., Pentland, A.: Space-time gestures. In: Proc. IEEE
cluttered background. IEEE Trans. Circuits Syst. Video Technol. Computer Society Conf. on Computer Vision and Pattern Recog-
15(4), 575–584 (2005) nition, pp. 335–340 (1993)
16. Chai, Y., Shin, S., Chang, K., Kim, T.: Real-time user interface 36. Denman, S., Chandran, V., Sridharan, S.: Adaptive optical flow
using particle filter with integral histogram. IEEE Trans. Con- for person tracking. In: Proc. of the Digital Imaging Computing:
sum. Electron. 56(2), 510–515 (2010) Techniques and Applications, DICTA ’05, pp. 1–7 (2005)
17. Chang, S.F.: The holy grail of content-based media analysis. 37. Denman, S., Chandran, V., Sridharan, S.: An adaptive optical
IEEE Multimed. 9(2), 6–10 (2002) flow technique for person tracking systems. Pattern Recognit.
18. Chen, L., Yang, H., Takaki, T., Ishii, I.: Real-time frame- Lett. 28(10), 1232–1239 (2007)
straddling-based optical flow detection. In: Proc. of IEEE Int. 38. Denman, S., Fookes, C., Sridharan, S.: Improved simultaneous
Conf. on Robotics and Biomimetics, pp. 2447–2452 (2011) computation of motion detection and optical flow for object
19. Chen, Q., Sun, Q.S., Heng, P.A., Xia, D.S.: Two-stage object tracking. In: IEEE Digital Image Computing: Techniques and
tracking method based on kernel and active contour. IEEE Trans. Applications, pp. 175–182 (2009)
Circuits Syst. Video Technol. 20(4), 605–609 (2010) 39. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recog-
20. Chen, Y., Zhang, L., Lin, B., Xu, Y., Ren, X.: Fighting detection nition via sparse spatio-temporal features. In: Int. Conf. on Com-
based on optical flow context histogram. In: Proc. of IEEE 2nd puter Communications and Networks, vol. 14, pp. 65–72. IEEE
Int. Conf. on Innovations in Bio-inspired Computing and Appli- Press, New York (2005)
cations, pp. 95–98 (2011) 40. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analy-
sis. Wiley, Stanford Research Institute, Menlo Park (1973)
21. Cheng, F.H., Chen, Y.L.: Real time multiple objects tracking and
41. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action
identification based on discretewavelet transform. Pattern Recog-
at a distance. In: Proc. 9th IEEE Int. Conf. on Computer Vision,
nit. 39, 1126–1139 (2006)
vol. 2, pp. 726–733 (2003)
22. Cheung, K., Baker, S., Kanade, T.: Shape-from-silhouette across
42. Elgammal, A., Harwood, D., Davis, L.: Non-parametric model
time part II: applications to human modeling and markerless mo-
for background subtraction. In: Frame-Rate Workshop, pp. 751–
tion tracking. Int. J. Comput. Vis. 63(3), 225–245 (2005)
767. IEEE Press, New York (2000)
23. Chiverton, J., Mirmehdi, M., Xie, X.: On-line learning of shape
43. Fazli, S., Pour, H.M., Bouzari, H.: Multiple object tracking us-
information for object segmentation and tracking. In: Proc. of
ing improved GMM based motion segmentation. In: IEEE ECTI-
British Machine Vision Conference, pp. 1–11 (2009)
CON, vol. 2, pp. 1130–1133 (2009)
24. Chiverton, J., Xie, X., Mirmehdi, M.: Automatic bootstrapping 44. Fergus, R., Perona, P., Zisserman, A.: Object class recogni-
and tracking of object contours. IEEE Trans. Image Process. tion by unsupervised scale-invariant learning. In: Proc. of IEEE
21(3), 1231–1245 (2012) Computer Society Conference on Computer Vision and Pattern
25. Chomat, O., Crowley, J.L.: Probabilistic recognition of activity Recognition, vol. 2, pp. 264–271 (2003)
using local appearance. In: IEEE Computer Society Conf. on 45. Filipovych, R., Ribeiro, E.: Combining models of pose and
Computer Vision and Pattern Recognition, vol. 2, pp. 637–663 dynamics for human motion recognition. In: 3rd International
(1999) Springer Symposium on Advances in Visual Computing, Ab-
26. Cohen, C.J., Morelli, F., Scott, K.A.: A surveillance system for erdeen, Scotland, pp. 21–32 (2007)
recognition of intent within individuals and crowds. In: Conf. on 46. Forsyth, D.A., Arikan, O., Ikemoto, L., O’Brien, J., Ramanan,
Technologies for Homeland Security, Waltham, MA, pp. 559– D.: Computational studies of human motion: part 1, tracking and
565. IEEE Press, New York (2008) motion synthesis. Found. Trends Comput. Graph. Vis. 1(02/03),
27. Cohen, W.W.: Fast effective rule induction. In: Proc. of 12th Int. 77–254 (2005)
Conf. on Machine Learning, pp. 115–123. Morgan Kaufmann, 47. Gallagher, M., Downs, T.: Visualization of learning in multilayer
San Mateo (1995) perceptron networks using principal component analysis. IEEE
28. Coifman, B., Beymer, D., McLauchlan, P., Malik, J.: A real- Trans. Syst. Man Cybern. 33, 28–34 (2003)
time computer vision system for vehicle tracking and traffic 48. Gavrilla, D., Davis, L.: 3D Model-based tracking of humans in
surveillance. Transp. Res., Part C, Emerg. Technol. 6(4), 271– action: a multi-view approach. In: Int. Proc. of the Computer Vi-
288 (1998) sion and Pattern Recognition, pp. 73–80 (1996)
29. Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, 49. Ghanem, N., DeMenthon, D., Doermann, D., Davis, L.: Repre-
D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, sentation and recognition of events in surveillance video using
P., Wixson, L.: A system for video surveillance and monitor- Petri nets. In: Conf. on Computer Vision and Pattern Recogni-
ing. Tech. rep, Robotics Institute at Carnegie Mellon University tion Workshop, pp. 112–121 (2004)
(2000) 50. Gilbert, A., Illingworth, J., Bowden, R.: Fast realistic multi-
30. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object track- action recognition using mined dense spatio-temporal features.
ing. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–577 In: IEEE 12th Int. Conf. on Computer Vision, pp. 925–931
(2003) (2009)
31. Cupillard, F., Bremond, F., Thonnat, M.: Group behavior recog- 51. Girisha, R., Murali, S.: Tracking humans using novel optical flow
nition with multiple cameras. In: Proc. 6th IEEE Workshop on algorithm for surveillance videos. In: Proceedings of the 4th An-
Applications of Computer Vision, pp. 177–183 (2002) nual ACM Bangalore Conf., COMPUTE ’11, pp. 1–8 (2011)
A survey on activity recognition & behavior understanding in video surveillance

52. Gong, S., Xiang, T.: Recognition of group activities using dy- 75. Kameda, Y., Minoh, M.: A human motion estimation method us-
namic probabilistic networks. In: Proc. 9th IEEE Int. Conf. on ing 3-successive video frames. In: Proc. of Int. Conf. on Virtual
Computer Vision, vol. 2, pp. 742–749 (2003) Systems, pp. 135–140 (1996)
53. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Ac- 76. Kang, W., Deng, F.: Research on intelligent visual surveillance
tions as space-time shapes. IEEE Trans. Pattern Anal. Mach. In- for public security. In: 6th Int. Conf. Comput. and Inf. Sci,
tell. 29(12), 2247–2253 (2007) pp. 824–829. IEEE/ACIS, Melbourne (2007)
54. Gupta, A., Davis, L.S.: Objects in action: an approach for com- 77. Ke, Y., Sukthankar, R., Hebert, M.: Spatio-temporal shape and
bining action understanding and object perception. In: IEEE flow correlation for action recognition. In: IEEE Conf. on Com-
Conf. on Computer Vision and Pattern Recognition, pp. 1–8 puter Vision and Pattern Recognition, pp. 1–8 (2007)
(2007) 78. Khan, S.M., Shah, M.: Detecting group activities using rigidity
55. Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding of formation. In: Proc. of the 13th Annual ACM Int. Conf. on
videos, constructing plots learning a visually grounded storyline Multimedia, pp. 403–406 (2005)
model from annotated videos. In: IEEE Conf. on Computer Vi- 79. Kim, H., Sakamoto, R., Kitahara, I., Toriyama, T., Kogure, K.:
sion and Pattern Recognition, pp. 2012–2019 (2009) Robust silhouette extraction technique using background sub-
56. Haritaoglu, I., Harwood, D., Davis, L.S.: W 4 : real-time surveil- traction. In: 10th Meeting on Image Recognition and Understand
lance of people and their activities. IEEE Trans. Pattern Anal. (MIRU), Hiroshima, Japan, pp. 1–6 (2007)
Mach. Intell. 22(8), 309–330 (2000) 80. Kim, J.B., Kim, H.J.: Efficient region-based motion segmen-
57. Heisele, B., Ho, P., Wu, J., Poggio, T.: Face recognition: tation for a video monitoring system. Pattern Recognit. Lett.
component-based versus global approaches. Comput. Vis. Image 24(1/3), 113–128 (2003)
Underst. 91, 6–21 (2003) 81. Kim, T.K., Im, J.H., Paik, J.K.: Video object segmentation and its
58. Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspec- salient motion detection using adaptive background generation.
tive. Int. J. Comput. Vis. 80, 3–15 (2008) IEEE Power Electron. Lett. 45(11), 542–543 (2009)
59. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual 82. Ko, T.: A survey on behavior analysis in video surveillance for
surveillance of object motion and behaviors. IEEE Trans. Syst. homeland security applications. In: AIPR, pp. 1–8. IEEE Press,
Man Cybern., Part C, Appl. Rev. 34(3), 334–352 (2004) New York (2008)
60. Hu, W., Xie, D., Tan, T., Maybank, S.: Learning activity patterns 83. Kuno, Y., Watanabe, T., Shimosakoda, Y., Nakagawa, S.: Au-
using fuzzy self-organizing neural network. IEEE Trans. Syst. tomated detection of human for visual surveillance system.
Man Cybern. 34(3), 1618–1626 (2004) In: Proc. of the Int. Conf. on Pattern Recognition, ICPR ’96,
pp. 865–869. IEEE Computer Society, Washington (1996)
61. Huang, J., et al.: GPU-accelerated computation for robust motion
84. Ladikos, A., Benhimane, S., Navab, N.: A realtime tracking sys-
tracking using the CUDA framework. In: Int. Conf. on Visual
tem combining template-based and feature-based approaches. In:
Information Engineering, vol. 5, pp. 437–442 (2008)
VISAPP (2007)
62. 11th IEEE Int. Workshop on Performance Evaluation of Tracking
85. Lalos, C., Anagnostopoulos, V.: Hybrid tracking approach for as-
and Surveillance (2009). http://www.cvg.rdg.ac.uk/PETS2009/
sistive environments. In: In Int. Conf. Proc. Series, 05, vol. 39/64.
authors.html
ACM Press, New York (2009)
63. Imagery Library for Intelligent Detection Systems (2010).
86. Laptev, I.: On space-time interest points. Int. J. Comput. Vis.
http://www.ilids.co.uk
64(2–3), 107–123 (2005)
64. IMS Research. http://www.imsresearch.com/ 87. Laptev, I., Lindeberg, T.: Space-time interest points. In: Proc. 9th
65. Ince, S., Konrad, J.: Occlusion-aware optical flow estimation. IEEE Int. Conf. on Computer Vision, pp. 432–439 (2003)
IEEE Trans. Image Process. 17(8), 1443–1451 (2008) 88. Laptev, I., Perez, P.: Retrieving actions in movies. In: Proc. of the
66. Intille, S.S., Bobick, A.F.: A framework for recognizing multi- 11th IEEE Int. Conf. on Computer Vision, pp. 1–8 (2007)
agent action from visual evidence. In: AAAI-99, pp. 518–525. 89. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning
AAAI Press, Menlo Park (1999) realistic human actions from movies. In: IEEE Conf. on Com-
67. Ishii, I., Taniguchi, T., Yamamoto, K., Takaki, T.: 1000 fps puter Vision and Pattern Recognition, pp. 1–8 (2008)
real-time optical flow detection system. Proc. SPIE 7538, 1–11 90. Leordeanu, M., Collins, R.: Unsupervised learning of object fea-
(2010) tures from video sequences. In: Proc. of IEEE Computer Society
68. Ivanov, Y.A., Bobick, A.F.: Recognition of visual activities and Conf. on Computer Vision and Pattern Recognition, Washington,
interactions by stochastic parsing. IEEE Trans. Pattern Anal. DC, USA, vol. 1, pp. 1142–1149 (2005)
Mach. Intell. 22(8), 852–872 (2000) 91. Li, X., Hu, W., Zhang, Z., Zhang, X.: Robust foreground seg-
69. Jan, T.: Neural network based threat assessment for automated mentation based on two effective background models. In: Proc.
visual surveillance. In: Int. Joint Conf. on Neural Networks, of the 1st ACM Int. Conf. on Multimedia Information Retrieval,
vol. 2, pp. 1309–1312. IEEE Press, New York (2004) MIR ’08, pp. 223–228. ACM Press, New York (2008)
70. Jang, D.S., Choi, H.I.: Active models for tracking moving ob- 92. Liao, H.H., Chang, J.Y., Chen, L.G.: A localized approach to
jects. Pattern Recognit. 33(7), 1135–1146 (2000) abandoned luggage detection with foreground-mask sampling.
71. Javed, O., Shah, M.: Tracking and object classification for auto- In: Proc. of the IEEE 5th Int. Conf. on Advanced Video and Sig-
mated surveillance. In: Proc. of the 7th European Conference on nal Based Surveillance, AVSS’08, pp. 132–139. IEEE Computer
Computer Vision, pp. 343–357. Springer, London (2002) Society, Washington (2008)
72. Jeong, Y.S., Jeong, M.K., Omitaomu, O.A.: Weighted dynamic 93. Lin, F., Chen, B.M., Lee, T.H.: Robust vision-based target track-
time warping for time series classification. Pattern Recognit. 44, ing control system for an unmanned helicopter using feature fu-
2231–2240 (2011) sion. In: 9th IAPR Int. Conf. on Machine Vision Applications,
73. Jiang, H., Drew, M.S., Li, Z.N.: Successive convex matching for vol. 13, pp. 398–401 (2009)
action detection. In: IEEE Computer Society Conf. on Computer 94. Lin, H.H., Liu, T.L., Chuang, J.H.: A probabilistic svm approach
Vision and Pattern Recognition, vol. 2, pp. 1646–1653 (2006) for background scene initialization. In: Int. Conf. on Image Pro-
74. Joo, S.W., Chellappa, R.: Attribute grammar-based event recog- cessing, vol. 3, pp. 893–896 (2002)
nition and anomaly detection. In: Conference on Computer Vi- 95. Lipton, A.J.: Local application of optic flow to analyse rigid
sion and Pattern Recognition Workshop, CVPRW ’06, pp. 107– versus non-rigid motion. http://www.eecs.lehigh.edu/FRAME/
114 (2006) Lipton/ieevframe.html
S. Vishwakarma, A. Agrawal

96. Lipton, A.J., Fujiyoshi, H., Patil, R.S.: Moving target classifica- 115. Narayana, M., Haverkamp, D.: A Bayesian algorithm for track-
tion and tracking from real-time video. In: Proc. of the 4th IEEE ing multiple moving objects in outdoor surveillance video. In:
Workshop on Applications of Computer Vision, pp. 8–14. IEEE CVPR, pp. 1–8. IEEE Press, New York (2007)
Computer Society, Washington (1998) 116. Natarajan, P., Nevatia, R.: Coupled hidden semi Markov mod-
97. Liu, C., Yuen, J., Torralba, A., Sivic, J., Freeman, W.T.: Sift flow: els for activity recognition. In: IEEE Workshop on Motion and
dense correspondence across different scenes. In: Proc. of the Video Computing, pp. 1–8 (2007)
10th European Conference on Computer Vision: Part III, pp. 28– 117. Nevatia, R., Hobbs, J., Bolles, B.: An ontology for video event
42. Springer, Berlin, Heidelberg (2008) representation. In: IEEE Conf. on Computer Vision and Pattern
98. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from Recognition Workshop, pp. 119–128 (2004)
videos in the wild. In: IEEE Int. Conf. on Computer Vision and 118. Nevatia, R., Zhao, T., Hongeng, S.: Hierarchical language-based
Pattern Recognition, pp. 1–8 (2009) representation of events in video streams. In: Conf. on Com-
99. Lublinerman, R., Ozay, N., Zarpalas, D., Camps, O.: Activ- puter Vision and Pattern Recognition Workshop, vol. 4, pp. 39–
ity recognition from silhouettes using linear systems and model 47 (2003)
(in)validation techniques. In: 18th Int. Conf. on Pattern Recogni- 119. Nguyen, N.T., Phung, D.Q., Venkatesh, S., Bui, H.: Learning and
tion, vol. 1, pp. 347–350 (2006) detecting activities from movement trajectories using the hierar-
100. Lucas, B., Kanade, T.: An iterative image registration technique chical hidden Markov model. In: IEEE Computer Society Conf.
with an application to stereo vision. In: Int. Joint Conf. on Artifi- on Computer Vision and Pattern Recognition, vol. 2, pp. 955–960
cial Intelligence, pp. 674–679. AAAI Press, Menlo Park (1981) (2005)
101. Luo, R., Li, L., Gu, I.Y.: Efficient adaptive background subtrac- 120. Niebles, J.C., Wang, H., Fei-fei, L.: Unsupervised learning of
tion based on multi-resolution background modeling and updat- human action categories using spatial-temporal words. In: Proc.
ing. In: Springer-PCM, pp. 118–127. Springer, Berlin (2007) British Machine Vision Conference (BMVC) (2006)
102. Lv, F., Nevatia, R.: Single view human action recognition us- 121. Niethammer, M., Tannenbaum, A., Angenent, S.: Dynamic ac-
ing key pose matching and Viterbi path searching. In: CVPR, tive contours for visual tracking. IEEE Trans. Autom. Control
Minneapolis, Minnesota, USA, pp. 1–7. IEEE Computer Soci- 51(4), 562–579 (2006)
ety, Washington (2007) 122. Nowozin, G.S., Bakir, G., Tsuda, K.: Discriminative subse-
103. Ma, X., Grimson, W.E.L.: Edge-based rich representation for ve- quence mining for action classification. In: ICCV, vol. 11, pp. 1–
hicle classification. In: Proceedings of the Tenth IEEE Interna- 8. IEEE Press, New York (2007)
123. Ogale, A.S., Karapurkar, A., Aloimonos, Y.: View-invariant
tional Conference on Computer Vision, vol. 2, pp. 1185–1192.
modeling and recognition of human actions using grammars. In:
IEEE Computer Society, Washington (2005)
10th Conf. on Category Curve of Long Video, vol. 10, pp. 115–
104. McHugh, J.M., Konrad, J., Saligrama, V., Jodoin, P.M.:
126, Beijing, China. IEEE Press, New York (2005)
Foreground-adaptive background subtraction. IEEE Signal Pro-
124. Oh, S., Hoogs, A., et al.: A large-scale benchmark dataset for
cess. Lett. 16(5), 390–393 (2009)
event recognition in surveillance video. In: Proc. of IEEE Int.
105. Meyer, F., Bouthemy, P.: Region-based tracking using affine mo-
Conf. on Computer Vision and Pattern Recognition, pp. 3153–
tion models in long image sequences. CVGIP, Image Underst.
3160 (2011)
60(2), 119–140 (1994)
125. Oikonomopoulos, A., Patras, I., Pantic, M., Paragios, N.:
106. Migdal, J., Grimson, W.E.L.: Background subtraction using
Trajectory-based representation of human actions. In: Artificial
Markov thresholds. In: Proc. of the IEEE Workshop on Motion
Intelligence for Human Computing, vol. 4451, pp. 133–154.
and Video Computing (WACV/MOTION’05), WACV-MOTION
Springer, Berlin (2007)
’05, vol. 2, pp. 58–65. IEEE Computer Society, Washington 126. Oikonomopoulos, A., Patras, I., Pantici, M.: Spatiotemporal
(2005) salient points for visual recognition of human actions. IEEE
107. Minnen, D., Essa, I., Starner, T.: Expectation grammars: leverag- Trans. Syst. Man Cybern. 36(3), 710–719 (2006)
ing high-level expectations for activity recognition. In: Proceed- 127. Oliver, N.M., Rosario, B., Pentland, A.P.: A Bayesian computer
ings IEEE Computer Society Conference on Computer Vision vision system for modeling human interactions. IEEE Trans. Pat-
and Pattern Recognition, 2003, vol. 2, pp. 626–632 (2003) tern Anal. Mach. Intell. 22(8), 831–843 (2000)
108. Moeslund, T.B., Granum, E.: A survey of computer vision-based 128. Oliver, N., Horvitz, E., Garg, A.: Layered representations for hu-
human motion capture. Comput. Vis. Image Underst. 81(03), man activity recognition. In: Proc. 4th IEEE Int. Conf. on Multi-
231–268 (2001) modal Interfaces, pp. 3–8 (2002)
109. Moeslund, T.B., Hilton, A., kruger, V.: A survey of advances in 129. Ong, E.J., Gong, S.: The dynamics of linear combinations: track-
vision-based human motion capture and analysis. Comput. Vis. ing 3d skeletons of human subjects. Image Vis. Comput. 20(5/6),
Image Underst. 104(2–3), 90–126 (2006) 397–414 (2002)
110. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object 130. Paragios, N., Deriche, R.: Geodesic active contours and level sets
detection in images by components. IEEE Trans. Pattern Anal. for the detection and tracking of moving objects. IEEE Trans.
Mach. Intell. 23(4), 349–361 (2001) Pattern Anal. Mach. Intell. 22(3), 266–280 (2000)
111. Monnet, A., Mittal, A., Paragios, N., Ramesh, V.: Background 131. Paragios, R., Stenger, B., Ramesh, V., Paragios, N., Buhmann,
modeling and subtraction of dynamic scenes. In: Proc. 9th IEEE F.C.J.: Topology free hidden Markov models: application to
Int. Conf. on Computer Vision, vol. 2, pp. 1305–1312 (2003) background modeling. In: IEEE Int. Conf. on Computer Vision,
112. Moore, D., Essa, I.: Recognizing multitasked activities from pp. 294–301 (2001)
video using stochastic context-free grammar. In: Proc. AAAI 132. Parameswaran, V., Chellappa, R.: View invariance for human ac-
National Conf. on AI, pp. 770–776. AAAI Press, Menlo Park tion recognition. Int. J. Comput. Vis. 66(1), 83–101 (2006)
(2002) 133. Parameswaran, V., Singh, M., Ramesh, V.: Illumination compen-
113. Moore, D.J., Essa, I.A., Hayes, M.H.: Exploiting human actions sation based change detection using order consistency. In: IEEE
and object context for recognition tasks. In: Proc. of 7th IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
Int. Conf. on Computer Vision, vol. 1, pp. 80–86 (1999) pp. 1982–1989 (2010)
114. Morris, B.T., Trivedi, M.M.: A survey of vision-based trajectory 134. Parikh, D., Zitnick, C.L., Chen, T.: Unsupervised learning of hi-
learning and analysis for surveillance. IEEE Trans. Circuits Syst. erarchical spatial structures in images. In: Proc. of IEEE Conf.
Video Technol. 18(08), 1114–1127 (2008) on Computer Vision and Pattern Recognition, pp. 1–8 (2009)
A survey on activity recognition & behavior understanding in video surveillance

135. Park, S., Aggarwal, J.K.: A hierarchical Bayesian network for 156. Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match:
event recognition of human actions and interactions. Assoc. video structure comparison for recognition of complex hu-
Comput. Mach. Multimedia Syst. J., 164–179 (2004) man activities. In: IEEE 12th Int. Conf. on Computer Vision,
136. Paruchuri, J.K., Sathiyamoorthy, E.P., Ching, S., Cheung, S., pp. 1593–1600 (2009)
Chen, C.H.: Spatially adaptive illumination modeling for back- 157. Ryoo, M.S., Aggarwal, J.K.: UT-Interaction Dataset, ICPR
ground subtraction. In: IEEE Int. Conf. on Computer Vision contest on Semantic Description of Human Activities (SDHA).
Workshops (ICCV Workshops), pp. 1745–1752 (2011) http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
137. Pentland, A.: Smart rooms, smart clothes. In: Proc. Fourteenth (2010)
Int. Conf. on Pattern Recognition, vol. 2, pp. 949–953 (1998) 158. Ryoo, M.S., Chen, C.C., Aggarwal, J.K., Roy-Chowdhury, A.:
138. Peursum, P., West, G., Venkatesh, S.: Combining image regions An overview of contest on semantic description of human ac-
and human activity for indirect object recognition in indoor wide- tivities 2010. In: Proc. Int. Conf. Pattern Recognition Contests,
angle views. In: 10th IEEE Int. Conf. on Computer Vision, vol. 1, pp. 1–16 (2010)
pp. 82–89 (2005) 159. Sakaino, H.: A semitransparency-based optical-flow method with
139. Pilet, J., Strecha, C., Fua, P.: Making background subtraction ro- a point trajectory model for particle-like video. IEEE Trans. Im-
bust to sudden illumination changes. In: Proc. European Conf. on age Process. 21(2), 441–450 (2012)
Computer Vision, pp. 1–14 (2008) 160. Salembier, P., Marques, F.: Region-based representations of im-
140. Pinhanez, C.S., Bobick, A.F.: Human action detection using pnf age and video: segmentation tools for multimedia services. IEEE
propagation of temporal constraints. In: Proc. IEEE Computer Trans. Circuits Syst. Video Technol. 9(8), 1147–1169 (1999)
Society Conf. on Computer Vision and Pattern Recognition, 161. Sarkar, S., Phillips, P.J., Liu, Z., Vega, I.R., Grother, P., Bowyer,
pp. 898–904 (1998) K.W.: The humanoid gait challenge problem: data sets, perfor-
141. Platt, J.C.: Fast training of support vector machines using sequen- mance, and analysis. IEEE Trans. Pattern Anal. Mach. Intell.
tial minimal optimization. In: Proceedings of Advances in Ker- 27(2), 162–177 (2005)
nel Methods—Support Vector Learning, pp. 185–208. Microsoft, 162. Schmaltz, C., Rosenhahn, B., Brox, T., Weickert, J.: Localised
Redmond (1998) mixture models in region-based tracking. In: Proc. of the 31st
142. Poppe, R.: A survey on vision-based human action recognition. DAGM Symposium on Pattern Recognition, pp. 21–30. Springer,
Image Vis. Comput. 28, 976–990 (2010) Berlin (2009)
143. Porikli, F., Ivanov, Y., Haga, T.: Robust abandoned object detec- 163. Schmaltz, C., Rosenhahn, B., Brox, T., Weickert, J.: Region-
tion using dual foregrounds. EURASIP J. Adv. Signal Process. based pose tracking with occlusions using 3D models. Mach. Vis.
08, 197875 (2008) Appl. 23(3), 557–577 (2012)
164. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions:
144. Qi, Y., An, G.: Infrared moving targets detection based on op-
a local SVM approach. In: Proc. IEEE Computer Society Pattern
tical flow estimation. In: Proc. of IEEE Int. Conf. on Computer
Recognition, vol. 3, pp. 32–36. IEEE Computer Society Press,
Science and Network Technology, pp. 2452–2455 (2011)
Los Alamitos (2004)
145. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan
165. Schunck, B.: The image flow constraint equation. Comput. Vis.
Kaufmann, San Francisco (1999)
Graph. Image Process. 35(1), 20–46 (1986)
146. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Be-
166. Schunck, B., Horni, B.: Determining optical flow. In: DARPA81,
longie, S.: Objects in context. In: Proc. of the IEEE 11th Int.
pp. 144–156 (1981)
Conf. on Computer Vision, pp. 1–8 (2007)
167. Sclaroff, S., Isidoro, J.: Active blobs: region-based, deformable
147. Rao, C., Shah, M.: View-invariance in action recognition. In:
appearance models. Comput. Vis. Image Underst. 89(2/3), 197–
Proc. of IEEE Computer Society Conf. on Computer Vision and 225 (2003)
Pattern Recognition, vol. 2, pp. 316–322 (2001) 168. Senst, T., Evangelio, R.H., Sikora, T.: Detecting people carrying
148. Reddy, V., Sanderson, C., Sanin, A., Lovell, B.C.: Adaptive objects based on an optical flow motion model. In: IEEE Work-
patch-based background modelling for improved foreground ob- shop on Applications of Computer Vision, pp. 301–306 (2011)
ject segmentation and tracking. In: 7th IEEE Int. Conf. on Ad- 169. Shechtman, E., Irani, M.: Space-time behavior based correlation.
vanced Video and Signal Based Surveillance (AVSS), pp. 172– In: IEEE Computer Society Conf. on Computer Vision and Pat-
179 (2010) tern Recognition, vol. 1, pp. 405–412 (2005)
149. Ren, Y., Chua, C.S.: Bilateral learning for color-based tracking. 170. Sheikh, Y., Javed, O., Kanade, T.: Background subtraction for
Image Vis. Comput. 26(11), 1530–1539 (2008) freely moving cameras. In: IEEE 12th Int. Conf. on Computer
150. Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: a spatio- Vision, pp. 1219–1225 (2009)
temporal maximum average correlation height filter for action 171. Sheikh, Y., Sheikh, M., Shah, M.: Exploring the space of a human
recognition. In: CVPR. IEEE Press, New York (2008) action. In: Tenth IEEE Int. Conf. on Computer Vision, vol. 1,
151. Rui, Y., Huang, T.S.: Image retrieval: current techniques, promis- pp. 144–149 (2005)
ing directions and open issues. J. Vis. Commun. Image Repre- 172. Shi, J., Tomasi, C.: Good features to track. In: CVPR, pp. 593–
sent. 10, 39–62 (1999) 600. IEEE Computer Society, Washington (1994)
152. Ryoo, M.S., Aggarwal, J.K.: Recognition of composite human 173. Shi, Y., Huang, Y., Minnen, D., Bobick, A., Essa, I.: Propagation
activities through context-free grammar based representation. In: networks for recognition of partially ordered sequential action.
IEEE Computer Society Conf. on Computer Vision and Pattern In: Proc. of IEEE Computer Society Conf. on Computer Vision
Recognition, vol. 2, pp. 1709–1718 (2006) and Pattern Recognition, vol. 2, pp. 862–869 (2004)
153. Ryoo, M.S., Aggarwal, J.K.: Hierarchical recognition of human 174. Shibata, M., Yasuda, Y., Ito, M.: Moving object detection for ac-
activities interacting with objects. In: IEEE Conf. on Computer tive camera based on optical flow distortion. In: Proc. of the 17th
Vision and Pattern Recognition, pp. 1–8 (2007) World Congress the International Federation of Automatic Con-
154. Ryoo, M.S., Aggarwal, J.K.: Recognition of high-level group trol, Seoul, Korea, pp. 14,720–14,725 (2008)
activities based on activities of individual members. In: IEEE 175. Siskind, J.M.: Grounding the lexical semantics of verbs in visual
Workshop on Motion and Video Computing, pp. 1–8 (2008) perception using force dynamics and event logic. J. Artif. Intell.
155. Ryoo, M.S., Aggarwal, J.K.: Semantic representation and recog- Res. 15, 31–90 (2001)
nition of continued and recursive human activities. Int. J. Com- 176. Sivic, J., Zisserman, A.: Video data mining using configura-
put. Vis. 82, 1–24 (2009) tions of viewpoint invariant regions. In: Proc. of the IEEE Conf.
S. Vishwakarma, A. Agrawal

on Computer Vision and Pattern Recognition, Washington, DC, 195. Vogler, C., Metaxas, D.: Parallel hidden Markov models for
pp. 1–8 (2004) American sign language recognition. In: IEEE Int. Conf. on
177. Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and Computer Vision, vol. 1, pp. 224–228 (1999)
TRECVid. In: Proc. of the 8th ACM Int. Workshop on Multi- 196. Vosters, L., Shan, C., Gritti, T.: Background subtraction under
media Information Retrieval, Santa Barbara, California, USA, sudden illumination changes. In: 7th IEEE Int. Conf. on Ad-
pp. 321–330 (2006) vanced Video and Signal Based Surveillance (AVSS), pp. 384–
178. Starner, T., Pentland, A.: Real-time American sign language 391 (2010)
recognition from video using hidden Markov models. In: 197. Vu, V.-T., Bremond, F., Thonnat, M.: Automatic video interpre-
Proceedings International Symposium on Computer Vision, tation: a novel algorithm for temporal scenario recognition. In:
pp. 265–270 (1995) Proc. 8th Int. Joint Conf. Artif. Intell, pp. 9–15 (2003)
179. Stauffer, C.: Automatic hierarchical classification using time- 198. Waltisberg, D., Yao, A., Gall, J., Gool, L.V.: Variations of a
based co-occurrences. In: IEEE Int. Conf. on Computer Vision hough-voting action recognition system. In: Proc. of Int. Conf.
and Pattern Recognition, vol. 2, pp. 333–339 (1999) on Pattern Recognition, pp. 1–7 (2010)
199. Wang, J., Bebis, G., Miller, R.: Robust video-based surveillance
180. Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using
by integrating target detection with tracking. In: Proc. Conf. on
real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell. 22(8),
Computer Vision and Pattern Recognition Workshop, CVPRW
747–757 (2000)
’06, pp. 137–144. IEEE Computer Society, Washington (2006)
181. Tavakkoli, A., Nicolescu, M., Bebis, G.: A novelty detection
200. Weber, M.: Unsupervised learning of models for object recogni-
approach for foreground region detection in videos with quasi-
tion. Ph.D. thesis, California Institute of Technology, Pasadena,
stationary backgrounds. In: Proc. of the 2nd Int. Symposium
California (2000)
on Visual Computing, pp. 40–49. Springer, Berlin, Heidelberg 201. Weinland, D., Boyer, E., Ronfard, R.: Action recognition from
(2006) arbitrary views using 3D exemplars. In: ICCV, Rio de Janeiro,
182. Techmer, A.: Contour-based motion estimation & object tracking Brazil, vol. 11, pp. 1–7. IEEE Computer Society Press, Los
for real-time applications. In: IEEE Image Processing Proceed- Alamitos (2007)
ings, vol. 3, pp. 648–651 (2001) 202. Weinland, D., Ronfard, R., Boyer, E.: Automatic discovery
183. Thi, T.H., Zhang, J., Cheng, L., Wang, L., Satoh, S.: Semi- of action taxonomies from multiple views. In: CVPR, vol. 2,
supervised human action recognition and localization us- pp. 1639–1645. IEEE Computer Society, Washington (2006)
ing spatially and temporally integrated local features (2009). 203. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action
http://huetuan.net/semiaction.html recognition using motion history volumes. Comput. Vis. Image
184. Trec Video Retrieval Evaluation Official Website. http://huetuan. Underst. 104(02), 249–257 (2006)
net/semiaction.html 204. Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based
185. Tsai, D.M., Lai, S.C.: Independent component analysis-based methods for action representation, segmentation and recognition.
background subtraction for indoor surveillance. IEEE Trans. Im- Comput. Vis. Image Underst. 115, 224–241 (2011)
age Process. 18(1), 158–167 (2009) 205. Wen, Z., Cai, Z.: A robust object tracking approach using mean
186. Tsuchiya, M., Fujiyoshi, H.: Evaluating feature importance for shift. In: 3rd IEEE Int. Conf. on Natural Computation, vol. 2,
object classification in visual surveillance. In: Proc. of the 18th pp. 170–174 (2007)
Int. Conf. on Pattern Recognition, vol. 2, pp. 978–981. IEEE 206. Wong, S.F., Cipolla, R.: Extracting spatiotemporal interest points
Computer Society, Washington (2006) using global information. In: ICCV, vol. 11, pp. 1–8. IEEE Press,
187. Valera, M., Velastin, S.A.: Intelligent distributed surveillance New York (2007)
systems: a review. IEE Proc., Vis. Image Signal Process. 152(2), 207. Wong, S.F., Kim, T.K., Cipolla, R.: Learning motion categories
192–204 (2005) using both semantic and structural information. In: IEEE Conf.
188. Varcheie, P.D.Z., Sills-Lavoie, M., Bilodeau, G.A.: A multiscale on Computer Vision and Pattern Recognition, pp. 1–6 (2007)
region-based motion detection and background subtraction algo- 208. Wunsch, P., Hirzinger, G.: Real-time visual tracking of 3D ob-
rithm. Sensors 10, 1041–1061 (2010) jects with dynamic handling of occlusion. In: Int. Conf. on
189. Vaswani, N., Chowdhury, A.R., Chellappa, R.: Activity recogni- Robotics and Automation, 97, Albuquerque, New Mexico, USA,
vol. 4, pp. 2868–2879 (1997)
tion using the dynamics of the configuration of interacting ob-
209. Xiang, T.: Video behavior profiling for anomaly detection. IEEE
jects. In: Proc. IEEE Computer Society Conf. on Computer Vi-
Trans. Pattern Anal. Mach. Intell. 30(5), 893–908 (2008)
sion and Pattern Recognition, vol. 2, pp. 633–640 (2003)
210. Xiao, J., Cheng, H., Han, F., Sawhney, H.: Geo-spatial aerial
190. Vaswani, N., Chowdhury, A.R., Chellappa, R.: Shape activity:
video processing for scene understanding and object tracking. In:
a continuous state HMM for moving/deforming shapes with ap-
CVPR, pp. 1–8. IEEE Press, New York (2008)
plication to abnormal activity detection. IEEE Trans. Image Pro- 211. Xu, M., Zuo, L., Iyengar, S., Goldfain, A., DelloStritto, J.:
cess. 14(10), 1603–1616 (2005) A semi-supervised hidden Markov model-based activity moni-
191. Veeraraghavan, A., Chellappa, R., Roy-Chowdhury, A.K.: The toring system. In: 33rd Annual Int. Conf. of the IEEE Engineer-
function space of an activity. In: IEEE Computer Society Conf. ing in Medicine and Biology Society (EMBC), Boston, Mas-
on Computer Vision and Pattern Recognition, vol. 1, pp. 959–968 sachusetts USA, pp. 1794–1797 (2011)
(2006) 212. Yacoob, Y., Black, M.J.: Parameterized modeling and recogni-
192. Vishwakarma, S., Agrawal, A.: A novel approach for feature tion of activities. In: 6th Int. Conf. on Computer Vision, pp. 120–
quantization using one-dimensional histogram. In: Annual IEEE 127 (1998)
India Conference (INDICON), pp. 1–4 (2011) 213. Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in
193. Vishwakarma, S., Sapre, A., Agrawal, A.: Action recognition time-sequential images using hidden Markov model. In: Proc.
using cuboids of interest points. In: IEEE Int. Conf. on Signal IEEE Computer Society Conf. on Computer Vision and Pattern
Processing, Communications and Computing (ICSPCC), pp. 1– Recognition, pp. 379–385 (1992)
6 (2011) 214. Yamazaki, M., Xu, G., Chen, Y.W.: Detection of moving ob-
194. Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated jects by independent component analysis. In: Proc. of the 7th
mesh animation from multi-view silhouettes. ACM Trans. Graph. Asian Conf. on Computer Vision, ACCV’06, vol. 2, pp. 467–
27(3), 97:1–97:9 (2008) 478. Springer, Berlin, Heidelberg (2006)
A survey on activity recognition & behavior understanding in video surveillance

215. Yang, F., Li, B.: Unsupervised learning of spatial structures 231. Zhu, Y., Dariush, B., Fujimura, K.: Kinematic self retargeting: a
shared among images. Vis. Comput. 28(2), 175–180 (2011) framework for human pose estimation. Comput. Vis. Image Un-
216. Yilmaz, A., Javed, O., Shah, M.: Object tracking: a survey. ACM derst. 114(12), 1362–1375 (2010)
Comput. Surv. 38(4), 1–45 (2006)
217. Yilmaz, A., Li, X., Shah, M.: Contour-based object tracking Sarvesh Vishwakarma received
with occlusion handling in video acquired using mobile cam- the B.Tech. degree in electron-
eras. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1531–1536 ics and communication engineering
(2004) from the University Institute of En-
218. Yilmaz, A., Shah, M.: Actions sketch: a novel action represen- gineering and Technology, Kanpur,
tation. In: CVPR, vol. 1, pp. 984–989. IEEE Computer Society, India, in 2001, and the M.Tech. de-
Washington (2005) gree in computer science and en-
219. Yohannes, Y., Hoddinott, J.: Classification and regression trees. gineering from the Indian Institute
Tech. rep., International Food Policy Research Institute, Wash- of Technology, Roorkee, India, in
ington, DC, USA (1999) 2003. He is currently working to-
220. Yokoyama, M., Poggio, T.: A contour-based moving object de- ward the Ph.D. degree in informa-
tection and tracking. In: 2nd Joint IEEE Int. Workshop on Vi- tion technology with the Indian In-
sual Surveillance and Performance Evaluation of Tracking and stitute of Information Technology,
Surveillance, pp. 271–276 (2005) Allahabad, India. His research in-
221. Yu, E., Aggarwal, J.K.: Detection of fence climbing from monoc- terests include computer vision, im-
ular video. In: 18th Int. Conf. on Pattern Recognition, vol. 1, age processing, pattern recognition, and artificial intelligence applied
pp. 375–378 (2006) to unusual activity analysis and surveillance applications. He is a mem-
222. Yu, T.H., Kim, T.K., Cipolla, R.: Real-time action recognition ber of the IEEE Computer Society.
by spatiotemporal semantic and structural forests. In: Proc. of
British Machine Vision Conference, pp. 1–7 (2010)
223. Zelnik-Manor, L., Irani, M.: Event-based analysis of video. In: Anupam Agrawal received his
Proc. of IEEE Computer Society Conf. on Computer Vision and M.Sc. degree in computer science
Pattern Recognition, vol. 2, pp. 123–130 (2001) from the J.K. Institute of Applied
224. Zhan, B., Monekosso, D.N., Remagnino, P., Velastin, S.A., Xu, Physics, Allahabad University in
L.Q.: Crowd analysis: a survey. Mach. Vis. Appl. 19(5–6), 345– 1988, his M.Tech. degree in com-
357 (2008) puter science and engineering from
225. Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I.: Model- Indian Institute of Technology
ing individual and group actions in meetings with layered hmms. Madras, Chennai in 1995, and his
IEEE Trans. Multimed. 8(3), 509–520 (2006) Ph.D. degree in information tech-
226. Zhang, J., Tian, Y., Yang, Y.: Adaptive dynamic model particle nology from the Indian Institute
filter for visual object tracking. In: ISECS International Collo- of Information Technology Alla-
quium, vol. 1, pp. 333–336. IEEE Press, New York (2009) habad in 2006. He was a post-
227. Zhang, L., Li, S.Z., Yuan, X., Xiang, S.: Real-time object classi- doctoral researcher at the depart-
fication in video surveillance based on appearance learning. In: ment of computing and information
IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1– system, University of Bedfordshire
8 (2007) (UK). Earlier, he was working as a scientist “D” at DEAL, DRDO,
228. Zhao, Y., Gong, H., Lin, L., Jia, Y.: Spatio-temporal patches for Govt. of India, Dehradun. Presently, he is a Professor at the department
night background modeling by subspace learning. In: 19th Int. of information technology at Indian Institute of Information Technol-
Conf. on Pattern Recognition, pp. 1–4 (2008) ogy, Allahabad in India. His research interests include computer vision,
229. Zhong, H., Shi, J., Visontai, M.: Detecting unusual activity in image processing, medical image processing, multimedia, and graph-
video. In: Proc. of IEEE Computer Society Conf. on Computer ics. He has more than 75 publications on these areas in international
Vision and Pattern Recognition, vol. 2, pp. 819–826 (2004) journals and conference proceedings, and has authored one book. He is
230. Zhou, S.K., Chellappa, R., Moghaddam, B.: Visual tracking and a senior member of IEEE and a fellow of the Institution of Electronics
recognition using appearance-adaptive models in particle filters. and Telecommunication Engineers. He is serving as a Chairman of the
IEEE Trans. Image Process. 13(11), 1491–1506 (2004) ACM Chapter, IIIT-A.

You might also like