Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Backpack: Detection of People Carrying Objects Using Silhouettes

Ismail Haritaoglu, Ross Cutler, David Harwood and Larry S. Davis


Computer Vision Laboratory
University of Maryland, College Park, MD 20742
hismail,rgc,harwood,lsd  @umiacs.umd.edu

Abstract
We described a video-rate surveillance algorithm to de-
tect and track people from a stationary camera, and to de-
termine if they are carrying objects or moving unencum-
bered. The contribution of the paper is the shape analysis
algorithm that both determines if a person is carrying an   



 ! "#%$&'(*)!
#,+.-
object and segments the object from the person so that it can / 0$&12+43&
56#6
6 $&17
8$-91
+.
#,+.
17( "
be tracked, e.g., during an exchange of objects between two (&: 6$&1;-<(+46,+48$ / (0=
people. As the object is segmented, an appearance model
of the object is constructed. The method combines periodic
motion estimation with static symmetry analysis of the sil-
houettes of a person in each frame of the sequence. Experi- constraints are determined. Those results are combined to
mental results demonstrate robustness and real-time perfor- determine if a person is carrying an object and to segment
mance of the proposed algorithm. the object from the silhouette. We construct an appearance
model for each carried object, so that when people exchange
objects, we can detect “who” carries “which” object via an
1 Introduction analysis of the segmentation.
Backpack employs a global shape constraint derived
Visual surveillance is both a challenging scientific prob- from the requirement that the human body shape is sym-
lem and an important application in computer vision. metric around its body axis. Backpack uses that constraint
With increasing processor power, more attention has been to segment outlier regions from their silhouette. The ex-
given to developing real-time “smart” surveillance systems. pected shape model of a person is compared with the cur-
Surveillance cameras are already installed in many loca- rent person silhouette to determine the outlier regions (non-
tions such as highways, streets, stores, ATM machines, symmetric region). One can observe that because of the
homes and offices. The ability to detect and track people is motion of people’s arms, legs, hands, outliers are periodi-
a key element of such systems. The problem is more chal- cally detected in the vicinity of those body parts. However,
lenging when we want to monitor interactions between peo- outliers are detected continuously in the vicinity of a suffi-
ple and objects, and detect unusual events such as deposit- ciently large carried object because of continued symmetry
ing an object (unattended baggage in airports), exchanging constraint violations. Therefore, Backpack use periodicity
bags, or removing an object (theft). It requires an ability to analysis to classify whether outlier regions belong to an ob-
detect people carrying objects, to segment the objects from ject or a body part.
people, and to construct appearance model for the objects Backpack has been designed to work under the control
so they can be identified subsequently. of > ? [5, 7]. >@? is a real time visual surveillance sys-
Backpack combines two basic observations to analyze tem for detecting and tracking people and their body parts,
people carrying objects: Human body shape is symmet- and monitoring their activities in an outdoor environment.
ric, and people exhibit periodic motion while they are mov- It operates on monocular grayscale video imagery, or on
ing unencumbered. During tracking, the periodic motion video imagery from an infrared camera. >A? detects objects
of a person and his parts is estimated, and the regions on through a background subtraction process. Before Back-
the silhouette which systematically violate the symmetry- pack, > ? simply classified those objects as people, vehi-
cles, or “other” on the basis of static size and shape prop-
erties, and dynamic analysis of shape periodicities. If an
object was classified as a person, then >A? segmented the
shape into body parts (head, torso, feet and hands), built
appearance models for the entire person and the parts, and
tracked the person and his parts. Backpack allows >A? to   
   %3
+.+
(#-
1 -<3&

#+*
-
analyze people carrying objects. We are interested in inter- -
1 %$ / 0  $  &+ % #
 1

actions between people and objects; e.g., people exchanging +
#,+.
1 / (0 Backpack 6#
6$& 1 
 $ / #0  +.- ,
&$+46 81
objects, leaving objects in the scene, taking objects from the &$1   ) # # -! / 1 0.$&+    +.- ( $&1  ""
scene. Backpack forms a basis for developing algorithms to /
0. $$ #
$ %&$&  #
3 %& #
 +48,
- $ 8+ - $+  '
reason about activities involving people and objects. The / <043& ) (,$&+ $&* 1 #
#6+  & #6*)
,+48$ 3 - + &6
reminder of this paper is organized as follows. After a brief
literature review in Section 1.1, Section 2 describes the sil-
houette model used in this work. Section 3 focuses on de-
tection of shape periodicity based an moving objects. The
symmetry-based region segmentation and dynamic appear-
ance of non-symmetric regions are explained in Section 4.

 


There are many real-time systems proposed in the past


  
,+ - %   8 !  + "8+ - (&+.&%$&
1 ( " #6*) 
,+48$
few years to detect and track people. Each system uses a
3  - + & & -    !  $$%$&
#6-4$ / 0 $1 & ##%$&
  / (0=
particular sensor type (single or multiple camera, color or
grayscale, moving or stationary), and have different func-
tionalities (track single person, multiple people, handle oc-
clusion, body part detection and tracking). None of the pre-
component analysis to the silhouette pixels. The best fit-
vious real-time surveillance system attempts to determine
ting axis constrained to pass through the median silhouette
whether a person is carrying an object or not. Pfinder [11]
coordinate is computed by minimizing the sum of absolute
is a real-time system for tracking a person which uses a
perpendicular distances to that axis. The direction of the
multi-class statistical model of color and shape to segment
major axis is given by an eigenvector associated with the
a person from a background scene. It finds and tracks peo-
largest eigenvalue of its covariance matrix (Figure 2(b)).
ple’s head and hands under a wide range of viewing con-
The shape of a 2D binary silhouette is represented by
dition. CMU’s system [9] extracts moving targets from a
its projection histogram. We compute the 1D vertical and
real-time video stream, classifies them into pre-defined cat-
horizontal projection histograms of the silhouettes in each
egories and tracks them. SRI’s system [8] and the other
frame. Vertical and horizontal projection histograms are
extension of >@? [6] uses real-time stereo to detect and
computed by projecting the binary foreground region onto
track multiple people. MIT’s system [4] uses real-time color
axes perpendicular to and along the major axis, respectively
based detection and motion tracking algorithms to classify
(Figure 2(f) (g)). Projection histograms are normalized by
detected objects and to learn common patterns of activity.
rescaling them to a fixed length, and aligning the median
KidRooms [1] is a color-based multi camera tracking sys-
coordinate at the center.
tem based on “closed-world regions”, which allows people
to interact with each other. Lehigh/Colombia’s system [2]
uses an omnidirectional camera to detect and track multi- 3 Shape Periodicity Analysis
ple blobs in real time. Hebrew University’s system [10] has
the ability to detect moving object from a moving camera in People exhibit periodic motion while they are moving.
real-time. We previously introduced a robust, image-correlation based
technique to compute the periodicity of a moving object [3].
It was used during tracking to determine whether the re-
2 Silhouette Model gion is a vehicle or a person. The method computes image
intensity self-similarities as the object appearance evolves
Backpack generates a set of shape and appearance fea- over time. A computationally inexpensive version of [3],
tures for each person’s silhouette. Backpack determines a which also requires less memory, is employed by Backpack
major axis of a person’s silhouette by applying a principal to determine shape periodicity. Periodic motion is deter-
  
I J $  - "& 
+4  ;6
8$ -
 
&$+ #+  $
 
 6
 $

 3&  #6*)
+  $@3 -' -
1 %$
K
+ &6 -  '
 -4$'3&$
- 1  %$&  &%:#%$  Backpack
  3
$&&  -*%  8  + " 8+;$1@
 8 1'(&+  $%$&


 - +   *)
,+48$ 3 - + & & - 
-<3 $ 
4 Symmetry Analysis

Silhouettes of human are typically close to symmetric


mined by self-similarity of silhouettes over time using the
L&M
about the body axis while standing, walking or running.
silhouette’s projection histograms. The vertical and hori- Let be the symmetry axis constructed for a given sil-
zontal projection histograms of a walking person are shown
"=N
houette [7]. Each pixel is classified as symmetric or non-
in Figure 4. Note that the person completes one walking
"$O
symmetric using the following simple procedure: Let

 
cycle in 17 frames. Shape periodicity is obtained by using
"=N "$O LPM
and be a pair of pixels on the silhouette boundary such

G <N
similarities of the last projections (typically ).
LQM " M
that the line segment from to is perpendicular to

 ! G <O
The projections are aligned and normalized during track-
"=NR" M
and intersects with at (shown in Figure 5). Let and

"$#&%  "$#('  


ing. A similarity plot between the projection his-
" &"$O S
be the length of line segment [ ] and length of line

" N M " O
togram, , at time and at time is computed as segment [ ], respectively. A pixel lying on the line
follows:

G MX4 WZY\[P]_^ G <N G <aO `cbed


segment [ ] is classified as follows:
)*+-2 35,/4.12 680 7:; 6=9 <6=>@? " #<1AB% 354DC " #< ' ? (1) STVU Non-Symmetric if
(2)
4
Symmetric otherwise
where E and F are the lower and upper bounds of projec-
where G S "M
minimum similarity is computed by translating "=#&% over a
tion histograms. In order to account for tracking error, the d M is the length of the line segment from pixel to .
and is a constant. Figure 6 shows example of symmetry-
small search window G .
based segmentation results for people with and without an
A row-based auto correlation method is applied to  to de-
object by showing their detected head location, computed

f
hypothetical symmetry axis, and non-symmetric region seg-
tect periodicity. Note that the similarity plot has a very dis-
mentation.
gihch   Bjlk nm co pj _ jrqtsulmvm awQk
tinctive pattern when the shape is changing periodically. In
      
=xyw _jls
Figure 3, the similarity plots for a walking person and a
moving car are shown. The similarity values in the plot have 

been scaled to a grayscale intensity range where darker re-
gions indicate higher similarity. Periodic motions will have
dark lines or curves parallel to the diagonal which repre-
> ? constructs a dynamic template -called a temporal
textural template [5] while it is tracking and segmenting in-
 H<
sents self-similarity of each projection histogram to itself.
dividual people. A similar appearance model is generated
For each row of , a period value is determined where
z
and updated for non-symmetric regions over time in Back-
H< {
the absolute auto-correlation of that row has a peak.
{
pack. The temporal texture template for an object is de-
H
Among all peaks, , the most frequently occurring peak
fined by:
 
is selected as the fundamental frequency, , of the motion.

 H #| S=l:} (S= eb ~ ~ |#P|#P   S=S=y€‚b  #P|   S=


A confidence value is computed based on the number of
rows in that have similar frequency to . In Figure 4, sim- (3)

} S=
ilarity plots obtained by using vertical and horizontal pro-
jection histograms of a walking person and their determined Here, refers to the intensity of pixel(x) which is classi-
fundamental frequencies are shown. A detailed description fied as non-symmetric pixel, and all coordinates are repre-
of periodic motion detection algorithms can be found in [3]. sented relative to the median coordinate of the person. The
i
 6
 #&8
 - "  
+4 " (&#-
1 -

$&+ +48$ 
-<  +.-   
#8
 +*3 $&1 n  +*3&+ &$ (4)
#,+ ( "

-<3 %$ *+ 3
#% 1
+.
,+.
#1 &3
1 8  #+  $    +.
1 3"+43&
+48&- "& 
+4 " # -! $&1 i $&& $$  - "
+4 8
6
8$;-
 
&$+ #+  $

H 
is the fundamental frequency of shape periodicity for the
f
entire body, and is a constant.

 g   wQj 
x  w cwQk=w !u
 
 

Non-symmetric pixels are group together into regions,


and the shape periodicity of each non-symmetric region is
computed individually. The horizontal projection histogram
 c 
 6
 $

3& 2+.
6& +.
 +*$
segment bounded by a non-symmetric region is used to
compute the shape periodicity of the corresponding non-
  +.
 +.
-
##  #
- 1!%$ +46:#%$   
 

  
 +*3   +*3&+ $ (*)
,+  3
-<3&
$1 %$ 
symmetric region. A non-symmetric region which does not
exhibit significant periodicity is classified as an object car-
+.
&$-* + " $
&$+.- +*3
+.
&  8#+
 
$&+4 
ried by a person, while a non-symmetric region which has
( 1 " / &  1 0 $&1$&$$  - "
+4 8 
8$&- /
 <0

-<3 $ 
significant periodicity is classified as a body part. In Fig-
ure 8, the final classification results are shown for a walking
{ person who is not carrying an object, and a person who is
carrying an object.

~# {
weights are the number of times that a pixel in is clas- |
In the first example, a person is walking with 1Hz fre-
quency (15 frames per half period with 100% confidence

weights ~ # (S=
of |
sified as a non-symmetric pixel during tracking. The initial
are zero and are incremented each time
value); the similarity plot of the vertical projection his-
togram for the entire body is shown in Figure 8(a)(right).
that the corresponding location (relative to the median tem- Note that the legs and arms of the person violate the sym-
plate coordinate) is detected as a foreground pixel in the in- metry constraint periodically during walking. The pixels
put image. Note that a temporal textural templates has two around the legs and arms are detected as non-symmetric
components that can be used for subsequent identification: pixels and grouped into two non-symmetric regions (region
a textural component which represents the appereance of
the object (Figure 7d,f); and a shape component ( ) which ~# 1 around legs, and region 2 around arms). Then, the simi-
larity plots for region 1 and region 2 are obtained as shown
represents weighted shape information (Figure 7c,e). Ex- in Figure 8(a). Note that the shape periodicity algorithm
amples of temporal textural templates for the entire body is applied only to the horizontal projection histogram seg-
and for non-symmetric regions of a person while they are ments bounded by regions 1 and 2. Periodicity is detected
walking with and without an object are shown in Figure 7. for region 1 at 1.1Hz and for region 2 at 1.03 Hz, which
Backpack segments the shape component of a temporal tex- are very similar to the shape periodicity of the entire body.
tural template to determine the regions where periodic mo- Therefore those regions are classified as body parts (shown

S ~ # (S= W H b
tion analysis should be applied. Periodic motion analysis is in green). In the second example, a person is walking and
applied to a non-symmetric pixel if where carrying a bag with 0.85Hz frequency (17.9 frame per half
 6
 %$(*)
,+ 1
+.
+  $;
-<  +.-(&#-
1  $ $ $  - "& 
+4  
8$-
 &
$+ #+48$ &$1-<3&

#  1-
 \  
   + ";$&& " -*8-  &%:#%$& 

 +*3+ $&1  +*3  ( /  $  - $ 1
$&
 #
#-<
;%$ 
6,
&$+ #
  $&1

 
  -  

&$&,"%$  0

ROC graph
1

0.8

Detection(true positive)
0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

a
False positive

  

'##
6# (4)
,+ 1
+.
,+48$

 6


(4) #
,+ 1
+.
#,+48$ 
-<  +.- 3  
 
#
 directions (people carry an object in 62 sequences, and not
-4$  -  &# %$ 1  

$&+ 1 
#,+48$- carry in 38 sequences). We estimated the Receiver Operat-

z
ing Curve (ROC) which plots the probability of detection

S
along -axis and the probability of false detection along
-axis (Figure 10). An ideal recognition algorithm would
period with 98% confidence value); its similarity plot from produce results near the top left of the graph (low false
the vertical projection histogram of the entire body is shown alarm and high detection probability). For different peri-
in the Figure 8(b). The legs of the person and the bag vi- odicity confidence thresholds, we computed the number of
olate the symmetry constraint during walking, and the re- instances that are correctly classified as person-with-object
gions around the legs (region 1) and the backpack (region2) (true positive), and the number of instances that are misclas-
are grouped into non-symmetric regions. Shape periodicity sified as people-with object (false positive). For the opti-
is detected for region 1 at 0.84Hz with high confidence and mal choice of thresholds, Backpack successfully determined
for region 2 at 2.5Hz with low confidence. The periodicity whether a person is carrying an object in 91/100 sequences.
of region 1 is very similar to the periodicity of the entire It generally failed on sequence where there is not a large
body, and it is classified as a body part. However, region 2 enough non-symmetric region (5/100) or insufficient shape
does not have a significant fundamental frequency similar changes (4/100) (causing low periodicity confidence value)
to the entire body, so it is classified as a carried object. e.g., when a person is moving towards the camera. In those
The symmetry and shape periodicity analysis used in cases, Backpack uses a non-global 2D-intensity based peri-
Backpack are view-based techniques; the results depend on odicity analysis [3] to compute periodicity to decrease the
the direction of motion of the person, and location of the false positive rate (yielding a 95/100 success rate). Back-
object on the silhouettes. Figure 9 shows detection results pack uses appearances (shape, intensity, and position) in-
where a person is carrying an object in his hand while mov- formation embedded into its temporal textural templates to
ing in different directions. We ran a series of experiments track carried objects that they have been detected and their
using 100 sequences where a person is moving in different temporal textural templates generated.
Trajectoriy of People
350
Person 0
300 Person 1
250

X position
200

150

100
50

0
200 250 300 350 400 450 500
Time (in Frames)
People Carrying Object
2
Person 0
1.5 Person 1
1

Carry Object
0.5

-0.5
-1

-1.5
200 250 300 350 400 450 500


Time (in Frames)

 6
 
(4)
,+ 1
+
#,+48$ 6
-<8+ - 3 8
 
#
   
 5
- 
#3&$ $ (*)!
#,+
-4$  - ! "#%$& 1

$&+ + "
-  (*)!
#,+  $&1
#
&$+ 
 & #%$ 1

$&+ 1 
#,+48$&-

References

Detection- Tracking 11.92 ms [1] A. Bobick, J. D. S., Intille, F. Baird, L. Cambell, Y. Irinov,
Shape Analysis 2.63 ms C. Pinhanez, and A. Wilson. Kidsroom: Action recognition in
Symmetric segmentation 0.27 ms an interactive story environment. Technical Report 398, M.I.T
Similarity Plot Computation 1.50 ms Perceptual Computing, 1996.
Periodicity Computation 0.32 ms [2] T. Boult, Frame-rate omnidirectional surveillance and track-
Temporal Textural Template 1.17 ms ing of camouflaged and occlude targets. In Second Workshop

 
of Visual Surveillance at CVPR, pages 48-58, 1999
(8
 #
# 


&+48$+4%
;
 6 &
[3] R.Cutler and L. Davis. Real-Time Periodic Motion Detection,
Analysis, and Applications. In Computer Vision and Pattern
Recognition Conference, (2) pages 326-331, 1999.
[4] E. Grimson and C. Stauffer and R. Romano and L. Lee. Us-
5 Conclusion and Discussion ing adaptive tracking to classfy and monitoring activities in a
site In Computer Vision and Pattern Recognition Conference,
pages 22-29, 1998.
We have describe a silhouette-based method to deter- [5] I. Haritaoglu, D. Harwood, and L. Davis. W4: Who, when,
mine if a person is carrying and object, and to segment where, what: A real time system for detecting and tracking
the object from the silhouette. We construct an appearance people. In Third Face and Gesture Recognition Conference,
model for each carried object, so that when people exchange pages:222-227, 1998
objects, we can detect ”who” carries ”which” object via an [6] I. Haritaoglu, D. Harwood, and L. Davis. W4S: A real time
analysis of the segmentation. Backpack has been imple- system for detecting and tracking people in 2.5 D In Eurepean
mented in C++ and runs under the Windows NT operating Conference on Computer Vision, pages: 877-892, 1998
system. Currently, for 320x240 resolution gray scale im- [7] I. Haritaoglu, D. Harwood, and L. Davis. Hydra: Multiple
ages, Backpack runs at 30 Hz on Pentium II 400 MHz PC. People Detection and Tracking Using Silhouettes In Second
Table 1 gives benchmarks for different component of Back- Workshop of Visual Surveillance at CVPR, pages 6-13, 1999
pack. [8] K. Konolige. Small vision systems: hardware and imple-
mentation. In International Symposium of Robotics Research,
There are several directions that we are pursuing to im-
pages 111-116, 1997.
prove the performance of Backpack and extend its capabili-
ties. We are studying how to improve detection results using [9] A. Lipton, H. Fujiyoshi, and R. Patil. Moving target detec-
local shape and appearance information, such as, “Is there tion and classification from real-time video In Proceedings of
IEEE Workshop on Application of Computer Vision, 1998.
an object near a person’s hand that looks like a briefcase?”.
We are also interested in interactions between people and [10] A. Peleg, et. al. Multi sensor representation of extended
objects; e.g., people exchanging objects, leaving objects in scenes using multi-view geometry In DARPA Image Under-
the scene, taking objects from the scene. The description standing Workshop, 1998.
of people; their positions, and motions-developed by Back- [11] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland.
pack is designed to support such activities. Backpack forms Pfinder: Real-time tracking of the human body. IEEE Trans-
a basis for developing algorithms to reason about activities action on Pattern Analysis and Machine Intelligence, 19(7),
involving people and object, such as depositing/removing pges 780-785,1997
objects, and exchanging objects (Figure 12).

You might also like