Professional Documents
Culture Documents
Hand Models and Systems For Hand Detection, Shape Recognition and Pose Estimation in Video
Hand Models and Systems For Hand Detection, Shape Recognition and Pose Estimation in Video
1 Introduction 2
1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . 3
3 System Comparison 23
3.1 Recognition Performance . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 View Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 User Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Image Quality and Resolution . . . . . . . . . . . . . . . . . . . . 26
3.6 Training Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 Computational Costs . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Conclusion 29
C Condensation Algorithm 35
1
Chapter 1
Introduction
Humans can do a lot with their hands. Many different hand poses can be
assumed to manipulate objects or convey information. Therefore the human
hand can serve as a very valuable input device for a computer, much more then
simply moving a mouse in a 2D plane. Think of manipulating objects in a
virtual world or using gestures to communicate.
To obtain this information several approaches exist. Besides using a glove
with sensors or using a camera and markers on the hand it is also possible
to approach the problem from a purely computer vision based side. This has
the desirable possibility of being non-obtrusive for the user. This is however
complicated due to the articulate nature of the hand and a possibly complex
and changing environment.
In this report several solutions for hand detection, shape recognition and
pose estimation from video are discussed and compared. This report is focused
on how presence of a hand and the possible handposes are modelled within the
available systems and the possible choices within these systems.
In the remainder of this chapter the possible applications and approaches
will be discussed in more detail. Thereafter possible systems and solutions are
discussed in chapter 2. These systems are then compared with respect to several
important criteria in chapter 3. This is followed by the conclusion in chapter 4.
2
1.1 Applications
There are a lot of possible ways to use hands in the interaction between human
and computer. The shape, orientation and location of the hand can convey a
lot of information.
Virtual Reality is a very natural application to use the handpose as input.
First of all it can be used to reconstruct the hand in the virtual world for a
more natural experience. Furthermore the hand can then be used to execute
actions in that world like grabbing objects, pushing, pulling, etc. In such an
application it is important to obtain as much information as possible about the
configuration of the hand in all it’s 27 degrees of freedom.
A related application is to use the handshape to communicate commands
using gestures. Think of controlling a robot in a situation where other means
of communication are hard to use.
Finally, handshape recognition can be used as part of a sign language recog-
nition system. The handshape is an important aspect of the gestures in a sign
language and is therefore essential in such a system. Systems can be designed
for example to be used as a translator between sign language and written (or
spoken) text. Another option is to create a learning environment where the
computer verifies the the correctness of a gesture.
1.2 Approaches
Several approaches exist to obtain the desired information about the handshape
of the user. The currently most reliable and accurate way is to use a glove with
sensors for all joint angles (often referred to as a Dataglove or Cyberglove).
Some systems even provide touch sensors. Although these systems are very
accurate they are also expensive and more important: very obtrusive. A user
always has to wear the glove. It depends on the situation, whether this is a
problem. Wireless systems do exist to improve the mobility of the user when
wearing such a glove.
Another approach is to use camera’s to track colored markers on the hand
to derive the pose and position of the hand. This however still forces the user
to wear these markers while introducing the problem of (self)occlusion.
To overcome the disadvantage of having to wear something like a glove or
markers, purely computer vision based systems can be used. A vision based
system is much less obtrusive. A vision based system however has to deal with
ambiguous situations, because in some cases it is impossible to observe the full
state of the hand. This is however also true for human vision. A vision based
system can only be less obtrusive when it does not impose other restrictions on
the user and/or the environment.
In the remainder of this report only methods based on computer vision will
be discussed. It is however important to note that other approaches exist and
can be used to obtain more accurate data about the hand pose.
3
Figure 1.1: Structure of the human hand (taken from [WLH05]).
only can it assume many different poses, the appearance of each pose is different
for different viewing angles. As shown in figure 1.1 each finger has 3 joints giving
4 DoF per finger. The thumb is even more flexible and has 5 DoF. Therefore
the configuration of the fingers alone has 21 DoF. Combined with the global
orientation and position this results in 27 DoF [WLH05]. All have to be taken
into account during recognition.
Not only is there a huge number of possible variations for one hand. There
are also differences between humans: the shape, size, color and possible con-
figurations can differ between users. For example the hand of a child is very
different from the hand of an adult male.
Furthermore some parts of the hands may be occluded by other parts re-
sulting in self-occlusion. Occlusion can also occur by the other hand, an arm or
even other objects in the scene. This regularly makes it impossible to observe
the full state of the hand.
Some problems depend on the specific conditions of an application. In the
case of gesture recognition one can expect several problems. First of all, hands
are moving. The position of the hand can change from frame to frame. The
movement could also result in motion blur.
When video of the whole upper body is used to recognize the gesture, the
hands only occupy a small part of the image, resulting in a low level of detail.
This makes it harder to distinguish shapes and more easy to be corrupted by
noise.
To be widely usable a system can not pose restrictions on the background.
Other, possibly moving, objects in the background should be tolerated. Par-
ticularly in sign language gestures, the hands are regularly positioned in front
of the face and sometimes even touch the face. Therefore it is harder to make
assumptions based on e.g. a difference in color between the hand and the back-
ground.
It is possible that during a gesture, hands are touching or even grabbing each
other. This does not only result in a lot of occlusion, it can also force hands in
unnatural configurations.
4
Because of the large number of configurations, views and differences between
users the number of possibilities is huge. In the ideal case all these possibilities
are covered by the system. When training examples are used to model these
variations, this would require an enormous amount of examples. Especially
when all these examples have to be labeled with the desired outputs this can
become a problem in itself.
The systems discussed in chapter 2 demonstrate different solutions to some
of these problems. How successful these solutions are in relation to the posed
problems is discussed in chapter 3.
5
Chapter 2
Systems for handshape recognition from video can be composed in a lot of ways.
Not only the chosen solutions, but also how these are combined influences the
properties and performance of the system. Choices can be influenced by the
specific application and the problems it tries to solve.
Two important architectures can be distinguished: The data-driven archi-
tecture and the model-driven architecture. In both systems the ultimate goal
is to map the image to the desired output. These can be shape classes or pose
parameters.
In data-driven systems the main objective is to find a direct mapping from
the image to the desired output, based on the measurements. In model-driven
systems a model is used that captures the possible valid instances of the hand.
Because the model only provides a mapping from the parameters to an object
instance, a fitting procedure based on local image features is used.
Data-Driven Systems
The basic architecture of a data-driven system has several steps. These steps
are shown in figure 2.1. First preprocessing is done and the region of interest is
selected, then features are extracted. Thereafter these features are mapped to
6
input features features output
image
0
1
5
a reduced representation. This is then used as input for the recognition. The
recognition can be seen as a mapping from input to output, which can be learned
from examples. Important aspects of these systems are listed in table 2.1. These
are discussed in more detail in the following sections.
Data-Driven Detectors
A variation of the data-driven architecture is to use the knowledge about the
appearance of the hand to locate the hand in the image. These methods do not
recognize differences between handshapes, they detect the presence or absence
of a hand. To obtain such a detector, the whole image has to be searched at
multiple scales. This can be viewed as a large set of parallel detectors that each
act on a different region in the image as shown in figure 2.2.
Because a window is slided across the image at multiple scales, and for each
position and scale detection has to be executed, a single detection step has to be
fast. Therefore different constraints on features and recognition lead to different
choices in these systems.
7
model parameters output
model
projection
feature parameter
extraction estimation
Model-Driven Systems
Model-driven systems use a model which captures the possible valid variations
of the object. A model can for example describe the possible positions of the
fingertip relative to the palm. Because the model only provides a mapping
from the parameters to an object instance, a fitting procedure is always used to
estimate the parameters.
The process as shown in figure 2.3 starts with an initial model. Features are
then extracted to estimate new model parameters based on the evidence found
in the image. These local features depend on the current state of the model.
Because initial model parameters are needed to extract the features, some kind
of estimation and initialization has to be used to be able to project the model
onto the input image.
Important variations between these systems can be found in table 2.2.
8
Table 2.2: Properties of model-driven systems.
System Output Features Model Tracking and Fit-
ting
Appearance Models
[HS95] position Local Edges Active Shape Model previous frame + it-
erative fit
[HW96] k classes local gray-level pro- multiple Active previous frame + it-
file Shape Models erative fit
[FAK03] 3+6 parameters Local texture (gray- multiple Active Ap- previous frame + it-
level and structure) pearance Models erative fit
[TvdM02] 10 classes Gabor Jets multiple Elastic iterative fit
Graphs
[HH96] position Local Edges 3D Active Shape previous frame + it-
Model erative fit
[BJ96] 4 classes + posi- Pixel Intensity Multiscale iterative fit
tion eigenspace
Physical models
[RK95] 3+3+21 parame- SSD error 3D model gradient descent
ters
[WLH05] 3+3+21 parame- Local Edges 3D model (card- Sequential Monte
ters board) Carlo algorithm
(importance sam-
pling based)
[SMC01] 3+3+1 parame- Local Edges 3D model (39 trun- Unscented Kalman
ters cated quadrics) filter
[LC04] 2+3+20 parame- Local Edges 3D model Particle Filter
ters
9
Figure 2.4: A 3D model of the human hand used in [SMC01].
2.1 Models
Each system captures the knowledge about possible handshapes and the cor-
responding appearances in some part of the system. Several types of models
exist.
In a data-driven system the hand is modelled in terms of features. Such a
model can be seen as a direct mapping from features to the desired output. This
mapping is discussed in more detail in section 2.5.
In model-driven systems two important types of models can be distinguished:
Appearance based models and Physical models. Appearance based models cap-
ture the variation of the object in the 2D image and therefore depend on the
view. Physical models represent the possible variations in the real object and
therefore need a mapping between the 2D image and the 3D model.
Physical Models
A physical model describes the possible variations in the real three dimensional
object. In most cases the hand is modelled by a set of linked rigid bodies.
The possible configurations are described in a kinematic model corresponding
to structure in figure 1.1.
In [RK95] the hand is modelled by a palm and three links per finger. [SMC01,
LC04, WLH05] use very similar models for the structure of the hand. The model
of [SMC01] is shown in figure 2.4.
It is possible to model for example the constraints between fingers. In
[WLH05] the configuration space is first captured using examples recorded with
a CyberGlove and reduced to 7 dimensions with PCA. It is then further reduced
to 28 basis configurations. All other configurations are then described as a linear
combination of two of these basis configurations.
A physical model can also model the dynamics of the hand. In [SMC01]
10
these dynamics are modelled using a second order process. In [WLH05] a more
simple random walk model is used.
The usage of the physical models is discussed in section 2.7.
Appearance Models
The basic principle of all appearance models is that one can create a statistical
model of the appearance of an object that than can be used to extract that
object from the image.
Such a model captures the correspondence between important landmarks
in the set of examples and the possible variations between these points. For
example the tip of the thumb in one image, corresponds to the tip of the thumb in
another image, although the appearance and position might be slightly different.
Active Shape Models are the most easy to understand. To create a model,
examples are manually labelled with the landmarks. The points in the train-
ingset are then aligned using scaling, rotating and translating the examples.
Then the mean shape is determined. PCA (Principal Component Analysis) is
then used to determine the most important variations in the trainingset. The
model is formed by the mean shape and the shape parameters that are obtained
using PCA. This model that describes this spatial distribution of the landmarks
is commonly referred to as the Point Distribution Model. Active Shape Models
are used in [HS95, HW96].
In [HS95] a 3D Point Distribution Model is used, it is acquired automati-
cally from 3D MRI data. The model is created in a similar way, based on the
distribution of the landmarks.
Active Appearance Models[CET98] are a generalization of Active Shape
Models. These do not only model the shape variation but also the texture
in the area between these landmarks. In [CET98] the extra model points to
capture this textured surface are added using triangulation. The variations in
texture are expressed in texture parameters using PCA. PCA is again applied
on the combination of shape and texture to obtain the appearance parameters.
A variation is used in [FAK03]. The mesh of points covering the hand is
shown in figure 2.5(a).
Elastic Graphs [TvdM02] also model a set of points with specific properties.
Instead of modelling the spatial distribution of all points in a linear relationship,
the model has a set of edges connecting the points as shown in figure 2.5(b).
Each edge poses a constraint on the distance between two points.
11
(a) Active Appearance Model from
[FAK03]
12
of multiple cameras. The 3D model is simply projected differently for each
camera and the features from each image are then all used for further parameter
estimation.
Region Selection
It is important to only measure properties of the hand. For data-driven systems
the relevant part of the image has to be extracted before the actual features are
extracted.
Several systems [HSS00, CP04, CSW95, HT05] simply restrict the hand to
be darker or lighter then the background. Simple thresholding is then used to
obtain the hand region.
A bit more robust solution is used in [IMT+ 00, RASS01, FAK03, WLH05,
GCC+ 06]. These systems rely on skin-color models to either remove background
from the image or to create a binary silhouette. In [GCC+ 06] an additional
active contour extraction algorithm is used to provide a more robust extraction
of the hand-region.
In [KT04, STTC04, OB04] a detector is used to scan the whole image to
find the hand region. This region can then be used for further recognition. In
[AS03] a clean segmented hand region is not required. A cluttered background
is allowed within the bounding box that marks the hand region.
13
Figure 2.6: A simple contour description based on the distances(di ) between
contour points(Pi ) and the center of gravity(G). (taken from [HSS00]).
14
Figure 2.8: The log-polar histogram used for Shape Context features (taken
from [OB04]).
possible measure for similarity between edge images is the chamfer distance.
The chamfer distance can be expressed as:
1 X
c(X, Y ) = min kx − yk
|X| y∈Y
x∈X
Low-level Features
[KT04] and [OB04] both use Haar-like features as proposed by [VJ01] for ob-
ject detection. This are simple features calculated from the difference be-
tween sums over two or more rectangular area’s in the image. These fea-
tures can be computed efficiently because the sums can be computed in a
fixed time after preprocessing the image. This is done using a so called in-
tegral image ii(x, y) which is the sum of all pixels to the top left of pixel
(x, y). The sum of any rectangle (x1 , y1 , x2 , y2 ) can then be calculated as
ii(x1 , y1 ) + ii(x2 , y2 ) − ii(x1 , y2 ) − ii(x2 , y1 ). There are thousands of Haar-like
features possible by varying the size and position of the rectangles.
15
Figure 2.9: A few possible block differences for Haar-like features (taken from
[VJ01]).
An even more simple way of using image intensity is to directly use the pixel
intensity values as features [IMT+ 00, CSW95, BJ96, STTC04]. These are very
sensitive to all kinds of variations. In [FAK03] the normalized gray-level and
the structural values (angle and magnitude) at specific model points are used
to describe texture.
Local autocorrelation patterns are used in [HT05]. The second-order local
autocorrelation is given by
X
R(a1 , a2 ) = f (r)f (r + a1 )f (r + a2 )
r
where f (r) is the intensity of the image at position r. When considering a 3x3
neighbourhood in the image this results in 25 patterns given different values for
a1 and a2 . In [HT05] these patterns are calculated at three different scales for
each of 64 rectangular sections in the image.
Local Features
A feature that is used in a lot of model-driven systems is the location of edges
near the projected model contours. This is shown in figure 2.10. To find the
strongest edge in the neighbourhood, points are sampled along the line per-
pendicular to the model contour. In each point, the derivative in the direction
of the line is calculated. The point with the strongest edge is then chosen
[HS95, HH96]. This location can then be used to propose a new location for the
model points.
A little more complex is to learn a gray-level profile for the neighbourhood
of each point during training. Instead of simply finding the strongest edge, the
best match for this profile along the normal is used. This is applied in [HW96].
In [TvdM02] the local image description is based on a wavelet transform
with a complex Gabor-based kernel. It is invariant to constant offsets in grey-
level value. The combination of several convolutions with different filters form
a descriptor called a jet. Multiple jets are combined in a so called bunch-jet
which enables to model variations of corresponding points in different images.
16
emax
e(dX) = emax
17
classifier. The systems that output parameters simply output the parameters
corresponding to the nearest neighbour instead of a single class label.
[AS03] and [GCC+ 06] do not retrieve a single closest item. They retrieve a
set of k best candidates and try to maximize the number of relevant items in
this set.
In [RASS01] Neural networks are used to map features to output. This
specialized mapping method uses multiple neural networks, each specialized in
mapping a certain part of the input space. All mappings are then applied to
the input. Thereafter a heuristic is used to choose the best hypothesis from the
ones generated by each mapping.
In [KT04] a cascade of boosted classifiers is used. Such a cascade forms a
detector as shown in figure 2.11(a). Each level is formed by a strong classifier
trained using the AdaBoost algorithm. These strong classifiers are a linear
combination of weak classifiers, each determined by a single Haar-like feature.
The training process consists of iteratively selecting a weak classifier to add
to the pool of classifiers. Each time the error is weighted, giving more weight
to previously wrongly classified examples. This means that the newly selected
weak classifier is chosen based on its capability of discriminating the hardest
cases. This process is described in more detail in appendix A.
Each level in the cascade is trained in such a way that it has a very high
detection rate, taking a high number of false positives for granted. Each level in
the cascade further reduces the number of false positives using a stronger clas-
sifier while detecting (almost) all positive examples. Because of this structure,
the toplevel classifier can use only a few features to quickly weed out a lot of
clearly negative examples.
A similar idea is used in [OB04], a tree structure of detectors creates a
multi-class classifier which also provides location information. A general hand
detector formed by a cascade structure similar to the one in [KT04] is the root
of the tree (figure 2.11(b)). If a possible hand region is detected by this detector
it is feeded to a set of shape-specific detectors. These detectors then determine
the specific hand shape. The detectors are trained using FloatBoost which is
similar to AdaBoost, but can also remove those weak classifiers that no longer
contribute positively to the result. A smaller set of weak classifiers with similar
performance remains.
In [STTC04] a tree structure is also used(figure 2.11(c)). Easy to compute
features are used near the top, while more expensive and precise features are
used near the leafs. Each level in the tree acts on a more limited range of
configurations. The detection and recognition are combined in the same tree,
opposed to [OB04] where detection and recognition are clearly separate steps.
18
reject
reject
reject
hand detected
reject
reject
reject
hand detected
reject
reject reject
19
clustered, based on some similarity measure [IMT+ 00, OB04]. These clusters
can then be used as output labels. To obtain the real parameters, corresponding
to a training image, a Dataglove can be used. Wearing such a glove influences
the image, to minimize this the glove can be covered with a second glove. This
is used in [HT05] to obtain a labeled test set.
An alternative way to obtain input-output examples is to generate input
examples using a physical model. In [RASS01, AS01, AS03, HT05, GCC+ 06] a
3D computer graphics model of the hand is used to create examples. This not
only avoids the manual labeling, it also gives the posibility to obtain examples
for any configuration from any viewpoint. These models are very similar to the
physical models discussed in section 2.1.
20
each model on the image and then select the best fit ([TvdM02]). This however
becomes more problematic when a lot of models are used.
Several ad-hoc rules are used in [HW96] to switch between the models based
on the location of model points and shape parameters. In [FAK03] a complex
network structure is used to switch between models. For each frame the model
from the previous frame is fitted first. Then the two most likely neighbours
are also fitted and the one with the lowest error is chosen. Which neighbour is
most likely depends on the specific error vector, which relation is learned during
training.
To fit a physical 3D model to the image, it has to be projected into the 2D
image. This is much more complex then for appearance models. An important
aspect is the handling of self-occlusion.
In [RK95] deformable templates are used to model each link. These are then
combined, taking into account the occlusion relations. In [SMC01] the links
are modelled using 3D Quadrics. The contours are easily projected onto a 2D
plane. A simple cardboard representation is used in [WLH05] because its much
simpler to project then the quadrics. Each link is modelled by a simple plane.
In [RK95] the SSD error between templates and image is used. A gradient
descent algorithm is then used to minimize the error.
All other systems use edges in the neighbourhood of the projected model.
These features are used to estimate the parameters taking into account the
previous state and the contraints given by the model.
In [SMC01] an Unscented Kalman filter is used. [LC04] and [WLH05] use
different variations of particle filtering. In [WLH05] a sequential Monte Carlo
algorithm based on Importance Sampling is used. Furthermore an alterna-
tive approach using the Condensation algorithm is also discussed which is a
computer-vision specific particle filtering algorithm. Both particle filtering and
the Unscented Kalman filter are nonlinear estimation techniques that use sam-
pling to approximate complex distributions in the system. The Condensation
algorithm is discussed in detail in appendix C and shows the general principle
of sampling based estimation. The algorithm in [WLH05] is a variation where
the main difference lies in the generation of the samples.
2.8 Tracking
In data-driven systems temporal coherence between the handshape in two suc-
cesive frames is only exploited in [HSS00]. A transition network is used to limit
search space. Similarity is only evaluated for the templates that are a likely
successor to the previous state. The network is learned from a series of example
sequences. A reduction of 90% in processing time is claimed.
All systems with an appearance model use the current model as the initial-
ization for the next iteration and/or frame. Because some overlap between the
model and the hand in the image is required to fit the model, only local changes
can be tracked in this way. A seperate global hand search[HS95] can be used to
handle the case that the tracking is lost. [FAK03] suggests the use of a seperate
tracking mechanism when changes between frames are to large. When certain
variables cannot be estimated directly from the appearance model, [FAK03] uses
the value from the previous frame.
Systems using a physical model require a more robust tracking mechanism.
21
This tracking is tightly integrated with the estimation of model parameters.
These estimation methods have been discussed in section 2.7.
Initialization
With tracking, information from the past is used for the current estimation.
This poses a problem for the first frame in a sequence, because there is no
information from the past. An initialization step becomes necessary.
The most simple solution is to put the initialization in the hands of the user
by requiring a predefined initial state [RK95]. The hand model in [LC04] is
initialized automatically. This even includes the user specific parameters.
It is also possible to use a global detector or a data-driven recognition system
to initialize the model. This is proposed in [HS95, AS03, KT04]. Such an
approach can also be used to recover when tracking is lost due to for example
complete occlusion.
2.9 Output
Depending on the objective of the application different kinds of output can be
desired.
Most methods that use a physical model during training or fitting output the
3D model parameters. These parameters can then be used for a reconstruction
or further recognition.
[FAK03] maps the 2D model parameters to a simplified 3D model using
linear regression. The model has only six parameters for the hand configuration
and three for the hand pose. The extension of each finger is represented by
a single parameter between 0 and 1. An additional parameter represents the
spread of all fingers.
[HSS00] and [CP04] directly output the features for gesture recognition using
for example a Hidden Markov Model.
Other systems simply output a single class label. In some cases these classes
are based on visual appearance. Others are based on the real hand configuration.
The systems described in [AS03] and [GCC+ 06] provide additional estimates for
the global handpose besides the class label.
[OB04] and [IMT+ 00] output labels that correspond to clusters. These clus-
ters are learned in a training phase based on a visual similarity measure. It is
however unclear whether these clusters are accurate and usefull for recognition.
22
Chapter 3
System Comparison
A perfect method for handshape recognition does not yet exist, each solution
has specific strengths and weaknesses which may be more or less important
depending on the application.
In the folowing sections several criteria related to the problems discussed in
section 1.3 are discussed. An overview containing all discussed systems is shown
in table 3.1.
23
Table 3.1: Overview of all systems with respect to the reported
robustness(section 3.2), real-time execution(section 3.7), handling of
low-resolution(section 3.5), view independence(section 3.3) and perfor-
mance(section 3.1).
Robust to:
System Complex Overlap with Occlusion Varying Real-time Works with View in- Performance
Backgrounds skin-colored Lighting execution? Low resolu- depent
objects tion hand recogni-
region? tion?
Data-driven recognition
[HSS00] no 1 no not reported no 1
not reported not reported no 95.9% correct, 260
model images, 74
input images
1
[CP04] no no not reported no not reported yes (320x240 no 98.6% on 500 instances
upperbody of 20 gestures with
video) HMM
3
[CSW95] no no no could be outdated yes (32x32 no 96% with 4% false re-
with correct hand region) jection
training
[IMT+ 00] yes 2
yes, seperate not reported not reported yes yes (64x64 no 93% correct, 72% cor-
handling hand region) rect in front of face
2
[RASS01] yes no no yes yes not reported yes ...7
[AS03] yes expected8 not reported not reported no no, significant yes ...7
performance
drop is re-
ported
3
[HT05] no no not reported no yes not reported yes 7 degrees mean error
on joint angles
+ 34
[GCC 06] not reported yes not reported not reported yes not reported yes ...7
Data-driven Detectors
[KT04] yes yes not reported yes yes yes no ...7
[OB04] expected6 yes not reported not reported not reported yes no 99.8% correct detec-
tion, 97.4% correct
shape recognition
[STTC04] yes yes not reported expected no not reported yes ...7
Appearance Models
[HS95] yes, with yes not reported no yes not reported no not reported
special
lighting
[HW96] yes yes not reported not reported yes not reported no not reported
[FAK03] yes 2 no not reported yes expected5 claimed, but yes 10% mean error in pa-
unclear rameters
[TvdM02] yes expected8 not reported yes no not reported no 86.2% correct with
complex background
[HH96] yes yes no not reported yes not reported no not reported
[BJ96] no 3 no not reported yes outdated expected6 only affine not reported
transfor-
mations
Physical Models
[RK95] not reported not reported not reported not reported yes not reported yes not reported
[WLH05] yes 2 no not reported yes yes not reported yes not reported
[SMC01] not reported not reported not reported not reported expected5 not reported yes not reported
[LC04] yes 2 no no (planned) yes expected5 not reported yes not reported
1 A threshold on the brightness of the hand region is used to extract it. This heavily restricts the allowed
background.
2 The extraction of the hand region depends on skin-color it is therefore robust to most complex backgrounds.
3 A uniform background is used to simplify extraction of the hand region. Complex backgrounds are not considered.
4 An Active Contour is used to extract the hand region
5 Results are reported on outdated machines. Based on the advance in computational power real-time execution
is expected.
6 Not reported but it is expected based on results of similar systems.
7 Results are reported but cannot be expressed as a single number. See section 3.1 for details.
8 Not reported but it is expected because features do not depend on skin-color.
24
k ranked items. This is the case in [AS03] and [GCC+ 06]. These results are
hard to interpret and are therefore only suitable to compare options within the
systems.
The model-driven systems are clearly less developed then the data-driven
systems. Except for [FAK03] and [TvdM02] no objective errors are reported.
All methods that use multiple cameras (as discussed in section 2.2) report a
significant increase in performance. It might however not always be possible to
arrange multiple camera’s to acquire images from multiple significantly different
views. This is probably most suitable for a relatively controlled environment.
Performance in a real-life situation depends a lot on the robustness of the
system, which is discussed in the next section.
3.2 Robustness
Maybe the most important aspect of a recognition system is the robustness.
A summary is shown in table 3.1. In this table robustness to cluttered back-
grounds, overlap with skin-colored objects, occlusion and varying lighting is
listed.
As discussed in section 2.3 some methods assume a simple background to be
able to extract the hand. These methods fail with objects in the background.
Systems that use a skin-color model are more robust to complex backgrounds.
This however creates a problem if the hand is in front of a skin-colored object.
This is handled in [IMT+ 00] by detecting this situation and then scanning the
relevant area with a correlation-based detector this does however result in a
drop in performance.
Because color is influenced by lighting, the robustness to lighting conditions
becomes very dependent on the color model. In [RASS01] changes in the color
distribution are constantly tracked to handle changes in lighting.
[AS03] and [STTC04] both try to combine multiple features using color and
edge information to obtain a more robust system.
[KT04] and [OB04] do also not rely on color information. [KT04] shows to be
robust to complex backgrounds and different lighting conditions. The detection
is however limited to only six view-dependent shapes. It is unclear whether
systems like [IMT+ 00] can really handle different lighting conditions because
the system relies directly on the intensity values as features.
Several methods like [HSS00, CP04] rely on perfectly segmented hands. This
is not expected to be possible in practical systems.
Model-driven systems using local edge information should also be capable
of handling cluttered backgrounds. [HW96, HH96] positively report about this,
it is however not quantified. [TvdM02] does show quantative results compar-
ing cluttered backgrounds (86.2% recognition) with the results on a uniform
background (93.3%).
The handling of occlusion with other objects by the discussed systems is
mostly unclear. It is however expected that data-driven systems that use shape
features can not handle this, because occlusion changes the shape. The use of
multiple cameras does not solve this, but it does limit the occluded area.
Model-driven systems with tracking should in principle be able to handle
partial occlusion. It is however unclear whether this is actually the case for
hand models. If the tracking is lost due to the occlusion it must be reinitialized.
25
A data-driven detector does not have this problem because it can find the hand
as soon as it is visible again.
26
The Viola-Jones inspired systems [KT04, OB04] can clearly handle low-
resolution images. The base resolution is in the order of 20x30 pixels. This
is also true for [CSW95], where the extracted hand area is only 32x32 pixels. In
[IMT+ 00] the extracted hand area is normalized to 64x64 pixels. This should
correspond to a video resolution for the whole upperbody of about 640x480
pixels.
In [FAK03] the system is claimed to properly handle low resolution, it is
however unclear how this is defined and tested.
All other systems either report full-size hand images of at least 100x100
pixels or do not mention the required resolution at all. Especially for features
describing the contour it is not clear how these are influenced by the low reso-
lution.
The methods that are reported to work with fairly low resolution use low-
level intensity features and take into account the whole hand surface.
27
calculation of sums is used in [STTC04]. In [AS03] the chamfer distance is em-
bedded in euclidian space to simplify comparison between items while retaining
most of the performance.
Fast reduction of the search space can also greatly reduce computational
cost. The cascade structure used in [KT04] first weeds out unlikely candidates
with a very simple classifier in the first cascade. Each cascade further reducing
the search space. [OB04] and [STTC04] occupy a tree structure to achieve
similar characteristics. In [HSS00] a transition network is used to limit the
posible candidates. A reduction of 90% is reported. [AS03] applies a two stage
retrieval limiting the search space in the first step using cheap features.
In model-driven systems the search is more naturally reduced to local changes.
Because measurements are only executed at specific locations depending on the
state of the model, costs are relatively low. Multiple iterations per frame are
however often required to get accurate results.
The use of tracking and the modelling of pose and motion constraints also
reduces the search space[SMC01, WLH05].
In [FAK03] the relation between the error vector and parameter updates is
learned in advance giving the posibility to use much more model points because
the error minimization is less costly.
28
Chapter 4
Conclusion
29
Appendix A
Viola-Jones Object
Detection
A popular method for object detection is proposed by Viola and Jones in [VJ01].
It has been succesfull for face detection and is applied to hand detection in
[KT04]. The particular combination of features, learning algorithm and classifier
structure is often referred to as Viola-Jones object detection.
The basis is formed by the Haar-like features. These are computed as the
difference between the sum of pixels of rectangualar area’s in the detector win-
dow. In [VJ01] these area’s are restricted to be adjacent and have similar shape.
As can be seen in figure 2.9 features can be composed of two, three or four rect-
angles.
To calculate the sum over an area efficiently the integral image is proposed.
The integral image ii(x, y) is defined as
X
ii(x, y) = i(x0 , y 0 )
x0 ≤x,y 0 ≤y
where i(x, y) is the original image. It gives the sum of all pixels to the top-left of
pixel (x, y). The integral image can then be used to calculate the sum of pixels
over any rectangular area (x1 , y1 , x2 , y2 ) in the image as ii(x1 , y1 ) + ii(x2 , y2 ) −
ii(x1 , y2 ) − ii(x2 , y1 ). Due to this optimization the calculation of any feature at
any scale and position can be done in constant time.
Given that the image is scanned with a window with a base resolution of
24x24 this results in 45.396 possible features. It is imposible to consider all these
for each sub-window at each scale. Therefore a selection of features is made
during training using the AdaBoost algorithm. This algorithm is designed to
compose a strong classifier based on the combination of a set of weak classifiers.
Each weak classifier is based on one feature and a threshold. The AdaBoost
algorithm is presented as algorithm 1.
The last important aspect of a Viola-Jones object detector is the cascade
structure of multiple classifiers. Instead of evaluating each sub-window using
a classifier that considers hundreds of features, first a classifier is used to find
those sub-windows that are likely to contain the object based on only a limited
set of features. This reduces the number of sub-windows to consider with the
stronger classifier. Because there is a trade-off between detection rate and false
30
Algorithm 1 AdaBoost
• For all examples (x1 , y1 ), ..., (xn , yn ), yi = 0 for negative and yi = 1 for
positive examples.
• Initialize weights:
(
1
2m if yi = 0, with m the number of negatives;
w1,i = 1
2l if yi = 1, with l the number of positives.
• for t = 1..T :
where αt = log β1t , which gives weights inversely proportional to the train-
ing error.
31
positives, to avoid missing candidates a high number of false positives has to be
accepted.
The same principle can then be applied to the remaining sub-windows re-
sulting in multiple cascases each reducing the number of false positives further.
These cascades need more features to achieve these better results, there are
however also much less sub-windows to consider. Each cascade is trained using
those examples that have passed through all previous cascades, this training is
given in algorithm 2.
32
Appendix B
κ(u, σ) = 0.
33
original σ=2 σ =5 σ = 10 σ = 20
Figure B.1: Koch snowflake curve and evolved versions with increasing σ (taken
from [MM92]).
Figure B.2: CSS image for the curve shown in figure B.1 (taken from [MM92]).
34
Appendix C
Condensation Algorithm
Dynamical model
The dynamics of the process can be described in terms of a stochastic differential
equation:
35
Algorithm 3 Condensation
• The state-density at time t is described by a set of N samples with weights
(n) (n)
{st−1 , πt−1 , n = 1, ..., N }
• for n = 1...N
(n)
– Select a sample s0 t from the distribution with replacement where
(j) (j)
each sample st−1 has a probability πt−1 of being selected.
– Predict a new value for the sample from
(n)
p(xt |xt−1 = s0 t )
36
Figure C.1: A single iteration of the Condensation algorithm (taken from
[IB98]). The size of the samples corresponds to it’s weight.
Observation model
The weight of the samples depends on the observation density p(z|x).
In [IB98] a model is proposed that uses the tracking of a contour using edges
near the contour. At M points the edges along a line normal to the contour are
detected. Each edge is seen as a measurement that comes eather from the true
target or from clutter.
When it is assumed that the clutter is a Poisson process with parameter λ
and the true measurement comes from an unbiased normal distribution with
standard deviation σ the observation can be modelled for the one dimensional
case as:
1 X v2
p(z|x) ∝ 1 + √ exp − k2
2πσα 2σ
k
37
The observation density for the measurement along all M normals can be
calculated as the product of the one-dimensional measurements. This product
can be computed using the folowing discrete approximation:
M
X 1
p(z|x) ∝ exp{− min((z1 (m) − r(m))2 , µ2 )}
m=1
2rM
38
Bibliography
39
[HW96] C. Huang and M. Wu. A model-based complex background gesture
recognition system. 1996.
[IB98] M. Isard and A. Blake. Condensation – conditional density propaga-
tion for visual tracking. International Journal of Computer Vision,
29(1):5–28, 1998.
[IMT+ 00] K. Imagawa, H. Matsuo, R. Taniguchi, D. Arita, S. Lu, and S. Igi.
Recognition of local features for camera-based sign language recog-
nition system. icpr, 04:4849, 2000.
[KT04] M. Kolsch and M. Turk. Robust hand detection. fgr, 00:614, 2004.
[LC04] S.U. Lee and I. Cohen. 3d hand reconstruction from a monocular
view. In ICPR ’04: Proceedings of the Pattern Recognition, 17th
International Conference on (ICPR’04) Volume 3, pages 310–313,
Washington, DC, USA, 2004. IEEE Computer Society.
[MM92] F. Mokhtarian and A.K. Mackworth. A theory of multiscale,
curvature-based shape representation for planar curves. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 14(8):789–
805, 1992.
[OB04] E. Ong and R. Bowden. A boosted classifier tree for hand shape
detection, 2004.
[RASS01] R. Rosales, V. Athitsos, L. Sigal, and S. Sclaroff. 3d hand pose
reconstruction using specialized mappings. iccv, 01:378, 2001.
[RK95] J.M. Rehg and T. Kanade. Model-based tracking of self-occluding
articulated objects. In ICCV, pages 612–617, 1995.
40