Hand Models and Systems For Hand Detection, Shape Recognition and Pose Estimation in Video

Hand Models and Systems for Hand Detection,
Shape Recognition and Pose Estimation in Video
Michiel van Vlaardingen
November 15, 2006

Contents
1 Introduction 2
1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . 3
2 Existing Systems and Solutions 6

2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Multi-view Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Feature Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Data-driven Recognition . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Trainingset Creation . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Model Projection and Fitting . . . . . . . . . . . . . . . . . . . . 20
2.8 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 System Comparison 23
3.1 Recognition Performance . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 View Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 User Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Image Quality and Resolution . . . . . . . . . . . . . . . . . . . . 26
3.6 Training Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 Computational Costs . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Conclusion 29
A Viola-Jones Object Detection 30
B Curvature Scale Space 33
C Condensation Algorithm 35
1
Chapter 1
Introduction
Humans can do a lot with their hands. Many different hand poses can be
assumed to manipulate objects or convey information. Therefore the human
hand can serve as a very valuable input device for a computer, much more then
simply moving a mouse in a 2D plane. Think of manipulating objects in a
virtual world or using gestures to communicate.
To obtain this information several approaches exist. Besides using a glove
with sensors or using a camera and markers on the hand it is also possible
to approach the problem from a purely computer vision based side. This has
the desirable possibility of being non-obtrusive for the user. This is however
complicated due to the articulate nature of the hand and a possibly complex
and changing environment.
In this report several solutions for hand detection, shape recognition and
pose estimation from video are discussed and compared. This report is focused
on how presence of a hand and the possible handposes are modelled within the
available systems and the possible choices within these systems.
In the remainder of this chapter the possible applications and approaches
will be discussed in more detail. Thereafter possible systems and solutions are
discussed in chapter 2. These systems are then compared with respect to several
important criteria in chapter 3. This is followed by the conclusion in chapter 4.
2
1.1 Applications
There are a lot of possible ways to use hands in the interaction between human
and computer. The shape, orientation and location of the hand can convey a
lot of information.
Virtual Reality is a very natural application to use the handpose as input.
First of all it can be used to reconstruct the hand in the virtual world for a
more natural experience. Furthermore the hand can then be used to execute
actions in that world like grabbing objects, pushing, pulling, etc. In such an
application it is important to obtain as much information as possible about the
configuration of the hand in all it’s 27 degrees of freedom.
A related application is to use the handshape to communicate commands
using gestures. Think of controlling a robot in a situation where other means
of communication are hard to use.
Finally, handshape recognition can be used as part of a sign language recog-
nition system. The handshape is an important aspect of the gestures in a sign
language and is therefore essential in such a system. Systems can be designed
for example to be used as a translator between sign language and written (or
spoken) text. Another option is to create a learning environment where the
computer verifies the the correctness of a gesture.
1.2 Approaches
Several approaches exist to obtain the desired information about the handshape
of the user. The currently most reliable and accurate way is to use a glove with
sensors for all joint angles (often referred to as a Dataglove or Cyberglove).
Some systems even provide touch sensors. Although these systems are very
accurate they are also expensive and more important: very obtrusive. A user
always has to wear the glove. It depends on the situation, whether this is a
problem. Wireless systems do exist to improve the mobility of the user when
wearing such a glove.
Another approach is to use camera’s to track colored markers on the hand
to derive the pose and position of the hand. This however still forces the user
to wear these markers while introducing the problem of (self)occlusion.
To overcome the disadvantage of having to wear something like a glove or
markers, purely computer vision based systems can be used. A vision based
system is much less obtrusive. A vision based system however has to deal with
ambiguous situations, because in some cases it is impossible to observe the full
state of the hand. This is however also true for human vision. A vision based
system can only be less obtrusive when it does not impose other restrictions on
the user and/or the environment.
In the remainder of this report only methods based on computer vision will
be discussed. It is however important to note that other approaches exist and
can be used to obtain more accurate data about the hand pose.
1.3 Problems and Challenges

A key problem in the recognition of handshape from video is that the appearance
of the hand changes dramatically when the view or the posture changes. Not
3
Figure 1.1: Structure of the human hand (taken from [WLH05]).
only can it assume many different poses, the appearance of each pose is different
for different viewing angles. As shown in figure 1.1 each finger has 3 joints giving
4 DoF per finger. The thumb is even more flexible and has 5 DoF. Therefore
the configuration of the fingers alone has 21 DoF. Combined with the global
orientation and position this results in 27 DoF [WLH05]. All have to be taken
into account during recognition.
Not only is there a huge number of possible variations for one hand. There
are also differences between humans: the shape, size, color and possible con-
figurations can differ between users. For example the hand of a child is very
different from the hand of an adult male.
Furthermore some parts of the hands may be occluded by other parts re-
sulting in self-occlusion. Occlusion can also occur by the other hand, an arm or
even other objects in the scene. This regularly makes it impossible to observe
the full state of the hand.
Some problems depend on the specific conditions of an application. In the
case of gesture recognition one can expect several problems. First of all, hands
are moving. The position of the hand can change from frame to frame. The
movement could also result in motion blur.
When video of the whole upper body is used to recognize the gesture, the
hands only occupy a small part of the image, resulting in a low level of detail.
This makes it harder to distinguish shapes and more easy to be corrupted by
noise.
To be widely usable a system can not pose restrictions on the background.
Other, possibly moving, objects in the background should be tolerated. Par-
ticularly in sign language gestures, the hands are regularly positioned in front
of the face and sometimes even touch the face. Therefore it is harder to make
assumptions based on e.g. a difference in color between the hand and the back-
ground.
It is possible that during a gesture, hands are touching or even grabbing each
other. This does not only result in a lot of occlusion, it can also force hands in
unnatural configurations.
4
Because of the large number of configurations, views and differences between
users the number of possibilities is huge. In the ideal case all these possibilities
are covered by the system. When training examples are used to model these
variations, this would require an enormous amount of examples. Especially
when all these examples have to be labeled with the desired outputs this can
become a problem in itself.
The systems discussed in chapter 2 demonstrate different solutions to some
of these problems. How successful these solutions are in relation to the posed
problems is discussed in chapter 3.
5
Chapter 2
Existing Systems and

Solutions
Systems for handshape recognition from video can be composed in a lot of ways.
Not only the chosen solutions, but also how these are combined influences the
properties and performance of the system. Choices can be influenced by the
specific application and the problems it tries to solve.
Two important architectures can be distinguished: The data-driven archi-
tecture and the model-driven architecture. In both systems the ultimate goal
is to map the image to the desired output. These can be shape classes or pose
parameters.
In data-driven systems the main objective is to find a direct mapping from
the image to the desired output, based on the measurements. In model-driven
systems a model is used that captures the possible valid instances of the hand.
Because the model only provides a mapping from the parameters to an object
instance, a fitting procedure based on local image features is used.
Data-Driven Systems
The basic architecture of a data-driven system has several steps. These steps
are shown in figure 2.1. First preprocessing is done and the region of interest is
selected, then features are extracted. Thereafter these features are mapped to
input ROI features features output

image
0
1
5
region feature feature recognition

selection extraction reduction
Figure 2.1: Data-driven recognition architecture.
6
input features features output
image
0
1
5
region feature feature detection integration

selection extraction reduction
Figure 2.2: Data-driven detector architecture.
Table 2.1: Properties of data-driven systems.

System Output Features Recognition
Data-driven Recognition
[HSS00] 12 features Contour descriptor 1-NN
[CP04] k features Curvature Scale Space -
[CSW95] 28 classes Pixel Intensity 1-NN
[IMT+ 00] k clusters Pixel Intensity 1-NN
[RASS01] 2+22 parameters Hu Moments Specialized Mappings
[AS03] 26 classes + 3 Approximate Directed Chamfer top-K
view parameters Distance, Edge Orientations, Line
Matching
[HT05] 24 parameters high-order local autocorrelation 1-NN
pattern
[GCC+ 06] 15 classes + 3 Shape Context top-K
view parameters
Data-driven Detectors
[KT04] 8 classes + posi- Haar-like AdaBoost (cascade)
tion
[OB04] 300 clusters + Haar-like FloatBoost (tree)
position
[STTC04] 2+3+21 parame- Chamfer distance, Haussdorf dis- linear (tree)
ters tance, Templates
a reduced representation. This is then used as input for the recognition. The
recognition can be seen as a mapping from input to output, which can be learned
from examples. Important aspects of these systems are listed in table 2.1. These
are discussed in more detail in the following sections.
Data-Driven Detectors
A variation of the data-driven architecture is to use the knowledge about the
appearance of the hand to locate the hand in the image. These methods do not
recognize differences between handshapes, they detect the presence or absence
of a hand. To obtain such a detector, the whole image has to be searched at
multiple scales. This can be viewed as a large set of parallel detectors that each
act on a different region in the image as shown in figure 2.2.
Because a window is slided across the image at multiple scales, and for each
position and scale detection has to be executed, a single detection step has to be
fast. Therefore different constraints on features and recognition lead to different
choices in these systems.
7
model parameters output
model
projection
input features history

image
0
1
5
feature parameter
extraction estimation
Figure 2.3: Model-driven architecture.
Model-Driven Systems
Model-driven systems use a model which captures the possible valid variations
of the object. A model can for example describe the possible positions of the
fingertip relative to the palm. Because the model only provides a mapping
from the parameters to an object instance, a fitting procedure is always used to
estimate the parameters.
The process as shown in figure 2.3 starts with an initial model. Features are
then extracted to estimate new model parameters based on the evidence found
in the image. These local features depend on the current state of the model.
Because initial model parameters are needed to extract the features, some kind
of estimation and initialization has to be used to be able to project the model
onto the input image.
Important variations between these systems can be found in table 2.2.
8
Table 2.2: Properties of model-driven systems.
System Output Features Model Tracking and Fit-
ting
Appearance Models
[HS95] position Local Edges Active Shape Model previous frame + it-
erative fit
[HW96] k classes local gray-level pro- multiple Active previous frame + it-
file Shape Models erative fit
[FAK03] 3+6 parameters Local texture (gray- multiple Active Ap- previous frame + it-
level and structure) pearance Models erative fit
[TvdM02] 10 classes Gabor Jets multiple Elastic iterative fit
Graphs
[HH96] position Local Edges 3D Active Shape previous frame + it-
Model erative fit
[BJ96] 4 classes + posi- Pixel Intensity Multiscale iterative fit
tion eigenspace
Physical models
[RK95] 3+3+21 parame- SSD error 3D model gradient descent
ters
[WLH05] 3+3+21 parame- Local Edges 3D model (card- Sequential Monte
ters board) Carlo algorithm
(importance sam-
pling based)
[SMC01] 3+3+1 parame- Local Edges 3D model (39 trun- Unscented Kalman
ters cated quadrics) filter
[LC04] 2+3+20 parame- Local Edges 3D model Particle Filter
ters
9
Figure 2.4: A 3D model of the human hand used in [SMC01].
2.1 Models
Each system captures the knowledge about possible handshapes and the cor-
responding appearances in some part of the system. Several types of models
exist.
In a data-driven system the hand is modelled in terms of features. Such a
model can be seen as a direct mapping from features to the desired output. This
mapping is discussed in more detail in section 2.5.
In model-driven systems two important types of models can be distinguished:
Appearance based models and Physical models. Appearance based models cap-
ture the variation of the object in the 2D image and therefore depend on the
view. Physical models represent the possible variations in the real object and
therefore need a mapping between the 2D image and the 3D model.
Physical Models
A physical model describes the possible variations in the real three dimensional
object. In most cases the hand is modelled by a set of linked rigid bodies.
The possible configurations are described in a kinematic model corresponding
to structure in figure 1.1.
In [RK95] the hand is modelled by a palm and three links per finger. [SMC01,
LC04, WLH05] use very similar models for the structure of the hand. The model
of [SMC01] is shown in figure 2.4.
It is possible to model for example the constraints between fingers. In
[WLH05] the configuration space is first captured using examples recorded with
a CyberGlove and reduced to 7 dimensions with PCA. It is then further reduced
to 28 basis configurations. All other configurations are then described as a linear
combination of two of these basis configurations.
A physical model can also model the dynamics of the hand. In [SMC01]
10
these dynamics are modelled using a second order process. In [WLH05] a more
simple random walk model is used.
The usage of the physical models is discussed in section 2.7.
Appearance Models
The basic principle of all appearance models is that one can create a statistical
model of the appearance of an object that than can be used to extract that
object from the image.
Such a model captures the correspondence between important landmarks
in the set of examples and the possible variations between these points. For
example the tip of the thumb in one image, corresponds to the tip of the thumb in
another image, although the appearance and position might be slightly different.
Active Shape Models are the most easy to understand. To create a model,
examples are manually labelled with the landmarks. The points in the train-
ingset are then aligned using scaling, rotating and translating the examples.
Then the mean shape is determined. PCA (Principal Component Analysis) is
then used to determine the most important variations in the trainingset. The
model is formed by the mean shape and the shape parameters that are obtained
using PCA. This model that describes this spatial distribution of the landmarks
is commonly referred to as the Point Distribution Model. Active Shape Models
are used in [HS95, HW96].
In [HS95] a 3D Point Distribution Model is used, it is acquired automati-
cally from 3D MRI data. The model is created in a similar way, based on the
distribution of the landmarks.
Active Appearance Models[CET98] are a generalization of Active Shape
Models. These do not only model the shape variation but also the texture
in the area between these landmarks. In [CET98] the extra model points to
capture this textured surface are added using triangulation. The variations in
texture are expressed in texture parameters using PCA. PCA is again applied
on the combination of shape and texture to obtain the appearance parameters.
A variation is used in [FAK03]. The mesh of points covering the hand is
shown in figure 2.5(a).
Elastic Graphs [TvdM02] also model a set of points with specific properties.
Instead of modelling the spatial distribution of all points in a linear relationship,
the model has a set of edges connecting the points as shown in figure 2.5(b).
Each edge poses a constraint on the distance between two points.
2.2 Multi-view Systems

In most systems a single camera is used to obtain the video sequence for further
recognition. It is however also possible to use multiple cameras to obtain more
information and reduce possible ambiguity.
In [HSS00] a stereo image is used with a 30 degree angle between the cameras.
The results from both cameras are weighted based on the complexity. It is
assumed that the more complex image can represent the 3D structure of the
hand better.
Two cameras are also used in [GCC+ 06] under an angle of around 90 degrees
to reduce the occlusion area. [SMC01] also takes into account the possibility
11
(a) Active Appearance Model from
[FAK03]
(b) Elastic Graph model

from [TvdM02]
Figure 2.5: Appearance models for the human hand.
12
of multiple cameras. The 3D model is simply projected differently for each
camera and the features from each image are then all used for further parameter
estimation.
2.3 Feature Extraction

Features are used in both data-driven and model-driven systems. The role is
however somewhat different. In data-driven systems the hand is represented
as accurately as possible in terms of features. At the same time it is tried to
remove variations that are not important to the recognition. Features can be
invariant to variations like lighting, scale, translation and rotation.
Region Selection
It is important to only measure properties of the hand. For data-driven systems
the relevant part of the image has to be extracted before the actual features are
extracted.
Several systems [HSS00, CP04, CSW95, HT05] simply restrict the hand to
be darker or lighter then the background. Simple thresholding is then used to
obtain the hand region.
A bit more robust solution is used in [IMT+ 00, RASS01, FAK03, WLH05,
GCC+ 06]. These systems rely on skin-color models to either remove background
from the image or to create a binary silhouette. In [GCC+ 06] an additional
active contour extraction algorithm is used to provide a more robust extraction
of the hand-region.
In [KT04, STTC04, OB04] a detector is used to scan the whole image to
find the hand region. This region can then be used for further recognition. In
[AS03] a clean segmented hand region is not required. A cluttered background
is allowed within the bounding box that marks the hand region.
Shape and Edge features

Several systems directly use the binary silhouette obtained using thresholding
on intensitity or color. This shape can be described in several ways.
A simple method to describe the contour of the shape is used in [HSS00].
First 256 points are sampled along the contour of the binary silhouette. Then
the distance from each point to the center of gravity is measured. To make the
measure invariant to in-plane rotations the set of points is shifted to start with
the most significant peak. The distances are then normalized to obtain scale
invariance. Plotting these normalized distances for the sampled points results
in the graph shown in figure 2.6. Because the distances are measured relative
to the center of gravity, this measure is also invariant to translation.
Another contour description is used in [CP04] and named Curvature Scale
Space (CSS)[MM92]. This is a multiscale representation of the contour curva-
ture. The computation of the CSS image from the parameteric curve repre-
sentation is discussed in appendix B. It is invariant to translation, scale and
in-plane rotations.
Because the CSS image is a two dimensional representation (figure 2.7) in
[CP04] it is reduced to a fixed-size n-dimensional feature vector to be able to
13
Figure 2.6: A simple contour description based on the distances(di ) between
contour points(Pi ) and the center of gravity(G). (taken from [HSS00]).
Figure 2.7: Curvature Scale Space image (taken from [CP04]).
use it as a state vector in a HMM. Each element in the vector corresponds to

one of n uniformly divided intervals. Then the peaks in the CSS image are
extracted and the corresponding interval is determined. Each element in the
feature vector is assigned the highest peak in the corresponding interval.
Shape context is yet another contour based description used in [GCC+ 06,
OB04]. The basic idea is that each point along a contour can be described
by a log-polar histogram(figure 2.8) of other points on the contour. Similarity
between shapes can then be expressed as those shapes that have multiple points
with similar shape contexts. The cost Ci , j between two points can be computed
using for exampleP the χ2 statistic. The distance d between two shapes is given
by the sum d = k (Ck , π(k)) minimized by a permutation π. This optimal
permutation can be computed using the Hungarian method.
Because a histogram is used, this measure is not restricted to a single con-
tinuous outer contour, inner contours or other edge information can also be
taken into account when creating the histogram. Shape Context is invariant to
translation, in-plane rotation and scaling.
In [RASS01] Hu Moments are used to describe the shape. Hu Moments are
functions of central image moments. Due to the particular combination, Hu
Moments are invariant to translation, scaling and in-plane rotation.
The hand can also be characterized by the edges in the hand region. A
14
Figure 2.8: The log-polar histogram used for Shape Context features (taken
from [OB04]).
possible measure for similarity between edge images is the chamfer distance.
The chamfer distance can be expressed as:
1 X
c(X, Y ) = min kx − yk
|X| y∈Y
x∈X
In this expression kx − yk is the distance between an edge pixel from X and

an edge pixel from Y . The chamfer distance is used in [STTC04]. In [AS03] a
variation is used named the approximate directed chamfer distance. This is an
embedding of the chamfer distance in euclidean space.
The Haussdorf distance used in [STTC04] is similar to the chamfer distance
measure:
h(X, Y ) = max min kx − yk

x∈X y∈Y
An edge orientation histogram is a measure to compare edge images. It is

used in [AS01, AS03]. The orientation angle for each pixel is determined from
the derivatives in the x and y direction. Each bin in the histogram represents
a limited range of orientations. The whole histogram is normalized to sum to
1. These histograms can then be compared using the maximum intersection on
the cummulative histogram.
Probabalistic line matching is another measure used in [AS03] and is de-
signed to be robust to clutter. It is based on matching straight line segments
between the model images and the input image and tries to identify those cor-
respondences between lines that are least likely to occur by chance.
Low-level Features
[KT04] and [OB04] both use Haar-like features as proposed by [VJ01] for ob-
ject detection. This are simple features calculated from the difference be-
tween sums over two or more rectangular area’s in the image. These fea-
tures can be computed efficiently because the sums can be computed in a
fixed time after preprocessing the image. This is done using a so called in-
tegral image ii(x, y) which is the sum of all pixels to the top left of pixel
(x, y). The sum of any rectangle (x1 , y1 , x2 , y2 ) can then be calculated as
ii(x1 , y1 ) + ii(x2 , y2 ) − ii(x1 , y2 ) − ii(x2 , y1 ). There are thousands of Haar-like
features possible by varying the size and position of the rectangles.
15
Figure 2.9: A few possible block differences for Haar-like features (taken from
[VJ01]).
An even more simple way of using image intensity is to directly use the pixel
intensity values as features [IMT+ 00, CSW95, BJ96, STTC04]. These are very
sensitive to all kinds of variations. In [FAK03] the normalized gray-level and
the structural values (angle and magnitude) at specific model points are used
to describe texture.
Local autocorrelation patterns are used in [HT05]. The second-order local
autocorrelation is given by
X
R(a1 , a2 ) = f (r)f (r + a1 )f (r + a2 )
r
where f (r) is the intensity of the image at position r. When considering a 3x3
neighbourhood in the image this results in 25 patterns given different values for
a1 and a2 . In [HT05] these patterns are calculated at three different scales for
each of 64 rectangular sections in the image.
Local Features
A feature that is used in a lot of model-driven systems is the location of edges
near the projected model contours. This is shown in figure 2.10. To find the
strongest edge in the neighbourhood, points are sampled along the line per-
pendicular to the model contour. In each point, the derivative in the direction
of the line is calculated. The point with the strongest edge is then chosen
[HS95, HH96]. This location can then be used to propose a new location for the
model points.
A little more complex is to learn a gray-level profile for the neighbourhood
of each point during training. Instead of simply finding the strongest edge, the
best match for this profile along the normal is used. This is applied in [HW96].
In [TvdM02] the local image description is based on a wavelet transform
with a complex Gabor-based kernel. It is invariant to constant offsets in grey-
level value. The combination of several convolutions with different filters form
a descriptor called a jet. Multiple jets are combined in a so called bunch-jet
which enables to model variations of corresponding points in different images.
16
emax
e(dX) = emax
Figure 2.10: Local edges as features (taken from [CTCG95]).
2.4 Feature Reduction

Especially in the case of intensity values as features it is important to reduce
the number of features that are used for further recognition. There are a lot
of feature reduction methods available, but almost all data-driven systems use
PCA to map the features to a more compact representation, while retaining
most of the variance.
In [CSW95] an alternative mapping method, named MDF (Most Discrimi-
native Features) is used. This should improve recognition performance because
it emphasizes the features most important for recognition, instead of the ones
explaining most of the variance in the dataset like with PCA.
Besides combining features into a reduced representation it is also possible
to reduce the number by simply select features. In the case of Haar-like features
thousands of features are possible within a small 24x24 window. In [KT04,
OB04] the selection of the best features for detection is part of the AdaBoost
and FloatBoost learning algorithms discussed in section 2.5 and appendix A.
2.5 Data-driven Recognition

An important step in the recognition of handshapes is the mapping from features
to the desired output.
The most simple approach to recognition is to collect a set of labeled exam-
ples and then find the one closest to the new input. A generalization of this idea
is the k-Nearest Neighbour classifier, which outputs the label occuring most fre-
quently in the set of k nearest neighbours. To determine the nearest neighbour
a distance measure is used. This can be for example the euclidian distance on
the features, but other distances like the chamfer distance can also be used. As
can be seen in table 2.1 most data-driven systems use a variation of this simple
17
classifier. The systems that output parameters simply output the parameters
corresponding to the nearest neighbour instead of a single class label.
[AS03] and [GCC+ 06] do not retrieve a single closest item. They retrieve a
set of k best candidates and try to maximize the number of relevant items in
this set.
In [RASS01] Neural networks are used to map features to output. This
specialized mapping method uses multiple neural networks, each specialized in
mapping a certain part of the input space. All mappings are then applied to
the input. Thereafter a heuristic is used to choose the best hypothesis from the
ones generated by each mapping.
In [KT04] a cascade of boosted classifiers is used. Such a cascade forms a
detector as shown in figure 2.11(a). Each level is formed by a strong classifier
trained using the AdaBoost algorithm. These strong classifiers are a linear
combination of weak classifiers, each determined by a single Haar-like feature.
The training process consists of iteratively selecting a weak classifier to add
to the pool of classifiers. Each time the error is weighted, giving more weight
to previously wrongly classified examples. This means that the newly selected
weak classifier is chosen based on its capability of discriminating the hardest
cases. This process is described in more detail in appendix A.
Each level in the cascade is trained in such a way that it has a very high
detection rate, taking a high number of false positives for granted. Each level in
the cascade further reduces the number of false positives using a stronger clas-
sifier while detecting (almost) all positive examples. Because of this structure,
the toplevel classifier can use only a few features to quickly weed out a lot of
clearly negative examples.
A similar idea is used in [OB04], a tree structure of detectors creates a
multi-class classifier which also provides location information. A general hand
detector formed by a cascade structure similar to the one in [KT04] is the root
of the tree (figure 2.11(b)). If a possible hand region is detected by this detector
it is feeded to a set of shape-specific detectors. These detectors then determine
the specific hand shape. The detectors are trained using FloatBoost which is
similar to AdaBoost, but can also remove those weak classifiers that no longer
contribute positively to the result. A smaller set of weak classifiers with similar
performance remains.
In [STTC04] a tree structure is also used(figure 2.11(c)). Easy to compute
features are used near the top, while more expensive and precise features are
used near the leafs. Each level in the tree acts on a more limited range of
configurations. The detection and recognition are combined in the same tree,
opposed to [OB04] where detection and recognition are clearly separate steps.
2.6 Trainingset Creation

All data-driven methods discussed require input-output examples to be trained.
Systems using appearance models also require training examples, these must
however also be annotated with additional information like landmarks.
Manually attaching class labels to the examples is the most common way to
obtain these examples. This becomes more complex when the output is a set of
parameters, these then have to be manually estimated as in [FAK03].
To avoid this manual labour several solutions exist. The examples can be
18
reject
reject
reject
hand detected
(a) Cascade structure

used in [KT04]
reject
reject
reject
hand detected
shape 1 shape2 shapeN
(b) Tree structure used in [OB04]
reject
reject reject
shape 1 shape 2 shape N
(c) Hierarchical detector structure used in

[STTC04]
Figure 2.11: Structures used in data-driven detectors.
19
clustered, based on some similarity measure [IMT+ 00, OB04]. These clusters
can then be used as output labels. To obtain the real parameters, corresponding
to a training image, a Dataglove can be used. Wearing such a glove influences
the image, to minimize this the glove can be covered with a second glove. This
is used in [HT05] to obtain a labeled test set.
An alternative way to obtain input-output examples is to generate input
examples using a physical model. In [RASS01, AS01, AS03, HT05, GCC+ 06] a
3D computer graphics model of the hand is used to create examples. This not
only avoids the manual labeling, it also gives the posibility to obtain examples
for any configuration from any viewpoint. These models are very similar to the
physical models discussed in section 2.1.
2.7 Model Projection and Fitting

In model-based systems the parameters must be estimated to minimize the
error between the projected model and the image. The basic principle is first
explained for appearance models and then extended to the fitting of physical
models.
The most straightforward fitting is applied to fit Active Shape Models. First
the model is initialized to for example the average hand. Then the position of
each model point is determined. Thereafter a new location for that point is
proposed using the strongest edge in the neighbourhood as shown in figure 2.10.
The model parameters are then adapted to minimize the error between the
new model point positions and the proposed positions by solving a set of linear
equations. Because there can be a fairly large difference between the initial
model and the object in the image and the updates are only local, the algorithm
is executed iteratively. Each iteration further refines the model parameters.
An Active Appearance Model is based on the same principles. Because the
texture between the landmarks is also modelled there are however much more
model points. This makes it expensive to independently propose new locations
for each model point. A smart solution has been proposed in [CET98]. It
is based on the error vector which contains the difference between the gray-
level values from the model and the ones in the image. During training the
relation between the error vector and changes in the appearance parameters is
learned. This is done by systematically changing the parameters and observing
the corresponding error vector.
In [FAK03] a variation is used. Only the shape parameters are updated
based on the error vector. The texture parameters are then simply directly
determined from the texture values at the new model point positions.
The Elastic Graph in [TvdM02] is fitted in three step process. First the
whole image is scanned in coarse steps without changing the shape of the model
to determine the position. Thereafter the model is allowed to scale and shift by
a small amount. Finally the nodes may each move seperatly by a small amount.
The cost function takes into account the deformation of the graph and the fit
of the node itself.
Because Appearance models depend on appearance, it is imposible to cap-
ture all views and configurations in a single appearance model, without allowing
almost any object to fit the model. To cover multiple posible views and con-
figurations it is necessary to use multiple models. It is possible to simply fit
20
each model on the image and then select the best fit ([TvdM02]). This however
becomes more problematic when a lot of models are used.
Several ad-hoc rules are used in [HW96] to switch between the models based
on the location of model points and shape parameters. In [FAK03] a complex
network structure is used to switch between models. For each frame the model
from the previous frame is fitted first. Then the two most likely neighbours
are also fitted and the one with the lowest error is chosen. Which neighbour is
most likely depends on the specific error vector, which relation is learned during
training.
To fit a physical 3D model to the image, it has to be projected into the 2D
image. This is much more complex then for appearance models. An important
aspect is the handling of self-occlusion.
In [RK95] deformable templates are used to model each link. These are then
combined, taking into account the occlusion relations. In [SMC01] the links
are modelled using 3D Quadrics. The contours are easily projected onto a 2D
plane. A simple cardboard representation is used in [WLH05] because its much
simpler to project then the quadrics. Each link is modelled by a simple plane.
In [RK95] the SSD error between templates and image is used. A gradient
descent algorithm is then used to minimize the error.
All other systems use edges in the neighbourhood of the projected model.
These features are used to estimate the parameters taking into account the
previous state and the contraints given by the model.
In [SMC01] an Unscented Kalman filter is used. [LC04] and [WLH05] use
different variations of particle filtering. In [WLH05] a sequential Monte Carlo
algorithm based on Importance Sampling is used. Furthermore an alterna-
tive approach using the Condensation algorithm is also discussed which is a
computer-vision specific particle filtering algorithm. Both particle filtering and
the Unscented Kalman filter are nonlinear estimation techniques that use sam-
pling to approximate complex distributions in the system. The Condensation
algorithm is discussed in detail in appendix C and shows the general principle
of sampling based estimation. The algorithm in [WLH05] is a variation where
the main difference lies in the generation of the samples.
2.8 Tracking
In data-driven systems temporal coherence between the handshape in two suc-
cesive frames is only exploited in [HSS00]. A transition network is used to limit
search space. Similarity is only evaluated for the templates that are a likely
successor to the previous state. The network is learned from a series of example
sequences. A reduction of 90% in processing time is claimed.
All systems with an appearance model use the current model as the initial-
ization for the next iteration and/or frame. Because some overlap between the
model and the hand in the image is required to fit the model, only local changes
can be tracked in this way. A seperate global hand search[HS95] can be used to
handle the case that the tracking is lost. [FAK03] suggests the use of a seperate
tracking mechanism when changes between frames are to large. When certain
variables cannot be estimated directly from the appearance model, [FAK03] uses
the value from the previous frame.
Systems using a physical model require a more robust tracking mechanism.
21
This tracking is tightly integrated with the estimation of model parameters.
These estimation methods have been discussed in section 2.7.
Initialization
With tracking, information from the past is used for the current estimation.
This poses a problem for the first frame in a sequence, because there is no
information from the past. An initialization step becomes necessary.
The most simple solution is to put the initialization in the hands of the user
by requiring a predefined initial state [RK95]. The hand model in [LC04] is
initialized automatically. This even includes the user specific parameters.
It is also possible to use a global detector or a data-driven recognition system
to initialize the model. This is proposed in [HS95, AS03, KT04]. Such an
approach can also be used to recover when tracking is lost due to for example
complete occlusion.
2.9 Output
Depending on the objective of the application different kinds of output can be
desired.
Most methods that use a physical model during training or fitting output the
3D model parameters. These parameters can then be used for a reconstruction
or further recognition.
[FAK03] maps the 2D model parameters to a simplified 3D model using
linear regression. The model has only six parameters for the hand configuration
and three for the hand pose. The extension of each finger is represented by
a single parameter between 0 and 1. An additional parameter represents the
spread of all fingers.
[HSS00] and [CP04] directly output the features for gesture recognition using
for example a Hidden Markov Model.
Other systems simply output a single class label. In some cases these classes
are based on visual appearance. Others are based on the real hand configuration.
The systems described in [AS03] and [GCC+ 06] provide additional estimates for
the global handpose besides the class label.
[OB04] and [IMT+ 00] output labels that correspond to clusters. These clus-
ters are learned in a training phase based on a visual similarity measure. It is
however unclear whether these clusters are accurate and usefull for recognition.
22
Chapter 3
System Comparison
A perfect method for handshape recognition does not yet exist, each solution
has specific strengths and weaknesses which may be more or less important
depending on the application.
In the folowing sections several criteria related to the problems discussed in
section 1.3 are discussed. An overview containing all discussed systems is shown
in table 3.1.
3.1 Recognition Performance

The most interesting aspect to compare systems is offcourse the performance.
It is however impossible to directly compare systems based on the reported
results. There is no widespread use of public test sets, therefore conditions and
circumstances vary. Furthermore the way results are presented depends very
much on the type of system.
As can be seen in table 3.1 some systems have a reported succesrate of more
than 90%. These experiments are however very limited and do not incorporate
complex situations related to the robustness of the algorithm.
In [OB04] a very high detection rate of 99.8% and recognition rate of 97.4%
are reported. It is however unclear whether this generalizes to more complex
situations because backgrounds are simple and testing and training images are
acquired under similar conditions.
In [KT04] and [STTC04] results are presented as a ROC curve. There is a
tradeoff between detection rate and false positives. [KT04] claims that a high
detection rate is not needed because at a high framerate the hand will still be
recognized within a few frames. [KT04] also shows that some handshapes are
much harder to detect then others.
[STTC04] shows that a classifier trained on real data performs better then
a classifier trained on artificial examples.
In [FAK03, RASS01, HT05] the error in estimated parameters is reported.
These are all under 10% but in none of these cases it is clear whether this
average error is the result of a small errors around 10% or by a combination of
near perfect estimations combined with completely wrong estimates.
Several systems that approach the recognition problem as database retrieval
do not report an error but only whether a good result is within the set of top-
23
Table 3.1: Overview of all systems with respect to the reported
robustness(section 3.2), real-time execution(section 3.7), handling of
low-resolution(section 3.5), view independence(section 3.3) and perfor-
mance(section 3.1).
Robust to:
System Complex Overlap with Occlusion Varying Real-time Works with View in- Performance
Backgrounds skin-colored Lighting execution? Low resolu- depent
objects tion hand recogni-
region? tion?
Data-driven recognition
[HSS00] no 1 no not reported no 1
not reported not reported no 95.9% correct, 260
model images, 74
input images
1
[CP04] no no not reported no not reported yes (320x240 no 98.6% on 500 instances
upperbody of 20 gestures with
video) HMM
3
[CSW95] no no no could be outdated yes (32x32 no 96% with 4% false re-
with correct hand region) jection
training
[IMT+ 00] yes 2
yes, seperate not reported not reported yes yes (64x64 no 93% correct, 72% cor-
handling hand region) rect in front of face
2
[RASS01] yes no no yes yes not reported yes ...7
[AS03] yes expected8 not reported not reported no no, significant yes ...7
performance
drop is re-
ported
3
[HT05] no no not reported no yes not reported yes 7 degrees mean error
on joint angles
+ 34
[GCC 06] not reported yes not reported not reported yes not reported yes ...7
Data-driven Detectors
[KT04] yes yes not reported yes yes yes no ...7
[OB04] expected6 yes not reported not reported not reported yes no 99.8% correct detec-
tion, 97.4% correct
shape recognition
[STTC04] yes yes not reported expected no not reported yes ...7
Appearance Models
[HS95] yes, with yes not reported no yes not reported no not reported
special
lighting
[HW96] yes yes not reported not reported yes not reported no not reported
[FAK03] yes 2 no not reported yes expected5 claimed, but yes 10% mean error in pa-
unclear rameters
[TvdM02] yes expected8 not reported yes no not reported no 86.2% correct with
complex background
[HH96] yes yes no not reported yes not reported no not reported
[BJ96] no 3 no not reported yes outdated expected6 only affine not reported
transfor-
mations
Physical Models
[RK95] not reported not reported not reported not reported yes not reported yes not reported
[WLH05] yes 2 no not reported yes yes not reported yes not reported
[SMC01] not reported not reported not reported not reported expected5 not reported yes not reported
[LC04] yes 2 no no (planned) yes expected5 not reported yes not reported
1 A threshold on the brightness of the hand region is used to extract it. This heavily restricts the allowed
background.
2 The extraction of the hand region depends on skin-color it is therefore robust to most complex backgrounds.
3 A uniform background is used to simplify extraction of the hand region. Complex backgrounds are not considered.
4 An Active Contour is used to extract the hand region
5 Results are reported on outdated machines. Based on the advance in computational power real-time execution
is expected.
6 Not reported but it is expected based on results of similar systems.
7 Results are reported but cannot be expressed as a single number. See section 3.1 for details.
8 Not reported but it is expected because features do not depend on skin-color.
24
k ranked items. This is the case in [AS03] and [GCC+ 06]. These results are
hard to interpret and are therefore only suitable to compare options within the
systems.
The model-driven systems are clearly less developed then the data-driven
systems. Except for [FAK03] and [TvdM02] no objective errors are reported.
All methods that use multiple cameras (as discussed in section 2.2) report a
significant increase in performance. It might however not always be possible to
arrange multiple camera’s to acquire images from multiple significantly different
views. This is probably most suitable for a relatively controlled environment.
Performance in a real-life situation depends a lot on the robustness of the
system, which is discussed in the next section.
3.2 Robustness
Maybe the most important aspect of a recognition system is the robustness.
A summary is shown in table 3.1. In this table robustness to cluttered back-
grounds, overlap with skin-colored objects, occlusion and varying lighting is
listed.
As discussed in section 2.3 some methods assume a simple background to be
able to extract the hand. These methods fail with objects in the background.
Systems that use a skin-color model are more robust to complex backgrounds.
This however creates a problem if the hand is in front of a skin-colored object.
This is handled in [IMT+ 00] by detecting this situation and then scanning the
relevant area with a correlation-based detector this does however result in a
drop in performance.
Because color is influenced by lighting, the robustness to lighting conditions
becomes very dependent on the color model. In [RASS01] changes in the color
distribution are constantly tracked to handle changes in lighting.
[AS03] and [STTC04] both try to combine multiple features using color and
edge information to obtain a more robust system.
[KT04] and [OB04] do also not rely on color information. [KT04] shows to be
robust to complex backgrounds and different lighting conditions. The detection
is however limited to only six view-dependent shapes. It is unclear whether
systems like [IMT+ 00] can really handle different lighting conditions because
the system relies directly on the intensity values as features.
Several methods like [HSS00, CP04] rely on perfectly segmented hands. This
is not expected to be possible in practical systems.
Model-driven systems using local edge information should also be capable
of handling cluttered backgrounds. [HW96, HH96] positively report about this,
it is however not quantified. [TvdM02] does show quantative results compar-
ing cluttered backgrounds (86.2% recognition) with the results on a uniform
background (93.3%).
The handling of occlusion with other objects by the discussed systems is
mostly unclear. It is however expected that data-driven systems that use shape
features can not handle this, because occlusion changes the shape. The use of
multiple cameras does not solve this, but it does limit the occluded area.
Model-driven systems with tracking should in principle be able to handle
partial occlusion. It is however unclear whether this is actually the case for
hand models. If the tracking is lost due to the occlusion it must be reinitialized.
25
A data-driven detector does not have this problem because it can find the hand
as soon as it is visible again.
3.3 View Independence

The appearance of the hand changes drastically with the view. Therefore a
system has to explicitly take this into account to become independent of the
viewing direction.
All data-driven methods and methods using an appearance model can only
represent limited viewing and pose variation in a single model. Therefore these
methods require multiple models to represent all posible views.
For data-driven methods this simply means adding examples of each hand-
shape for each viewing direction. Because this requires a huge amount of extra
training examples, this is mostly the area of methods that use computer gener-
ated examples.
Due to the sensitivity to changes in view, systems using low-level image
features such as the Haar-like features or pixel intensities require a lot of training
examples to model just a single view. None of these systems provide view
independence, possibly because to many training examples would be required
to cover all views for all posible shape variations.
Systems using a 2D appearance model need a mechanism to switch between
the available models because a single model cannot represent all views on a single
handshape. Appearance models are however able to fit to in-plane rotations,
scaled and translated versions. This limits the number of models needed for
representation of all views.
Again methods using a 3D model for training or fitting are better suited for
view independence. Instances can be generated for any desired view.
3.4 User Independence

The variation in hands between different users should be handled well by a
recognition system.
Several methods take into account these variations. Models created by train-
ing are extended by adding examples from different subjects.
Methods that use a physical 3D model for training must manually incor-
porate these variations in the model. This is done in [HT05] by generating
examples with varying lengths for the fingers.
It is more difficult to make model-driven systems with a physical model
user independent. In [WLH05] the model is manually adjusted to a specific
user before initialization. The model in [LC04] is automatically adjusted in the
initialization step after starting with the average male hand.
3.5 Image Quality and Resolution

For sign-language recognition a video of the whole upper body is used. This
means that the resolution of the hand region is relatively low because only a
small part of the image actually contains the hand.
26
The Viola-Jones inspired systems [KT04, OB04] can clearly handle low-
resolution images. The base resolution is in the order of 20x30 pixels. This
is also true for [CSW95], where the extracted hand area is only 32x32 pixels. In
[IMT+ 00] the extracted hand area is normalized to 64x64 pixels. This should
correspond to a video resolution for the whole upperbody of about 640x480
pixels.
In [FAK03] the system is claimed to properly handle low resolution, it is
however unclear how this is defined and tested.
All other systems either report full-size hand images of at least 100x100
pixels or do not mention the required resolution at all. Especially for features
describing the contour it is not clear how these are influenced by the low reso-
lution.
The methods that are reported to work with fairly low resolution use low-
level intensity features and take into account the whole hand surface.
3.6 Training Effort

Some methods like the ones based on boosting [KT04, OB04] require significant
training time, this can make it harder to evaluate multiple options. More im-
portant is the amount of manual labour required during training as discussed in
section 2.6. More manual labour means that it is harder to use a lot of different
training objects.
Manual labeling is not only labour intensive, [AS03] finds that manual es-
timation of parameters is also very inaccurate. Especially for hand parameters
other methods like the use of a 3D model to generate examples are desirable.
Clustering based on a visual similarity measure seems less valuable because the
clusters depend on appearance. This rules out the possibility to group multiple
views of the same finger configuration, which could be more meaningfull.
2D Appearance models require training images that are annotated with land-
marks. In [FAK03] a semi-automatic method is mentioned, details are however
not reveilled. In [TvdM02] it is proposed to first manually label the first exam-
ple, then fit it to the next example and only correct the errors manually.
3.7 Computational Costs

Real-time execution is an important aspect of a gesture recognition system.
To achieve this, complex feature calculations must be avoided. Most systems
discussed here perform (near) real-time. [BJ96] report longer processing times,
but are tested on outdated hardware. Only [AS03] reports behavior far from
real-time with processing times of several seconds per image.
The actual cost is probably more then anything dependent on specific im-
plementations. It is however important to note which factors play an important
role in reducing the computational cost.
First of all, the features must be cheap to compute. This is especially true
for a detector, which needs to evaluate a lot of image patches for every frame.
For example, the use of a precalcuated integral image, as proposed in [VJ01]
and applied in [KT04] and [OB04], reduces the calculation of a pixel sum over
a rectangular area to four lookups. A similar precomputed table for efficient
27
calculation of sums is used in [STTC04]. In [AS03] the chamfer distance is em-
bedded in euclidian space to simplify comparison between items while retaining
most of the performance.
Fast reduction of the search space can also greatly reduce computational
cost. The cascade structure used in [KT04] first weeds out unlikely candidates
with a very simple classifier in the first cascade. Each cascade further reducing
the search space. [OB04] and [STTC04] occupy a tree structure to achieve
similar characteristics. In [HSS00] a transition network is used to limit the
posible candidates. A reduction of 90% is reported. [AS03] applies a two stage
retrieval limiting the search space in the first step using cheap features.
In model-driven systems the search is more naturally reduced to local changes.
Because measurements are only executed at specific locations depending on the
state of the model, costs are relatively low. Multiple iterations per frame are
however often required to get accurate results.
The use of tracking and the modelling of pose and motion constraints also
reduces the search space[SMC01, WLH05].
In [FAK03] the relation between the error vector and parameter updates is
learned in advance giving the posibility to use much more model points because
the error minimization is less costly.
28
Chapter 4
Conclusion
Without standarized benchmarks it is unclear which systems perform best. No

single system is reported to be perfect, there are however several interesting de-
velopments that can contribute to the creation of better systems for handshape
recognition and pose estimation.
For all systems the handling of (partial) occlusion is an unexplored area. This
can be important for gesture detection because hands can easily be occluded by
the other arm.
There are numberous advantages in the use of a 3D model of the hand for
training or fitting. It enables the system to cover all posible poses and shapes
and avoids the need to manually estimate parameters for training images. The
known ground truth makes it possible to create a more accurate system.
In general model-driven systems seem to be a logical choice because these
have the most potential to exploit dependencies and constraints. These sys-
tems are however still far from complete and therefore performance is mostly
unreported. Furthermore these systems require appropriate initialization which
complicates the practical implementation of such a system.
If practical application and robustness is considered, the Viola-Jones object
detector is the best option to consider for gesture recognition. It operates in
real-time, is robust, does not rely on skin-color and can work with low-resolution
information. Furthermore detectors can be used to recognize different shapes
and views when used in a tree-like structure. It is however unknown whether this
preserves the robustness and real-time behaviour given the amount of possible
combinations of shape and view. An interesting open issue is whether it is also
possible to exploit physical and motion constraints in data-driven systems as
succesfully as in model-driven systems.
29
Appendix A
Viola-Jones Object
Detection
A popular method for object detection is proposed by Viola and Jones in [VJ01].
It has been succesfull for face detection and is applied to hand detection in
[KT04]. The particular combination of features, learning algorithm and classifier
structure is often referred to as Viola-Jones object detection.
The basis is formed by the Haar-like features. These are computed as the
difference between the sum of pixels of rectangualar area’s in the detector win-
dow. In [VJ01] these area’s are restricted to be adjacent and have similar shape.
As can be seen in figure 2.9 features can be composed of two, three or four rect-
angles.
To calculate the sum over an area efficiently the integral image is proposed.
The integral image ii(x, y) is defined as
X
ii(x, y) = i(x0 , y 0 )
x0 ≤x,y 0 ≤y
where i(x, y) is the original image. It gives the sum of all pixels to the top-left of
pixel (x, y). The integral image can then be used to calculate the sum of pixels
over any rectangular area (x1 , y1 , x2 , y2 ) in the image as ii(x1 , y1 ) + ii(x2 , y2 ) −
ii(x1 , y2 ) − ii(x2 , y1 ). Due to this optimization the calculation of any feature at
any scale and position can be done in constant time.
Given that the image is scanned with a window with a base resolution of
24x24 this results in 45.396 possible features. It is imposible to consider all these
for each sub-window at each scale. Therefore a selection of features is made
during training using the AdaBoost algorithm. This algorithm is designed to
compose a strong classifier based on the combination of a set of weak classifiers.
Each weak classifier is based on one feature and a threshold. The AdaBoost
algorithm is presented as algorithm 1.
The last important aspect of a Viola-Jones object detector is the cascade
structure of multiple classifiers. Instead of evaluating each sub-window using
a classifier that considers hundreds of features, first a classifier is used to find
those sub-windows that are likely to contain the object based on only a limited
set of features. This reduces the number of sub-windows to consider with the
stronger classifier. Because there is a trade-off between detection rate and false
30
Algorithm 1 AdaBoost
• For all examples (x1 , y1 ), ..., (xn , yn ), yi = 0 for negative and yi = 1 for
positive examples.
• Initialize weights:
(
1
2m if yi = 0, with m the number of negatives;
w1,i = 1
2l if yi = 1, with l the number of positives.
• for t = 1..T :
1. Normalize weights so that wt is a probability distribution.

2. TrainP
a classifier hj for each feature j and evaluate the weighted error
j = i wt,i |hj (xi ) − yi |.
3. Choose the classifier ht with the lowest error t .
4. Update weights to give more weight to the examples with high error:
wt+1,i = wt,i βt1−ei

t
where βt = 1− t
, ei = 0 for a correctly classified example xi and
ei = 1 otherwise.
• Compose the final classifier out of the T chosen classifiers:
( PT PT
1
1 t=1 αt ht (x) ≥ 2 t=1 αt
h(x) =
0 otherwise
where αt = log β1t , which gives weights inversely proportional to the train-
ing error.
31
positives, to avoid missing candidates a high number of false positives has to be
accepted.
The same principle can then be applied to the remaining sub-windows re-
sulting in multiple cascases each reducing the number of false positives further.
These cascades need more features to achieve these better results, there are
however also much less sub-windows to consider. Each cascade is trained using
those examples that have passed through all previous cascades, this training is
given in algorithm 2.
Algorithm 2 Cascade training

• The folowing parameters are selected:
– f , the maximum acceptable false positive rate per layer;
– d, the minimum acceptable detection rate per layer;
– Ftarget , target overall false positive rate.
• P is the set of positives, N is the set of negatives.
• Ci is the cascaded classifier at itteration i.
• F0 = 1.0, D0 = 1.0
• i=0
• while Fi > Ftarget
– i←i+1
– ni = 0; Fi = Fi−1
– while Fi > f Ḟi−1
∗ ni ← ni + 1
∗ Train a classifier Hi with ni features on P and N using AdaBoost
and add it to Ci
∗ Evaluate Ci on validation set to determine Fi and Di .
∗ Decrease threshold for Hi until Ci has a detection rate of at least
dḊi−1 and determine corresponding Fi
– N ←∅
– If Fi > Ftarget then evaluate Ci on the whole set of negative examples
and put false detections into the set N
32
Appendix B
Curvature Scale Space
A Curvature Scale Space(CSS) image is a multiscale representation of the con-

tour curvature of an object[MM92].
The starting point is a parametric representation of the curve:
Γ = {(x(u), y(u))|u ∈ [0, 1]},
where u is the normalized arc length parameter.

The curvature κ of a curve is the derivative of the tangent angle φ with
respect to the arc length s. The following expression for the curvature as a
function of u is derived in [MM92]:
x0 (u)y 00 (u) − x00 (u)y 0 (u)

κ(u) = 3 .
((x0 (u))2 + (y 0 (u))2 ) 2
The key principle in CSS is that curves with less detail, so called evolved
curves, can be created by a convolution with a Gaussian kernel g(u, σ). An
evolved version of a curve can be defined as:
Γσ = {(x(u, σ), y(u, σ))|u ∈ [0, 1]}.
An example is shown in figure B.1 where several increasingly evolved versions

of a curve are shown. The curvature of such an evolved curve can be computed
directly with:
Xu (u, σ)Yuu (u, σ) − Xuu (u, σ)Yu (u, σ)

κ(u, σ) = 3 .
((Xu (u, σ))2 + (Yu (u, σ))2 ) 2
where
Xu (u, σ) = x(u) ∗ g 0 (u, σ),
Xuu (u, σ) = x(u) ∗ g 00 (u, σ),
Yu (u, σ) = y(u) ∗ g 0 (u, σ),
Yuu (u, σ) = y(u) ∗ g 00 (u, σ).
The CSS image of a curve Γ is then implicitly defined by the equation
κ(u, σ) = 0.
33
original σ=2 σ =5 σ = 10 σ = 20
Figure B.1: Koch snowflake curve and evolved versions with increasing σ (taken
from [MM92]).
Figure B.2: CSS image for the curve shown in figure B.1 (taken from [MM92]).
An example is given in figure B.2. u is represented on the horizontal axis and

σ on the vertical axis. As can be seen in the image with increasing σ the curve
becomes smoother and the number of zero-crossings in the curvature function
decreases.
34
Appendix C
Condensation Algorithm
The Condensation algorithm[IB98] is an example of particle filtering. It uses

sampling to approximate a complex probability distribution that is used to
estimate model parameters (state xt ) based on noisy observations(zt ).
The history of states up to time t is given by Xt = {x1 , ..., xt } and the history
of observations by Zt = {z1 , ..., zt }.
The assumption is made that the new state of the process only depends on
the immediatly preceding state:
p(xt |Xt−1 ) = p(xt |xt−1 )
The most important aspect of the Condensation algorithm is that the state-
density can be approximated using a set of weighted samples. The algorithm
is described as algorithm 3 and represented visually in figure C.1. Given a set
of samples a new equally sized set of samples is determined. First, samples
are selected from the old set. Taking into account the weight of each sample.
Then new values are predicted based on a dynamical model. Thereafter the
samples are weighted with the probability of the current observation given the
states corresponding to the samples. The new distribution can then be used to
estimate the state.
To initialize the whole process samples can be taken from a prior density
p(x0 ).
Dynamical model
The dynamics of the process can be described in terms of a stochastic differential
equation:
xt − x̄ = A(xt−1 − x̄) + Bwt

where wt are independent vectors of independent standard normal variables.
The dynamic model has a deterministic component represented by A and a
stochastic component represented by B. These correspond to the drift and
diffuse steps shown in figure C.1.
The parameters x̄, A and B can be set based on prior knowledge or estimated
from example input data. This data can for example be obtained by tracking
the object using another tracking algorithm in a more constrained environment
e.g. without clutter.
35
Algorithm 3 Condensation
• The state-density at time t is described by a set of N samples with weights
(n) (n)
{st−1 , πt−1 , n = 1, ..., N }
• for n = 1...N
(n)
– Select a sample s0 t from the distribution with replacement where
(j) (j)
each sample st−1 has a probability πt−1 of being selected.
– Predict a new value for the sample from
(n)
p(xt |xt−1 = s0 t )
If the prediction is described as a linear stochastic differential equa-

(n) (n) (n)
tion a new value can be generated as: st = As0 t + Bwt where
(n)
wt is a vector of standard normal random variates and BB T is the
process noise covariance.
– Measure features zt and weight the sample with respect to the ob-
servation density:
(n) (n)
πt = p(zt |xt = st )
P (n)
• Normalize weights so that n πt = 1.
• Estimate the desired parameters from the new N samples e.g. using the
mean.
36
Figure C.1: A single iteration of the Condensation algorithm (taken from
[IB98]). The size of the samples corresponds to it’s weight.
Observation model
The weight of the samples depends on the observation density p(z|x).
In [IB98] a model is proposed that uses the tracking of a contour using edges
near the contour. At M points the edges along a line normal to the contour are
detected. Each edge is seen as a measurement that comes eather from the true
target or from clutter.
When it is assumed that the clutter is a Poisson process with parameter λ
and the true measurement comes from an unbiased normal distribution with
standard deviation σ the observation can be modelled for the one dimensional
case as:
1 X v2
p(z|x) ∝ 1 + √ exp − k2
2πσα 2σ
k
where α = qλ and vk = zk − x. This results in a distribution with peaks around

each of the K measurements and a non-zero base level.
This distribution is then approximated by:
1
p(z|x) ∝ exp − min(v12 , µ2 )
2σ 2
√ √
where µ = 2σ log 1/ 2πασ is a scale constant and v1 is the smallest differ-
ence between a measurement zk and the hypothesis x. It can be seen that the
probability decreases when the distance between the closest edge and the hy-
pothesis shape increases. µ provides a maximum for this distance and therefore
a non-zero minimum probability. Due to this minimum level a good hypothesis
(sample) can survive a series of bad observations.
37
The observation density for the measurement along all M normals can be
calculated as the product of the one-dimensional measurements. This product
can be computed using the folowing discrete approximation:
M
X 1
p(z|x) ∝ exp{− min((z1 (m) − r(m))2 , µ2 )}
m=1
2rM
where µ is a spatial scale constant, r a variance constant, r(m) is the hy-

pothesis and z1 (m) the closest observation for point m on the contour.
38
Bibliography
[AS01] V. Athitsos and S. Sclaroff. An appearance-based framework for 3D

hand shape classification and camera viewpoint estimation. Techni-
cal Report 2001-022, 22 2001.
[AS03] V. Athitsos and S. Sclaroff. Estimating 3d hand pose from a cluttered
image, 2003.
[BJ96] M.J. Black and A.D. Jepson. Eigentracking: Robust matching and
tracking of articulated objects using a view-based representation. In
ECCV (1), pages 329–342, 1996.
[CET98] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance
models. Lecture Notes in Computer Science, 1407:484–??, 1998.
[CP04] C. Chang and C. Pengwu. Gesture recognition approach for sign lan-
guage using curvature scale space and hidden markov model. 2004.
[CSW95] Y. Cui, D.L. Swets, and J. Weng. Learning-based hand sign recog-
nition using SHOSLIF-m. In ICCV, pages 631–636, 1995.
[CTCG95] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active
shape models - heir training and application. Comput. Vis. Image
Underst., 61(1):38–59, 1995.
[FAK03] H. Fillbrandt, S. Akyol, and K. Kraiss. Extraction of 3d hand shape
and posture from image sequences for sign language recognition.
amfg, 00:181, 2003.
[GCC+ 06] H. Guan, J.S. Chang, L. Chen, R.S. Feris, and M. Turk. Multi-view
appearance-based 3d hand pose estimation. cvprw, 0:154, 2006.
[HH96] T. Heap and D. Hogg. Towards 3d hand tracking using a deformable
model. fg, 00:140, 1996.
[HS95] T. Heap and F. Samaria. Real-time hand tracking and gesture recog-
nition using smart snakes, 1995.
[HSS00] Y. Hamada, N. Shimada, and Y. Shirai. Hand shape estimation
using image transition network. humo, 00:161, 2000.
[HT05] K. Hoshino and T. Tanimoto. Realtime estimation of human hand
posture for robot hand control. 2005.
39
[HW96] C. Huang and M. Wu. A model-based complex background gesture
recognition system. 1996.
[IB98] M. Isard and A. Blake. Condensation – conditional density propaga-
tion for visual tracking. International Journal of Computer Vision,
29(1):5–28, 1998.
[IMT+ 00] K. Imagawa, H. Matsuo, R. Taniguchi, D. Arita, S. Lu, and S. Igi.
Recognition of local features for camera-based sign language recog-
nition system. icpr, 04:4849, 2000.
[KT04] M. Kolsch and M. Turk. Robust hand detection. fgr, 00:614, 2004.
[LC04] S.U. Lee and I. Cohen. 3d hand reconstruction from a monocular
view. In ICPR ’04: Proceedings of the Pattern Recognition, 17th
International Conference on (ICPR’04) Volume 3, pages 310–313,
Washington, DC, USA, 2004. IEEE Computer Society.
[MM92] F. Mokhtarian and A.K. Mackworth. A theory of multiscale,
curvature-based shape representation for planar curves. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 14(8):789–
805, 1992.
[OB04] E. Ong and R. Bowden. A boosted classifier tree for hand shape
detection, 2004.
[RASS01] R. Rosales, V. Athitsos, L. Sigal, and S. Sclaroff. 3d hand pose
reconstruction using specialized mappings. iccv, 01:378, 2001.
[RK95] J.M. Rehg and T. Kanade. Model-based tracking of self-occluding
articulated objects. In ICCV, pages 612–617, 1995.
[SMC01] B. Stenger, P. R. S. Mendonca, and R. Cipolla. Model-based 3d

tracking of an articulated hand, 2001.
[STTC04] B. Stenger, A. Thayananthan, P. Torr, and R. Cipolla. Hand pose
estimation using hierarchical detection, 2004.
[TvdM02] J. Triesch and C. von der Malsburg. Classification of hand postures
against complex backgrounds using elastic graph matching, 2002.
[VJ01] P. Viola and M. Jones. Robust real-time object detection. ICCV

2001 Workshop on Statistical and Computation Theories of Vision,
2001.
[WLH05] Y. Wu, J. Lin, and T.S. Huang. Analyzing and capturing articulated
hand motion in image sequences. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 27(12):1910–1922, 2005.
40

Hand Models and Systems For Hand Detection, Shape Recognition and Pose Estimation in Video

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hand Models and Systems For Hand Detection, Shape Recognition and Pose Estimation in Video

Uploaded by

Copyright:

Available Formats

Hand Models and Systems for Hand Detection,

Shape Recognition and Pose Estimation in Video

Michiel van Vlaardingen

November 15, 2006

2 Existing Systems and Solutions 6

A Viola-Jones Object Detection 30

B Curvature Scale Space 33

1.3 Problems and Challenges

Existing Systems and

input ROI features features output

region feature feature recognition

Figure 2.1: Data-driven recognition architecture.

region feature feature detection integration

Figure 2.2: Data-driven detector architecture.

Table 2.1: Properties of data-driven systems.

input features history

Figure 2.3: Model-driven architecture.

2.2 Multi-view Systems

(b) Elastic Graph model

Figure 2.5: Appearance models for the human hand.

2.3 Feature Extraction

Shape and Edge features

Figure 2.7: Curvature Scale Space image (taken from [CP04]).

use it as a state vector in a HMM. Each element in the vector corresponds to

In this expression kx − yk is the distance between an edge pixel from X and

h(X, Y ) = max min kx − yk

An edge orientation histogram is a measure to compare edge images. It is

Figure 2.10: Local edges as features (taken from [CTCG95]).

2.4 Feature Reduction

2.5 Data-driven Recognition

2.6 Trainingset Creation

(a) Cascade structure

shape 1 shape2 shapeN

(b) Tree structure used in [OB04]

shape 1 shape 2 shape N

(c) Hierarchical detector structure used in

Figure 2.11: Structures used in data-driven detectors.

2.7 Model Projection and Fitting

3.1 Recognition Performance

3.3 View Independence

3.4 User Independence

3.5 Image Quality and Resolution

3.6 Training Effort

3.7 Computational Costs

Without standarized benchmarks it is unclear which systems perform best. No

1. Normalize weights so that wt is a probability distribution.

wt+1,i = wt,i βt1−ei

Algorithm 2 Cascade training

Curvature Scale Space

A Curvature Scale Space(CSS) image is a multiscale representation of the con-

Γ = {(x(u), y(u))|u ∈ [0, 1]},

where u is the normalized arc length parameter.

x0 (u)y 00 (u) − x00 (u)y 0 (u)

Γσ = {(x(u, σ), y(u, σ))|u ∈ [0, 1]}.

An example is shown in figure B.1 where several increasingly evolved versions

Xu (u, σ)Yuu (u, σ) − Xuu (u, σ)Yu (u, σ)

An example is given in figure B.2. u is represented on the horizontal axis and

The Condensation algorithm[IB98] is an example of particle filtering. It uses

xt − x̄ = A(xt−1 − x̄) + Bwt

If the prediction is described as a linear stochastic differential equa-

where α = qλ and vk = zk − x. This results in a distribution with peaks around

where µ is a spatial scale constant, r a variance constant, r(m) is the hy-