Professional Documents
Culture Documents
Human Identification Based On Gait (International Series On Biometrics)
Human Identification Based On Gait (International Series On Biometrics)
Human Identification Based On Gait (International Series On Biometrics)
IDENTIFICATION
BASED ON GAIT
International Series on Biometrics
Consulting Editors
Professor David D. Zhang Professor Anil K. Jain
Department of Computer Science Dept. of Comput er Scienc e& Eng.
Hong Kong Polytechnic University Michigan State University
Hung Hom, Kowloon, Hong Kong 3115 Engineering Bldg.
East Lansing, MI48824-1226, u.s.s.
email : csdzhang@comp.polyu .edu.hk Email : jain@cse.msu .edu
Additional information about this series can be obtained from our website:
http ://www.springeronline.com
HUMAN
IDENTIFICATION
BASED ON GAIT
by
Mark S. Nixon
University ofSouthampton, UK
Tieniu Tan
Chinese Academy of Sciences, Beijing, P. R. China
Rama Chellappa
University ofMaryland, USA
~ Springer
Mark S. Nixon Tieniu Tan
School of Electronics & Computer Institute of Automation
Science Chinese Academy of Sciences
Univers ity of Southampton Beijing , P. R. China
UK
Rama Chellappa
Dept. of Electrical & Computer
Engineering
Center for Automation Research
University of Maryland
USA
ISBN-13: 978-0-387-24424-2
ISBN-IO: 0-387-24424-7
e-ISBN -13 : 978-0-387 -29488-9
e-ISBN-IO: 0-387-29488-0
springeronl ine.com
Contents
Preface vii
1 Introduction 1
1.1 Biometrics and Gait.. 1
1.2 Contexts 2
1.2.1 Immigration and Homeland Security ------------------------------------- 2
1.2.2 Surveillance ------------------------------------------------------------------ 2
1.2.3 Human ID at a Distance (HiD) Program --------------------------------- 3
1.3 Book Structure 3
3 Gait Databases 17
3.1 Early Databases 17
3.1.1 UCSD Gait Data------------------------------------------------------------17
3.1.2 Early Soton Gait Data------------------------------------------------------18
3.2 Current Databases 20
3.2.1 Overall Design Considerations -------------------------------------------20
3.2.2 NIST/ USF Database------------------------------------------------------- 21
3.2.3 Soton Database -------------------------------------------------------------22
Overview 22
Laboratory Layout 24
Outdoor Data Design Issues 27
Acquisition Set-up Procedure 29
Filming Issues 29
Recording Procedure 30
Ancillary Data 31
3.2.4 CASIA Database -------------------------------- ------------------- --------32
3.2.5 UMD Database ---------------------------------------------------------- ---33
vi
vi
Learning Motion Model and Motion Constraints 117
Experiments and Discussions 125
6.4 Other Approaches 131
6.4.1 Structure by Body Parameters ------------------------------------------ 132
6.4.2 Structural Model-based Recognition----------------------------------- 132
References 157
Literature 157
Medicine and Biomechanics 157
Covariate factors 158
Psychology 159
Computer Vision-Based Analysis of Human Motion 160
Databases 161
Early work 162
Current approaches 163
Further Analysis 166
Other Related Work 169
General 169
9 Appendices 171
Appendix 9.1 Southampton Data Acquisition Forms 171
Appendix 9.1.1 Laboratory Set-up Forms ----------------------------------- 171
Appendix 9.1.2 Camera Set-up Forms --------------------------------------- 175
Appendix 9.1.3 Session Coordinator's Instructions ------------------------ 180
Appendix 9.1.4 Subject Information Form ---------------------------------- 182
Index 185
vii
vii
Preface
Ttearly
is a great honor to be associated with subjects at their inception. is certainly
It
in the cycle for gait - as it is for biometrics. is then a great honor to be
It
part of the first ever series on biometrics, as it is to be amongst the first
researchers in gait as a biometric. It has been great fun too - a challenge indeed
since gait concerns not just recognizing objects, but moving objects at that, so we
have had to develop new techniques before we saw the first results that people can
indeed be recognized by the way they walk.
In terms of setting the scene, and the context of this book with others in the
same series, it has been fascinating to see the rise in prominence of biometrics,
from what was originally an academic interest, to one that is on the lips of leading
politicians. This is because biometrics has the capability to solve current problems
of international concern. These essentially center on verification of identity at
speed and with assured performance and biometrics has a unique capability here
since we carry our own identity. As can be found elsewhere in the series, the
earliest biometrics were palm prints - these suited computational facilities
available in the 1970's . Then, there has been interest in the more popular
biometrics: the fingerprint given its long forensic use; the face given that it is non-
invasive and can be captured without a subject's knowledge or interaction; and
the iris. Iris recognition has proved quite an inspiration in biometrics, providing
some of the largest biometric deployments and with some excellent performance .
The fingerprint is now used in products such as mobile phones, computers and
access control. Face recognition has a more checkered history, but it is the
biometric favored by many in view of its practical advantages. These of course
make face recognition more difficult to deploy, as can be found in other volumes
in the International Series 011 Biometrics. Visitors to the US now routinely find
their fingerprints and faces recorded at portals of entry. Our context here is to set
the scene, not to contrast merit and advantage - that comes later. One of the main
reasons for the late entry of gait onto the biometrics stage was not just idea, but
also technology. Recognition by gait requires processing sequences of images and
this imposes a large computational burden and only the recent advances in speed
and memory made gait practicable as a biometric.
Rather than coordinate an edited book, we chose to author this text. We
provide a snapshot of all the biometric work in human identification by gait and
all major centers for research are indicated in the text. To complete the picture, we
have added studies from medicine, psychology and other areas wherein we will
find not only justification for the use of gait as a biometric, but also pointers to
techniques and to analysis. We have collocated the references at the end of the
book, itemized by the area covered and cross referenced to the text. There are of
course many other references we could have included since gait is innate to
human movement so we have aimed here to provide a set of references which
serve as a complete picture of current research in gait for identification, and as
pointers to the richer literature to be found in this topic.
As academics, we know well that this book would not have been possible
without the contributions of colleagues and students who have conducted research
x
1 Introduction
1.2 Contexts
Biometrics has risen to prominence quickly, even with its short history. The
current political agendas of many countries are permeated by questions that
biometrics might answer, including security and immigration. Now, the u.s.
Citizens and Immigration Services require applicants for immigration benefits to
be fingerprinted for the purpose of conducting FBI criminal background checks;
US-VISIT requires that most foreign visitors traveling to the U.S. on a visa have
their two index fingers scanned and a digital face photograph taken to verify their
identity at the port of entry. In the Enhanced Border Security and Visa Entry
Reform Act of 2002, the U.S. Congress mandated the use of biometrics with U.S.
visas. This law required that Embassies and Consulates abroad must issue to
international visitors "only machine-readable, tamper-resistant visas and other
travel and entry documents that use biometric identifiers," not later than October
26, 2004. From a topic that was largely on a University research agenda in 2002,
biometrics have moved fast.
The move was largely due to performance: biometrics offer a combination of
speed and security, ideal in any mass transit scenario. Also, since they are part of
a human subject, they are in principle difficult to counterfeit. Not only this, but
they are amenable to electronic storage and checking, and devices with such
capability continue to proliferate. It is for these reasons that face, iris and
fingerprint have found evaluation in security and immigration. Other biometrics
have not enjoyed this. This is because some do not lend themselves well to that
application scenario, others - like gait - were simply too new to be considered at
that time.
1.2.2 Surveillance
2
sequences from an arbitrary viewpoint will be shown later. The ongoing trend is
that deployment of surveil1ance systems will continue to increase, suggesting
wider deployment of gait recognition techniques .
The main single contributor to progress in automatic recognition by gait has been
the Defense Advanced Research Projects Agency's (DARPA's) Human ID at a
Distance research program led by Dr. Jonathon Phillips from National Institute of
Standards in Technology (NIST). This program embraced three main areas: face;
gait and new technologies, initial1y aimed to improve security at US embassies
following some terrorist acts in 1998. The Human ID at a Distance program
started in 2000 and finished in 2004 (ironically, privacy concerns in the US led to
its closure). Gait is a natural contender for recognition at a distance, given its
unique capabilities. The DARPA program concentrated on three main areas: face
gait and new technologies and in each area there was new technique; new data;
and evaluation. The aim of the gait program was essential1y to progress from
laboratory-based studies on small populations to large scale populations of real
world data. Of the current approaches to recognition by gait and data that can be
used to analyze performance, those from MIT, Georgia Institute of Technology
(GaTech), NIST and the Universities of Maryland (UMD), Southampton (Soton),
Carnegie Mellon (CMU) and South Florida (USF) were originally associated with
the Human ID at a Distance program. The program achieved many of its initial
objectives: gait achieved capability concurrent in research extent and depth with
research in face recognition.
3
described in Chapter 5. The alternative is to analyze shape and dynamics of the
moving human body, usually by deployment of a model, and these approaches are
described in Chapter 6. We then describe further application potential for the new
biometric approaches before concluding with an analysis of the potential for this
new, unique and intriguing biometric . You will find an extensive selection of
references on human identification by gait, on gait analysis and on general factors
relevant to this new technology. These have been grouped at the end of the book
for convenience.
4
2 Subjects Allied to Gait
2.1 Overview
T here is considerable support for the notion that each person's gait is unique. As
we shall see, it has been observed in literature that people can be recognized by
the way they walk. The same notion has been observed in medicine and
biomechanics, though not in the context of biometrics but more as an assertion of
individuality. Perhaps driven by these notions, though without reference to them,
there has been work in psychology on the human ability to recognize each other by
using gait. Those suffering myopia often state that they can use gait as a way of
recognizing people. There is other evidence too, which suggests that each person's
gait is unique. People have also studied walking from medical and biomechanical
perspectives, and this gives insight into how its properties can change which is of
general interest in any biometric deployment. We shall start with literature, with
definitions of meaning.
2.2 Literature
Perhaps the oldest gait analysis is due to Aristotle [I] though the word "gait" was
only to arrive some time later. Its usual meaning is "mann er of walking" [2] though
this is sometimes given as a "manner ofmoving onfoot" [3] since this can subsume
running as well. It is variously given either as derived from gang which means gait
in German, or from the Middle English gate [3], meaning path or gait, as derived
from the Old Norse gata , meaning path. In this respect it is interesting that one
'English' word for a double is doppelganger which derives from "a double" and
"goer", the latter given in this case as from middle High German [3].
Shakespeare made several references to the individuality of gait, e.g. in The
Tempest [Act 4 Scene I], Ceres observes "High 'st Queen of state, Great Juno
comes; I know her by her gait" even more, in Twelfth Night [Act 2 Scene 3] Maria
observes of Malviolo "wherein, by the colour of his beard, the shape of his leg, the
manner of his gait, the expressure of his eye, forehead, and comp lexion, he shall
find himselfmost feelingly personated" and in Henry IV Part II [Act 2, Scene 3] "To
seem like him: so that, in speech, in gait, in diet, in affections of delight, in military
rules, humours ofblood, he was the mark and glass, copy and book" . Shakespeare 's
works actually preceded the first complete English dictionary that was only to
appear in the 1755, so it is worth checking that Shakespeare's definition accords
with our own understanding of the meaning of the word gait. In a curious - but
rather expected - circular reference, in Johnson's English dictionary gait was
defined [4] to be the manner of walking and Shakespeare was quoted as an
exemplar of its meaning. Interestingly, Johnson also suggested it derived from gat
in Dutch, but the current meaning of gat in Dutch concerns an aperture and not gait.
Similar anecdotes can be found in more contemporary literature such as "I
noticed this figure coming, and I realized it was John Eubanks from the way he
walked' in the Band of Brothers [5] which is important since it describes
parachutists in the Normandy landings, operating in twilight when few biometrics
can be observed except gait, and in a critical scenario too.
In terms of history, Aristotle was one of the earliest in this area (he was the son of a
physician). Other notable names include Leonardo da Vinci who studied force
vectors and Galileo was a pioneer in mechanics who translated those interests to
biomechanics. Borelli (1608-1679) was an early pioneer in the study of human
locomotion who was interested in the mechanical principles of locomotion,
representing the starting point for the study of biomechanics of locomotion. Later,
the Weber brothers (1836) investigated human gait, both walking and running with
simple instrumentation, and suggested that the lower limbs act like a pendulum.
However, these awaited scientific justification. More advanced mathematical
techniques and reliable instrumentation were necessary to probe into the study of
locomotion. Muybridge (1830-1894) was the first to employ photographic
techniques extensively to record locomotion. Since those early times there has been
much medical and biomechanical research since gait is fundamental to human
activities.
The aim of medical research has been to classify the components of gait for the
treatment of pathologically abnormal patients. Murray et al. [6] produced standard
movement patterns for pathologically normal people which were used to compare
the gait patterns for pathologically abnormal patients [7]. These studies again
suggested that gait appeared unique to each subject. The data collection system used
required markers to be attached to the subject. This is typical of most of the data
collection systems used in the medical field, and although practical in that domain,
they are not suitable for identification purposes.
Fig. 2.1 illustrates the terms involved in a gait cycle. A gait cycle is the time
interval between successive instances of initial foot-to-floor contact 'heel strike' for
the same foot. Each leg has two distinct periods: a stance phase, when the foot is in
contact with the floor, and a swing phase, when the foot is off the floor moving
forward to the next step. The cycle begins with the heel strike of one foot which
marks the start of the stance phase. The ankle flexes to bring the foot flat on the
floor and the body weight is transferred onto it. The other leg swings through in
front as the heel lifts of the ground. As the body weight moves onto the other foot,
the supporting knee flexes. The remainder of the foot, which is now behind, lifts off
the ground ending the stance phase.
6
M 0%
Rt Heel Strike
i~ ML~M 50010
Lt Heel Strike
Rt Swing
100%
Rt Heel Strike
.1
=
~ Single-Limb Support
Double-Limb Support
Rt Stride Length
Rt-Lt Step Length Lt-Rt Step Length
" -,
"
"
"
(a) (b)
Figure 2.2 Leg Rotation Angles (a) hip and (b) knee (from [85])
7
The normal hip rotation pattern of the angle of the thigh illustrated in Fig. 2.2(a)
is characterized by one period of extension and one period of flexion in every gait
cycle. Fig. 2.3 gives the average rotation pattern as presented by [7]. The upper and
lower lines indicate the standard deviation from the mean. In the first half of the gait
cycle, the hip is in continuous extension as the trunk moves forward over the
supporting limb. In the second phase of the cycle, once the weight has been passed
onto the other limb, the hip begins to flex in preparation for the swing phase. This
flexing action accelerates the hip so as to direct the swinging limb forward for the
next step.
.,
Q)
60 ,
. ,
,
m ,
c 80 , ,
<t
C
---
....0
+>
111
100
-,
.......................................... . ........... ..........................•...: .
+>
0
0:::
120
...................,.'
140
o 20 40 60 80 100
Peroent Of Walking Cyole
Figure 2.3 Mean Hip Rotation Pattern
The pattern for normal knee rotation is more complex than that for the hip
rotation. It shows two phases of flexion and two phases of extension. At the start of
the walk cycle the knee of the outstretched limb has already begun to go into
flexion. The maximum flexion occurs when the trunk moves forward over the
supporting leg. As the trunk moves ahead of the supporting limb, the knee begins its
first phase of extension. The knee begins to flex when the contra lateral foot makes
contact with the ground at the midpoint of the walking cycle. The angular velocity
of the knee increases quite rapidly, characterizing the swing phase by large rapid
excursions into flexion and then into extension. Later, we will see approaches to
model this motion, and observe how it can be extracted from a sequence of images.
Gait has a property known as bilateral symmetry, which means that when one
walks or runs the left arm and right leg interchange direction of swing with the right
arm and left leg, and vice versa, with half a period phase shift. This is illustrated in
Fig. 2.4, which also shows which foot is in contact with the ground. The second half
of the gait cycle is a reflection (about the midpoint of the cycle) of the first half for
both the cases of walking and running. The notion of symmetry in gait is still of
interest [8], and we shall later see how concepts of symmetry can be used to
recognize people by their gait.
8
There is a considerable literature on human gait and its many aspects. These
include notions of balance [9], on kinematics [10, 11] and its relation to stride and
frequency [12]. Naturally, the relationship between stride length and frequency is of
fundamental concern to many studies of walking [13, 14] (as we shall find later,
even this has been used for recognit ion). A study of frequency domain properties of
gait suggests that the fundament al concern is with the lower frequencies [15].
Though we shall later find that many of the model-based approaches tend not to
focus on the arms since a subject could be carrying something, there have been
studies on arm movement in gait [16, 17].
Walking
Left foot
• • •
Right foot
• • •
Running
Left foot
•
Right foot
•
Figure 2.4 Symmetry, Stance and Float in Walking and Runn ing [17]
9
differently; one can often recognize an acquaintance by his manner ofwalking even
when seen at a distance " [21]. The majority of the medical approaches so far have
used marker-based systems . There is also an approach known as observational gait
analysis [23] which concerns the clinical use of observations made of people
walking unhindered, both physically and psychologically, by the marker based
systems. More recently this has progressed to using video recordings, and this
allows for comparative analysis but appreciation of performance is impeded by the
small numbers of subjects involved .
There have been (medical/ biomechanics) studies on the .effects of treadmills on
gait which is of concern not in deployment of gait as a biometric, but more for data
acquisition since a treadmill can be used to obtain long walking sequences
conveniently, and is the only practicable means to obtain video data of running
subjects . One study suggests that walking on an 'ideal' treadmill, when the
supporting belt moves with a constant speed, does not differ mechanically from
walking over ground, except for wind resistance, which is negligibly small during
walking [12]. The only difference between the two conditions is perceptual: the
environment is stationary when a treadmill is used [24]. However modem treadmills
require selection at least not only of speed, but also of inclination. Murray found
that during treadmill walking, subjects tend to use a faster cadence and shorter stride
length than during floor walking . However, in general, treadmill walking was not
found to differ markedly from floor walking in kinematics measurements [25].
Whether a treadmill will affect one's gait will also depend on the habituation of the
subjects to treadmill walking [26]. Given that one aim of deploying computer vision
in gait will be to produce marker-less video based systems, it will be interesting to
see whether the convenience associated with automated computer vision technique
can help to resolve these matters, not only in respect of the number of subjects
involved, but also it terms of deployment.
As biometrics concern humans then there are naturally many potential variations in
the data. Since these are usually not fundamental to the measured property, but are
the result of human action, they are known as covariate data . In a behavioral
biometric, such as gait, these are likely to be exacerbated. In face recognition the
subject can smile or make other facial expressions; an environment concern is that
the face is unlikely to be centered and that the illumination' is likely to vary . These
factors are likely to obtain in gait: mood is likely to affect gait and people move
throughout an image sequence and thus will interact with any fixed illumination.
One of the purposes of biometric approaches is to determine identity invariant to
imaging conditions and to a subject's disposition . For face recognition we seek a
signature that is invariant to expression and illumination. Similarly in gait we seek a
unique biomechanical signature that is the same for the subject whatever their mood
or walking environment. As biometrics is relatively new, there is as yet little data to
investigate these notions. It is evident that studies of the effects of age on face
recognition are unlikely to have data older than the subject itself. In fact, we shall
find later that approaches to gait as a biometric have not only learnt from the
established biometric approaches, and the most recent databases do now include
10
covariate data, but the research is contemporaneous with other biometrics in that
subjects report outdoor enrolment (finding the subject in "real-world" images) and
determination of potent factors for recognition, to be described later.
We shall review in outline some of these effects, especially since they provide
pointers to areas in which data should be, and indeed has been, collected. Evidently,
the corpus of subjects for whom the data is collected is usually small, since medical
studies await the convenience of markerless gait analysis. As such, there is certainly
high variance attached to any measures made and this is one area where computer
vision-based analysis can contribute since the data collection is much easier and the
corpus of subjects can be made much larger with ease. Certainly, load affects gait
[27, 28] suggesting the need to acquire imagery of subjects carrying luggage of
different weight and of different shape. Naturally, footwear can affect walking [29-
31] as can alcohol [32, 33] (we anticipate no shortage of student volunteers for a
new data study here!). Tight clothing will affect gait; loose clothing will affect the
perception of gait by video. Intuitively, gait will change with age as do most
biometrics, except ears. However, most medical studies concern disorder with gross
(short-term) change [34-40] - essentially the detection of abnormal gait; one study
[35] suggested that only cadence changed with time but the study used floor based
measures based on footprints and these are unlikely to be sufficiently sensitive for
measuring the smaller changes likely to occur with aging. These changes can be due
to compound changes in physiology, neurology and/or illness. There are illnesses
which are known to have particular effects on gait, such as Parkinson's disease or a
Trendelenburg gait where body weight is transferred to the affected side when hip
abductors are unable to stabilize the hip. Without rapid and convenient analysis it is
unlikely that study of effect of aging will progress much further and this is one area
where automated gait analysis via computer vision can make contributions beyond
those associated with biometric issues. Finally, mood can affect gait, as can music
[41, 42] for which reason the Southampton indoor database (Section 3.2.3) was
recorded with a talk-only radio playing in the laboratory to reflect comfort and with
intentional absence of music.
The sources of variation in gait might seem dispiriting at first: how can one even
hope to recognize subjects with this volume of covariate factors? First, this is not
unique to gait and gait has the unique advantage in terms of recognition at a
distance. In terms of recognition, the intention in biometrics (in pattern recognition
even) is to derive a set of measurements for one subject for which the variation for
that subject (the intra-class variation) is less than the variation between subjects (the
inter-class variation). For visualization, in a 3-dimensional feature space (where
subjects' identities are represented by three measurements) then if each subject's
measurements are contained in a small sphere then recognition can be achieved
when all the spheres are spread apart. The problem then becomes one of appropriate
measurement and as we shall find, that is where the research is. Essentially we seek
a unique biomechanical invariant; of note, we shall find that shoe type affected
recognition little in one study (the only shoe type that affected gait considerably was
flip-flops). As such, we seek to understand these covariate factors in recognition,
with potential to reinforce medical assessment given a considerably large number of
subjects consistent with the ease of analysis of a computer-vision based system.
11
2.4 Psychology
In the earliest psychology studies of gait perception [43] participants were presented
with images produced from points of light attached to body joints. When the points
were viewed in static images they were not perceived to be in human form, rather
that they formed a picture - of a Christmas tree even (the illustration in Fig. 2.5 does
not appear to be like a Christmas tree but can similarly be perceived to be without
human form). When the points were animated, they were immediately perceived as
representing a human in motion. Later work showed how by point light displays a
human could be rapidly extracted and that different types of motion could be
discriminated, including jumping and dancing [44]. Later, Binham [45] showed that
point light displays are sufficient for the discrimination of different types of object
motion and that discrete movements of parts of the body can be perceived . As such,
human vision appears adept at perceiving human motion, even when viewing a
display of light points. Indeed, the redundancy involved in the light point display
might provide an advantage for motion perception [46] and could even offer
improved performance over video images.
...
.. • ...
Figure 2.5 Marker-based Gait Analysis [160]
12
still be perceived. Studies on the ability to perceive gender from motion are ongoing
[56]. Like Shakespeare's observations, like medicine and like biomechanics, these
studies encourage the view that gait can indeed be used as a biometric.
13
selection of good body models is important to efficiently recognize human shapes
from images and properly analyze human motion. Stick figure models and
volumetric models are commonly used for three-dimensional tracking, and the
ribbon model and blob model are also used but are not so popular . Stick figure
models connect sticks at joints to represent the human body. Akita [65] proposed a
model consists of six segments: two arms, two legs, torso and head. Lee and Chen's
model [69] uses 14 joints and 17 segments. Guo et al [Gu094] represent the human
body structure in the silhouette by a stick figure model which has ten sticks
articulated with six joints.
On the other hand, volumetric models are used for a better representation of the
human body. One model [71] consists of 24 segments and 25 joints and those
segments and joints are linked together into a tree-structured skeleton. The "flesh"
of each segment is defined by a collection of spheres located at fixed positions
within the segment's co-ordinate system. At the same time, angle limits and collision
detection are incorporated in the motion restrictions of the human model. Among
the different volumetric models, generalized cones are the most commonly used
ones. A generalized cone [70] is the surface swept out by moving a cross-section of
constant shape but smoothly varying size along an axis. Generalized cylinders are
the simplified case of generalized cones that have a cross-section of constant shape
and size.
14
research, human motion can be defined by the different gestures of body motion,
different athletic sports (tennis, ballet) or human walking or running. The analysis
varies according to different motions. There are two main methods to model human
motion. The first is model-based: after the human body model is selected, the 3-D
structure of the model is recovered from image sequences with [73, 69] or without
moving light displays [65, 66, 74]. The second emphasizes determining features of
motion fields without structural reconstruction [72, 87, 90].
Ideas from human motion studies [6] can be used for modeling the movement of
human walking. Hogg [66] and Rohr [74] use flexion/extension curves for the hip,
knee, shoulder and elbow joints in their walking models. A different approach for
the modeling of motion was taken by Akita [65], who used a sequence of stick
figures, called key frame sequence, to model rough movements of the body. In his
key frame sequence of stick figures, each figure represents a different phase of body
posture from the point view of occlusion. The key frame sequence is determined in
advance and referred to in the prediction process. In order to find out the
interpretation tree of human body and reduce its computation complexity, Chen and
Lee [69] applied general walking-model constraints that are from walking motion
knowledge to eliminate the number of unfeasible solutions.
Other approaches that are different from above consider the properties of the
spatiotemporal pattern as a whole. These are the model-free approaches, of which
we shall find versions in gait-biometrics approaches. Polana and Nelson [72]
defined temporal textures to be the motion patterns of indeterminate spatial and
temporal extent, activities to be motion patterns which are temporally periodic but
are limited in spatial extent, and motion events to be isolated simple motions that do
not exhibit any temporal or spatial repetition. Little and Boyd's approach [87] is
similar to Polana and Nelson's idea, but they derive dense 2-D optical flow of the
person and derive a series of measures of the position of the person and the
distribution of the flow. The frequency and phase of these periodic signals are
determined and used as features of the motions. As already indicated, some of these
models have found application in systems aimed to track humans in image
sequences. Here of course, we are verging on to biometrics, since some of the
notions that people can be recognized by the way they walk are to be found as
developments of the analysis of human motion.
15
3 Gait Databases
N aturally, the success and evolution of a new application relies largely on the
dataset used for evaluation. The early gait databases were collected about 10
years before the time of writing. Then, computers had little power and memory
costs were comparatively high. Clearly, this was before digital video and acquisition
was based on analogue camcorder technology which resulted in frames being
digitized individually. Since techniques were in their infancy, as in face recognition,
early databases only had few subjects. The idea then was to determine whether
recognition could be achieved at all - or not. At that stage we were not interested in
the ramifications of recognition. There were two early databases which were
developed independently: the UCSD data was recorded outdoors and the
Southampton data was indoors, with subjects wearing special trousers. The current
databases are considerably more advanced, but certainly benefited in their
development for the early approaches.
The first generally available database was from the Visual Computing Group of the
University of California San Diego and originally was 5 subjects for which there
were 5 sequences and this was later augmented to 6 subjects for which there were 7
sequences . Obtained gait data was two different sets taken at two different dates: the
first set was extended by adding two more sequences to each of 5 original subjects
and one new subject with 7 sequences. A Sony Hi8 video camera was used to
acquire these images. The video camera was opposite a concrete wall in an outdoor
courtyard. The use of outdoor conditions and a shaded scene was aimed to make the
lighting as diffuse as possible, though shadows are evident on the images. The
students walked in a circular path around the camera so that only one person at a
time was in the camera's field of view. The use of a circular path ensured a smooth
walking motion was maintained throughout acquisition. The image sequences were
recorded with the subject walking fronto parallel to the camera: the direction of
walk was normal the camera's plane of view. The subjects walked around the track
for around fifteen minutes and the first two passes in front of the camera were
discounted to handle camera awareness, though it is also more likely the subjects
would settle into a steady gait later. The original full color images were of 640x480
pixels and the sequence length was of the order of 100 frames. At 30 frames/ sec
this constitutes around three periods of a human walking (naturally dependent on
speed).
An example of the data is shown in Fig. 3.1(a). In terms of computers available
now, this seems to be a rather limited set of data. Clearly the data was augmented,
so conditions changed though only slightly. At the time though, processing a
database of around 4000 images was a considerable task so some of the early
studies cropped the data and used it in black and white format only [88, 92]. The
extraction shows that the data met its aims, that a single moving object could be
extracted from the image. Clearly the lighting was generally chosen well though
there are shadows on the original image underneath the walking subject's legs, and
the shadow is present in the extraction too (though more advanced techniques are
available now, aimed to reduce the effects of shadows).
The other early gait data came from Southampton. In this, a CCD camera was used
to collect the data and its output was recorded on a video recorder and later
digitized. As in the UCSD data, the camera was sited with a plane normal the
subject's path but in an indoor location with controlled (level) illumination. Subjects
walked in front of a plain, static, cloth background. Given that the main aim of the
data was for analysis by a model based approach to recognition (to be described
later), problems occurred with the data in terms of clarity of the moving legs. This
was mainly due to creases in the subjects ' trousers and the self occlusion of gait led
to merging of the legs in the images. One solution was to make each subject wear a
special pair of white trousers that had a dark stripe down the middle of the outside
of each leg. In this way, the leg closest to the camera could be distinguished visually
from the other leg at all times. Fig. 3.2(a) shows an example image of a subject used
in this study.
18
Each subject walked past the camera ten times and the first and last three of these
sequences were discarded leaving four sequences for each of ten subjects, taken for
the central part of the walking period. Again, this data seems limited now, but the
data was recorded in 1996. This was because the subject would be becoming
familiar with the experiment in the early part of the recording whereas in the latter
part their aim was to finish the experiment. Given that it was desirable that the
subjects achieved constant velocity, subjects were given room to accelerate to a
constant velocity before entering the field of view of the camera. This did not use
the circular path of the UCSD data, though a constant speed was achieved. One
problem with the data was that as the camera had no shutter, the lower leg appeared
blurred during the swing phase of the gait cycle. As the initial target was analysis by
the vision paradigm of edge detection and then feature extraction, the edge image in
3.2(b) shows sufficient contrast for this purpose, noting that there is sufficient edge
strength in the line marking the leg as well as in the front edge of the trousers.
The UCSD and early Soton databases are still available, but largely superceded
now. This is not just in the number of subjects recorded which is now much larger,
but also in terms of covariate factors and application potential. The early databases
sufficed to show that people could be recognized by the way they walk, by
techniques described in the next Chapter. Before that, we shall review the current
state of art in gait database design.
19
3.2 Current Databases
It is encouraging to note the rich variety of data that has been collected and it is very
encouraging to see how research in gait has benefited from research in other
biometrics: there is a range of scenarios, covariate and ground truth data already
available,
~ ,!
a"·"
.'
, ,
"or'..: t
,
These current databases include: UMD's surveillance data [IOO] ; NISTI USF's
outdoor data, imaging subjects at a distance [75]; GaTech 's data combining marker
based motion analysis with video imagery [106]; CMU's multi-view indoor data
[80]; CASIA's outdoor data [I22] and Southampton 's data [83] which combines
ground truth indoor data (processed by broadcast techniques) with video of the
same subjects walking in an outdoor scenario (for computer vision analysis).
Examples of Maryland 's outdoor surveillance view data, a silhouette derived from
CMU' s treadmill data, and of Southampton's indoor and outdoor data are given in
20
Figs. 3.3(a)-(d), respectively. The NISTI USF data was explicitly col1ected for the
Human ID at a Distance program, for an evaluation known as the gait chal1enge
which concerned recognition capability of outdoor data with study of covariate
factors. These concern potential for within-subject variation which includes
footwear and apparel. Application factors concern deployment via computer vision
though none of the early databases allowed facility for such consideration, save for
striped trousers in the early Southampton database (aiming to allow for assessment
of validity of a model-based approach), as shown in Fig. 3.2(a). The new databases
seek to include more subjects so as to allow for an estimate of inter-subject
variation, together with an estimate of intra-subject variation thus allowing for
better assessment of the potential for gait as a biometric.
The data described here was developed especially for purposes of evaluation and
is usually freely available for evaluation
:Shoe:iYP" ,; >'
" »
21
Originally, the gait challenge concerned analysis of the data for which no briefcase
was carried and later data was added for subjects carrying a briefcase. The range of
data collected and analyses possible is shown in Fig. 3.1 where the gallery subset
for the gait challenge analysis was G,A,R,NB (as highlighted) and the probe data
was the remaining sequences in which no briefcase was carried.
A later study extended the database usage by manual labeling [149]. To gain
insight into the relationship between recognition capability and silhouette quality,
silhouettes were created for one gait cycle for 71 subjects under 4 different
conditions, (shoe-type, surface, and time). This gait cycle was aimed to be selected
from the same 3D location in each sequence, whenever possible, excluding the
»,
portion that included the calibration box with high contrast (Fig. 3.4(b to avoid
errors in background subtraction. Each pixel was also labeled according to body
segment: head, torso, left arm, right arm, left upper leg, left lower leg, right upper
leg, and right lower leg. Examples are shown in Fig. 3.5 which highlight how this
data can be used to analyze difficulty arising from the legs' self-occlusion. This
resource is available online [79] for use by the gait community to test and design
better silhouette detection algorithms. Further, the data allows for understanding of
the contribution not only of body labeling, but also of the segments, to test
recognition capability.
Overview
In order to provide an approximation to ground truth and to acquire imagery for
application analysis, the Southampton data procedure filmed subjects indoors and
outdoors. To investigate the potential for gait as a biometric, the database was aimed
to contain more than 100 subjects, henceforth referred to as the large-subject
database, to allow for estimation of between-subject variation. An overview of the
databases comprising the large-subject database is given in Table 3.2. This is
accompanied by the small-subject database which contained around 10 subjects in
differing scenarios to allow for within-subject variation, described in overview in
Table 3.3. The resource is available for research via ftp download [81] after
completing necessary formalities.
22
(} S cani,Tvpe1{~ suJ)ieCtSt" ~ Locll.litV
IView'Ani!:le t #Y Wll.lk.SurfaCe
A Progressive Normal I 16 Indoors Track
scan
D Interlaced Oblique 116 Indoors Track
B Progressive Normal 116 Indoors Treadmill
scan
C Interlaced Oblique 116 Indoors Treadmill
E Progressive Normal I 16 Outdoors Track
scan
F Interlaced Oblique 116 Outdoors Track
Table 3.2 Overview of Southampton's Large-Subject Gait Databases
Progressive
Scan
BS Interlaced Obli ue 12 Indoors Track
GS Progressive Inclined and 12 Indoors Track
Scan Normal
HS Progressive Frontal 12 Indoors Track
Scan
Table 3.3 Overview of Southampton's Small-Subject Gait Databases
Indoors, treadmills are most convenient for acquisition but there is some debate
as to how they can affect gait. As described earlier, some studies suggest that
kinetics are affected rather than kinematics, but our experience with using untrained
subjects and their limitations on footwear (subjects wearing open-backed shoes
experienced particular difficulty) and clothing motivated us to consider the track as
the most suited for full analysis. As in Fig. 3.6(a), the track was prepared with the
chromakey (bright green, as this is an unusual clothes' color) background
illuminated by photoflood lamps, viewed normally and at an oblique angle. The
track was of the shape of a "dog's bone" , as seen in Fig. 3.7, so that subjects walked
constantly and passed in front of the camera in both directions. The same camera
view and chromakey arrangements were used for the treadmill, but here subjects
were highlighted with diffuse spotlights, as in Fig. 3.6(b). The treadmill was set at
23
constant speed and inclination, arming to mImIC a conventional walk pattern.
Similar layout was used for the outdoor track and here the background contained a
selection of objects such as foliage , pedestrian and vehicular traffic , buildings (also
for calibration) as well as occlusion by bicycles, cars and other subjects. As such,
subjects' silhouettes can be extracted from outdoor and indoor, Fig. 3.6(c), imagery
and their signatures compared. The imagery for the large database was completed
with a high resolution still image of each subject in frontal and profile view,
allowing for comparison with face recognition and for good estimates of body shape
and size. The track data was initially segmented into background and walking and
further labels were introduced for each heel strike and direction of walking. This
allowed for basic analysis including manually imposed gait cycle labels . The
treadmill and outside data was segmented into background and walk data only.
The availab le gait databases, Tables 3.2 and 3.2, will 'be later referred to by a
camera label. These databases are stored as sequences of DV for which a
reader/interface has been made available in C and Python that allows database users
to access frames in the DV directly.
Laboratory Layout
A plan of the layout of the gait laboratory is given in Fig . 3.7. The track floor was
painted so as to ensure good chromakey extraction of the feet and ankles as well as
the body . A slight difference in intensity means that this is achieved by a two-pass
procedure rather than a single pass for pure chromakey. The dimensions of the room
24
allowed deployment of the four camcorders, in the chosen configuration and to
illuminate subjects and backgrounds satisfactorily. The track completed with two
circular segments to allow the subjects to tum around without ceasing to walk
normally. This is the dog's bone shape. This superseded an earlier circular track
where subjects entered at one end of the track and exited at the other, and then
walked through other rooms so as to enter at the same point as before. This was
superseded since subjects took a long time to walk round the track and pipelining
subjects was not possible without collision. Further, subjects would only have been
filmed walking in one direction. For these reasons the track was chosen in its final
form.
(a) track from start (X in Fig. 3.7) (b) track from end (Y in Fig. 3.7)
(c) laboratory viewed from track end (d) camera setups (inc. surveillance)
Figure 3.8 Southampton Gait Laboratory
Two cameras were used to view subjects in each scenario, one normal to the
walking direction and the other at an oblique angle. Two viewpoints were chosen so
as to allow for later investigation of viewpoint-invariant gait signatures (which had
already been demonstrated for model-based techniques, for a small change in view
angle of about 20° [166] - to be discussed later). As it is not unlikely that security
video will use interlaced format, the cameras positioned at an oblique angle were set
in interlaced mode whereas those normal to the subject were progressive scan. At
the time of camera procurement, the most appropriate camera types were the Canon
25
MV30i (progressive scan) and Sony TRV900E (interlaced). An evaluation
suggested that the Sony camera's optics and color response were at higher quality
than those of the Canon. Unfortunately, the Sony could achieve progressive scan at
a reduced frame rate whereas the Canon could give video rate progressive scan
capability.
The small-subject database was constructed with all subjects walking along the
indoor track, Figs. 3.8(a) and (d). The outdoor data primarily aims to investigate
performance of computer vision techniques in extracting people whereas the
treadmill aims to enable easy acquisition. Since neither factor was basic to gait as a
biometric, variational data was not recorded for those scenarios. Since the recording
was all inside, the camera settings differed from those used in the large database.
Also, two extra cameras were used, one of which (as can be seen in Fig. 3.8(d» was
placed normal to the track but with increased elevation. This is Database as in
Table 3.3. The other was a front view showing images from the viewpoint of Fig.
3.8(a). This is Database HS in Table 3.3.
The chromakey material was placed behind everywhere subjects appeared in the
field of view of the camera. The main difficulty in illuminating the chromakey
background was lack of ceiling height, as commonly used to ensure even
illumination. In an approximate solution, as shown in Figs. 3.8(b) and (c), 800W
Tre-D neon photoflood lights were reflected off the ceiling with carefully-
positioned screens (made from black wrap) to prevent light spills into the other
areas of the laboratory. The subjects were illuminated on the treadmill using two
500W spotlights diffused through white umbrellas, seen in Fig. 3.8(c). A screen was
positioned between the treadmill and the track areas, to ensure that the lighting for
the scenarios did not interfere, seen in the far left in Fig. 3.8(b).
The eventual use of the database will be by computer vision techniques. The
usual methodology is for high-level feature extraction and description and/or
statistical recognition techniques to follow low-level feature extraction.
Accordingly, we decided to evaluate data quality by performance of low-level
feature extraction techniques. The techniques chosen were Canny/ Sobel and an
established moving-object extraction technique. Simpler subject extraction
techniques (e.g. image subtraction with background obtained from temporal
median) were not chosen to evaluate the data as these have known performance
limitations and are unlikely to be deployed in any future recognition system. Much
of the quality effects are difficult to demonstrate in static format as given here,
though animated imagery is available. Naturally, there was particular concern with
lighting but a layout aiming for uniform lighting when recording a walking subject
is a paradoxical situation, given the interaction of the walking subject with their
illumination. In an iterative procedure, the lighting was optimized so as to obtain the
best subject extraction and quality of edge data for the walkway and treadmill data.
For (Canny) edge extraction, Fig. 3.9(b) shows much better definition of the
walking subject, with many fewer background edges (note that these are static, and
can be much further reduced by thesholding). For subject extraction, by more and
better positioned lighting the imagery of Fig. 3.9(d) was improved from 3.9(c)
where it can be seen that the background is much reduced, as are the effects of
shadows near the feet (consistent largely with the painting of the floor's surface).
The subject extraction also highlights problems with matte surfaces, in the skin and
cloth, which can cause these areas to be perceived as background (in grayscale
imagery). Inside of the silhouette these can easily be removed by infilling; the
26
problems did motivate repositioning of lighting systems to reduce their occurrence.
The three-chip Sony sensor system was found, as expected, to deliver imagery of
much better quality than the single chip (Canon) systems. Chromakey extraction
also evaluated so as first to check the paint used for the flooring and the drape of the
background cloth. Later, the same procedures were used to monitor wear in the
track's surface and continuance of the laboratory set-up.
(a) evaluation data: edge extraction (b) recorded data: edge extraction
-, -- . ".
(c) evaluation data: subject extraction (d) recorded data: subje ct extraction
Figure 3.9 Data Analysis Procedures' Results
Outdoor Data Design Issues
The outdoor set-up was designed to allow for the same viewing geometry of a
subject walking with fronto-parallel and oblique view, again recorded by interlaced
and progressive scan camcorders. The outdoor data is designed to allow recognition
in real-world scenarios which suggests that real world issues should be
accommodated in the data. These included:
• occlusion by other subjects;
• interference from moving background objects;
• interference from static background objects;
• variation in illumination; and
• variation in shadows.
These are specified as zones of activity in Fig. 3.10. For illustrative purposes,
the zones are separate, but the tree could cast a shadow over much of the foreground
area and people other than the subject walked in planes normal or parallel to the
subjects, as well as along a passageway behind the cars parked behind the
foreground area. The recording took place in a zone of access to several buildings,
ensuring that the walking subjects are sometimes obscured moving foreground
objects such other people walking in a similar or different plane, or bicycles. There
27
was traffic on the road behind the subject and this was sometimes stationary and
sometimes moving; there were pedestrians occasionally walking directly behind the
subjects. There was a bush to the left of the camera view and parked cars to the right
of the camera view, giving man-made and natural (textured) static background
objects. In some sequences, other subjects from the database were recorded walking
in the far background. The English weather certainly helped to satisfy the latter of
the main two requirements: the outdoor data was recorded over a month of late
spring when the weather varies considerably in the UK. Recording did not take
place when it was raining, but there is variation in illumination in that for some
sessions it was overcast and in others there was bright sunshine, causing wide
variation in shadows. The tree was observed to move quite freely on windy days,
giving a region where there was much movement of a highly textured object.
Finally, there is also a building in the rear background offering some possibility of
calibration. Examples of some of the frames from the data are given in Fig. 3.11. In
the stored sequences, a subject walks from one side to the other - there are
background sequences for each subject which are the view just before the subject
walked.
28
trained on indoor data and then deployed on outdoor data, as has already occurred in
an approach aimed to determine the walking human figure [156].
Filming Issues
Human psychology plays a large part in collecting data of this type and magnitude.
Firstly, to avoid affecting the subject's walking patterns , the treadmill training and
filming took place after each subject had first walked outdoors, and then inside on
the laboratory track. No other people were in the laboratory, as this could distract
the subjects. There is debate in the use of treadmills for gait analysis concerning
29
their suitability , speed and inclination. The speed and inclination were set at
constant values derived by evaluation, however, it is worth noting that treadmills
allow for capture of long high resolution continuous gait sequences . Further issues
included not informing subjects when the cameras were filming (reducing shyness
issues by switching the cameras on prior to the subjects entering the laboratory) , not
talking to subjects as they walked (as invariably humans will tum their head to
address the person) , using a talk-only radio station for background noise (to reduce
the impulse need of a human to break the talk-silence), removing the need for a
person to control the cameras reduced the camera- shyness and talking issues further.
Recording Procedure
Each of the recording sessions lasted at most 1 hour, this being the time available
for a DV tape and the outside cameras ' batteries. The same procedure was used for
every subject in the large database . The database should avoid any subject
conditioning, especially of those unfamiliar with walking on a treadmill. For this
reason, subjects were recorded first walking on the outside and inside tracks and
finally on the treadmill . This was not an issue during acquisition of the small-
subject database since all subjects were by then well familiar with walking on a
treadmill. Those record ing each session were primed to instruct the subjects in the
same way, Appendix 9.1.4. Each subject was first filmed walking along the outside
track, ensuring that at least 8 good sequences were recorded walking in both
directions . As this was outside data, cars could enter the background, or other
people could walk within the cameras ' field of view and sufficient sequences were
recorded to ensure that the effect of these could be mitigated in later analysis . After
sufficient sequences had been judged to be recorded, each subject then stood in
front of the cameras displaying the session and their unique subject ID. Subjects
were then filmed indoors walking on the track, again for at least 8 sequences in
either direction. Later examination of the data showed the prudence in recording
more data than was needed. Presumably since he was unsupervised, one subject
actually left the track to inspect the laboratory and was recorded as such. Chairs
were provided to ensure that subjects not being filmed did not interfere with the
recording in progress. All subjects then spent at least 3 minutes walking on a
treadmill set to be the same as that used in the laboratory, but one with handles to
ease subject training . Subjects were then filmed walking along the laboratory
treadmill (with only a front handle so that the body was not obscured) for at least 3
minutes. The speed of the treadmill certainly caused concern; subjects of preferred
to walk at different speeds. Evaluation suggested that it was necessary to set the
treadmill to a fixed speed for the whole of the larger database . This speed (4.1
km/h) was set to be the average speed selected by a group of 10 subjects at which
they found their walking to be most comfortable . The treadmill was inclined at 3° as
this was found to lead to a more natural walk. A mirror was placed in front of the
treadmill , seen in Fig. 3.8(b), again helping to improve balance for those unfamiliar
with such exercise and to prevent the subject from looking downwards at the
treadmill's controls (which were also obscured to lessen further the potential for
distraction).
After the completion of recording, each subject completed an information form
and a consent form, given in Appendices 9.1.5 and 9.1.6 respectively. The
information form aimed to record those factors known to influence gait and which
30
were not evident from the video information, including known injury or medication
and in one case fatigue. The consent form complies with current UK Data
Protection legislation. Intentionally, there was no linkage between the consent and
the information forms - the database is totally anonymous and a subject's identity
has never been linked with their data. After recording, the consent forms' order was
randomized to ensure this. After the acquisition process was completed, each
subject was given a book token in appreciation of their time and collaboration.
Ancillary Data
The ancillary data for each database already comprises the information derived from
the subject information forms together with the camera set up forms for each
session. The main difficulty associated with using treadmills to acquire natural walk
data for a large number of subjects is that many subjects are unfamiliar with
walking on a treadmill and that the inclination and speed need to be set so as to
enable natural walking which would further lengthen any training time. As such the
track database appears to be the most appropriate to evaluation of the basic potential
for gait as a biometric. As such, this was given special consideration and is labeled
in depth.
The labeling format was XML as specified within the Human ID program and
an example of an example XML fragment associated with data in database A is
given in Table 3.4. The primary labels on the track Database A allows evaluation to
use images where the whole of the subject is visible. First, the filename for each DV
sequence was arranged to record the camera, the filming session number, the subject
number, the subject's sequence number and the direction of the subject's walk.
Labels were derived for the frame before which the subject starts to enter the scene,
for the frame when the subject is first wholly within the camera's view, for the
frame where the subject is last wholly visible, and for the frame where the subject
has totally left the camera's view. The images between the start of the sequence and
the subject's entrance in the field of view, and between the subject's exit and the
end, are background data. The track Database A was further labeled so as to enable
evaluation of single cycle data. The most evident event to be labeled is heel-strike
so each of the subject's heel-strikes was labeled together with the respective foot.
..-..
-":;
.. .. ""
.~
.
-~
31
software to ensure that the correct labels had been derived. Finally, the heel-strikes
were first extracted automatically and then again checked manually for accuracy. As
such sufficient labels were derived for automatic single-cycle gait data analysis. The
efficacy of using DV was a significant advantage here since, unlike analogue
systems, time is recorded with the image data in digital format. The potential for
drift within the different cameras' timing was checked prior to this analysis and
found to be negligible.
The labels derived for extracting the sequences, the initial segmentation stage,
were then used to extract sequences for the oblique camera, Database B.
Unfortunately, there was no explicit consistency between the labels derived from
the normal view (Database A) and the imagery in the oblique view. As the same
level of ground truth was not required for all databases, only Database A, this was
not investigated further. In this spirit, there is no heel strike data for the treadmill or
the outside data, all four databases being stored as sequences of subject data (from
when the subject is wholly within the camera's view) and background data (when
the subject was not within the camera's field of view. The only extra label in the
outside data concerns change to the background, be it another human or vehicle
moving within the field of view.
As with the other databases, we shall later show how the Southampton databases
have been used not only for recognition in both the studio and the more demanding
outdoor data, but also for data analysis to determine those parts that are potent for
recognition purpose, but also to guide migration of new vision technique to human
movement analysis.
In the CASIA database [82, 122], Panasonic NV-DXlOOEN digital cameras fixed
on a tripod captured gait sequences at a rate of 25 frames per second on two
different days in an outdoor environment. Here we assume that a single subject
moves in the field of view without occlusion. All subjects walked along a straight-
line path at free cadences in three different views with respect to the image plane,
namely: fronto-parallel (0°), obliquely (45°), and frontally (90°). The images are
recovered from video storied in DV tapes to a Microsoft AVI wrapper with an IEEE
1394 interface offline, and finally transcoded using the Sthvcd2000 decoder into 24-
32
bit full-color BMP files with a resolution of 352 x240 . The resulting CASIA gait
database includes 20 different subjects and 4 sequences per view per subject. The
database thus includes a total of 240 (20 x4 x3) sequences. The length of each image
sequence varies with the pace of the walker, but the average is about 90 frames . A
comparative performance of several techniques on the CASIA is available [121].
Some sample images are shown in Fig. 3.13, where the arrowed line represents the
walking path.
33
background subtracted image sequences. Fig. 3.15 illustrates samples from the
second dataset.
en
~
3
"g
Segmenl 3
Segment Z
Camera 6
(8) (b)
IIIIIIIIIJIJ••
SlanC<'s .lonR s<'gmenl J SlanC<'S along segmenl 2
Figure 3.15 (a) Sample image from the Second Maryl and Dataset (b) T-shaped path
(c) Background subtracted image samples from each of the four segments
34
4 Early Recognition Approaches
.. ...'
36
covariance matrix of the full set of image data is not sensitive to class structure in
the data. In order to increase the discriminatory power of various facial features ,
LDA/ CA optimizes the class separability of different face classes and improve the
classification performance. The features are obtained by maximizing between-class
and minimizing within-class variations. Unfortunately, this approach has high
computational cost. Moreover, the within-class covariance matrix obtained via CA
alone may be singular. Combining EST with canonical space transformation (CST)
reduces the data dimensionality and optimizes the class separability of different gait
sequences simultaneously.
Given c training classes to be learnt, where each class represents a walking
sequence of a single subject, X'iJ is the jth image (of n pixels) in class i and N; is the
number of images in i-th class . The total number of training images is
NT =
N1 + N z +..... + N c (4 .1)
and the training set is represented by [X;,I ",X;,N" X;.I' ..., X~.N< ] . First, the brightness
of each sample image is normalized by
= X~;, j II
(4.2)
».,
After normalization, the mean pixel value for the full image set is:
c ~ (4.3)
m, = XrT~~Xi,j
Then we form an nxNT matrix X where each column is formed from each of Xi.j less
the mean as:
X=[x1,-mX' '' ''X' N -mx,,,,,x -m (4.4)
• • I c ' Nc x]
EST uses the eigenvalues and eigenvectors, generated by the data covariance matrix
derived from the product XX T, to rotate the original data co-ordinates along the
direction of maximum variance. Calculating the eigenvalues and eigenvectors of
the nrn matrix XXT is computationally intractable for typical image sizes . Based on
singular value decomposition, we can compute the eigenvalues of XTX where the
matrix size is NT xN r and is much smaller. The eigenvectors of XTX are used as an
orthogonal basis to span a new vector space. Each image can be projected to a
single point in this space. According to the theory of PCA, the image data can be
approximated by taking only the largest eigenvalues and their associated
eigenvectors. This partial set of k eigenvectors spans an eigenspace in which the
points Y i.j are the projections of the original images X i.j by the eigenspace
transformation matrix, [e., ...ek], as
.. = [e" ...ek]Tx I , j .
Y',J (4.5)
After this transformat ion, each original image can be approximated by the linear
combination of these eigenvectors .
In CST, the classes of the transformed vectors resulting from eigenspace
calculation are used to calculate a scatter matrix S" a within-class matrix S; and a
between class matrix S, which reflect the dispersion, the variance and the variance
of the difference , respectively. The objective of CST is to minimize S; and to
maximize Sb, simultaneously. This is achieved by minimiz ing the generalized Fisher
linear discriminant function J, where
37
J(W) = WTSb W / (4.6)
/WTSwW
The ratio of the variances is maximized by the selection of the feature W if
os/ - 0 (4.7)
/OW-
Supposing W· to be the optimal solution and that w'; be its column vector which is
a generalized eigenvector and corresponds to the i-th largest eigenvector Ai, then
SbW; =A;Sww; (4.8)
After the generalized eigenvalue equation is solved, we obtain a set of eigenvalues
and eigenvectors that span the canonical space where the classes are much better
separated and the clusters are much smaller, as required for recognition purposes .
(d) 1st eigenvalue (e) 2nd eigenvalue (f) 3rd eigenvalue (g) 4 th eigenvalue
Figure 4.2 Original Image, Derived Silhouettes and Eigenvalues [92]
In application, this analysis was applied to human silhouettes derived by
subtracting the background from the image and then thresholding the result[92].
Fig. 4.2(a) shows an image from the original sequence and Figs. 4.2(b) and (c) show
two of the extracted silhouettes. The eigenvalues were then extracted from the
sequence of silhouettes, the first four of which are shown in Fig. 4.2(d-g). The
trajectories in eigenspace overlap and their centroids are very close together . After
CST the trajectories are much better separated and with lower individual variance as
in Fig. 4.3 (though for three dimensions - the separation in 3D is much clearer when
the view is rotated). Recognition from the canonical space was accomplished using
the distance between the accumulated center to each centroid . On five sequences of
five people from the early San Diego dataset an 85 % classification rate was
achieved by CST alone whereas 100% was achieved with combined EST and CST
(as evidenced by the cluster size and separation in Fig. 4.3).
38
0.1
o
-0.1
~ person 1 0
-0.2 person 2 +
person 3 x
-0 .3 person 4 -
person 5 •
-0 .4
0.5
0.4 0.1
0.3
0.2
v2 v1
39
different subjects were then compared for distinctive, or unique, characteristics.
The video sequences were averaged to reduce high frequency noise and edge
images were produced by applying the Canny operator with hysteresis thresholding.
The Hough Transform (HT) was then applied to the edge image resulting in an
accumulator space that had several maxima, each corresponding to a line in the edge
image. A peak detection algorithm was applied to extract the parameters of each of
these lines (in x and y co-ordinates) using the standard foot-of-normal form
s =xcos¢+ ysin¢ (4.9)
where sand ¢ are the distance and angle to the foot of normal. There are several
methods for peak detection. In back-mapping, the peak in the accumulator at
(spk,¢pk) is found. For each edge point in the image which lies on the line
represented by (spk, ¢pk), the points in the accumulator associated with that edge
point were decremented. This effectively removed the votes cast by the line
(spk, ¢pk), and so the peak was reduced. This process was repeated until the
parameters for all the lines had been found, the result of processing Fig. 4.2(a) is
shown in Fig. 4.2(b). Gaps in the data occur when the legs cross where it is difficult
to discriminate between the legs and there was also some high frequency noise on
the data. To in fill for missing data, and to smooth noisy components, the thigh
angles given by the lines' inclinations were fitted to a high order polynomial by
least squares. An eighth-order polynomial describing the thigh angle 0 variation
with time t is
O(t) = ao+ a,t + a2t 2 + ....+ a/ (4.10)
and for N points B; by least squares
N N
8
N
(4.11)
N 'l)i Lti a LBi
;=1 ;=1 o ;= )
N N N
a, N
Lt,'6 a9
N N N N
8 9 8
Lti Lti LB,ti
;=1 i=t ;=1 ;=)
An example least squares fit for four sequences of single cycles of a particular
subject is shown in Fig. 4.5. These fit nicely within Murray's data, Fig. 2.3.
15
10
e
-5
-10
- 15
- 2 0 '--.l---'--l---'---'_'---'--....l---'---'
e 10 20 30 40 50 60 7 0 80 90 ~~~_
40
spectra. The magnitude and phase spectrum for one walking cycle of subjects 2 and
5 are shown in Fig. 4.6. The magnitude spectrum drops to near-zero above the fifth
harmonic, again agreeing with earlier work [15]. The magnitude spectra for the two
subjects can be used to distinguish between them. However, the phase spectra
appear to be much more different but some components carry little information
since their respective magnitude component is very small.
I.S
3
o '-"-"---'--'--'---L---L_'--.L.---'-"--' e L-.L-..L----L---'------'-----'_l--.L-..L--l
e 10 20 30 40 50 60 70 80 90 100 a 10 20 3 0 40 50 60 70 80 90 100
(a) magnitude spectrum for subject 2 (b) phase spectrum for subject 2
10 r--r-r-r-r-r-r-r-r-.,--, 3 r--r-r-r-r-r-r-,.--,.--...-.
2. 5
1. 5
a \. ,/ e '-'--'--'--'--I<-.l.-.L-.L-.L.....:J
a 10 20 30 40 50 60 7 0 80 90 100 o 10 20 30 40 5 0 6 0 70 80 90 100
(c) magnitude spectrum for subject 5 (d) phase spectrum for subject 5.
Figure 4.6 Phase and Magnitude Gait Spectra [96]
The k-nearest neighbor rule was then used to classify the transform data using
the 'leave one out' rule, for k = 3 and for k = 1. In the early Southampton data, four
video sequences were acquired for each of ten subjects. The correct classification
rates (CCR) are summarized in Table 4.1 which gives analysis for classification by
magnitude spectra alone, and for multiplying the magnitude spectra by the phase,
both for differing values of k. Note that the magnitude component of the FT is time
shift invariant; it will retain its spectral envelope regardless of where in time the FT
is performed. The phase component does not share this characteristic, and a time
shift in the signal will change the shape of the phase envelope. Accordingly, the
rotation patterns were aligned to start at the same point, to allow phase comparison .
This is because the magnitude plots do not confer discriminatory ability whereas the
phase plots appear to do so. The multiplication appears reasonable, since gait is not
characterized by extent of flexion alone, but is controlled by musculature which in
tum controls the way the limbs move. Accordingly, there is physical constraint on
the way we move our limbs. However, we cannot use phase alone, since some of the
41
phase components occur at frequencies for which the magnitude component is too
low to be of consequence . By multiplication of the spectra, we retain the phase for
significant magnitude components. Clearly, in this analysis, using phase-weighted
magnitude spectra gives a much better classification rate (90%) than use of
magnitude spectra alone (40%), for k = 3. Selecting the nearest neighbor, as
opposed to the 3-nearest neighbor , reduced the classification capability, as expected.
.
•
•r
,I' .
I ;
f
.,
...-. ~ ,-
"
where Vy essentially reflects any slope in the walking surface . The horizontal
component is modeled as
42
predecessor non-automated approach in Table 4.1. The confusion matrix for the
recognition performance on ten subjects is shown in Fig. 4.8. Here, darkness
reflects closeness of matching and lightness represents disparity and the 90%
recognition performance is reflected in the dark diagonal where the signature for
each subject matched well and the light areas outside of the diagonal show a poor
match to the signatures of the other subjects. Fig. 4.8(a) shows the match for the
Fourier magnitude alone and Fig. 4.8(b) shows the match for the phase-weighted
Fourier magnitude and this better performance by the phase-weighted Fourier
magnitude is reflected in the greater difference between its diagonal and the
remainder ofthe matrix.
I J 3 4 .56 7 8 9 LO L J 3 4 .5 6 1 8 9 10
L
J
3
4
.5
s
1
8
9
1 0 _~ _ _
We shall see later how this approach can be translated to model the whole leg, and
can also be used as a feature extraction stage to model running as well as walking.
43
5 Silhouette-Based Approaches
5.1 Overview
O f the current approaches, most are based on analysis of silhouettes. This is like
the earlier analysis which concerned abstracting a walking subject from the
background and then deriving set of measurements that describe the shape and
motion of that silhouette in a sequence of images. This is similar in approach the
classic computer vision approach to recognize objects by using shape descriptions
--
of objects derived, say, by optimal thresholding.
-----
Seq uences of images of walking subjects
Model-! r;;allalysis - Model-based analysis
46
As wiIl be considered later on analysis of potency, it then raises a feature selection
problem: by what can one, or should one, recognize gait?
Collins et al. from CMU used key frame analysis for sequence matching [110]
with innate viewpoint dependence. The key frames were compared to training
frames using normalized correlation, and subject classification was performed by
nearest neighbor matching among correlation scores . The approach implicitly
captures biometric shape cues such as body height, width, and body-part
proportions, as well as gait cues such as stride length and amount of arm swing. The
approach was evaluated on the Mobo dataset, and on early versions of the UMD and
Southampton and MIT databases. The technique showed excellent performance
across databases and across the gaits covered by the Mobo databases. In another
approach from CMU, Liu et al. used "frieze patterns" [Ill] derived from image
sequences by compressing images into a concatenated pattern, with some similarity
to the earliest approach [86]. The approach also gave facility for viewpoint
correction and was deployed to good effect on the Mobo database. Later, Tolliver et
al. were to show [151] that people could be recognized by shape with especial
consideration of noisy silhouettes. The technique used a variance-weighted
similarity metric to induce clusters that cover different stages in the gait cycle .
Results were evaluated on the NISTIUSF Gait Challenge data and suggested that
gait shape is effective when comparing subjects and test sets are acquired under
similar conditions.
The University of Southampton's newer approaches range from a baseline-type
approach by Foster et al.'s technique measuring area [112], to extension of
technique for object description including symmetry by Hayfron-Acquah et al. [137]
(with some justification from psychology studies [53]) and Shutler et al.'s statistical
moments [114]. These wiIl be considered in more detai11ater.
Lee et al. from the Massachusetts Institute of Technology (MIT) used ellipsoidal
fits to human silhouettes [119] . The gait appearance feature vector is comprised of
parameters of moment features in image regions derived from silhouettes of a
walking person aggregated over time either by averaging or spectral analysis. Each
silhouette is divided into seven parts after centroid determination: the head/shoulder
region; the front of the torso; the back of the torso; the front thigh; the back thigh;
the front calf and foot ; and the back calf and foot. Evaluation was performed on the
MIT and the Mobo data, also with consideration of gender classification (which was
achieved) and of potency for gender classification (for which the thigh orient ation
was ranked as most potent) .
Curtin developed a modified form of a Point Distribution Model (PDM) [120]
which included time to distinguish the movements of a walking and a running
subject. The gait was classified by the movement of points on the 2D shape .
Evaluation on a small number of subjects recorded walking and running on a
treadmill showed that the temporal PDM could be used for recognition and for
discrimination between walking and running, with greatest difficulty experienced in
discriminating between a fast walk and running .
The CAS Institute of Automation (CASIA) developed an eigenspace
transformation of an unwrapped human silhouette [121] and eigenspace
transformation of distance signals derived from sequences of silhouettes [122].
Again , these will be considered in more detail later in this Chapter .
Bhanu et al. from University of California, Riverside, used kinematic and
stationary features [124] by estimating 3D walking parameters by fitting a 3D
47
kinematic model to 2D silhouettes. Shape and structure was extracted separately and
then combined for recognition . The stationary features include body part length and
flexion, estimated from key frames. The combination was by combination
formulations including the sum and product rules of which the former performed
best, evaluated on a proprietary database. Han et al. [125] were also to use the Gait
Energy Image formed by averaging silhouettes and then deployed PCA and multiple
discriminant analysis to learn features for fusion. A feature fusion strategy was
applied at the decision level to improve recognition performance, increasing
capability beyond that of individual approaches . These were deployed to good effect
on the Gait Challenge data, with performance generally exceeding the baseline
algorithm.
Of the more recent approaches, Zhao et al. [126] used the mean amplitude of key
poses, evaluated on the NISTI USF gait challenge data, to achieve recognition . Lee
et al. separate gait into style and content by generating temporally aligned gait
sequences via local linear embedding with separation by a bilinear model and
achieved good performance on the gait challenge database [127]. Kobayashi et al.
used higher order correlation to extract motion correlation as a feature of human
movement, classifying running and walking as well as subject identification [128].
As mentioned previously, Boyd's more recent approach straddles model-based
and model-free approaches , by synchronizing the oscillation of pixel intensity with
those of arrays of phase-locked loops [130] with patterns analyzed by Procrustes
Analysis and directional statistics, and evaluated on Carnegie-Mellon Mobo and the
Southampton databases .
As mentioned in Section 2.5 there is other work on identifying perambulatory
subjects but few of these use gait-directed analysis. On the analysis of walking
movements in the context of gait recognition, Hild [131] estimates the 3D motion
trajectory with further estimation of leg frequency from the area between the legs.
Davis used basic stride parameters with one view-based constraint to identify basic
walking movements from a selection of other motions [133].
These show promise for approaches that impose low computational and storage
cost, together with deployment and development of new computer vision techniques
for analysis of objects moving through image sequences . We adumbrate the basis
for these approaches and show how they can be used to recognize subjects by the
way they walk, together with their advantages (especially over earlier techniques).
Of particular interest is the emerging studies of potency which determine which
parts of the human silhouette (or which measurements) are the most important for
recognition purposes . This parallels emergent research in other biometrics,
especially in face recognition.
The main limitation on using PCAI LDA to analyze a sequence of binary silhouettes
was that the order of the images in the sequence could be changed, but the same
result would ensue. This suggests that including order is essential in the gait
signature. One Southampton approach determined the change in area through the
sequence and the change in area of selected parts of the sequence was used to
characterize each subject. Though this is computationally very attractive the
48
approach does not separate body shape and body dynamics. This was achieved by
extending standard shape description techniques to include motion, here concerning
statistical moments and discrete symmetry. Moments are a suitable candidate for
extension as they are a standard approach to shape description in computer vision,
with a concise analytical framework and with access to the scale of a shape's
description. Symmetry was chosen as it has a direct link to psychological
observations of gait, namely that gait concerns a synchronous pattern of
symmetrical movement. These three approaches from the University of
Southampton have been applied to gait data from a variety of sources, though
naturally concentrating on the Southampton data, and each has demonstrated
capability for recognition.
The area masks derive a measure of the change in area of a selected region of a
silhouette [112, 134]. Examples of area masks are shown in Fig. 5.2, and this is
not unlike the current use of Haar wavelets in face image analysis. Here, a
horizontal line (Fig. 5.2(a)) isolates those parts in the region of the waist whereas a
vertical one, Fig. 5.2(b), selects the thorax and these parts of the legs intersecting
with the vertical window. Many alternatives are possible such as a combination of
the upper body and legs, Fig. 5.2(c) merely measures the entire area change of the
image Fig. 5.2(d) simply analyses the legs of the subject.
=D~D
a) horizontal line b) vertical line c) bottom half d) full
Figure 5.2 Sample Area Masks
Each mask mj represents an area mask as a binary vector, with 1 representing the
white parts of the image. For each mask, we obtain a signature 8 by determining
the area that is congruent between the mask and the images x thus
An area history is obtained by applying each mask to each image in the sequence
to then derive a time history of the number of points that are congruent. Note that all
the masks used are left right symmetric masks and therefore all information
corresponding to the asymmetric part of gait is lost. However, this also gives
invariance to the direction of walk. That is, any image within a sequence could be
mirrored and still give the same gait signature. An asymmetric masks which might
give additional information, but all walkers would need to be imaged walking in the
same direction.
Examples of the signatures ~(t) for three different masks are shown in Fig. 5.3
(the full area mask gives a similar output, but is not shown as the details are harder
to see when plotted on a scale showing all four functions). As can be seen, the
dynamic signature is intimately related to gait. The peaks in the graph represent
49
when the selected area of the subject is at a maximum (i.e. when the legs are furthest
apart) and the dips represent when the selected area is at a minimum (i.e. when the
legs are closest together). The bottom half mask selects the lower portion of the
image so has a much greater number of points that the vertical line, but the shape is
naturally very similar. The horizontal line mask selects those points near the region
of the waist which varies less with thorax motion than the associated movement of
the subject's arms. Clearly, the masks emphasize bilateral symmetry: one leg swings
whilst the other is in stance after which the limbs swap in function . Limping is an
inherently asymmetrical gait, and this would be evidenced in a clear disparity in
measured area between the two halves of the gait cycle .
Area
S/t) -+- Vertical Line
.~ Horizontal Line
---.- Bottom Half
600
400'-----'----'----'------'----'---
o 10 15 20 25 30
Time
Figure 5.3 Sample Output of Different Area Masks from the Same Subject
The sequences are aligned for comparison since they start at different points in
the gait cycle. This is achieved by using minima as a starting point. To eliminate
variations due to the sampling time and speed at which the subject walks, the part of
the sequence corresponding to the full walking cycle is resampled using cubic
splines to interpolate between the observations ~(t) . Thirty evenly spaced samples
are taken from the cubic spline, giving a 30 element vector for each area mask used.
Note that this does lose information about the subject's speed.
Each mask yields a vector describing the dynamics of area change within that
mask. Since there are many degrees of freedom in a walking subject, there could be
a considerable degree of independent information from different masks. Therefore,
the information from multiple masks was combined to provide a more complete
dynamics signature . The simplest way to achieve this is by concatenating the 30
element vector from each area mask to form a single long vector V of size n x 30,
where n denotes the number of area masks used . A different vector is derived for
each sequence of image and these are in fact the feature vectors used for
recognition. Note that the vectors still contain some information about the static
shape of the silhouette. This information is contained in the average value of
sequence ~(t) . This information can be removed by subtracting the average of the
sequence, although this reduces the recognition performance.
50
For each walking sequence, an n x30 element vector V is obtained that provides
a dynamic signature describing the gait. Canonical analysis was used to select a low
dimensional subspace where the differences between classes are maximized and the
difference within classes is minimized . The centroids of each class were calculated
in this new space and a k-nearest neighbor classifier was used to decide which
subject a sequence of images belong to.
where Pi and P, are pairs of points having P, as their mid point and C(Pi-Pj) as
their symmetry contribution. The symmetry relation or contribution, C(Pi,Pj)
between two points Pi and P, is:
(5.3)
where D i.j and Ph., (see Eqns. 5.4 and 5.6 respectively) are the distance and the
phase between the two points . I, and ~ are the logarithmic intensities at the two
points . The symmetry distance weighting function gives the distance between two
different points Pi and Pj, and is calculated as:
1
Di ·j=.J2Jraexp
[(IIP;-PjIIJJ . .
- 2a ,'rfI*J
(5.4)
where (7 controls the scope of the function . The logarithm intensity function, Ii, of
the edge magnitude Mat point (x,y) is
I, = logf l + M i ) (5.5)
51
points. Alternatively, a small value of (J implies local operation and local symmetry.
Increasing the value of (J increases the weighting given to the more distant points
without decreasing the influence of the close points. A comparison with the
Gaussian-like functional showed that the mean of the distribution locates the
function on the mean value of the sample. A focus, u, was therefore introduced into
the distance weighting function to control the focusing capability of the function,
hence further improving the scaling possibilities of the symmetry distance function .
The resulting function is called the focu s weighting function, FWF. This replaces
Eqn . 5.4 as follows:
(5.7)
FWF .
'.j
=-I-exp[
.J 21Ca
-(IIPi-Pjll-JlJJ
2a
,'ii=l' j
The addition of the focus into the distance weighting function moves the attention
of the symmetry operator from points close together to a selected distance.
The symmetry contr ibution obtained is then plotted at the midpoint of the two
points. The symmetry transform as discussed here detects reflectional symmetry that
is invariant under 2D rotation, translation transformations and under change in
scale, and as such has potential advantage in automatic gait recognition.
The gait signature for a subject was derived from a full image sequence. Each
sequence of image frames consists of one gait cycle taken between successive heel
strikes of the same foot, thus normalizing for speed . The following gives an
overview of the steps involved in extracting symmetry from silhouette information.
First, the image background was computed from the median of five image frames
and subtracted from the original image , Fig. 5.4(a), to obtain the silhouette, Fig .
5.4(b) . This was possible because the camera used to capture the image sequences
was static and there is no translational motion. Moreover, the subjects were walking
at a constant pace. The Sobel operator was then applied to the image in Fig. 5.4(b)
to derive its edge-map, Fig. 5.4(c). The edge-map was thresholded so as to set all
points beneath a chosen threshold to zero, to reduce noise or remove edges with
weak strength. These processes reduce the amount of computation in the symmetry
calculation. (Note that since optical flow also gives magnitude and direction this can
replace the edge operation in symmetry calculation. This does not taking advantage
of data reduc tion unless the output of an edge operator is used to mask the optic
flow data inside the silhouette.) The symmetry operator is then applied to give the
symmetry map, Fig. 5.4(d) . For each image sequence, the gait signature, GS, is
obtained by averaging all the symmetry maps , that is
(5.8)
The resulting components retained after a low-pass filter (of selected radius) was
applied to a Fourier transform ofGS to form the feature vector for each subject.
52
(a) original (b) silhouette (c) edges (d) symmetry map
Figure 5.4 Applying the Symmetry Operator
for image i and the moments are of order m,n,a,y . The term from standard spatial
moments is denoted by S(i,m,n) and the motion, or velocity, is introduced through
U(i,a,y) which are calculated from the differences between consecutive CaMs in
the image sequence.
53
U (i,a,r) = (~-Xi_1 r
(Yi - Yi-I r (5.10)
Xi is the current COM in the X direction, while Xi_I is the previous COM in the X
- -
direction, Yi and Yi-I are the equivalent values for the Y direction . In this way
moving shapes are characterized by their shape and motion, concurrently (but
separated shape and motion descriptions are available through vmm ,n,O,O and
vmo,o.a.y, respectively).
The shape's structure contributes through each pixel P;> ,y and the weighting
function S which is either a centralized Cartesian or Zernike polynomial. Cartesian
monomials were first studied due to their simplicity and ease of computation.
Further to this, the orthogonal Zernike moments are a weII established and proven
standard technique (in both image noise and pattern recognition), providing an ideal
platform to enable the analysis of the new framework on an orthogonal set. In
Cartesian velocity moments
S(i,m,n) = (x-~r (Y- Yi r (5.11)
The Zenike formulation essentiaIIy replaces the shape structure with a Zernike
polynomial computed over the unit circle, The moment set is computed from the
sequence of silhouettes within a gait cycle, as with symmetry the silhouettes can
either be binary or grayscale derived from the magnitude of the optical flow. The
moment set was reduced by Anova analysis and the remaining moments serve as a
feature vector describing different subjects . As expected, the resulting moments
described not only shape, but also motion, The motion used was horizontal, as to be
expected .
5.2.4 Results
The main criterion for assessing performance is naturaIIy the recognition capability,
the correct classification rate (CCR). This is derived by aIIocating class according to
proximity, in the chosen feature space, to labeled class examples. The k-nearest
neighbor (k-nn) rule is weII known for its robustness and performance, with the
added advantage that it can be easily replicated. Given that recognition rates are
usually high when biometric techniques reach publication status, it is common to
include measures of uncertainty in that performance. This aIIows for some
assessment of recognition capability in larger populations. Further, it is usual to
assess in some way practical capabilities in respect of potential application scenario .
This includes performance in noisy conditions, as surveillance video tapes are often
re-used leading to undesirable video artifacts, and when a subject might be obscured
or part of their body occluded, say when walking behind a street lamp. FinaIIy, a
unique advantage for gait is capability at low resolution, or distance and this too
should be explored.
The statistical recognition techniques from Southampton have been analyzed
primarily on the Southampton dataset A but also on a selection of other database s.
The performance has been analyzed by CCR and a selection of measures of
uncertainty are also given . Performance factors have also been investigated and a
selection is described alongside analysis of each technique's recognition capability.
54
Recognition by Area Masks
Table 5.1 shows the recognition rates derived by application of individual area
masks [47] on the Southampton database A. Six samples of each subject (3 left to
right, 3 right to left) were used to train the database and two samples (l left to right,
1 right to left) were used as the test data.
where p is the true recognition rate. The mean and variance for Mare Np and
Np(l- p), respectively. The natural unbiased estimate for the recognition rate is
MIN and the expected error in the unbiased estimate is ~ p(l- p)/N . The observed
recognition rate can be used as an approximation to p . For our database of 114
subjects with 2 test samples of each, N = 228 so the expected error in the mean is
around 3.0%, as shown in Table 5.1.
The advantage of the area mask approach is that the execution speed is very fast,
especially compared with other extant approaches . Since these results are
substantially higher than chance, we must assume that there is clear potential for
area masks to derive discriminatory information.
By combining the results from area masks as discussed previously and using
canonical analysis in the same manner as before the recognition rate increases
markedly. Results are combined by simply concatenating vectors together to
produce a vector of size n x 30 where n denotes the number of area masks used. By
combining masks we greatly increase the information available for recognition and
thus the results improve, shown in Table 5.2. The achieved recognition rate of over
75% is encouraging and significantly greater than chance alone.
55
Area Masks Combined Recognition Result s Expected Error a.
Horizontal + Vertical Line Mask 49.6% 3.3 1%
Bottom Half + Full Mask 50.0% 3.3 1%
All Above Masks 75.4% 2.85%
Table 5.2 Performance by Combining Multiple Masks
As can be seen from Table 5.3 the results are similar in levels of performance to
those in Table 5.1. However, we can see that the recognition rates using the
horizontal line mask has dropped by almost 20% to less than 10%. As expected,
this shows that the horizontal mask contains very little temporal information which
can be used for recognition. By combining the masks in the manner described
previously, it is possible to increase this recognition rate. Combining the four
masks produced a recognition rate of 52.7%. This suggests that approximately half
of the recognition rate come from the temporal components of gait, with the
remaining coming from the shape of the silhouettes.
The psychologists' view is that gait is a symmetrical pattern of motion [53, 8].
We have assumed this to be true, and not taken into account the foot on which the
subject starts, or their direction of travel. Does taking this information into account
result in a significant difference in recognition rate? The recognition rates for each
combination of starting leg and walk direction by four area masks (vertical line,
bottom half, horizontal line and full) and leave one out cross validation produced
approximately constant recognition rates for each of the combinations. This was to
be expected as all subjects are able-bodied and none showed a distinct impediment
in their walk, to the human eye.
56
Analysing the results in more detail indicates that if knowledge of the starting
leg is used, then performance can be improved. Table 5.4 shows the recognition
rates of individual masks on a data set where all the subjects are walking right to left
and starting on the left foot.
If we compare the results in Table 5.4 to those in Table 5.1 we can see there is a
marked increase in the recognition rate when the starting leg and direction of travel
is taken into account. Table 5.5 shows the results on the same data set when the
masks are combined.
From the results in Table 5.4 we can see that whilst knowledge of the starting leg
and direction of walk can increase performance when single masks are used, it does
not provide significant advantages when masks are combined . Further research is
necessary to determine if knowing the starting leg is a substantial advantage to the
potency of gait as a biometric . Essentially this poses a question concerning the
latent symmetry of gait which is yet to be determined by automatic analysis .
The Southampton database A consists of more than 100 subjects, of which 20
are female. Gender discrimination could be a practical first stage in gait
classification if the number of subjects is large, as it would reduce the number of
subjects needed to be searched. To avoid bias in the test sample, 20 people of each
sex were used, giving a total database size of 40. 15 people of each sex were used
as training, with the remainder as test subjects. In this case, eight sequences were
used for each. This gave 120 training sequences and 40 test sequences.
By using canonical analysis, the maximal gender discrimination rate was 64%
which is better than chance, but not by much. Whilst results on gender classification
are disappointing, they are justifiable. With the subject walking normal to the
camera's plane of view there is very little gender discrimination information
available . In mitigation, the clothing of males is far more uniform than that of
females, suggesting that diversity of apparel could increase the potential recognition
rate for women. Previous medical work has actually indicated that male and female
walkers differ in terms of lateral body sway, with males tending to swing their
shoulders from side to side more than their hips, and females tending to swing their
hips more than their shoulders [50, 47]. This sort of information might be better
discerned from a fronto-normal or an overhead view. Further research is needed to
57
determine whether area masks, using different views of the subject, will be able to
fulfil the task of gender discrimination.
In this way, the area masks have shown capability to discriminate people by
their gait. Further, fusion approaches have been shown to improve performance.
Compared with chance and expected error rates, these early results are of statistical
significance and show that this baseline technique can indeed be deployed to
recognise people. Naturally, the simplicity sacrifices discriminatory ability, but
people can be recognized just by change in area over time. Later approaches will
offer much better performance, but usually at the expense of complexity. The
measures have been shown to use shape and motion independently, again a feature
of later approaches. Finally, there is in principal a consideration of the potency of
the measurements. This is a topic of current interest in all of biometrics, and we
shall describe gait's contribution to this growing debate, later.
Recognition by Symmetry
After our earlier work [135, 137] the same method was applied to a much larger
database of 28 subjects, part the Southampton database A. At the time, this new
database equaled in size the largest (published) contemporaneous gait database
though new and larger databases will emerge in the near future. Each subject has at
least four image sequences giving in total 112 image sequences. With this part of
the database, only the silhouette information is used and the recognition rates
obtained are also shown in Table 5.6. Clearly, using a larger database still gave the
same good recognition rates and that is very encouraging.
For the low pass filter, all possible values of radius were used to investigate the
number of components that can be used (covering 0.1 to 100% of the Fourier data).
Though the results of Table 5.6 were achieved for all radii greater than 3 (using the
early Southampton database), selecting fewer Fourier components might affect the
recognition rates on a larger database of subjects, and this needs to be investigated
in future. Fig. 5.5 shows the general trend of the recognition rate against the
different cut-off frequencies. The figure shows that selecting about 1.1% of the
Fourier descriptions is enough to give the best recognition rates for both k = 1 and
k = 3, but this is only on a small (early) part of the database.
58
Effect of low pass filtering us ing different rad ii
-t"-""""_I--II-_-_ _-"""'-!!,
100 =",,,,,,,,,,,""=,,",,","~ . . .
98
~ :
96
~-
~ 90
c: 88
~ 86
~ 84
g 82
.:! 80
78
78
74 ......'-..--""~=:.;;;...::.;;;...:"'-'-~-"-
1 10 15 ~ ~ ~ ~ 45
Radius
Performance was evaluated with respect to missing spatial data, missing frames and
noise using the early Southampton database. Out of the 16 image sequences in the
database, one (from subject 4) was used as the test subject with the remainder for
training. The evaluation, aimed to simulate time lapse, was done omitting a
consecutive number of frames. For a range of percentages of omitted frames, Fig.
5.6, the results showed no effect on the recognition rates for k = I or k = 3. This is
due to the averaging associated with the symmetry operator. Fig. 5.6 shows the
general trend of deviation of the best match of each subject to the test subject. The
increase in the similarity difference measures as more frames are omitted appears to
be due to the apparent increase in high frequency components when averaging with
fewer frames.
Sim ilarity
20000
18000
12000
' 0000
t:::;;~~~!::::::::~~~==:~~
8000)-- - -- - ...,.,-::;;;-- .......""""= '---- -1
6OOO -I-- -:;;;..... - =....:..:...-- - - - - - -j
4OOO~~-------------j
2000 j- - - - - - - - - - - - - - -j
The evaluation was done by masking with a rectangular bar of different widths: 5,
10 and 15 pixels in each image frame of the test subject and at the same position.
The area masked was on average 13.2%,26.3% and 39.5% of the image silhouettes,
respectively. The bar either had the same color as the image silhouette or as the
background color, as shown in Fig. 5.7, simulating omission and addition of spatial
data, respectively. In both cases, recognition rates of 100% were obtained for bar
size of 5 pixels for both k = 1 and k = 3. For a bar width of 10 pixels, Fig. 5.7c
59
failed but Fig . 5.7a gave the correct recognition for k = 3 but not for k = 1. Fig.
5.6c failed as the black bar changes effective body shape and recognition uses body
shape and body dynamics. For bar sizes of 15 (i.e, Figs . 5.7b and d) and above, the
test subject could not be recognized as subject is completely covered in most of the
image frames. This suggests that recognition is unlikely to be adversely affected
when a subject walks behind a vertically placed object, such as a lamp post - when
it is thinner than the human shape.
To investigate the effects of noise (to simulate poor quality video data - usual in
many surveillance applications), synthetic noise was added to each image frame of a
test subject and compared the resulting signature with those of the other subjects in
the database. Figs . 5.7e and fshow samples of the noise levels used . The evaluation
was carried out under two conditions. First by using the same values of o and !l
(Eqn. 5.7) as earlier, for a noise level of 5%, the recognition rates for both k = 1 and
k = 3 were 100%. For 10% added noise, the test subject could still be recognized
correctly for k = 1 but not for k = 3. This suggests that the recognition is not greatly
affected by limited amounts of failure in background extraction, though those errors
are less likely to be uncorrelated. With added noise levels of 20% and above, the
test subject could not be recognized for k = 1 or k = 3. In order to reduce
contribution of distant (noisy) points, the values of c and !l were then made
relatively small. Now the recognition rates (100%) were not affected for both k = 1
and k = 3 for added noise levels even exceeding 60%. Fig . 5.5b shows how in this
case the best match of each subject deviated from that for the test subject, as more
noise was added . Here, the high recognition rate is consistent with the same order
being achieved despite addition of noise . As such, the symmetry operator can be
arranged aiming to include noise tolerance, but the ramifications of this await
further investigation.
60
Recognition by Velocity Moments
The Velocity moments were analyzed first on the CMU data [138], and later on
part of the Southampton data. The CMU data is silhouette data only, without the
original images, of subjects walking on a treadmill. This restricted deployment of
the velocity moments but they still showed good performance with a recognition
rate exceeding 90%
The part of the Southampton database A used for evaluation of the velocity
moments consisted of 50 subjects, with 4 sequences of each subject, (-600 images) .
The sequences studied contained the subjects walking from left to right for one and
a half gait cycles (three consecutive heel strikes). Moments were computed from
silhouettes and optical. Due to the increased resolution of the images and the
distance over which the subjects walked, the optical flow was computed within a
moving window, moving at the subject's average velocity, a method already used in
gait recognition by Little [88]. The average velocity was calculated using the Center
Of Mass (COM) information from the silhouette moments. As before the moments
of flow are (effectively) windowed data, so the average velocity is placed back into
the velocity moment calculation (as done for the Southampton and UCSD
databases) . However, due to the increased resolution and size of the dataset, the
Zemike moment scaling was switched off to avoid problems through scaling large,
non-binary images (i.e. the mapping could scale the subject causing it to exceed the
unit disc's area). The images were instead scaled to appear visually central to the
unit disc i.e. the thresholded image's COM was used in the mapping (in-place of the
actual grayscale COM). 234 Zemike velocity moments were computed on the
silhouettes and windowed optical flow.
61
investigated for discriminatory capability and only moments with a large F statistic
re used; it is worth noting that using just two velocity moments achieves nearly 40%
discrimination capability. The classification rates for the five selected velocity
moments computed from optical flow were relatively low in comparison to those
calculated from the silhouettes . The selected velocity moments based on flow
actually favor those holding solely spatial information. A similar result was found
upon analysis of the UCSD data (Section 3.1.1), supporting the hypothesis that
using optical flow gives detailed information about a subject's limb motion (which
may not vary enough between subjects on its own to allow for good subject
separation), while the silhouettes hold global shape/motion differences . The k = 1
classification results exceed those for k = 3 which suggests that the feature space is
closely packed (with respect to subject clusters). There are two obvious solutions to
overcome this problem. The first is to increase the dimensionality, using more
features to increase cluster separation , or secondly to use a more sophisticated
classifier. Though this effect may be caused by the normalization of the moment
values, the normalization is actually used to stop biasing of the k-nn classifier by
moments which naturally produce larger values. However, if one subject produces
significantly different feature values to the rest of the database, the remaining
subjects within the database (and the differences between them) will be compressed
into a small area of the feature space. This highlights the possible need for an
alternative classifier when analyzing larger datasets, or in situations where the
feature's order of magnitude may vary. Alternatively, we can combine the two
results by silhouette and optical flow which results in higher classification rates of
96% and 93% for k = 1 and k = 3, respectively The proximity of these results
suggests that the feature space is less packed with respect to between-subject
differences, than it was when using j ust silhouettes or optical flow alone. Finally, it
is worth noting that the probability of correctly classifying by chance this part of the
Southampton database A (for four independent samples of each subject) is
c.is- 10-6 •
Some performance factors have already been addressed in other analysis . One
potent facet of gait as a biometric is the capability for recognition at a distance . This
can be simulated by reducing the resolution of the silhouette . In fact, symmetry
retains 90% recognition capability even when the silhouette is reduce to be of size
16x 16. A similar analysis was conducted more comprehensively on the velocity
moments. The analysis was applied to the Southampton silhouettes as the re-scaling
is relatively simple, due to their binary nature. For each sequence, the Zernike
velocity moments used for classification were re-calculated for each different image
resolution. The normalized mean squared error (NMSE) is then calculated between
the original velocity moment values (Oi) - the full resolution versions - and the new
'altered' values (Wi)' at each incremental step. The NMSE was defined as
(5.13)
62
compactness and separation. Additionally, the subject clusters may all shift in the
feature space relative to each other, representing no change in the classification rate,
even though the features themselves have changed. If this reduction in resolution is
performed before the mapping function then the results are dependent both on the
mapping calculation and the Zernike moment calculation. However, the mapping
process will have a positive effect on the handling of lower resolution images. It
effectively ensures that there is no loss in the accuracy of the Zernike polynomial
calculations, by mapping the reduced resolution image to the same grid size as the
original resolution calculation, thus making the two results directly comparable. If
the reduction in resolution is applied after the mapping process, theory suggests that
the errors will rapidly increase due to loss of both image and calculation precision.
Therefore, the effects of applying the reduction in resolution before the mapping
process were studied. Assuming that the original image is the highest resolution
available, the images were progressively re-sampled to reduce their resolution. Sub-
pixel estimation is allowed, enabling any re-sampling size to be achieved. Eleven
different resolutions were analyzed from 50% of the original resolution, through to
2%. In recognition, the errors began to diverge as the pixel size increases beyond
10. However, the NMSE errors are still low. Degrading the resolution is effectively
adding noise to the perimeter of the silhouette, up to the point where each image
loses its overall shape.
63
Euclidean distance and the nearest neighbor rule. A more sophisticated classifier
was not used since the important factor was only the relative reduction/increase in
recognition rate at this stage. It is worth noticing that to be able to display all results
in an easily understandable way, the initial feature sets were extracted from 64 x 64
image including zeros when there are no silhouettes. However, to simplify the
statistical analysis, the following mask of each database was constructed: features
which contain only zero through all considered database were removed and their
locations in the original set were marked recorded for later display purpose.
However, there are feature vectors which still contain zeros for some subjects.
Therefore PCA was run on the covariance matrix rather than the correlation matrix.
Shared feature s - 84% of variance explained
10 10
20 20
30 30
40 40
50 50
60 60
20 40 60 20 40 60
50 50
~
.2l
l'! 40
<:
540 0
""OJ
·c :gOJ 30
0
830 o
~ ~ 20
~
20
10L----~---~----' 0
o 50 100 150 0 20 40 60 80
features features
Two sets of important features, which are the same in all three databases, were
considered. First, the features which explain 100% of variance in each data set, i.e.
236, 100I and 217 features from the three databases. These features contain 115
shared among three datasets important features. Fig. 5.8 shows the location of
shared l l S features on silhouette at the left top picture. The shared features cover
the contours of head, body, some legs and some features of arm. To find out which
role in recognition these 115 features play, database A was considered as a gallery
and database AS was considered as a probe and recognition rates were calculated
both for all, 4096 features, and for shared l l S features. Bottom left picture in Fig.
5.8 shows how recognition rate changes with adding additional features. Again here
the solid line describes dependency of significant features versus recognition rate
(46.3% for lIS features), while the dashed line corresponds to recognition rate
64
when all features are considered (56.4%). In this case 17.9% of recognition rate was
lost. The further reduction was tried. From each dataset 150 features obtained by
PCA earlier were compared and 79 shared features were selected. It was found out
that 79 features explain approximately 84% of variance in each database. These
features were projected on silhouette and presented in Fig. 5.8, top right. In this case
the most important shared features are contours of the head and body. The
recognition rate versus the shared features is presented in bottom right picture of
Fig. 5.8. In this case recognition rate for 79 features was 41.3% in comparison to
56.4% for all 4096 features, i.e. a reduction of 26.8%. Practically it means that it is
not enough for a differential silhouette to include only static component of gait, in
spite of the fact that static components of gait account for 84% of explained
variance. Legs play the important role in a differential silhouette, however a
practically negligible amount of features describing legs is shared through time, i.e.
through all three databases. This then suggests that the motion estimation is crude
and should be improved in future analysis.
Human gait is usually determined by the walker's weight, limb length, habitual
posture and so on. It includes both the body appearance and the dynamics of human
walking motion. In theory, joint angle changes are sufficient for recognition by gait.
However, their recovery from a video of walking person is difficult for current
vision techniques. The particular difficulties of joint angle computation from
monocular video sequences are self-occlusions of limbs and joint angle
singularities. Empirically, recognizing humans by gait can be achieved by applying
statistical analysis to the temporal patterns of individual subjects, which has been
well demonstrated in gait recognition. These techniques remain statistical in
essence, describing human motion by a compact representation of motion or
structural statistics of a sequence of area distributions rather than an attempt to
match the data to a model. Intuitively, recognizing people through gait depends
greatly on how the silhouette shape of an individual changes over time. Therefore,
we may consider gait as being composed of a set of static poses and their temporal
variations can be analyzed to obtain distinguishable signatures. Based upon the
above consideration, here we present a model-free automatic gait recognition
algorithm using the Procrustes shape analysis method [214]. The algorithm was
developed at the CAS Institute of Automation.
Fig. 5.9 gives an overview of the proposed method [121]. For each input
sequence, an improved background subtraction procedure is first used to extract the
spatial silhouettes of walking figures from the background. Pose changes of these
segmented silhouettes over time are then represented as an associated sequence of
complex configurations in a 2D shape space and are further analyzed by the
Procrustes shape analysis method to obtain an eigenshape as gait signature.
Standard pattern classification techniques such as the k-nearest neighbor classifier
and the nearest exemplar classifier based on the full Procrustes distance measure are
respectively adopted for recognition. Like many previous works, this approach also
65
does not directly analyze gait dynamics. It includes the appearance as part of the
gait recogni tion features . It is in essence holistic because gait is implicitly
characterized by the structura l statistics of the spatiotemporal patterns generated by
the silhouette of the walking person in image sequences.
InputGaitSequence
Motion BlobSequence
Silhouette Extraction
Human detection and tracking is the first step to gait analysis. To extract and
track moving silhouettes of a walking figure from the background image in each
frame, change detection and tracking is adopted which is based on background
subtraction. The main assumption made here is that the camera is static, and the
only moving obj ect in video sequences is the walker. Although this integrated
method basically performs well on the CAS IA dataset, it should be noted that robust
motion detection in unconstrained environments is an unsolved prob lem for current
vision techn iques because it concerns a number of difficult issues such as shadows
and motion clutter.
Background subtraction has been widely used in foreground detection where a
fixed camera is usually used to observe dynamic scenes . How to reliably generate
the background image from video sequences is critical. Here the LMedS (Least
Median of Squares) method [215] is used to construct the background from a small
portion of image sequences even including moving objects. Its advantages are that it
66
is especially efficient for 1D data process in returning the correct result even when a
many outliers are present. Let I represent a sequence including N images. The
resulting background bxy can be computed by
(5.14)
where p is the background brightness value to be determined for the pixel location
(x, Y), med represents the median value, and t represents the frame index ranging
within I-N. It is found that N over 60 is sufficient for the CASIA dataset to generate
a reliable background .
The brightness change is often obtained through differencing between the
background and current image. However, the selection of a suitable threshold for
binarization is very difficult, especially in the case of low contrast images as most of
moving objects may be missed out since the brightness change is too low to
distinguish regions of moving objects from noise. To solve this problem, we use the
following extraction function to indirectly perform differencing [215]
2.J(a+I)(b+l) 2.J(256-a)(256-b) (5.15)
f( a, b ) = 1- .--'-------
(a+I)+(b+l) (256-a)+(256-b)
where a(x, y) and b(x , y) are the brightness of current image and the background at
the pixel position (x, y) respectively, 0::5:a(x, y),b(x,y):Q55 ,0::5: f(a,b) < I . This
function can detect the change sensitivity of the difference value according to the
brightness level of each pixel in the background image.
For each image I xy , the distribution of the above extraction function f (a (r, y),
b (x, y)) over x and y can be easily obtained. Then, the moving pixels can be
extracted by comparing such a distribution against a threshold value decided by the
conventional histogram method.
It should be noted that the above process is performed independently for each
component R, G and B in an image. For a given pixel, if one of the three
components determines it as the changing point, then it will be set to the
foreground . This produces a mask that is considered as a region of interest for
further processing.
No change detection algorithm is perfect. Hence, it is imperative to remove as
much noise and distortion as possible from the segmented foreground.
Morphological operators such as erosion and dilation are first used to further filter
spurious pixels, and small holes inside the extracted silhouettes are then filled. A
binary connected component analysis is finally applied to extract a single highly
compact connected region with the largest size. An example of gait detection is
shown in Fig. 5.10.
67
Representation of Silhouette Shapes
An important cue in determining underlying motion of a walking figure is the
temporal changes in the walker's silhouette shape. To make the proposed method
insensitive to changes of color and texture of clothing, we ignore the color of the
foreground objects and only use the binary silhouette. Further, for the sake of
reducing redundant information, we use spatial edge contours to approximate
temporal patterns of gaits (see Fig. 5.11).
l~
Re
Once the spatial silhouette of a walking subject is extracted, its boundary can be
easily obtained using a border following algorithm based on connectivity. Then, we
can compute its shape centroid (x; Yc) by
(5.16)
where Nb is the total number of boundary pixels, and (x;, y;) is a pixel on the
boundary. Let the centroid be the origin of the 2D shape space. We can then unwrap
each shape anticlockwise into a set of boundary pixel points sampled along its
outer-contour in a common complex coordinate system. That is, each shape can be
described as a vector of ordered complex numbers with N; elements
(5.17)
where z;= x;+j xy;. The extraction and representation process of the silhouette's
boundary is illustrated in Fig. 5.11, where the black dot indicates the shape centroid,
and the two axes Re and 1m represent the real and imaginary part of a complex
number respectively. Therefore, each gait sequence will be accordingly converted
into an associated sequence of such 2D shape configurations .
We need a method that allows us to compare a set of static pose shapes in gait
pattern and is robust to changes of position, scale and rotation . A mathematically
elegant way for aligning point sets in a common coordinate system is Procrustes
shape analysis [214]. So it is expected that it can be easily adapted to handle spatial
patterns of gait motion. In the following section, we will give a brief introduction to
the Procrustes shape analysis method and show its application in gait signature
extraction and classification .
68
5.3.3 Procrustes Gait Feature Extraction and Classification
which minimizes
(5.20)
Note that the superscript * represents the complex conjugation transpose and
o~ d F ~ 1. The Procrustes distance allows us to compare two shapes independent
of position, scale and rotation .
Given a set of n shapes, we can find their mean by finding U that minimizes the
objective function [89]
(5.21)
u
The Procrustes mean shape is the dominant eigenvector of Su, i.e., the eigenvector
that corresponds to the greatest eigenvalue of S; [89].
69
we call this gait signature as "Eigenshape" . The follow ing summarizes the major
steps in determining the Procrustes mean shape for a sequence of shapes from n
frames, e.g., a gait pattern .
I . Select a set of k points from the boundary to represent a 2D shape as a vector
configuration z, in the manner discussed in Section 5.3.2. We tackle the point
correspondence problem through interpolation of bound ary pixels so that the point
set is the same for each image;
2. Set the centered configuration. When we represent the silhouette shape, we
have used shape centroid as the origin of 2D space to move all shapes to a common
center to handle translation al invariance . So, we can directly set u, = Z;, i = I, 2,
...,n;
3. Compute the matrix Su using Eqn.(5.22). Then , compute the eigenvalues and
the associated eigenvectors of Su;
4. Set the Procrustes mean shape U as the eigenvector that corresponds to the
maximum eigenvalue, and this mean shape is used as the gait signature.
The smaller the above distance measure is, the more similar the two gaits are.
Gait recognition is a traditional pattern classification problem which can be
solved by measuring similarities among gait sequences . We try three different
simple classification methods, namely the nearest neighbor classifier (NN), the k-
nearest-neighbor classifier (kNN) , and the nearest neighbor classifier with class
exemplar (ENN) .
Let T represent a test sequence and R, represent the ith reference sequence, we
may classify this test sequence into the class c that minimizes the similarity
distances between the test sequence and all reference patterns by
c = arg min d(T, R j ) (5.24)
This study at the CAS Institute of Automation aimed to establish an automatic gait
recognition method based upon spatiotemporal silhouette analysis measured during
70
walking. Gait includes both the body appearance and the dynamics of human
walking motion. Intuitively, recognizing people by gait depends greatly on how the
silhouette shape of an individual changes over time in an image sequence. So, we
may consider gait motion as being composed of a sequence of static body poses and
expect that some distinguishable signatures with respect to those static body poses
can be extracted and used for recognition by considering temporal variations of
those observations . Also, eigenspace transformation based on PCA has actually
been demonstrated to be a potent metric in face recognition (i.e., eigenface) and gait
analysis . Based on these observations, we proposed a silhouette analysis based gait
recognition algorithm using the traditional PCA. The algorithm implicitly captures
the structural and transitional characteristics of gait, especially the shape cues of
body biometrics . Although it is very simple in essence, the experimental results are
surprisingly promising .
Human Detection
and Tracking
Feature
Extraction
[ Projection in EigenSpace]
Eigenspace
Training or
Classification
Computation H
'-.---/ l'-----------'JReco gnition
71
correspondence method. The second module is used to extract the binary silhouette
from each frame and map the 2D silhouette image into 1D normalized distance
signal by contour unwrapping with respect to the silhouette centroid . Accordingly ,
the shape changes of these silhouettes over time are transformed into a sequence of
ID distance signals to approximate temporal changes of gait pattern . The third
module either applies PCA on those time-varying distance signals to compute the
predominant components of gait signatures (training phase) or determines the
person's identity using the standard non-parametric pattern classification techniques
in the lower-dimensional eigenspace (classification phase).
Spatiotemporal Feature Extraction
Before training and recognition, each image sequence including a walking figure is
converted into an associated temporal sequence of distance signals at the
preprocessing stage.
ec 0 8
un...-apPin.1! *
is 0.6
'0
~
Q)
~ 0.4
E
o
z 0.2
72
centroid as a reference origin, we unwrap the outer contour anti-clockwise to tum it
into a distance signal S = {dj, di. .. ., d.; .. ., dNb } that is composed of all distances d,
between each boundary pixel (x;,Yi) and the centroid
(5.25)
This signal indirectly represents the original 2D silhouette shape in the lD space.
To eliminate the influence of spatial scale and signal length, we normalize these
distance signals with respect to magnitude and size. First, we normalize its signal
magnitude through Lj-norm. Then, equally spaced re-sampling is used to normalize
its size into a fixed length (360 in our experiments) . Additionally, we regularize the
walking direction of sequences taken from the same view based upon the symmetry
of gait motion during shape representation (e.g., from left to right for all sequences
with lateral view) . By converting such a sequence of silhouette images into an
associated sequence of ID signal patterns that can be further analyzed using robust
signal analysis techniques, we will no longer need to cope with those likely noisy
silhouette data.
Feature Extraction and Classification
Training and Feature Projection
The purpose of PCA training is to obtain several principal components to re-
represent the original gait features from a high-dimensional measurement space to a
low-dimensional eigenspace . The training process is illustrated as follows .
Given s classes for training, and each class represents a sequence of distance
signals of one subject's gait. Multiple sequences of each person can be freely added
for training without the requirement of altering the following method . Let D i,j be the
lh distance signal in class i and N, the number of such distance signals in the lh
class. The total number of training samples is N(= N t+N2+ +N" and the whole
training set can be represented by [Du , D t •2, • • . , D1.Nj, D 2.j, , Ds.NsJ. We can easily
obtain the mean md and the global covariance matrix L of such a data set by
1 s IV, (5.26)
md=-LL~.j
N, ;=1 j= !
1 s N; (5.27)
I=- ""(D
L...L... '.J.-mJ)(D'.J. -mdf u
N( i=t j =l
If the rank of the matrix L is N, then we can compute N nonzero eigenvalues AI. ,.1.2,
. . ., AN and the associated eigenvectors ej, e2, . . ., eN based on SVD (Singular Value
Decomposition).
Generally speaking, the first few eigenvectors correspond to large changes in
training patterns , and the higher-order eigenvectors represent small changes.
Therefore , for the sake of memory efficiency in practical applications, we may
ignore those small eigenvalues and their corresponding eigenvectors using a
threshold value T,
(5.28)
73
where Wk is the accumulated variance of the first k largest eigenvalues with respect
to all eigenvalues. In our experiments, T, is chosen as 0.95 for obtaining steady
results.
Taking only the k<N largest eigenvalues and their associated eigenvectors, the
transform matrix E = [e\, ez, ... , ek] can be constructed to project an original
distance signal D iJ into a point P i,j in the k-dimensional eigenspace spanned by this
partial set of k eigenvectors.
P;'j =[t; E2 ... ekf Q,j (5.29)
where F;,(at+b) is a dynamic time warping vector from Pz(t) with respect to time
stretching and shifting for an approximation of the temporal alignment between the
two sequences. The selection of the parameters a and b depends on the relative
stride frequency and phase difference within a stride (two steps) respectively. Let./i
andJ2 denote the frequencies of the two gait sequences, then a = J2/f.. . By cropping a
sub-sequence of length J2 from the second sequence vector repeatedly and
expanding or contracting it with a, we may obtain its correlation with PI(t) . The
average minimum of all prominent valleys of the correlation results determines their
similarity.
74
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5.14 Gait Period Analysis: (a) input sequences, (b) aspect ratio signals of
moving silhouettes, (c) signals after removing the background , (d) autocorrelation
signals, (e) first-order derivative signals of autocorrelations, and (0 the positions
of peaks.
Gait period analysis has been explored in previous work [102, 110], which serves
to determine the frequency and phase of each observed sequence so as to perform
dynamic time warping to align sequences before matching. Note that a step is the
motion between successive heel strikes of opposite feet and that a complete gait
period is comprised of two steps. In [102], width time signal of the bounding box of
moving silhouette derived from an image sequence is used to analyze gait period . In
[110], either width time signal or height time signal is used because that the
silhouette width for frontal views is less informative, but the silhouette height as a
function of time plays an analogous role in periodicity . Different from them, here
we choose the aspect ratio of the bounding box of moving silhouette as a function of
time so as to enable it to cope effectively with both lateral view and frontal view.
The process of period analysis of each gait sequence is shown in Fig. 5.14. For an
input sequence (Fig. 5.14(a», once the person has been tracked for a certain number
of frames, its spatiotemporal gait parameters such as the aspect ratio signal of the
moving silhouette can be estimated (Fig. 5.14(b» . We may remove its background
component by subtracting its mean and dividing by its standard deviation, and then
smooth it with a symmetric average filter (Fig. 5.14(c». Further, we compute its
auto-correlation to find peaks occurring at a fundamental frequency (Fig. 5.14(d».
Finally, we compute its first-order derivative (Fig. 5.14(e» to find peak positions by
seeking the positive-to-negative zero-crossing points (Fig. 5.14(0). Due to the
bilateral symmetry of human gait, the autocorrelation will sometimes have minor
75
peaks half way between each pair of major peaks . The strength of these minor peaks
diminishes from nearly major peaks to zero as the camera viewpoint deviates from
fronto-parallel to perpendicular with respect to the image plane (see each colunm
from left to right in Fig. 5.14). Hence, we estimate the real period as the average
distance between each pair of consecutive major peaks . This process has been
demonstrated to be computationally feasible with respect to our background
subtraction results.
Note that the computational cost will increase quickly if the comparison is
performed in the spatiotemporal domain, especially when time stretching and
shifting is taken into account. Here, we tum to use the NED (Normalized Eucl idean
Distance) between the projection centroids of two gait sequences for the similarity
measure to eliminate the matching problems caused by velocity change and phase
shift in the spatiotemporal correlation.
Assuming that the trajectories of any two sequences in the eigenspace are P1(t)
and Pz(t) respectively, we can easily obtain their associated projection centroids C1
and C2• Each projection centroid implicitly represents a principal structural shape of
certain subject in the eigenspace. The normalized Euclidean distance between the
two sequential projection centroids can be defined by
(5.32)
Furthermore, for multiple sequences of the same subject, we may also obtain its
exemplar projection centroid by further averaging the projection centroids of those
single sequences as a reference template for that class. This exemplar centroid will
also be used for gait classification in our experiments .
The classification process is carried out via two simple different classification
methods, namely the nearest neighbor classifier (NN) and the nearest neighbor
classifier with respect to class exemplars (ENN) derived from the mean projection
centroid of those training sequences for a given subject.
Let T represent a test sequence and R; represent the ith reference sequence. We
may classify this test sequence into class c that can minimize the similarity distance
between the test sequence and all reference patterns by
c = argmindi(T,RJ (5.33)
where d is the similarity measure . Note that d can only choose NED if ENN is used.
A A A
-- - -------._ -- -- . - -- - - -- - ---- - -- - - - - - -- -
A
- -- --- --_. _- - -- - - - - -- - -- - --
A A A
Figure 5.15 Temporal Changes of Moving Silhouette in a GaitPattern (Frames 28-35)
~
76
5.3.5 Experimental Results and Analysis
-().Q2 -<l.02
-<l.O4 l
-<l.06 -<l.04 -<l.02 0.02 0.04 0.06 0.06 :;h~04::--: -: :'- --=--- =-O: -- : -: ::-::-'.: - - -: -! 0.06 0.08
=
-<l.04 ~ Exe"l'lar2
- Exe"l'lar 3
-<l.06
-<l.06 ~~~~ ~::=~~
-<l'~.06 -<l.06 -<l.04 -<l.02 0 0.02 0.04 0.06 0.08 -<l·~.08 -<l.06 -<l.04 -<l.02 0.02 0.04 0.06 0.08
-<l.02 -<l.02
-<l.04 -<l.04
- Exemplar1
-0.06 -<l.06 - Exemplar2
- Exemplar3
-0.08 -<l.06 - Exemplar4
- Exemplar5
-<l-<l.0~8--;-<l:-'.:.06::--<l7.04
::-:--:_0:-'.:.072--:-~~~=~~-
0.02 0.04 0.06 0.08 -<l-<l.08 -<l.06 -<l.04 -<l.02 0.02 0.04 0.06 0.08
(a) mean shapes and their exemplar of (b) Exemplars offive different subjects
four sequences of the same subject
Figure 5.16 Mean Shapes and the Exemplars for Different Views
77
Each sequence is accordingly converted into a sequence of shape representations
with the associated configurations in 2D space (the vector dimensionality is set to
360 here). Then, we can obtain their associated mean shapes. Note that the walking
direction is pre-normalized here to avoid the effect on recognition performance,
e.g., all sequences with the lateral view are flipped from right to left. Further, we
use the class average of mean shapes derived from the same-view sequences of a
subject as an exemplar for that class, which aims to avoid selecting a single and
random reference sample. Fig. 5.16 shows plots of mean shapes and their exemplar
of four sequences of the same subject and plots of the exemplars of five different
subjects (Note that Exemplar 1-5 are corresponding to Subject 5, 8, 14, 16 and 20
respectively), from which we can see that the intra-subject changes in eigenshapes
are very small, while the inter-subject changes are more significant. Such result
implies that the mean shapes have considerable discriminating power for identifying
individuals.
78
Table 5.8 CCRs of Different Classifiers under Different Viewing Angles
••
~. .- 0° ' 45° ':: ' :, > : 90°,>" "" ,:,
k=1 NN) 71.25% 72.50% 81.25%
k=3 3NN) 72.50% 73.75% 80.00%
ENN 88.75% 87.50% 90.00%
M~
i 0,9
;0.85
1.l
s 0,8
io.75
l5 0.7
O .6~
0,6L-::--,,;--~_._=..!==o""r~::::~='!
(b) ENN
Figure 5.17 Identification Performance Results in Terms of the FERET
Protocol's CMS Curve
79
---- . 0 Degree
- 45 Degree
0.8 ......... 90 Degree
I
0.6 ;
a:: i
~ \
0.4 1
EER
Evaluation
The performance of the algorithm is further evaluated with respect to the length
of the training sequence and the vector dimensionality of shape representation on
the CASIA database with the lateral view.
80
To evaluate the effects of the length of training samples, we conducted five tests
which respectively make use of the first 15,30,45,60 and 75 frames corresponding
approximately to one, two, three, four and five walking cycles from each gait
sequence captured at a rate of 25 frames per second (Note that 1 stride period = 2
cycles. An average cycle is typically 15 frames in term of a frame rate of 25 fps
according to the study of biomechanics though it seems to have a little difference on
cadences of different people) . The comparisons of recognition performances are
shown in Fig. 5.20. The results reveal that the best performance is achieved by using
just over four walking cycles of training samples from each subject (i.e., 60 frames) .
Furthermore, the recogn ition performance is improved by increasing the number of
training samples . The results thus appear to confirm recognition sensitivity to the
sequence length and imply that in a more extended analysis , care must be taken to
include sufficient samples in the training database .
1 -~,----~--~ -- ~--- - ~
0.9
0.8
2
~ 0.7 ~--::=!> l
y e--===-"~-
c:
~ 0.6
(J
"" 0 .5
'gj
1
'"
~
__. ==j~J
(J
0.4
Q)
~ 0.3
o
0.2 .
0.1 .
o
/-
15 30 45 60 75
Training sequence length (# frame)
Comparisons
Identification of people by gait is a challenging problem and has attracted
growing interests in the computer vision community . However, there is no baseline
algorithm or standard database for measuring and determining what factors affect
performance. The early unavailability of an accredited common database (e.g.,
something like the FERET database in face recognition) of a reasonable size and
evaluation methodology was a limitation in the development of gait recognition
algorithms . A large number of early papers reported good recognition results
usually on a small database , but few of them made informed comparisons among
various algorithms due to the lack of a standard test protocol. To examine the
performance of the proposed algorithm, here we provide some basic comparative
experiments.
This comparative experiment was to compare the performance of Procrustes
shape analysis with those of five recent methods which are from Maryland [101,
102], eMU [110], MIT [119] and USF [75] respectively, and to some extent reflect
81
the best work of these research groups in gait recogrnuon, The results are
summarized in Table 5.9 where Procrustes shape analysis compares favorably with
others, with performance very similar to [75]. The gait feature vector of [I 19] is
composed of parameters of moment features in image regions containing the
walking person aggregated over time . Intuitively, the mean features describe the
average-looking eIlipses for each of the regions of the body ; taken together, the 7
eIlipses describe the average shape of the body, which is in essence similar to the
idea here. Procrustes shape analysis outperforms the methods described in [10 I,
102, 110 and 75]. From experiments it is also found that the computational cost of
[I 10] and [75] was relatively higher than that of [101, 102, 119] and Procrustes
shape analysis .
Table 5.9 Comparison of Several Recent Algorithms on the CASIA database (0°)
,
...... 1n
The above only provides preliminary comparative results and may not be
generalized to say that a certain algorithm is always better than others . Algorithm
performance is dependent on the gaIlery and probe sets, so further evaluation and
comparisons on a larger and more realistic database are needed in future work .
Extensive experiments were also carried out to verify the effectiveness of the
spatiotemporal silhouette analysis based method . The foIlowing describes the details
of the experiments. For data, the CASIA database was again used .
82
manifold trajectory in the eigenspace. The projection trajectories of three trained
sequences with respect to lateral view, oblique view and frontal view respectively
are shown in Fig. 5.22, where only three-dimensional eigenspace is used for
visualization.
Figure 5.21 First Three Eigenvectors for each Viewing Angle obtained by PCA
Training
....
(a) lateral view (b) oblique view (c) frontal view
Figure 5.22 The Projection Trajectories of Three Training Gait Sequences
(only the three-dimensional eigenspace is used here for clarity)
83
uses the NED similarity measure with respect to projection centroids (solid line) and
exemplar projection centroids (dotted line) respectively. It is noted that the correct
classification rate is equivalent to P(ll (i.e., rank = I). That is, for side view, oblique
view and frontal view, the correct classification rates are respectively 65%, 63.75
and 77.5% with NN and STC, 65%, 66.25% and 85% with NN and NED, and 75%,
81.25% and 93.75% with ENN and NED.
~ j
~ 0.6 13 0.6
~
~
"~
::;
"-e_ 00
~ 0 .4 0.4
..-. - 4 5°
~ ~ ..••.• 90'
<3 0 .2 I
-
_
0'
45'
90°
<3 0.2 ~ o'
- - 45°
- - 90'
10 Rank 15 20 10 Rank 15 20
(a) (b)
Figure 5.23 Identification Performance based on the Cumulative Match Scores:
(a), and (b) classifiers based on NED with respect to single projection centroid
(solid line) and exemplar projection centroid (dotted line), respectively.
84
EER
Figure 5.24 ROC Curves of Gait Classifier based on NED with respect to
Three Viewing Angles
b) Verification Mode
For completeness, we also estimate FAR (Fals e Acceptanc e Rate) and FRR
(False Reject Rate) via the leave-one-out rule in verification mode. That is, we leave
one example out, train the classifier using the remaining, and then verify the left-out
sample on all 20 classes. Note that in each of these 80 iterations for each viewing
angle, there is one genuine attempt and 19 imposters since the left-out sample is
known a priori to belong to one of the 20 classes. Fig. 5.22 shows the ROC
(Receiver Operating Characteristic) curve using the NED similarity measure with
exemplar projection centroids, from which we can see that the EERs are about 20%,
13% and 9% for 0°, 45° and 90° views respectively. Here, the verification
performance of frontal view is also better than those of the other views.
85
parameters such as its centroid, width and height. To extract the other body points,
the vertical positions of the chest and ankles for a body height H are estimated by an
anatomical study [47] to be 0.720H and 0.039H respectively . Further, we can obtain
its chest width and the distance between the left foot and the right foot by
calculating the horizontal coordinate from two border points (note that for ankles,
we choose the leftmost and the rightmost points along the horizontal axis). Once the
person has been tracked for a certain number of frames, the spatiotemporal
trajectories of these gait parameters can be derived . We select the following
additional features to describe aspects of pace, stride and build. The first parameter,
gait period T, can be obtained as mentioned in Section 5.3. From the CASIA
database , we find that the gait period usually ranges from 26 frames to 37 frames .
The second parameter is stride 8L which is measured only at the maximal separation
point of the feet during the double support phase of a gait cycle. The last two
parameters, namely the body height H and the ratio R between the chest width and
the body height, are used to reflect build information (tall vs short and thin vs fat).
Because gait is a periodic and non-rigid motion, we obtain these parameters by
averaging these measurements over each time instance, namely the time instance
with the minimal separation points between two feet (when the subject is at his
maximal height) and the time instance with the maximal separation points between
two feet (when the subject have the maximal stride during the walking cycle). These
parameters are finally concatenated to form a four-dimensional vector <T, 8L, H, R>
for each sequence . Fig. 5.25(b) shows a distribution of such physic al features in 3D
space for clarity, where the same markers represent the results of different
sequences of the same subject. From Fig. 5.25(b), we can see that an efficient
combination of these physical properties will bring considerable discriminatory
power to gait classification.
___ ..... - 1-
.......... r
I
--", I
I
........
The effect of the above validation procedure on the CCR (Correct Classification
Rate) is shown in Table 5.10, from which we can see that the recognition
performance after the inclusion of the above additional features is indeed improved.
If we can construct a depth conversion factor as a function of the depth of the
subject from the camera as described in [106] or use camera calibration to convert
distances measured in the image from pixels to world units, we may extend the
86
validation step to solve the effects of foreshortening in terms of other views.
Although the results are very encouraging, further experiments on a larger database
need to be investigated in order to be more conclusive.
d) Comparisons
Here, we first compare the performance of the proposed algorithm with that of a
closely related method described in [101] . This approach named Eigengait is based
on PCA, and the major difference from our method is that it uses image self-
similarity plots as the original measurements. This algorithm was evaluated on a
dataset of Little and Boyd [88] and achieved a recognition rate of 80%, 82.5% and
90% with respect to k = I, 3 and 5 using the k-nearest neighbor classifier. We re-
implement this method using the CASIA database with a lateral viewing angle . The
best recognition rate is 72.5% (see Table 5.23), which is a little lower than our
method even with no validation (75.00%) . Due to the lack of the database used in
[88], here we are unable to test the proposed algorithm on the data set.
The performance of the proposed algorithm was also compared with those of a
few recent silhouette-based methods described in [110], [119] and [75] respectively.
Here, we re-implement these methods using the same silhouette data from the
CASIA database with lateral view. Based on the FERET protocol with rank of 1,5
and 10, the best results of all algorithms are summarized in Table 5.11, from which
we can see that our method compares favorably with others, and outperforms
Phillips et al. [75] and Collins et al. [110]. We also found that the computational
cost of [110] and [75] was much higher than that of [110], [75] and our method.
Here, the listed computational cost is an approximate average consumed time for
each test sequence using Matlab 6.1 on a PIlI processor working on 733Mhz with
256Mb DRAM (note that this process only includes feature extraction and
matching, and excludes gait segmentation and the training phase) .
Table 5.11
The effectiveness of the extracted gait features was evaluated with respect to
different variations on the "gait challenge" dataset described by Phillips et al. [75],
87
as this database is one of the largest available to date in terms of number of people,
number of video sequences, and the variety of conditions under which a person 's
gait is collected . Some samples are given in Fig. 5.26. So far, this dataset available
for gait analysis consists?f 452 sequellces from 74 subjects walking in elliptical
paths on 2 surface types ~oncrete and ~rass), from 2 camera viewpoints cMeft and
Bight), and with 2 shoe types ~ and :§). Thus we have 8 possible different
conditions for each person (refer to Table 5.12).
w.:._.. ~ I
I , -
.,#" .•
Table 5.12 Basic Results on the Gait Challenge database using Gallery Set (G, A, R)
View
Shoe
Shoe, View
Surface
Surface, Shoe
Surface, View
Surface, Shoe, View
Table 5.12 lists basic performance indicators of the proposed algorithm in the 7
challenge experiments , namely the identification rate (PI) for ranks of 1 and 5,
where the number of subjects in each subset is in square bracket, and the optimized
performance stated by [75] (an extended version of [76]) are in parentheses for
comparison . From Table 5.12, we can draw the same conclusions as [75]. That is,
88
among the three variations, the viewpoint seems to have the least impact and the
surface type has the most impact. As a whole, our method is only slightly inferior to
[75] in identification rate with the USF database, but far superior in computational
cost (see Table 5.11). As for the identification rate, we think that noisy segmentation
results bring about great impact on feature training and extraction to our method,
and that an elliptic walking path bring about the challenge for our view-based
algorithm (i.e., we set a nearly linear walking path in data acquisition, which is
consistent with most past databases and probably more realistic in real cases). The
performance of most existing algorithms performing on sequences with a linear
walking and side view does naturally result from the serious impact of the above
two aspects in both implementation (even may not work effectively) and the
resulting performance. As for the computational cost, the baseline algorithm
proposed by Phillips et al. [75] essentially belongs to an unlimited temporal-spatial
correlation process. Unlike other previous work, it performs inter-sequence
correlation repeatedly using the segmented silhouette images to measure similarity
between any two sequences because it does not have the explicit training procedure
for extracting a genuine compact feature vector for each sequence. As stated above,
these are just basic comparative results. More detailed statistical studies on larger
and more realistic datasets are desirable to further quantify the algorithm
performance.
89
The following are three basic problems that need to be addressed before using an
HMM based model for real world applications. The first problem involves
computing the probability of an observation sequence given the model parameters:
i.e. P(01,02, ....O; 12) . The second problem deals with computing the optimal
state sequence that best explains the set of observations given the model parameters.
The third problem deals with adjusting the model parameters 2 to maximize
p( 01'02" '" Or 12) . In the first problem the forward procedure which involves an
°
induction approach is used to efficiently compute the probability p( 12). The
°
forward variable a, (i) = P ( 0 " 2 , .. . , 0 " q, = s, I 2) is computed at every time
instant in an inductive fashion and the probability is computed as
P( °I2) =L:Ia (i). The second problem essentially involves maximizing the
r
(a) (b)
Figure 5.27 (a) Forward algorithm (b) Lattice structure and Viterbi decoded state
sequence
90
adjacent stances and one can identify N stances that best characterize the stances in
each cluster . Modeling human gait using an HMM essentially involves identifying
the N characteristic stances of an individual and modeling the dynamics between the
N such stances. Analogous to the HMM framework, the N characteristic stances (or
exemplars) correspond to the N states of the system; the transition characteristics
between the N stances define the transition probability matrix; the probabilistic
dependence of stances from the N clusters that comprise a gait cycle to the N
characteristic stances, constitutes the observation probability matrix. In the
following subsections , we discuss a HMM based human gait recognition system that
was developed at University of Maryland [99, 100].
Direct Approach
From the video sequences of P individuals walking along a pre-designated path, we
extract the background subtracted silhouettes from each frame of the video
sequence. Selection of feature vectors plays a key role in a HMM based recognition
framework . In the first approach, we use the silhouette image in its entirety as the
feature vector. Given a sequence of images of the r
J,
individual X j = [ xj (I),xj (2), ..., xj (T) we identify the N exemplar stances that
best characterize the gait sequence and compute the HMM parameters
Aj = (Jr j , A j, B ; ) . A plot of the sum of foreground pixels on each frame of the video
sequence illustrates the quasi-periodic nature of human gait as shown in Fig. 5.28.
The boundaries of each gait cycle are identified by means of adaptive filtering and
each gait cycle is subdivided into N clusters of temporally adjacent stances . The
initial estimate of the exemplars &;0) = [ e~, e~, ...,e7 Jis derived from the stances
belonging to each of the N clusters. The initial estimate of the transition probability
matrix is such that transitions allowed are from a state onto itself or to its adjacent
state, i.e. A(O) =[ ~i.j) ] is such that A U.j) = 0.5 and A~.jmOdN+l = 0.5 . The initial state
probabilities Jr are set equal to 1/N . The observation probability matrix B
j
(5.35)
where a and fJ are constants and D is a distance metric. The motivation behind
using an exemplar based model is that recognition could be based on the distance
between the observed feature vector and the exemplars.
91
OIiginal S'9na l
BOO •. . . . . • . . - Adapli V9 FIltered Signa l
o Media n Fikered Signa l
6. Differentia l Smoothed Signal
000 .. . .... . . . ·, . ..
JIl
.. 0 6. :
· · ·· ·1·· · · · · · ·.·
. · · · · · ·· .1
: : :r: . .
it 400 , , .
~
I 0Kfl·....i.A..•..Ir. J .JfffFv~1
~~,ro:
i
. ~ .v. , .
:
. 0 •
::
:::;-400 : ~
6.
..:
I'
:0
--600 .
. 6.. .. . .... .
.. . . . ....
o . . _ · · . , • • • •
··
• • •• •• • •
..
w • • •••
..
••• •
: 0 : 0
-!lOO ......_ ' 0
....... _---JL......._-'-_ _-'--_........J-'-_-'--_ _J...-_ --L_ _...J......_ _LJ
360 360 400 420 440 460 480 500 520 540
FflIll1ll index
e(1+1)
j = argo max n l ET)')
p( X, I)
e (5.36)
EI E2 E3 E4 E5 E6
92
Fig. 5.29 displays the exemplars estimated for a particular sequence. Using the
above formulations the j'h exempl ar is updated as e~+l) = L 'ET.ili(t)where i
correspond s to the normalized vector x. The Baum Welch algorithm [200] is used
to compute A ( i+l ) using the refined set of exemplars & U+ I) and A (i ) . At every stage
of the iteration we re-initialize Jr~i+l ) = 1/ N. Thus, the HMM parameters are
iteratively refined . Under the HMM framework, this phase corresponds to the
training phase.
Following the training phase, every individual j in the gallery has a set of
exemplars &U) =[e~ ,e~ , ..., e7 J and the set of HMM parameters Aj =( Jrj , Aj,B j) .
Given a video sequence v of an unknown individual, the background subtracted
silhouettes of the individual are extracted in the same way adopted for the video
sequence s that compri sed the gallery . Using the forward algorithm, we compute the
likelihood that the observed sequence v was produced by the r
individual. The
identity of the unknown individual in the video sequence is established as follows :
Indirect Approach
The problem of high dimensional ity of feature vectors in the direct approach is
overcome in the indirect approach. In cases where the quality of the silhouettes
extracted through background subtraction is good, the outer contour of the
silhouette captures almost all the information contained in the silhouette image in its
entirety . Thus, the row-wise width of the outer contour could potentially serve as a
good feature vector for recognition purposes . Fig. 5.30 shows the plot of width
profiles of two different individuals over several gait cycles . The brightness
variations that are evident from the width profiles are due to the arm swings and leg
swings that characterize one 's gait.
. -. - .. - - ... -
1;.1.:1••; la)
(b)
93
For every individual from the gallery, the exemplar width vectors
Ei =[ e~, e~, ...,e7 ] are selected as prescribed earlier. Let
X i = [ x i (1), x i (2), ..., x i (T)] correspond to the width vectors across T frames . We
define a quantity, the frame to exemplar distance (FED), as the observed signal of
r
the process. The FED vector for the individual is defined as
(5.39)
25lJ,---,..--,.----r---r-----,----.---,-----.-------,
200
/ + ..." T
c '
+ /*. \. '"
+ / ,
/
-+ .f
e 10 12 15 1&
Frame5 --->
94
spoken word [20 I] . In applications such as data-mining, gesture recogmtion,
robotics, medicine etc . DTW is used to compute the distance between two time
series . Fig. 5.32 shows the differences in computing the distance between two time
series, before and after optimal alignment, and thereby explains the significance of
the time-warping operation.
Ca) (b)
Figure 5.32 Comparison of Two Time Series (a) before alignment (b) after
alignment with DTW
Given two time series X = [XI,x2,.....,Xm ] and Y = [y1,Y2, .....'Yn]' the DTW algorithm
computes the warping function W that optimally aligns the elements of X and Y. A
cost matrix D of size m x n is constructed where the (iJyh element of the matrix
corresponds to the distance between the samples Xi and Yj from X and Y respectively.
A warping path W is a contiguous set of matrix elements that define a mapping
between X and Y denoted as W = [WJ,W2, ......,WK]. The kth element of W is defined as
Wk = (iJ)k and the length of warping path K is such that max(m,n) :::; K < (m+n). The
warping path is typically subject to the following constraints:
• Endpoint Constraints : The endpoints of the warping path Ware fixed at
WI = (1,1) and WK = (m,n). The warping path starts and finishes at the
diagonally opposite comer cells of the cost matrix .
• Monotonicity Constraints: The warping path W monotonically increases
in time i.e. given, then (i1-i 2) 2:0and (h-h) 2: 0 for all k E [I:K]
• Continuity Constraints : Continuity constraints are imposed to restrict the
allowable steps in the warping path to adjacent cells or nearly adjacent
cells, i.e. ifwk= (iIJI) and Wk_1 = (i2J2) , then (i1-i2):::; I and UIJ2):::; 1.
The optimal warping fiT that minimizes the cumulative costs is computed using
dynamic programming. We construct a cumulative cost matrix C, the (iJyh element
of which, corresponds to the cost of the minimum distance warp path that can be
constructed from the two time series XI = [Xl, .... ,xi] and Y1 = [yJ, ....,Yj]. Due to
continuity and monotonicity constraints that are imposed on the warping path, qiJ)
is computed as
C(i,)) = D(i,)) + min(C(i -I,) -I), C(i -I, )), C(i,) -I)) (5.40)
The cumulative cost matrix C is first filled column-wise and then row-wise. ql ,!)
is initialized to D(1,I). Upon computation of the cumulative cost matrix C, the
optimal path fiT is easily obtained by backtracking the path for the warping
95
function from the (m,n)'h element of C, keeping in mind the warping path
constraints as defined above. Since DTW is an (O(N 2) ) operation, due to
computational complexities a few global constraints are imposed on the warping
path. Sakoe-Chuba band and Itakura Parallelogram are some examples of the
constraints imposed on the number of cells that are to be evaluated on the cost
matrix . Figure 5.33 gives an example of the optimal warping path derived by using
Dynamic Time warping to align two time series X and Y and illustrates the
constraints imposed on the cells of the cost matrix .
I I I !",
T~ >-::7,
T 171
V
- I -I -
(b) Optimal alignment of X and Y
17
./
1/ 1
D T I- I-
~p.~ - _!I - -
Ul lU I
(a) optimal warping path obtained by DTW (c) (i) Sokoe-Chuba band
on time series X and Y (ii)Itakura Parallelogram
where M is the number of columns in the image. Fig. 5.34 illustrates the
computation of the feature vectors discussed above.
96
(a) (b) (c)
Figure 5.34 Illustration of the Generation of (a) left projection vector (b) right
projection vector (c) width vector
As discussed in previous sections, the shape of the human silhouette and the
transition characteristics from one stance to another act as significant cues towards
gait based human recognition. This section deals with understanding the relative
significance of shape and kinematics in gait based human identification approaches
[6,148]. In addition to their implications for recognition framework, we also study
the relative significance of shape and kinematics in human activity classification. In
the UMD approach, we represent shape adopting Kendall's shape theory and
represent dynamics using ARMA models.
Shape Analysis
Dryden describes the shape of an object as the geometric information that remains
after removing translational, rotational and scale information from an object's
representation [206]. Kendall's representation of shape describes the shape
configuration of k landmark points in an m-dimensional space as a krm matrix
containing the co-ordinates of the landmarks. In our approach, we extract k
landmark points from the outer contour of the human silhouette from each frame
and represent them as pre-shape vectors, i.e, a representation where translation and
scale have been filtered out. The centered pre-shape is given by
(5.42)
where l, is a krk identity matrix and lk is a vector of k ones. Since the pre-shape
vectors, thus obtained lie on a k-l dimensional complex hyper-sphere of unit radius,
the distance between two pre-shape vectors is non-Euclidean in nature.
Consider two complex configurations X and Y with correspond ing preshapes
a and 13. The full Procrustes fit between X and Y is chosen so as to minimize
d(Y,X)=llfJ-ase j8 -(a+ jb)q (5.43)
where s is a scale, (J is the rotation and (a+jb) is the translation. The full Procrustes
distance d,,{Y,){) and the partial Procrustes distance dP(X,Y) are defined as
97
dF (Y,X) = s,O,a,b
inf d(Y,X) (5.44)
inf IIp-aril
dp(X,Y)= r eSO(m) (5.45)
While the translation that minimizes the full Procrustes fit is given by (a+jb) = 0,
the scale s= la",61 is close to unity . The rotation angle () that minimizes the Full
Procrustes fit is given by () = argt] a" ,61). The partial Procrustes distance between
configurations X and Y is obtained by matching their respective preshapes a and p
over different rotation parameters. The Procrustes distance p(X,Y) is the closest
great circle distance between a and p on the preshape sphere. The Procruste s
distance, the full Procrustes distance and the partial Procrustes distance are
trigonometrically related to one another.
Dynamical Models
We use two time series models : Autoregressive (AR) and Autoregressive and
Moving Average (ARMA) to study the role of kinematics in human gait
recognition.
In the Stance Based AR Model a video sequence of an individual walking is
divided into N distinct stances . Within each stance , we study the dynam ics of the
98
shape vector. We extract the pre-shape vector from each stance and project them
onto the tangent space . The time series of the tangent space projections is modeled
as a Gauss Markov process ,
a j(t)=Ajaj(t-I)+w(t) (5.47)
- -
where, w is a zero mean white Gaussian noise process and Aj is the transition matrix
corresponding to thej" stance. For convenience and simplicity Aj is assumed to be a
diagonal matrix. Aj where ) E (1,2, ..., N) is computed.
Given the transition matrices of the gallery and the probe sequence, the distances
between the corresponding transition matrices are added to obtain a measure of the
distance between their kinematic models . If Aj and B, (for) = I,2,...,N) represent the
transition matrices for two sequences , then the distance between models is defined
as D(A,B)
j =N (5.48)
D(A,B) = 2:IIAj -BjII F
j=1
where II . IIF denotes the Frobenius norm . The model in the gallery that is closest to
the model of the given probe decides the identity of the person .
In the ARMA Model we learn a dynamical model for human gait and perform
recognition by comparing the learnt dynamic models . The dynamical model is a
continuous state, discrete time model, the parameters of which lie in a non-
Euclidean space. Let us assume that the time-series of shapes is given by a(t), t =
1,2,. .. , t: Then an ARMA model is defined as
Also, let the cross correlation between wand v be given by S. The parameters of the
model are given by the transition matrix A and the state matrix C. We note that the
choice of matrices A,C,R,Q,S is not unique . However, we can transform this model
to the "innovation representation" [207] which is unique.
The algorithm is described in [207, 208]. Given observations a(I), a(2),...., a(t),
~ ~ ~
C(r)=U (5.51)
(5.52)
99
models [210]. The subspace angles between two ARMA models ([At,C"Kd and
[Az,Cz,Kz] can be computed by the method described in [210]. Using these subspace
angles 0;, i = 1,2,...,n, three distances namely Martin distance(d M) , gap distance(dg )
and Frobenius distance (dr-) between the ARMA models are defined as follows:
2 1 (5.53)
d M = In Il--:'2--'--
n
cos (OJ)
;=1
The results obtained using the different distance measures were comparable. We
later report results arrived at using the Frobenius distance (d/).
5.4.4 Results
1
I
~ 1
o1;. 60
1::
£ 50
<1l
:2
(J: 40
>
~ 30
~
:> 20 -re- Train on fast walk (8 cycleS); test on fast walk(8cycles))
U -+- Train on slow walk(8 cycles); test on SlOw wak (8 cyces)
- Train onfastwillk(8 cycles} ; lest on slowwak{8 cyCles)
10 4- Train on Slow walk (8 cyCles);test on fastwalk(8cycles)
-.- Trainonballwalk(8 c de s): test on ballwalk8 c ctes
CMU Database
The CMU database comprises of video sequences of 25 subjects walking on a
treadmill under three conditions: (a) Slow pace (b) Fast pace (c) Carrying a ball. We
report results on the following experiments performed on this dataset: (l) Slow walk
vs. Slow walk (2) Fast walk vs. Fast walk (3) Slow walk vs. Fast walk (4) Fast walk
vs. Slow walk (5) Carrying a ball vs. Carrying a ball. The experiments where the
100
gallery and the probes are identical, training is performed using the first half of the
walking sequence and testing is performed using the other half. The recognition
results are shown in Fig. 5.35.
100
...
4
'
.~ ., 3
.
A
II 80
] 150 ,"
"
"
j2
i.I 40
... " ,I
,.,.'
Jl
I
,
20 .
.'
"
..- 0
00 10 20 30 40
RanI< - , .
(a) cumulative match characteristic (b) recognition confidence
Figure 5.36 HMM Recognition Results on UMD Database
UMD Database
The recognition results on the UMD dataset are plotted in Fig. 36(a). We assessed
the confidence in the recognition score by adopting a leave one-out strategy. Fig.
36(b) is a plot of the variance in recognition score at each rank.
80
70 ..
. .. . . . . .... .. . . . . .. _ . .. . ·. 0.
···· .... .... .... .... ....
.. ... ... ..
··: ..:- -:
..
-:.
. ..
.; ;
.. .
~
. . .
30 ·· .. . ·
··· ··r······ ···I' ·· ·· ··· ·· ····· · ···.··· ·······.·····.... . . . ...p ProbeA
···· .... .
. . ProbeB
..
. . . ProbeC
20
· . .
··
~
..! :. .
., ..0 • ••••• • : •
..
•• •• •••• • : •
..
• • • • • • • • •: • • • • • • •
..
•• • : • •• • • • -
•
ProbeD
ProbeE
ProbeF
10 " , : ., , . ,. : ' ., , . ' : ' . , . , .. , .: . , : : : ProbeG
-t.....:;;:::....;c...:...:c~'-'
101
USF Database
The USF database comprises of walking sequences of 122 individuals with
variations in viewing directions, surface of walk, shoe type etc. The database also
comprised of walking sequences where the individual carries a briefcase and
walking sequences taken months apart. The nomenclature adopted to represent the
covariates is as follows: The surface variations are indicated as G (grass) and C.
(concrete); A and B indicate the different shoe types; the camera positions are
indicated as L (left) and R (right); B refers to briefcase sequences and t2 refers to
sequences captured 6 months after the initial data collection. We report results on 12
experiments: (G,A,R) sequences comprised the gallery; the 12 probes were:
A(G,A,L), B(G,B,R), C(G,B,L), D(C,A,R), E(C,B,R), F(C,A,L), G(C,B,L),
H(G,A,R,B), I(G,B,R,B), J(G,A,L,B), K(G,A,R,t2), L(C,A,R,t2). Table 5.13 reports
the recognition results on an earlier version of the USF database and draws
comparisons with the baseline algorithm[76]. Fig. 5.37 displays the Cumulative
Match Scores for each of the 12 experiments on the USF database, using the direct
approach.
RankS '}
,· ...·,··..' ::u'o,",/ g.
100
90
90
35
35
60
50
Table 5.13 HMM Recognition Results on USF Database (version1)
I· . . .
T...ft
Fast vs Fast 92 88 92
Slow vs Slow 92 97 100
Slow vs Fast 50 50 50
Fast vs Slow 68 55 59
Table 5.14 DTW based recognition results (rank 1) on CMU database
102
GlIJr- - - - ----;===========;_,
• l en \ ectu r
• R ll:hl vecto r
-lA1
• Fu, l"n
311
211
III
II
One can extend the recognition approach discussed above to frontal gait sequences
as well. Though arm swings and leg swings are less apparent on frontal gait
sequences, the outer contour of the silhouette does contain the signature of the
individual. We extract the width vectors from the silhouette after appropriate
normalization taking into account the change in height as the individual walks
towards or away from the camera. We employ an approach similar to the one
discussed above to perform recognition on frontal gait sequences. On datasets where
the frontal and the side view of one's gait are available, we comb ine the recogn ition
results obtained from each of the two views [202]. Results are tabulated in Tables
5.15 and 5.16 .
lO°r---....... - - - - - - -......-r~n>P======="""==;_.,......--__,
90
80
.,0
60
50
40
30
20
to
o
B c
Figure 5.39 Rank 1 Recognition Scores from AR, ARMA, Baseline, Stance
Correlation, DTW and HMM
103
Frontal view 91
Side view 93
Fusion of Frontal and side 96
Table 5.15 Effect of Fusion of Frontal and Side View of Gait on CMU Dataset
Slow Walk 48
Fast Walk 28
Walk with Ball 12
Inclined lane 92
Table 5.17 Identification Rates on the CMU Dataset using Stance Correlation
Method (Figures in Braces denote HMM identification rates)
104
I.
--
105
6 Model-Based Approaches
6.1 Overview
he model based approaches aim to derive the movement of the torso and/or the
T legs. The distinction of a structural approach is one which uses static
parameters illustrated in Fig. 6.1(a) whereas a model can be the (relative) motion of
the angles (a , ~, and ~) between the limbs, shown in Fig. 6.l(b). As earlier, these
angles can also be measured relative to vertical.
height
~ .
stride
BenAbdelkader et aI.' s approach using self similarity and the use of structural stride
parameters (stride and cadence) [102] is a prime example of a model-based
approach which uses structural measures. Cadence was estimated via periodicity;
stride length was estimated as the ratio of the distance traveled (given calibration) to
the number of steps taken. By analysis on the UMD data, the variation in stride
length with cadence was found to be linear and unique for different people, and was
used not just for recognition, but also for verification.
Bobick et al. from GaTech used structural human stride parameters [106] which is
the other example of a structural model-based approach. The method used the action
of walking to derive relative body parameters which described the subject's body
and stride. The within-class and between-class variation were analyzed to determine
potency and on motion capture data the relative body parameters appeared to have
greater discriminatory power than the stride parameters. The approach also included
a measure of confusion which evaluated how much identification probability is
reduced following a measurement as well as a cross-condition mapping that allowed
application in conditions which varied from the original analysis, which was an
early approach to capitalize on the viewpoint invariance associated with model-
based recognition approaches. Another structural approach, by Tanawongsuwan et
aI., used joint angle trajectories, derived by markers placed on joint positions in the
legs and on the thorax [161]. A simple method was used to estimate the planar
offsets between the marker positions and the underlying skeleton and the variation
in joint angles (such as the orientation of the femur relative to the back) with time
was then derived. A variance compensated time warping was used to compensate
for temporal variations. Evaluation was conducted on a small database and showed
that recognition could be achieved. Given the small size of the database, a confusion
metric was derived aimed to show likely performance on a larger database. This
theme was continued by Johnson et aI. [143] who showed how the performance
could be predicted for a much larger database, from the same data, estimating
performance capability on a database five times larger.
Yam et aI. from Southampton extended the earlier model-based system to describe
both legs and to handle walking as well as running [115]; an alternative model-
based system uses evidence gathering as an initial step, followed by model-based
analysis driven by anatomical constraints and data and evaluated on the
Southampton dataset with an analysis of feature potency [116] was developed by
Wagg et aI. Both will be described in more detail in the next section.
In the study from the CAS Institute of Automation [117], a model-based approach
derived the dynamic information of gait by using a condensation framework to track
the walker and to recover joint-angle trajectories of lower limbs. Again, these will
be considered in more detail, next.
Zhang et al's approach [118] concerned the change in orientation of human limbs.
In fact, the extraction is model-based, and the description is structural making this a
blend of the two model-based approaches described so far. The lower limbs were
represented by trapezoids and the upper body was planar without the arms. Given
distances normalized by height of the thorax, the human body posture was
represented by a set of distance measurements and inclinations of its constituent
parts. The gait features were extracted from gait sequences by the Metropolis-
Hastings method to match body parts to the image data. The sequence fit was
achieved by minimizing an energy functional which described : difference of the
body from the silhouette derived from the image; the difference between moving
silhouettes; and the difference between modeled appearance and the silhouette.
This allows for derivation of elevation angles which describe dynamics of gait and
trajectories of joint positions which describe spatiotemporal history. The approach
thus centered on capturing temporal differences by extracting the elevation of the
knee and ankle and the width at the knees and ankles. As these are periodic, they
were described by Fourier analysis and then classified via an HMM. The procedure
was evaluated on the CMU Mobo and on the NIST databases and shown to have
discrimination capability, with better results on the Mobo database. Clearly it enjoys
the advantages of model-based techniques in that the data used for classification is
intimately linked to gait itself.
Again, there are emergent studies of the potency of the various model based
measures which is important for camera placement in application and for
development of new recognition techniques, as well as studies of viewpoint
invariance which reflects one of the major advantages of modeling, namely that
invariant properties can be achieved.
108
6.2 Planar Human Modeling
The University of Southampton continued its earlier use of pendul ar models. The
extensions aimed to use model-based approaches that could achieve recognition
whether the subject was walking or running. These modeled the thigh and the leg as
coupled penduli. The process again aimed for a frequency-based description, but
rather than the earlier direct extraction of the frequency components the extraction
now stated with extraction of the front of the limbs whose motion was then
described by its frequency content. As such, the new model-based approach
provided direction for the extraction of the front of the limbs, which was then
refined by analysis of the image. The model for the thigh angle was the same
pendular model in the previous approach. The thigh was assumed to drive the
motion of the freely-sw inging leg.
The evidence gathering technique comprised two stages : i) global/temporal
template matching across the whole sequence and, ii) local template matching in
each image . The aim of the first stage was to search for the best motion model that
can describe the leg motion well over a gait cycle, i.e. the gross motion of a
complete gait cycle . This matched a line which moves according either to the
structural or to the motion model to the edge maps of a whole sequence of images to
find the desired moving object. This gave estimates of the inclination of the thigh
and of the lower leg which were refined by a local matching stage in each separate
image. The first stage determined values for the parameters that maximize the match
of the moving line to the edge data, evaluated across the whole sequence, as those
parameters
(6.1)
ApB,C,D,OJK,OJr =max(L: L: L: (P',y(t)=MI:,y(t)))
ter x Elmage y Elmage
where P is the image and MT is the motion template with dynamics derived from
either the structural model, which is based on two parts . The thigh movement Or is
Or = A COS(OJr t)+ Bsin(wrt) (6.2)
where A and B are constants, and t is a time index . The motion of the knee OKis
given by
(6.3)
where C and D are constants. m, is actually the mass of the thigh, its inclusion
motivated by the differential solution to the motion of the knee and it is set to unity.
Having found the best set of parameters, the estimated thigh and lower leg
inclination for each frame was then generated. These angles formed the basis of a
local search for the best fit line to the data in each single image, as
(6.4)
109
where AT is the line resulting from application of the motion template, with
variation of up to ±5° in inclination and ±5 pixels in vertical and horizontal
translation. This aimed to ensure that the best-fit angle and position was found in
each frame. The estimated angles from the first stage vs. their manually derived
estimates are shown in Fig. 6.2, and these look encouragingly close. The second
stage match to a sequence of image data is shown in Fig. 6.3, again high fidelity can
be observed.
:,:u 60
C> 10
g> 40
~ ~.
~ 0 15
:a
15 ~ 20
oc a::
0. 10
Cl>
C.
.
'5
c.
~·20 x Manually Labelled «<: x Manually t.abeaed
- Motion Model x - Motion Mode l
. 20
20 40 60 80 100 0 20 40 60 80 100
Gait Cycle (% ) Galt Cycle (%j
(a) thigh motion for walking (b) leg rotation for running
Figure 6.2 Extended Motion Model Superimposed on Manually Labeled Data [l l S]
110
100 100
_ Clean I _ Clean
90 D 90
1D
25% Noise
Low Resolution
o 25'.. Noise
80 80 D Low Resolution
70 70 r-c
-
60 60
... 50 ... 50
40 40
30 30
20 20
10 10
0 '- - o' - = -'-
3 3
-4-+
1tr-
More recently, the model-based approaches have been extended further at the
University of Southampton [116], with investigation of discriminatory potential.
The basic paradigm was similar: to model the motion of the legs over the whole
sequence but in this case the body shape was explicitly included. Essentially, an
III
estimate of human shape is derived by shifting and accumulating the edge images
according to
(6.5)
where Av is the accumulation for velocity v (in pixels per frame), £1 is the edge
strength image at frame t, i and j are coordinate indices , N is the number of frames
in the gait sequence and dy, is the y-displacement of the subject from their center of
oscillation at frame t. Each moving object in the scene will appear as a peak in a
plot of maximal accumulation intensity against velocity. If the subject is the most
significant moving object in the scene (in terms of edge strength and visibility),
their velocity can be inferred by selecting the highest peak in this plot.
(a) accumulation (b) period and phase (c) global models (d) local model
Figure 6.6 Overview of Extended Accumulation Process [116]
The subject is then extracted by matching a coarse person-shaped template, Fig.
6.6(a). This template is constructed from mean anatomical data [20] scaled to the
subject's height. This employs a trapezium enclosing the motion of the legs and
rectangular sections define the expected positions of the thorax and the head. A
more accurate model of the subject's bulk shape uses an ellipse for the torso and for
the head . Four line segments are used to model each leg and a rectangle for each
foot; parameters describing leg segment lengths and radii are initially set to fixed
proportions of the subject's height (where radii are measurements of leg width at
chosen points) . The parameters describing the head and torso are determined
separately by template matching within the locality of the initial segmentation,
constrained by mean anatomical proportions. Although all shape dynamics are lost
in the temporal accumul ation process, it is still possible to estimate the amplitude of
hip rotation, which may be used to aid articulated motion estimation. Gait frequency
is determined by finding the frequency and phase that minimizes the error function
given by Eqn. 6.6. This minimization can be performed very quickly for the typical
range of frequency and phase expected for a walking person .
(6.6)
112
(a fixed ratio of the signal mean magnitude), Wi and <pj are the proposed gait
frequency and phase (the offset phase was determined empirically and holds for the
baseline approximation on this large number of subjects). Note that the dominant
signal frequency is twice that of the gait frequency, and the offset constant phase
shift is required to align the two sinusoids. Data collected from clinical gait studies
[20, 22] was used to build prototypical models for hip, knee, ankle and pelvis
rotation. The use of mean gait models allows extraction of approximate joint
positions for the subject, but this is not sufficient for recognition purposes; the
estimation process assumes average gait motion, or no individuality. To capture
individual variation, adaptation of the mean leg motion models (Fig. 6.6(c)) is
required. However, before matching of leg shape models to image data, it is
necessary to improve the estimate of the radius of the subject's leg at the hip, knee
and ankle. The initial estimates are computed as fixed proportions of the subject's
height, which may not be appropriate for certain types of clothing (baggy trousers,
shorts or skirts for example). An improved estimate is obtained by computing a line
Hough transform for each frame within the upper and lower leg regions. Within
each Hough space we find the peak accumulation satisfying constraints on the
expected rotation of the leg and the distance between the two lines (width of leg).
An overall estimate of leg width is computed as a mean of the best line parameters
from each frame, weighted by accumulation intensity.
p ~-_.~-~-
0.9
s:
ll)
()
:::r
-u 0.7 --_._----- --
a
0-
ll)
g
I- I
~ 0.5
- Indoor Dataset :
Outdoor Dataset
- ~- - - ~ ~ - ~ - -_ . _ ~ --- - - - - -- - - - - - - - - -- -
0.3
3 5 7 9 11 13 15 17 19
Rank
113
The recogmtion measures were analyzed by using Anova and for the
performance on the Southampton indoor and outdoor datasets: Database AS (Table
3.3) and the outdoor database E (Table 3.2) [116]. A cumulative match
characteristic is shown in Fig. 6.7 for which the Correct Classification Rate (CCR)
is given by the CMC for a rank of one, showing a CCR of appro ximately 84% on
the indoor dataset and 64% on the outdoor dataset , reflecting complexity of outdoor
data . This gives for an analysis of potency , shown in Table 6.1, with the highest F-
statistic giving greatest discriminatory capability and hence the highest rank. This is
then similar to the analyses of potency of silhouette measures. This analysis
suggests that the majority of the system' s discriminatory capability is derived from
gait frequency (cadence) and from some static shape parameters. Of course, these
shape parameters will be highly dependent on clothing , which may limit the utility
of performing recognition solely on the basis of these parameters. These results may
in part explain why some approaches using primarily static parameters [106] or
cadence [102] achieve good recognition capability from few parameters. There is a
signific ant reduction in discriminatory capability in the outdoor dataset compared to
the indoor dataset, resulting from the lower extraction accuracy , but there is still a
strong case for recognition potential using this data. A further analysis [157] has
improved extraction by using snakes to evolve from the evidence -gathering derived
approximations, as shown in Fig. 6.8, and this shows improved recognition
capability.
We present an effect ive approach to tracking walking human based on both body
model and motion model in a Condensation framework [219]. This approach was
developed at the CAS Institute of Automation . The Condensation framework is very
114
attractive because it can handle clutter and fusion of information. Fig. 6.9 gives the
framework of our tracking approach. In tracking, we maintain, at successive time-
steps, a sample set of poses that are 12-dimensional vectors. The sample set is
derived either from the tracking result of previous frame, or from a specific
initialization process when it comes to the first frame. For each new frame, the
sample set is subjected to the predictive steps. First, samples undergo drift
according to previous pose, motion model and motion constraints. The second
predictive step, diffusion, is random and the drifted samples may split. Our dynamic
model directs the predictive steps. After prediction our PEF (pose evaluation
function) measures the similarities between the image data and the projected human
body model with diffused poses. And the posterior mean pose of the tracked people
can be generated from the sample set by weighting with the similarities. The
tracking results will finally be used for gait recognition. Other tracking approaches
are also available, specifically aimed to track the human body in monocular video
data. In these, a shape-encoded particle filtering technique is used to track body
parts [107, 108]. Later, angle motions were estimated via multiple cameras [109]
with a kinematic chain to model the human motion. In this, the 2D motion of pixels
is related to the 3D motion of body parts via a projection model, a rigid model of the
human torso and the kinematic chain modeling the human body parts.
'---y---J'------.....v--~/ ~ "--y---J
Initialization Dynamic Model Pose Evaluation Function Gait
(Motion Model & (Body Model) Recognition
Motion Constraints)
115
Figure 6.10 Human Body Model Projected into the Image Plane from 5 Viewing
Angles
The above human body model in its general form has 34 DOFs: 3 DOFs for each
body part(14 x2), 3 DOFs for its global position (translation), and 3 DOFs for its
orientation (rotation). To search quickly in a 34-dimensional state space is
extremely difficult. However, in the case of gait recognition, people are usually
captured walking along a line when the camera is installed in a desirable
configuration (For convenience, we assume that people walk parallel to the image
plane. With little modification, our approach can also be applied to other fixed
directions), and the movements of the head, neck and lower torso relative to the
upper torso are very small. Therefore the state space can be naturally reduced with
such constraints. Here, we assume that only the arms and legs have relative
movements when the upper torso moves along a line. Furthermore, each joint has
thus only one DOF. Accordingly, this reduces the dimensionality of the state space
to 12: I DOF for each joint mentioned above plus 2 DOFs for the global position.
We represent the position and posture by a 12-dimensional state
vectorP ={x,y,0j,02 ," "01O} where (x, y) is the global position of the human body
and 0i is the ith joint angle. This state vector describes the relative position of
different body parts.
In model-based tracking, we need to synthesize and project the body model into
the image plane given the camera parameters and state vector
°°
P = {x, y, 1 , 2 , " ' , 01O } . In other words, we need to calculate the camera coordinate
of each point in the body model and transform it to the image coordinate. To locate
the positions of model parts in the camera coordinate, each part is defined in a local
coordinate frame with the origin at the base of the truncated cone (or center of the
sphere). Each origin corresponds to the center of rotation (the joint position). We
represent the human body model as a kinematical tree, with the torso at its root, to
order the transformations between the local coordinate frames of different parts.
Therefore, the camera coordinate of each part is formulated as the product of
transformation matrices of all the local coordinates on the path from the root of the
kinematical tree to that part. The geometrical optics is modeled as a pinhole camera
with a transformation matrix T such that Xi = T • X c , where X i and X; are image and
camera coordinates of a point on the human body respectively (see [181] for more
information).
116
Learning Motion Model and Motion Constraints
A motion model, encoding the dynamics of the human body, can be used in tracking
to greatly reduce the computational cost while achieving better results . As a highly
constrained activity, the gait patterns of human walking are symmetric, periodical
and of little variation in a wide range of people . So it is relatively easy to learn a
compact and effective motion model for human gait from limited training data.
Here, our motion model for human gait (hereafter referred to as motion model) is
learnt from semi-automatically acquired training examples and formulated as
Gaussian distributions. Also, the dependency of joint angles is analyzed to explore
the motion constraints that, together with the motion model, are integrated into the
dynamical model to focus on the heavy weighted samples in the Condensation
framework.
where A·B returns the matrix whose elements are products of the corresponding
elements in A and B. TaU) removes the first i-I rows of B and add i-I rows of
zeros to the end of B. When A and B have different rows, we add enough zeros to
end of the matrix that has less rows to make their rows equal.
We form matrix A 'j for each training example i with row index indicating time
step and column index indicating motion parameters, where i = 1. ..m. Then the
period t, of the ith example is computed from the cross-correlation corr(A';, A'; ) .
The interval between two dominant peaks is chosen as the period . To rescale the
walking cycle to the same length T (here T = 100), the B-spine interpolation
algorithm is applied to the example A 'j with the scalar a; = T/ T;. Given that A '; is
rescaled to A;, a specific one in A; (i = L .m), e.g. AI, is selected as the reference.
Then the phase b, of each example A j relative to the reference example A I is
indicated by the predominant peak in the cross-correlation corr(A;, AI ) . In all,
(6.8)
are the normalized examples with the same period and phase. The segments
B;(l :r), B;(r + 1: 2r) ..., i = L.m, renamed as It} with} = L .n, are exactly all of
117
the normalized walking cycles. Then our motion model is empirically represented as
Gaussian distributions Gk,l~k.I>O"k/) for each joint angle k (k = 1. .. 10) at any
phase t (t = 1.. . T) in the walking cycle with
1 n (6.9)
uk,r=-LW/t,k), k=l.. .lO,t=l.. .T
n j=1
1 n 2 (6.10)
v.. = - I (W; (t,k)- uk,, ) , k=I...lO, t =l...T
n j=1
Figs. 6.11(b) and (c) are temporal models of joint angles of left thigh and left
knee. The learning and representation of our motion model are compact, and it
shows great effectiveness in estimation of the prior distribution of initial pose and in
prediction of new pose for the next frame,
20
,--~-_.~~ -
10 20 30 40 50 60 70 w ~ 60 60 100
Frame Number Pereeotot VIIallMA Cycle
(a) joint angles of the left knee of 4 (b) temporal models of joint angles of
different people walking with various left thigh and left knee during a walking
periods and phases cycle
·80"--- - - - - - -- - ---'
20 40 60 100 ·5 0 5 10
Percentof Walking Cycle Angie oftne ShoulderJoint
(c) temporal models of joint angles of (d) dark lines and the shaded areas
left thigh and left knee during a walking indicate the mean and standard deviation
cycle of the corresponding distribution
Figure 6.I I Motion model and motion constraints
Motion Constraints
Although the motion model describes the basic pattern of walking, it does not
118
contain all information about walking . Therefore we derive motion constraints from
training data by further exploring the dependency of neighboring joints: shoulder
and elbow, thigh and knee, knee and ankle . Obviously, in a walking activity, the
movements of the lower arm and the upper arm are correlated and regular, so the
shoulder joint and the elbow joint are not independent. We assume that the lower
arm is driven by the upper arm, and accordingly the elbow joint is determined by
the shoulder joint except for some noise. So the motion constraint of the elbow joint
can be approximated by the conditional distribution p(Be IBs), where Be and Bs are
the joint angles of the elbow and the shoulder respectively. Using the training data
in the previous subsection, the distribution can be easily computed by the following
procedure . From each walking cycle Wi (i = L.n), a series of pairs of the shoulder
and elbow joint angles (B;s (I),B;e(I)) are formed as the time t varies from 1 to T. We
classify all pairs according to their first element, i.e, pairs having identical first
element are assigned to the same class. Then for any shoulder joint angle Bs '
provided that class (Bs ' . ) includes K pairs (Bs ' 0;), k = 1. . K, the conditional
distribution p(lW~s) is represented by a Gaussian distribution G(u,c?) where u and
0- are the mean and standard deviation of B;,k=I.. .K . Fig. 6.11(d) gives the
motion constraint for the elbow joint. The motion constraints for the knee and ankle
joint are learnt in the same way. Here, Gaussian representation is assumed for
simplicity which seems to work well and its further analysis remains future work .
We also derive intervals of valid value for each motion parameter from training
data by specifying its maximal and minimal value . All of the generated samples are
constrained in their associated intervals by setting the over-set values to its
.. .
mimmum or maximum.
Tracking
The main task here is to relate the image data to the pose vector
P = {x,y,Bt>B2 ,oo .,BIO } . Since the articulated human body model is naturally
formulated as a tree-like structure, a hierarchical estimation, i.e., locating the global
position and tracking each limb separately, is suitable here, especially when the total
number of parameters is large. Additionally, this approach to decomposing the
parameter space is strongly supported by other reasons as follows. Firstly, the global
position (x, y) is much more significant than other joint angles with respect to the
PEF, so it can be estimated separately with other motion parameters fixed .
Secondly, joint parameters are greatly dependent on the global position (x, y). In
detail, a slight deviation of the global position often cause the joint parameters to
drastically deviate from their real values when maximizing the PEF, and the result is
that a large weight is assigned to the actually unimportant sample . So the global
position should be located ahead of sampling the joint angles . Thirdly , locating the
optimal pose in a high-dimensional state space, e.g. 12 DOFs here, is intrinsically
difficult. Decomp-osition will effectively simplify this problem. Finally, since the
upper limbs can sometimes hardly be segmented from the torso, tracking the upper
limbs is more difficult than tracking the lower limbs and accordingly needs a larger
sample set.
Given the above considerations, we first predict the global position from the
centroid of the detected moving human and then refine it by searching the
neighborhood of the predicted position. Each limb is tracked under the
119
Condensation framework [219]. As a popular method in visual tracking, the
Condensation algorithm uses learnt dynamical models, together with visu al
observations, to propagate the random sample set over time . Instead of computing a
single most likely value, it evaluates the posterior distribution by factored sampling
and thus can represent simultaneous alternative hypotheses. Therefore, it is more
robust than Kalman filter, a Gaussian-based and unimodal method. Another
advantage of the Condensation framework is that it can easily handle fusion of
information, especially temporal fusion, in a principled manner. Later, we can see
that the information of observation, prior knowledge of motion model and motion
constraints are all straightforwardly fused by the density propagation rule to derive
the posterior distribution.
The rule of state density propagation over time is [219] :
(6.11)
where X, are the motion parameters at time t, 2, = (z}, Z2,.. ., Z,) is the image sequence
up to time t, and k, is a normalization constant independent of XI' According to this
rule, the posterior distribution of p(x t Iz.) can be derived from the posterior
P(Xt-l IZ/-I) at the previous time step and three other components: the prior
distribution p(xo) at time step 0, i.e., the initial ization, the dynamical model
p(xt IXt- l ) to predict the motion parameters X, by drifting and diffusing X,_}, and the
observation density p(Zt I XI) computed from the PEF . They are respectively
detailed in the following subsections.
I) Initialization
Initialization is concerned with the initial pose of a subject in capturing human
motion. Most previous approaches handled initialization by manually adjusting the
human body model to approximate the real pose or by arbitrarily assuming that the
initial pose is subject to uniform distribution. Unlike previous work on initialization
that attempts to roughly estimate the pose from a single frame , we accomplish it
using spatiotemporal information of the first N frames. Thus our approach is more
robust, and most importantly, it also achieves real-time speed by avoiding
evaluating the cost function. In what follows we describe the initialization
procedure that includes a learning process and an estimation process.
In the learning process, the moving human in each frame in the training data is
detected by subtracting the background image and extracted edges using the Sobel
operator. And then, the moving area is clipped and normalized to the same size .
Similar to the preprocessing of learning motion model, the normalized examples are
adjusted to the same phase and their periods are rescaled to the same length (here
the length is Tawe = 30) . Also, n normalized walking cycles ~ with j = 1.. .n, are
segmented from the m preprocessed examples. We use the average walking cycle
1 n (6.12)
V=- ""V .
n L.J J
j =l
120
N frames v (N < Tawe) are then located in the reference cycle by searching the major
peak in the cross-correlation corr( V, v). Referring the location (assumed to be t) to
the motion model, the pose of the last frame in v is roughly estimated as a 10-
dimentional vector (U 3.h U4.h . .. , U/2,t). Accordingly , the prior distribution for
tracking is the Gaussian distribution G((U3.t, U4,t, ...,1I12.t1o/ IIO ) where 110 is an
lOx 10 Iidenti . and
enttty matnx a,2 = ( 0"3 ,12 , 0"4 ,12 , ••• , O"IO,t 2 ) •
700
600
c 500
g
{!j40 0
I:
o
o 300
200
100
10 20 30 40 50 60
Frame Number
(a) cross-correlation between the short (b) projection of the human body model
sequence (frame 15-19 in sequence mp2) with the initial pose.
and two concatenated walking cycles
Figure 6.12 Example of initialization (for frame 19 in sequence mp2)
Fig. 6.12 illustrates the estimation process. To estimate the initial pose for frame 19
in sequence mp2, the short sequence (frame 15 to 19) is used to compute the cross-
correlation at all displacements between the short sequence and the average walking
cycle. Here two average walking cycles are concatenated to make the result more
accurate. Assuming the predominant peak over an shifts locates at t (t = 21, see
Figure 6.l2(a» , we may derive the phase in the motion model for frame 19:
p=t xT I T"we
Then the motion model at phase p indicates the initial pose. In Fig. 6.l 2(b), the
human body model with initial pose is projected to the real image data. The result
shows that, although there are some errors at the left ann and the left leg, the
initialization as a whole is very close to the true pose, demonstrating the
effectiveness of our automatic initialization procedure .
This initialization method can also be used to recover from severe tracking
failures due to occlusion, accumulated error, or image noise. When a severe failure
occurs (When the PEF reaches a predefined threshold) , the tracker will stop for N
frames and reinitialize using the spatio-temporal information derived from such N
frames to estimate the current pose. However, it is worthwhile to mention that the
real-time speed and robustnes s of the initialization and bootstrap is at the expense of
the first N- l frames in which tracking is stopped.
121
space containing most information about the posterior. The desired effect is to avoid
as far as possible generating samples that have low weights, since they contribute
little to the posterior. Here, the learnt motion model which served as a prior is
integrated into the dynamic model for efficient sampling. Given the assumption that
the Gaussian distributions at different phases in the motion model are independent,
at any time instant t the ith motion parameter Bi,1 satisfies the dynamic model
p (Be" IBe.,_I,Bs,/) = qG( au e., + /3U e,I_1 + rBe,I_I ' A((aa., f + (/3a",_1 )2) )+.. (6.14)
..+(I-q) p(Be"IBs./)
where a,jJ,Y, A are defined as above. Equations similar to 6.14 can also be
provided for knee and ankle joints .
122
gradient direction at pixel Pi ' In other words, the pixel nearest to Pi, and along that
direction is desired . Given that qi is the corresponding pixel and that F; stands for
the vector P;;;;, the matching error of pixel Pi to qi can be measured as the norm
IIF;II · Then the average of the matching errors of all pixels in the boundary of the
projected human model is defined as the boundary matching error
1 N (6.15)
s, =-
N
L:IIF;II
;=)
In general, the boundary matching error can properly measure the similarity
between the human model and image data, but it is insufficient under certain
circumstances. A typical example is given in Fig. 6.14(a) , where a model part falls
into the gap between two body parts in the edge image. Although it is obviously
badly-fitted, the model part may have a small boundary matching error. To avoid
such ambiguities, region information is further considered in our approach. Fig.
6.14(b) illustrates the region matching . Here the region of the projected human
model that is fitted into the image data is divided into two parts : PI is the region
overlapped with the image data and P2 stands for the rest. Then the matching error
with respect to the region information is defined by
(6.16)
where 1p;1,(i=1,2) is the area, i.e., the number of pixels in the corresponding region .
Both boundary and region matching errors are combined into the PEF that is
Cf 2
modeled in terms of a robust radial term Pi (s, 0')= ve-s/ [221] :
where P={x,y,Bj ,B2, oo .,BIO } is the pose vector, and a is a scalar to adjust the
weights of Eb and E; Apart from its robustness, the radial term can improve the
efficiency of factored sampling because it assigns heavier weights to important
123
samples and reduces the weights of insignificant ones. A smaller a will magnify
this effect, but it also makes the curve of the PEF peakier, which leads to a lower
survival rate of samples. And then the needed number of samples increases.
Therefore , a must be carefully selected. As far as a is concerned, a bigger value is
preferred for the upper limbs to diminish the influence of region matching error.
The reason is that the upper limbs and the torso often have clothes with the same
texture and they also frequently occlude each other, and therefore the region
information is of relatively little importance .
~~
image
Dlodel
(a) a typical ambiguity : a model part falls (b) measuring region matching error
into the gap between two body parts
Figure 6.14 Illustrating the Necessity of Simultaneous Boundary and Region Matching
Fig. 6.15 shows the effectiveness of the proposed PEF. Its curve is basically
smooth and has no local maxima at the neighborhood of the global maximum. These
two properties are very useful for optimization. Furthermore, according to the
contour of the PEF and other experiments, we can conclude that the global position
(x, y) is more significant than other joint angles with respect to the PEF. This is one
of the reasons why the global position can be firstly determined with other
parameters fixed.
. :
..":'
; . . . .................. :
1.4
,~ .....- . ... ..!.•
1.2 : ....
~ .
u. 1
w
e:O.8
o
~O ,6
'iii
>0.4
124
Figure 6.15 The curve of the PEF with the global position x and the joint angle of the
left thigh changing smoothly and other parameters remaining constant. Also shown
is the contour of the function.
Our PEF is insensitive to noise to some extent. It is true that the PEF is dependent
on the boundary and motion detection that is sensitive to noise. But in tracking, each
limb is considered as a whole. Although a part of the limb is affected by noise, the
PEF can still realistically reveal the pose when the total limb is considered. In other
words, the PEF can utilize the prior knowledge of body model to reduce the
influence of noise to some extent. For example, in Fig. 6.16 the left upper arm and
right leg are missed in the edge image (b) due to noise but the failure was recovered
in tracking where each limb is tracked in its entirety. However, when the whole
boundary of the limb cannot be detected , the PEF may fail that results in incorrect
tracking (see frame 32 in Fig. 6.21 where the left arm is missed due to significant
motion blur).
(a) original image (b) edge image with (c) left upper arm and right
missing data leg correctly tracked
Figure 6.16 Noise Sensitivity for the PEF
125
examples are all selected from this database. The outdoor sequences are from the
CASIA database.
126
constraints, only the human areas clipped from the original image sequences are
given. Some sequences include challenging configurations in which the two legs
and thighs occlude each other severely (e.g. frame 27 in Fig. 6.I7(a)), causing most
part of one leg or thigh is unseen. These difficult data verify the effectiveness of our
approach . Other challenges include shadow under the feet, the arm and the torso
having the same color, various colors and styles of clothes, different shapes of the
tracked people, and low quality of the image sequences . It is worthwhile to mention
that sometimes the arms off the camera in these sequences were lost for the severe
occlusion by the torso (see frame 19 in Fig. 6.!7(c)). However , their motion
parameters can usually be properly estimated using the motion model (see frame 27
in Fig. 6.I8(b)) or using the symmetric value of the other arm.
Also, further experiments are carried out to deeply analyze our approach . It is
mentioned that tracking new sequences requires much more state samples than
tracking sequences from thetraining data. The chief reason is that our motion model
is only learnt from limited training data and cannot accurately represent the
variations of all gait sequences , especially the abnormal ones. When encountering a
novel instance, the deficiency of the motion model will reduce the accuracy of the
prediction of the dynamic model. Fortunately, increasing the state samples will
compensate it. This is demonstrated by an experiment. In Fig. 6.19(a), when the
sequence is included in the training data, the prediction is very close to the refined
127
results. The good prediction, which also proves the effectiveness of the dynamic
model, requires a small set of samples . In contrast, in Fig. 6.19(b), when the same
sequence is intentionally removed from the training data, the prediction is less
accurate but the larger sample set offsets the inaccuracy.
Then a question arises : what a role does the motion model play in tracking ? We
track a sequence without sampling, so that the motion parameters are estimated
entirely from the motion model. The tracking results illustrated in Fig. 6.20 reveals
that, although roughly true, the results are wrong at some body parts in contrast to
that in Fig. 6.17(c) . Therefore the motion model does not unduly affect the tracking
and the sample set can offset the prediction errors .
The tracked sequences in Fig. 6.17 and Fig. 6.18 are all selected from the early
Southampton and the CASIA gait databases . In the two databases, the backgrounds
are basically clean. So we further test our approach on some additional sequences in
more complex real-world outdoor scenes . These sequences were captured on
different days and have significant motion blur and changing background due to
winds . Two samples of such sequences are shown in Fig. 6.21. It can be seen that
the tracking results are fairly accurate except some errors on the upper limbs
superposing on the torso (e.g. frame 32 in Fig. 6.21(a) and frame 38 in Fig. 6.21(b)) .
These errors are mainly caused by the noticeable motion blur where the edges of
most of the limbs can hardly be found when the limbs are superposing on other
body parts (e.g. frame 32 in Fig. 6.21(a) and frame 38 in Fig. 6.21(b) where the
arms are pulled to the torso edges) . For this specific issue where the arm is naked,
the skin color model may be used to segment out the arms . But the more applicable
method is to use image restoration to remove the motion blur or reduce the time of
exposure. Finally, it is worthwhile to mention the applicability of our motion model.
The motion model is learnt from the early Southampton gait database where the
subjects are mainly male and European, but it is still applicable to the two walkers
shown in Fig. 6.21. However, It often fails when it is applied to recover the poses of
their arms off the camera. This means that we should extend the motion model to
cover more individuals.
20
"
~1D
·20
:
~~, is ,
~
: ., ~
j
,
~
...
~
~
.. ~
~
-s
0( ·10
."
., -rs
~~~
.g~5 20 "-~
3D .~,,--~--sO-------SS 60 '2~5;---':;;---7.--C:'--:
35'-'ii--- "---"~
20
Fl1ImeNumbe, F,.meNumber
(a) angle ofleft knee of dhl that is (b) angle of left shoulder joint of the same
included in the training data sequence that is removed from the
training data
Figure 6.19 Estimation Results: predicted results are in thin lines with markers and
refined results by factored sampling are in bold lines
128
Figure 6.20 Analyzing the Significance of the Motion Model. Tracking results of
frame 19,23,33 and 43 in Sequence 2 of Subject 3 (vin2) Without Sampling
129
smoothed curves are the results after median filtering. It is the variation in these
joint signals that we wish to consider as dynamic information of body biometrics,
i.e., gait dynamics, for recognition.
0:: I
I
60
~ 20
<l:
- -.r\\ ----i-.t \\- ~ 5: _.___- .
I \ I I ~ '\ ! 11./""'"" I / .
~
~
0::
0
\~I
, : \~ :;
- t - - ~ --\ -T- - --
I
I "
n 1 I'
I -50 -\J- f --- -~L -- i
I I
-20 -- ---- -- - -- '-- -- - - - - ~ - - ---- -- -- -1OO ~ -- - - L -- - ......J.-- - ---- -- ~
o 20 40 60 o 20 40 60
# Frame # Frame
Figure 6.22 Joint-angle Trajectories of Lower Limbs
130
Leftlligh RotationAflerTimeNonnarization ardAlignmer1 lligh Rotation fromFOlJ'Offerer! Slbjects
30 ----,-- - ,-- -- -
Slbject4 I I I
25 - ---I---- H~'\:.:i~~~ 25 - - 1- - - -
I
20 20 - 1- - - -
I L---==C-~jU I
15
15
0; 0; 10
! 10 "
2 :; 5
~ 5
en
~ 0
80 100 20 40 80 80 100
<a t Clde (%)
Recognition Results
We give recognition results with respect to both identification and verification
modes (see Fig. 6.24) using normalized joint-angle trajectories from the left and
right knee and thigh joints as the gait features. Here, we col1ected 80 sequences
from 20 subjects with 4 sequences each subject. From Fig. 37, we can see that there
is indeed identity information in such dynamic features derived from walking video
that can be explored for the recognition task.
02: 0.98
V>
0 ::::::::t:::::::::I:::::::::L:::::T:::::::t:::: ~
30.96
(f)
S
0.1 ---- ---; ---------1 ---------1---------1---------(-- oE 0.9 ::l
131
6.4.1 Structure by Body Parameters
A structural approach [106, 142] aimed to identify people by static body parameters
determined from a subject walking across multiple views, developed at the Georgia
Institute of Technology . These measurements are specific to the periodic gait cycle.
Further, an adjustment procedure was defined which could accommodate subjects
walking at different viewing angles to the viewing plane of the camera and as such
was an early viewpoint invariant technique, one which was not restricted to the side
view alone. This depth correction was achieved by measuring a subject's height
when the feet were closest together, the point of maximum subject height in the
walking cycle. This gave a conversion factor to correct any distance measure, given
a known optical configuration .
The measures used were four dimensional: the height (of the a bounding box
surrounding a subject; the distance between the head and the pelvis; the maximum
of the values of the distance from the pelvis to the left and the right foot; and the
distance between the two feet. The mean of these measures was used to form a
descriptional vector for each walk sequence. The walk vector from a motion
capture system was then used to normalize the data for recognition .
The approach was also formulated to determine a confusion measure which
allowed prediction of how the features used could filter identity over a large
population, as an alternative to recognition performance . This measure can be
formulated a number of ways and here was achieved in a way analogous to
comparing the (Gaussian) distributions that characterize the group variation versus
the individual variation. This is then similar to an assessment of the area of
intersection of the distributions of feature vectors for different subjects: large
intersection (the within class variation is large compared with the group variation)
implies poor recognition capability which is reflected in high confusion. The
confusion rates achieved were around 6% which is equivalent to a CCR exceeding
90% on their database. Albeit there is deployment difficulty with this approach,
though this certainly shows that human identification by gait can be achieved by
structural parameters. A later study [143] showed how the performance could be
predicted for a much larger database, from the same data, estimating performance
capability on a database five times larger.
The team from Rutgers University will feature later for their work on animating
humans. By inference, that work led to an interest in automatic recognition by gait.
Like the refined Southampton approach, Zhang et ai's approach [I 18] concerned the
change in orientation of human limbs, Fig. 5.2(b). In fact, the extraction is model-
based, and the description is structural making this a blend of the two model-based
approaches described so far. The lower limbs were represented by trapezoids and
the upper body was planar without the arms. Given distances normalized by height
of the thorax, the human body posture was represented by a set of distance
measurements and inclinations of its constituent parts. The gait features were
extracted from gait sequences by the Metropolis-Hastings method to match body
parts to the image data. The sequence fit was achieved by minimizing an energy
functional which described: difference of the body from the silhouette derived from
132
the image; the difference between moving silhouettes; and the difference between
modeled appearance and the silhouette. This allows for derivation of elevation
angles which describe dynamics of gait and trajectories of joint positions which
describe spatiotemporal history .
Unlike Southampton's approach, based essentially on the elevation angles, the
approach centered on capturing temporal differences by extracting the elevation of
the knee and ankle and the width at the knees and ankles. As these are periodic , they
were described by Fourier analysis and then classified via an HMM. The procedure
was evaluated on the CMU Mobo and on the NIST HiD databases and shown to
have discrimination capability, with better results on the Mobo database . Clearly it
enjoys the advantages of model-based techniques in that the data used for
classification is intimately linked to gait itself.
133
7 Further Gait Developments
H uman gait characteristics are best analyzed when presented in its canonical view
(side view). The dynamics behind an individual's gait, the quasi-periodicity of
an individual's gait etc. which contribute significantly towards establishing his/her
identity are more evident when the canonical view of human gait is presented . Thus,
the availability of such a canonical view is crucial to the success of many gait
recognition systems. We observe that in surveillance applications a non-canonical
view of an individual's gait is captured more often and this necessitates developing
gait recognition algorithms that are view invariant. When individuals walk at an
angle oblique to the camera, the performance of gait recognition systems suffers due
to the gradual change in the individual's height and stride length as captured by the
camera.
One approach could be to build a 3D model of an individual and extracting view-
invariant features that best characterize the individual's gait. But such an approach
typically based on Structure from Motion (SfM) and stereo reconstruction
techniques suffer from many shortcomings. Shakhnarovich et.al. [182] compute a
visual hull from a set of monocular views and render virtual canonical views for
tracking and recognition. They extract moments based image features from the
silhouettes of the synthesized video to perform gait recognition.
Naturally, a model of walking is implicitly invariant to viewpoint for a small
change in viewing angle. On the structural side, Bobick et al. [106] developed a gait
recognition technique that recovers static body and stride parameters of subjects
from their walking sequences. The parameters are evaluated by means of a
confusion metric which predicts the effectiveness of the feature vector in identifying
an individual. They define a mapping function across different views and use the
same to perform gait recognition. The stride parameters that were extracted from
each subject were shown to be more resilient to different viewing directions.
Naturally, BenAbdelkader's stride and cadence structural approach equally has
viewpoint invariant properties [173]. On the pure modeling side, Carter et al.
showed [166] that variations in gait data could be corrected after the data had been
gathered, given that the trajectory is known. This was achieved by considerations of
geometry and use of a simple pendulum modeling the thigh, with correction by
rotation of the thigh swing axis. A Fourier analysis showed that viewpoint
correction had been achieved. The approach was then reformulated to provide a
pose invariant biometric signature which did not require knowledge of the subject's
trajectory. Spencer et al. was to extend these notions [167] and developed a
geometric correction to the measurement of the hip rotation angle, based on known
orientation to the camera. Results on synthesized data showed that simple pose
correction for geometric targets generalizes well for objects on the optical axis.
Later, these techniques were refined for analysis on walking subjects, and showed
that the approach can work well, given that target features can be extracted and
tracked with success [168].
Here, we present a gait recognition algorithm developed at UMD where prior to
recognition a canonical view of an individual's gait is synthesized from his/her non-
canonical view, without the explicit computation of 3D depth information [169,
170]. Figure 7.1 presents a framework for this approach .
Video of
unknown Tracking Estimation
subject at -. and
background
... of 3-D
walking
View
~ synthesis ... Gait
..
recogrution
Identity
-.
arbitrary
angle subtraction direction
l+ Reliability analysis
of track -
Figure 7.1 Framework for View Invariant Gait Recognition
Consider the imaging geometry presented in Fig. 7.2. The world coordinate frame is
such that its origin is at the center of perspective projection. The Z-axis is
perpendicular to the image plane . A translational velocity of the subject of the form
V = [vx,O,Of implies that the subject is walking along AB parallel to the image
plane and thereby presents a canonical view to the camera. A translational velocity
of the form V = [vx,O,vzf implies that the subject is walking along AC at an angle ()
to the image plane and thereby presents a non-canonical view to the camera. ()
addressed as the azimuth angle, refers to the angle of rotation about the vertical axis.
In our notation, [X, Y,Z] denotes the coordinates of a point in 3D and [x,y] denotes its
projection on the image plane . In our formulation we assume that the subject is far
away from the camera and hence could be approximated by a planar object. Given a
video of a person walking at an angle () to the image plane, we propose the
following two approaches to estimate (J.
X
...- (.I.yl )
_, t '·'
./ / "
/~<.
- - - - - - - - Zl-- -- -- -
Z I»(
136
7.1.2 Optical flow based SfM approach
For the constant velocity models, vzCt) = Vz and vx(t) = vx, cot( 8) = vx/vz. Thus,
given the initial position of the tracked point (xI ,Yd and focal length 1 (computed by
means of calibration), Bcan be computed from
Let [Xo, Yo, Zo]' denote the coordinates of a point on the person walking at angle
e?o to the plane parallel to the image plane and pass ing through the point [XI Y1
Ztl'. The 3-D coordinates of the synthesized point are computed by means of a
rotation operation.
o f] f] (7.4)
X ] [XO -Xre [Xre
Yo =R(e) · Yo=Y,.ef + y"ef
[
z, z, z.; z.,
where
cos( B)
~ Sin~e)]
R(e) = 0
[
-sin( e) o cos( e)
Denoting the corresponding image plane coordinates as [Xo,Yo]' and [xo, Yo]' (for
B= 0) and using the perspective transformation, we can obtain the equations for
[xo, Yo]' as [169] .
X o cos( e) + xref (1- cos( e))
X o =1 - -.----0--'---'-----,-----'-
(7.5)
-sm( e)( X o +x ref )+ 1
Yo = 1 Yo
-sin(e)(x o +xref)+ 1
137
where x = jX/z and y = jY/z .
Thus the estimated azimuth angle 0 is used to synthesize canonical views by
means of a direct transformation of the 2D image plane coordinates in the non-
canonical view.
r-sin(O) o -x]sin(O)+f
Gait gallery
138
The NIST database consists of 30 people walking along a ~:-shaped walking pattern
displayed in Fig. 7.3. The gait sequences were captured by two cameras. The
camera that was farther away from the subjects was chosen for our experiments as
the planar approximation of an individual is valid in such cases. The probe and the
gallery segments are displayed in Fig. 7.3 and the probe segment is at an angle of
33° from the gallery segment. A few images from the NIST database are shown in
Fig. 7.4. The implicit SfM approach discussed above was used to synthesize
canonical views on the NIST database. Some of the results of the synthesis are
shown in Fig. 7.4. Since it has been shown that the dynamics of the lower half of a
silhouette contributes significantly towards gait recognition, while performing gait
recognition we adopted the fusion of leg dynamics and height. The gait recognition
result is shown in Fig. 7.5. The performance is significantly better than that without
the synthesis of canonical views.
(a)
(b)
(c)
Figure 7.4 Sample Images from the NIST Databas e : (a) Gallery images of person
walking parallel to the camera (b)Space un-normalized images of person walking
at 33° to the camera (c) Space synthesized image for (b)
139
100 7
5 10 15 20 25 30
Rank - >
In the eMU MoBo database there are six synchronized cameras evenly distributed
around the subject walking on the treadmill. For testing our algorithm we
considered views of the slow walk sequences from cameras 3 and B. The views
captured by camera 3 and camera 13 are displayed in Fig. 7.6. The sequence from
camera 3 was used as the gallery while the sequence from camera 13 was used as
the probe . Since the subjects walk on a treadmill, the apparent translation was
minimal and hence the homography approach discussed above was adopted to
synthesize canonical views from the views captured by camera B . We considered
several points S = {(XhYl),oo ,,(xmYn )) on the side of the treadmill which is a planar
rectangular surface. We then constructed the view of this rectangular patch as it
would appear in the canonical view. A set of points S' = {(Xj',Y j'),oo .,(xn',Yn')} is
considered on this hypothesized patch . The homography [81] between the two
views was estimated using the sets Sand S' [213]. Gait recognition performance
using the unsynthesized and synthesized images is shown in Fig. 7.7. Again we
found that canonical view synthesis results in better gait recognition performance.
Though the validity of the planar assumption in this case was questionable, the
results proved to be satisfactory .
140
Cal
Cb l
Figure 7.6 (a) View from Camera 3 (b) View from Camera 13
100
/
90
,./
,.~
A
I
I
80
11'
Ii J
I 70
'" 60 .I
~
I
'5 50
j
!
.f 40 J
:5
§ 30 J
o
20
I
10
0
5 10 15 20
Rank - ->
141
other biometrics due to the sheer absence of information pertammg to other
biometrics. But in such surveillance applications, human face contributes
significantly towards recognition when the individual is physically closer to the
surveillance system. Optimal recognition systems must be designed in such a way
that they use maximum cues towards human identification and combine the
recognition performance in meaningful ways. Information may be fused in two
ways: the available data could be fused and decisions could be arrived at based on
the fused data (data fusion); or decisions could be based on the fusion of many
decisions made my analyzing each signal/feature individually.
A Gait gallery
C
Face probe
Figure 7.8 Arrangements for Sigma Shaped Path adopted in the NIST Dataset
As discussed earlier, human gait recognition systems perform better when presented
with a canonical view an individual's gait. On the contrary, face recognition systems
perform better when presented with a frontal view of the individual. Shakhnarovich
et al. [182, 171] extensively studied the fusion of face and gait cues. We develop an
approach [184] that combines the view-invariant gait recognition system [169]
described earlier and the probabilistic approach towards face recognition from video
[172]. Experiments at UMD were based on the NIST database which consists of 30
individuals walking along an inverted L shaped path. Fig. 7.8 shows the L path that
was adopted. Walking sequences along segment A are used as the gallery for gait
recognition while those from segment B which is at an angle 33° segment A are
used as a probe. The gait recognition results obtained using the view-invariant
approach discussed earlier is shown in Figs. 7.9(a) and (d). The last part of the
sequence along segment C was used as the probe for the video face recognition
system the gallery of which consists of still face images of the 30 subjects. The
results for face recognition are shown in Figs. 7.9(b) and (e).
142
(b)
(d)
''''
III
..
10
...... -.
11 •
(e) (f)
Figure 7.9 Face and Gait Fusion Results on NIST Database
143
different classifiers can be combined directly using simple rules like SUM,
PRODUCT etc. To effectively combine the scores the face and gait recognition
systems, it is necessary to make the scores comparable. We use the exponential
transformation for converting the scores obtained from the gait recognition, i.e.
given that the match score for a probe X from the gallery gaits is given by s), ...,SxN
we obtain the transformed scores exp(-S)),... , exp(-Sf) . Finally we normalize the
transformed scores to sum to unity. Next we discuss two forms of fusion:
hierarchical and holistic.
Hierarchical Fusion
Given a normalized similarity matrix obtained from the gait recognition system,
based on the histogram distribution of the diagonal and off-diagonal elements we
decide a threshold determines individuals who are to be screened by the face
recognition system. The threshold chosen for the NIST database such that, the top
six matches from the gait recognition system are passed on to the face recognition
system. The CMC plot for the resulting hierarchical fusion is shown in Fig. 7.9(c).
Note that the top match performance has gone up to 97% from 93%. This approach
reduces the number of computations significantly.
Holisti c Fusion
If the requirement from the fusion is that of accuracy as against computational
speed, alternate fusion strategies can be employed. Assuming that gait and face can
be considered to be independent cues, a simple way of combining the scores is to
use the SUM or PRODUCT rule [202]. The CMC curve for either case is as shown
in Figure 7.9(f).
144
walker. Temporal pose changes of these silhouettes are represented as an associated
sequence of complex vector configurations in a common coordinate, and are then
analyzed using the Procrustes shape analysis method to obtain an eigen-shape for
reflecting the body appearance, i.e., static information. Also, a model-based
approach under a Condensation framework together with human body model,
motion model and constraints is presented to track the walker in image sequences.
From the tracking results, we can easily calculate joint-angle trajectories of main
lower limbs, i.e. the dynamics of gait as discussed in previous sections. Both static
and dynamic information may be independently used for recognition using the
nearest exemplar pattern classifier. They are also combined effectively on decision
level to improve recognition performance. This method is in essence a combination
of model-based and motion-based approaches. It not only analyzes spatiotemporal
motion pattern of gait dynamics but also derives a compact statistical appearance
description of gait as a continuum. So this implicitly captures both structural
(appearances) and transitional (dynamics) characteristics of gait.
1----------------- ------ ---- -------------------------------,
I I
I I
I
I
I
I
:
~
Shape representation Procrustes shape analysis JI
145
combination classifier. Let r(n , Ri) be the rank of the class with name n in the
ranking R; this rule is defined as argminn(nk,:E~al r(nk,R) . If the score functions
are directly comparable, score-based strategies are a good way for decision fusion .
The simplest way to combine classifiers using the score is to compute the sum of the
score functions . Let s(n, Si) be the score of the class with name n in the Sj, this rule
is defined as argminn(nk>:E~=ls(nk,Sj)' i.e., the class with the lowest score sum
will be the final choice.
Empirical Probability Distribution UsingStaticFeatures EmpiricalProbabilityDistributionUsing DynamicFeatures
1, - -- - -- - - - _- ---, 1 \
0.8 ! 0.8
!
i O•6 ~0 .6
:0
0.01 0.02 0.03 0.04 0.05 1%00 -- 2OOQ---3000 "" 4000 --5000 6000
Score Score
146
For each image sequence, we first extract static features in the manner described
in Section 5.3. In addition, we perform the model-based tracking and recover
dynamic features in the manner described in Section 6.3. It should be noted that
self-occlusion of body parts, shadow under the feet, and the arm and the torso
having the same color, and low quality of the image sequences bring challenges to
our tracking method. For a small portion of failed tracking sequences, we manually
obtain the motion parameters as the focus here is not on tracking per se but on gait
recognition using the tracking data as dynamic features.
Due to a small number of examples, we hope to compute an unbiased estimate of
the true recognition rate using a leave-one-out cross-validation method. That is, we
first leave one example out, train on the remainder, and then classify or verify the
omitted element according to its differences with respect to the rest examples.
First, we use static and dynamic features separately for recognition. In
identification mode, the classifier determines which class a given measurement
belongs to. One useful measure of classification performance that is more general
than classification error is CMS (Cumulative Match Scores). Here, we use it to
report the results of identification. For completeness , we also use the ROC
(Receiver Operating Characteristic) curves to report verification results. Figs.
7.l2(a) and (b) respectively show performance of identification (for ranks up to 20)
and verification using a single modality. It should be mentioned that the CCR
(Correct Classification Rate) is equivalent to P(l) (i.e., rank = 1).
Based on the combination rules described above, we examine the results after
fusing both static features and dynamic features. Figs. 7.13(a) and (b) show the
results of identification and verification using rank-summation-based and score-
summation-based combination rules respectively, and Figs. 7.14(a) and (b) gives the
fusion results using the product, sum, max and min combination rules respectively.
For comparison, we also plot the results using a single modality in Fig. 7.13 and
Fig. 7.14.
0: I 1 I I I r
~ 1 I I EER I I I
0.3 - T- - - 1- - - I - - T - - "I - - -1- - -
I I I I I I
_ .1. _ - I I ..!.. __ ...! I _
I t I I I
I I I I I
_ _ ~ L __ l __ J I _
....... StaticFeature
I I I I
; r ~mic Featue I I I
15 20
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7
FRR
(a) identification (b) verification
Figure 7.12 Results using a Single Modality
147
n-e cws ClI1IeS The ROCCuws
I - Static Fealue
, I
- [),name FeatLre
0.6 - - 1- - - I
- FusionBasedonRario:St.mretion
I I
- FusionBasedonScor9 SuTmation
I I
~
- ~ - - r - -- I- - ~ - - 1 - - ' - - T - -
It
!
I r I I t I I
I I I I I I I
~
r-- I --- I --~ - - '-- l- -T--
I I I 1 I I I
I I I I I I I
0.9 ~--'-- - ' - -~- - ' - - l- - T - -
.!!l
::>
:-- - :.---,-: : EER:
- 1---,-- , -- "1 - -
: :
§ 0.85 I Staic Feaue
0 I I I I 1
o ,, -+- Dyraric Feaue __ _ _ _ I __ .J _ _ ...! _ _ .! _ _
0.1
I I I I
- Fusion Based on Ra-K&rrtraion , I ,
--e-- FusionBased onScore SuTrTli;tion
0.1 0.15 0.2 0.25 0.3 0.35 0.4
5 10 15 20 FAR
Ra-K
(a) identification (b) verification
Figure 7.13 Results of Rank and Score Based Summation Rules
0.4
0.35
0.3
~025
u,
~~""" _ _ ~ _ _ -l. _ _ l- _ _
From Fig. 7.11, we can see that there is indeed identity information in both static
and dynamic features derived from walking video that can be explored for the
recognition task. The results using dynamic information are somewhat better than
those using static information. This is likely due to the fact that the dynamics reflect
more essential information of gait motion. As we know, Tanawongsuwan and
Bobick [161] also used dynamical features (joint trajectories) for gait recognition.
But their work is quite different from ours. They used a motion capture system to
obtain motion data, whereas here the motion parameters were recovered
automatically. They achieved a recognition rate of 73% on a database including 18
subjects while our recognition rate is 87.5% on a database of20 subjects.
Figs. 7.13 and 7.14 demonstrate the improved performance of both identification
and verification for the integration step than that using any single modality. A
summary of CCRs and EERs is given in Table 7.1 for clarity. Another observation
from the comparative results is that the score-summation based rule outperforms
other combinations schemes as a whole. Of the last 4 statistical combination rules,
the sum rule is the best for identification, which has also been shown in [202] using
the sensitivity analysis to demonstrate that the sum rule is the most resilient to
148
estimation errors. However, the product rule is best for verification . The main
reason for the poor performance of the min rule is probably because that it suffers
more from the noise in score assignment than the relatively robust mean and product
rules. Also, it is believed that there will be better results if there are sufficient data
to model the probability distributions of scores for the two pattern classifiers more
precisely . In all, these studies highlight the importance of a careful choice of the
whole combination strategy.
Although the results are very encouraging, more experiments on a larger and
more realistic database still need to be further investigated in future work in order to
be more conclusive . Accordingly , much remains to be done: 1) To establish a larger
and more realistic database. Unlike face recognition, now gait recognition lacks of a
common evaluation dataset. The researchers often established their own gait
databases and then reported a recognition rate. Directly using these reported rates
for comparison seems to mean nothing. We have compared some recent algorithms
using static features on our dataset. But this comparison should be considered with
reservation . So we have a strong will that a common dataset will be established so
that everyone can make a reasonable comparison with other work. 2) To develop
more robust segmentation algorithms and to improve 3D tracking, which is critical
to the accurate and automatic extraction of gait features . 3) To design more
sophisticated classifiers and combination rules. 4) To use a dynamic silhouette
description in future work to obtain a better description of spatiotemporal silhouette
changes in a gait pattern than using static silhouette description here. 6). To further
analyze the correlation of two types of features. As a general rule, the higher of the
correlation, the lower of the recognition rate after fusion. In this study, human
parameters were divided into two categories: static and dynamic. Our two features
respectively reveal the two categories of parameters to some extent. Since the static
parameters of the body are basically uncorrelated with the dynamic ones, the
silhouette features should be uncorrelated with the trajectories to some extent. This
is perhaps why the fusion performs well. Nevertheless , thorough and deep analysis
of the correlation is a piece of future work.
149
8 Future Challenges
Table 8.1 Summary of Typical Assumptions in Previous Work and Real Possible
Affecting Factors
Real!"'>ossiblc'affediri '1facforst;;;ht
• No camera motion • Changes of the clothing style
• Only one person in the view • Distance between the camera
field and the walker
• No occlusion • Background or environment
• No carried objects • Carried objects such as
• Normal walking motion briefcase
• Moving on a flat ground plane • Abnormal walking style
• Constrained walking path • Variations in the camera
• Plain background or viewing angles
environment • Walking surface such as flat
• Specific viewing angle ground, grass or slope
• Data recorded over one time • Mood
span • Walking speed
• Chan e with time
152
estimate only some parameters rather than attempt to recover a full volumetric
model. If gait is observed from great distance and the captured video images are of
low resolution, 3D model-based tracking will be inapplicable. Despite this, 2D
tracking with low computational cost is preferable. This is just why silhouette-based
gait recognition methods are currently popular. Generally, the processing speed of a
biometric system such as gait recognition deserves more attention, especially for the
purpose of surveillance. Researchers are looking forward to new tracking
techniques to improve the performance, and meanwhile, to decrease the
computational cost.
It is believed that, by under the need of automatic gait recognition systems, vision
techniques related to automatic extraction of gait dynamics will be developed
quickly.
153
have already tried to address such problem by setting up a standard data set to
measure or determine what factors affect performance.
• Evaluating the key factors that affect performance
To develop the reliable recognition techniques in an uncontrolled environment,
the key factors that may affect the identification performance need to be examined
in accordance with scientific methods. As shown in Table 2, and as discussed in
Section 2.3 gait and its recognition is affected by many factors, e.g., viewing angles,
environments, distances, clothing, age [164, 165], walking speed [163], etc., a few
of which are discussed as follows. A detailed study of the potential effects of these
factors on recognition awaits future investigation.
Distance : The goal here is to determine whether the decreases of image resolution
related to viewed distances have a great adverse effect on recognition performance .
Gait recognition aims to develop such recognition systems that can function at a
great distance. It thus needs to translate the performance regarding to low-resolution
images into the associated recognition performance as a function of the viewed
distances. Some researchers are currently setting up experiments to explore this
mapping relationship.
Viewing angles: So far, images in all databases are two-dimensional and depend
greatly on the viewing angle. When a system tries to compare two shots of the same
person taken from different angles, it is far less effective. The obvious way to
generalize the algorithm is to store training sequences taken from multiple
viewpoints, and to classify both the subject and the viewpoint simultaneously [110].
Another interesting progress has been found that the view-dependent constraint
could be removed by synthesizing a virtual walking sequence in a canonical view
like [171, 169]. Actually, 3D body modeling, tracking and recognition [123, 180,
178, 153] may be of significance for extracting view-invariant features such as
motion trajectories and physical parameters.
Clothing: The dependence of gait feature on the appearance of a subject is
difficult to assess, unless the joint movements of a walker can be detected or
appearance-insensitive features are obtained. However, that the appearance of a
walking subject contains information about the identity has actually been
demonstrated in much previous work. To allow accurate recognition of a person
with the change of clothing style, multiple appearance representations of a person
with respect to different clothes are required.
Weather: An all-day biometric recognition system at a distance must contend
with bad weather such as fog, snow and rain. How to reliably extract information of
moving human from dynamic scenes in poor viewing conditions is critical.
Currently, some researchers are trying to improve the abilities of human detection
and recognition in bad weather to enhance the robustness of gait recognition
technique. Other advanced sensors such as infrared, hyper-spectral imaging and
radar are also being observed because over video they can be used at night and other
low-visibility conditions.
In a word, advanced evaluation methodologies will be able to a) determine
principal limits of gait biometrics; b) observe the effects of datasets with different
size and quality on performance; c) establish standard protocols for collecting
datasets and evaluating algorithms; and d) scientifically identify critical affecting
factors.
154
The performance of any image-based gait analysis method is inherently view-
dependent, since inevitably cameras capture only a planar projection of gait
dynamics. Thus, for best performance of gait recognition, it is ideal from an
intuition to use a camera that is nearly fronto-parallel to the walking person for
capturing more apparent gait dynamics dominated by arm and leg movements. This
view-selection capability can be provided by a distributed multi-camera system or
by synthesizing a virtual canonical view [171, 169]. Certainly, determining the
sensitivity of the extracted features to the change of viewing angles will practically
enable a multi-camera tracking system to select an optimal view for recognition.
Moreover, it is possible to identify which direction an individual is walking and use
that information to select the approximate set of exemplars corresponding to the
estimated direction. However, for multi-camera tracking systems, one needs to
decide which camera to use at each time instant. That is, the coordination and
information fusion between cameras are significant problems.
155
As gait is fundamental to human motion, it is not unlikely that gait could find
deployment in many other areas. Here we concentrate on deployment in surveillance
and in animation, as two likely contenders. In surveillance, gait has yet to find use.
This is in part due to development of technique. It is only recently that gait has been
demonstrated to be able to recognize people on large databases of outdoor data.
Even then, its use in surveillance video analysis for forensic purpose mandates the
ability to perform 3D analysis from images derived by a single camera. There are
viewpoint invariant approaches, Section 7.3, and the model-based approaches do
have limited viewpoint invariance. These are insufficiently generalized for forensic
analysis. There is a consideration that the likelihood of error, in analyzing single
frames derived from low-resolution surveillance video, is likely to be sufficiently
low so as to preclude forensic use. This error will be reduced by analyzing
sequences of video data, though experimentation will be required to determine by
how much this occurs. As such, we await development for forensic deployment,
though early and very recent studies have shown that "Surveillance images from a
bank robbery were analyzed and compared with images of a suspect. Based on
general bodily features, gait and anthropometric measurements, we were able to
conclude that one of the perpetrators showed strong resemblance to the
suspect."[188]. There is of course concern at such developments : DARPA's
HumanID at a Distance program was nominated as "Privacy Villain of the Week" in
2002
It is much more likely that the use of gait in surveillance video will be to signal
events likely to be of interest. By way of example, the ability to discriminate
between litter blowing into a perimeter fence and a person walking near it or
climbing it would make an operator focus on relevant data, and so long as the false
alarm rate is sufficiently low then this will improve overall response of a
surveillance system. In this respect some of the approaches to gait are already
sufficiently simple, such as the averaged silhouette or the use of area, to make
video-rate analysis possible. Certainly, the frameworks require viewpoint invariance
and ability to disambiguate articulated motion but this is a much simpler target than
full recognition by gait. These technologies are largely ready for such deployment,
now.
The use in animation is likely to be longer away. Computer vision researchers
have been synthesizing human faces for some time [189, 192], and this has been
used for face recognition [190] (by using it as a vector to achieve viewpoint
invariance). Further, approaches have moved to realistic depiction of the human
shape [191, 196]. This can reduce cost in film production, and reduce risk in special
effects (or even make effects possible), There have been other vision approaches
with this aim [193, 192, 195], but none have used gait modeling to improve
depiction of animated humans. One of these has already used elevation angles [197]
as a basis for modeling, the very angles that have been shown earlier to have
discrimination capability in model-based approaches to human ID by gait.
As such believe there is a considerable future for gait not just in biometrics, but
also in other application domains. We look forward with great interest to the
continuing developments in this field.
156
References
Literature
Covariate factors
158
[33] E. C. Jansen, H. H. Thyssen, J. Brynskov, Gait Analysis after Intake of
Increasing Amounts of Alcohol, ZeitschriJtfur Rechtsmedizin-Journal of Legal
Medicine, 2, pp 103-107, 1985
[34] S. Monsell and F. Tennant, Walking Problems in Young Children, Hospital
Medicine, 65(1), pp 34-38, 2004
[35] P. C. Grabiner, S. T. Biswas, and M. D. Grabiner, Age-Related Changes in
Spatial and Temporal Gait Variables, Archives of Physical Medicine and
Rehabilitation, 82(1), pp 31-35, 2001
[36] M. M. Samson, A. Crow, P. L. de Vreede, J. A. G. Dessens, S. A. Duursma and
H. J. Verhaar, Differences in Gait Parameters at a Preferred Walking Speed in
Healthy Subjects due to Age, Height and Body Weight, Aging-Clinical and
Experimental Research, 13(1), pp 16-21,2001
[37] 1. Quinn, 1. Kaye, The Neurology of Aging, Neurologist, 7(2), pp 98-112, 2001
[38] F. A. Rubino, Gait Disorders, Neurologist, 8(4), pp 254-262, 2002
[39] C. A. McGibbon, Toward a Better Understanding of Gait Changes with Age
and Disablement: Neuromuscular Adaptation, Exercise and Sport Sciences
Reviews, 31(2), pp 102-108,2003
[40] M. B. van Iersel, W. Hoefsloot, M. Munneke B. R. Bloem, M. G. M. O.
Rikkert, Systematic Review of Quantitative Clinical Gait Analysis in Patients
with Dementia, ZeitschriJt fur Gerontologie und Geriatrie, 37(1), pp 27-32,
2004
[41] L. Sloman, M. Pierrynowski, M. Berridge, S. Tupling, J. Flowers, Mood,
Depressive-Illness and Gait Patterns, Canadian Journal of Psychiatry - Revue
Canadienne de Psychiatrie, 32(3), pp 190-193, 1987
[42] N. Becker N., C. Chambliss, and C. Marsh, R. Montemayor, Effects of Mellow
and Frenetic Music and Stimulating and Relaxing Scents on Walking by
Seniors, Perceptual and Motor Skills, 80(2), pp 411-415, 1995
Psychology
[43] G. Johannson, Visual Perception of Biological Motion and a Model for its
Analysis, Perception and Psychophysics, 14, pp 201-211,1973
[44] W. H. Dittrich, Action Categories and the Perception of Biological Motion,
Perception, 22, pp. 15-22, 1993
[45] G. P. Binham, R. C. Shmidt and L. D. Rosenblum, Dynamics and the
Orientation of Kinematic Forms for Visual Event Recognition, Journal of
Experimental Psychology: Human Perception and Performance, 21(6), pp.
1473-1493,1995
[46] G. L. Pellechia and G. E. Garrett, Assessing lumbar stabilisation from point
light and normal video displays of lumbar lifting, Perceptual and Motor Skills"
85(3), pp. 931-937,1997
[47] L. T. Kozlowski and 1. E. Cutting, Recognizing the Sex of a Walker from a
Dynamic Point Light Display, Perception and Psychophysics, 21, pp. 575-580,
1977
159
[48] S. Runenson and G. Frykholm, Kinematic Specification of Dynamics as an
Informational Basis for Person-and-Action Perception : Expectation, Gender
Recognition and Deceptive Intention, Journal of Experimental Psychology:
General, 112, pp. 585-615 , 1983
[49] J. E. Cutting and D. R. Proffitt, Gait Perception as an Example of How we
Perceive Events, In R. D. Walk and H. L. Pich Eds., Intersensory Perception
and Sensory Integration, Plenum Press, London UK, 1981
[50] G. Mather and L. Murdock , Gender discrimination in biological motion
displays based on dynamic cues, Proceedings of the Royal Society London,
B:258,pp.273-279,1994
[51] J. E. Cutting and L. T. Kozlowski , Recognising friends by their walk, Bulletin
ofthe Psychonomic Society, 9(5), pp. 353-356,1977
[52] D. D. Hoffman and B. E. Flinchbaugh, The Interpretation of Biological
Motion, Biological Cybernetics, 42, pp. 195-204, 1982
[53] 1. E. Cutting, D. R. Proffitt and L. T. Kozlowski , A biochemical invariant for
gait perception, Journal of Experimental Psychology: Human Perception and
Pery(ormance,4,pp.357-372,1978
[54] R. F. Rashid, Towards a system for the interpretation of moving light displays,
IEEE Trans. on PAMI, 2(6), pp. 574-581, 1980
[55] S. V. Stevenage, M. S. Nixon and K. Vince, Visual Analysis of Gait as a Cue
to Identity, Applied Cognitive Psychology, 13(6), 513-526, 1999
[56] F. E. Pollick, 1. Kay, K. Heim et al., A review of gender recognition from gait,
Perception, 31, pp 118-118 Supp!. S, 2002
160
[65] K. Akita, Image Sequence Analysis of Real World Human Motion, Pattern
Recognition, 17(1), pp. 73-83, 1984
[66] D. Hogg, Model-based vision - a program to see a walking person, Image and
Vision Computing, I( 1), pp. 5-20, 1983
[67] R. J. Kauth and A. P. Pentland and G. S. Thomas, Blob: an unsupervised
clustering approach to spatial pre-processing of MSS imagery, 11th Int.
Symposium on Remote Sensing ofthe Environment , April, Ann Arbor, MI, USA,
1977
[68] S. Kurakake and R. Nevatia, Description and tracking of moving articulated
objects, Systems and Computers in Japan, 25(8), pp. 16-26, 1994
[69] H-J Lee and Z. Chen, Determination of 3D human body postures from a single
view, Computer Vision, Graphics, and Image Processing, 30, pp. 148-168, 1985
[70] D. Marr and H. K. Nishihara, Representation and recognition of the spatial
organization of three-dimensional shapes, Proc. Royal Society of London B.,
200,pp.269-294,1978
[71] 1. O'Rourke and N. Badler, Model-based image analysis of human motion
using constraint propagation, IEEE Trans. Pattern Analysis and Machine
Intelligence, 2(6), pp. 522-536, 1980
[72] R. Polana and R. Nelson, Detecting activities, Proc. Con! on Computer Vision
and Pattern Recognition, New York, USA, pp. 2-7, June 1993
[73] R. F. Rashid, Towards a system for the interpretation of moving light displays,
IEEE Trans. Pattern Analysis and Machine Intelligence, 2(6), pp. 574-581, 1980
[74] K. Rohr, Towards model-based recognition of human movements in image
sequences, Computer Vision, Graphics, and Image Processing, 59(1), pp. 94-
115,1994.
Databases
161
[82] http://www .sinobiometrics.com/resources.htm
[83] J. D. Shutler, M. G. Grant, M. S. Nixon, and 1. N. Carter On a Large Sequence-
Based Human Gait Database, Special Session on Biometrics Proceedings of the
i h International Conference on Recent Advances in Soft Computing,
Nottingham (UK), 2002
[84] ..http://www.gait.ecs .soton .ac.ukldata .php3..
Early work
162
[97] D. Cunado, 1. M. Nash, M. S. Nixon and 1. N. Carter, Gait Extraction and
Description by Evidence-Gathering, Proceedings of the Second International
Conference on Audio- and Video-Based Biometric Person Authentication
AVBPA99, Washington D.C., pp 43-48,1999
[98] D. Cunado, M. S. Nixon and 1. N. Carter, Automatic Extraction and
Description of Human Gait Models for Recognition Purposes, Computer Vision
and Image Understanding, 90(1), pp 1-41,2003
Current approaches
[99] A. Sundaresan, A. RoyChowdhury, and R. Chellappa, A Hidden Markov
Model Based Framework for Recognition of Humans from Gait Sequences,
Proceedings IEEE International Conference on Image Processing, pp. 143-50,
2003
[100] A. Kale, A. N. Rajagopalan, A. Sundaresan, N. Cuntoor, A. RoyChowdhury,
V. Kruger, and R. Chellappa, Identification of Humans using Gait, IEEE
Transactions on Image Processing, ppI163-1173, Sept. 2004
[101] C. BenAbdelkader, R. Cutler, H. Nanda and L. Davis, EigenGait: Motion-
Based Recognition Using Image Self-Similarity, Lecture Notes in Computer
Science (Proceedings of the Third International Conference on Audio Visual
Biometric Person Authentication) 2091, pp 289-294,2001
[102] C. BenAbdelkader, R Cutler and L Davis. Stride and Cadence as a Biometric
in Automatic Person Identification and Verification . Proceedings IEEE Face
and Gesture Recognition, pp. 372-377, 2002
[103] C. BenAbdelkader, R. Culter and L. Davis, Person identification using
automatic height and stride estimation, Proc. of Inti. Con.f on Pattern
Recognition, Quebec, Canada, 2002
[104] C. BenAbdelkader and L. Davis, Detection of load-carrying people for gait
and activity recognition, in Proc. ofIntI. Con.f on Automatic Face and Gesture
Recognition, Washington, DC, USA, 2002
[105] Z. Liu and S. Sarkar, Simplest Representation Yet for Gait Recognition:
Averaged Silhouette, Proceedings 17th International Conference on Pattern
Recognition, 2004
[106] A. F. Bobick and A. Y. Johnson, Gait Recognition Using Static, Activity-
Specific Parameters, Proceedings IEEE Computer Vision and Pattern
Recognition 2001, I, pp 423-430,2001
[107] H. Moon, R. Chellappa, and A. Rosenfeld, 3D Object Tracking Using Shape-
Encoded Particle Propagation, Proceedings International Conference on
Computer Vision, II, pp 307-314, 2001.
[108] T. Yamamoto and R. Chellappa, Shape and Motion Driven Particle Filtering
for Human Body Tracking, Proc. Inti. Conf. on Multimedia and Expo, 3, pp. 61-
64, July 2003.
[109] A. Sundaresan, A. Roy Chowdhury and R. Chellappa, Multiple View
Tracking of Human Motion Modelled by Kinematic Chains, Proc. Int. Con.f on
Image Processing, October 2004
[110] R. Collins, R. Gross and J. Shi, Silhouette-based Human Identification from
Body Shape and Gait, Proceedings of the IEEE International Conference Face
and Gesture Recognition '02, pp 366-371, 2002
163
(III] Y. Liu, R. Collins, and Y. Tsin. Gait Sequence Analysis using Frieze Patterns.
Proc. European Conference on Comp uter Vision, May 2002
[112] J. P. Foster, M. S. Nixon , and A. Prugei-Bennet, Automatic Gait Recognition
using Area-Based Metrics, Pattern Recognition Letters, 24 , pp 2489 -2497, 2003
[113] 1. B. Hayfron-Acquah, M. S. Nixon and 1. N. Carter , Hum an Identification by
Spatiotemporal Symme try, Proceedings 16th International Conference on
Pattern Recognition, 1, pp 632-635, 2002
[114] J. D. Shutle r, and M. S. Nixon, Zernike Velo city Mom ents for Des cription
and Recognition of Moving Shapes, Proceedings of the British Machine Vision
Conference 2001, pp 705-714, 200 1
(115] C-Y. Yam, M. S. Nixon and J. N. Carter, Automated Person Recognition by
Walking and Runn ing via Model-Ba sed Approaches, Pattern Recognition,
37(5), pp 1057-1072,2004
[116] D. K. Wagg, and M. S. Nixon, On Automated Model -Based Extraction and
Anal ysis of Gait , Proceedings of the IEEE International Conference Face and
Gesture Recognition '04, 2004
[117] L. Wang , T. Tan , H. Ning, and W. Hu. Fusion of Static and Dynamic Body
Biometrics for Gait Recognition, IEEE Transactions on Circuits and Systems for
Video Technology Special Issue on Image- and Video-Based Biometrics,14(2),
pp. 149-158,2004
[118] R. Zhang, C. Vogler, D. Metaxas, Human Gait Recognition, Proceedings
IEEE Computer Vision and Pattern Recognition 2004, Washington , July 2004
[119] L. Lee and W. E. L. Grimson, Gait An alysis for Recognition and
Classification , Proceedings of the IEEE International Conference Face and
Gesture Recognition '02, pp 155-162,2002
[120] E. Tassone , G. West and S. Venkatesh , Temporal PDM s for Ga it
Cla ssification, Proceedings 16th International Conference on Pattern
Recognition, pp 1065-10 69, 2002
[121] L. Wang, T. N. Tan, W. M. Hu, and H. Z. Ning, Automatic Gait Recognition
Based on Statistical Shape Analysis, IEEE Transactions on Image Processing,
12(9), pp 1120-1131 , 2003
(122] L. Wang, T. Tan, H. Z. Ning , and W. M. Hu, Silhou ette Anal ysis-Based Gait
Recognition for Hum an Ident ification, IEEE Transactions Pattern Analysis and
Machine Intelligence, 25(12), pp 1505-2528 , 2003
(123] B. Bhanu and J. Han , Individual Recognition by Kinem atic-B ased Gait
Anal ysis, Proceedings 16th International Conference on Pattern Recognition, 3,
pp 343-6346, 2002
[124] B. Bhanu and 1. Han, Human Recognition on Combining Kinematic and
Stationary Features, Lecture Notes in Computer Science (Proceedings of the
Fourth International Conference on Audio Visual Biometric Person
Authentication) 2688 , pp 600-608, 2003
[125] J. Han and B. Bhanu, Statistical Feature Fusion for Gait-based Human
Recognition, Proc. IEEE Computer Vision and Pattern Recognition 2004,
Washington, 2004
[126] G. Zhao , R. Chen, G. Liu , L. Hua , Amplitude Spectrum-based Gait
Recognition, Proceedings of the IEEE International Conference Face and
Gesture Recognition '04, pp 23-28, 2004
164
[127] C-S. Lee, A. Elgammal, Gait Style and Gait Content: Bilinear Models for Gait
Recognition using Gait Re-Sampling. Proceedings of the IEEE International
Conference Face and Gesture Recognition '04, pp 147- 152,2004
[128] T. Kobayashi, and N. Otsu, Action and Simultaneous Multiple-Person
Identification using Cubic Higher-order Local Auto-Correlation, Proceedings
17th International Conference on Pattern Recognition, 2004
[129] I. R. Vega, and S. Sarkar, Statistical Motion Model Based on the Change of
Feature Relationships : Human Gait-Based Recognition, IEEE Transactions. on
Pattern Analysis and Machine Intelligence, 25(10), ppI323-1328, 2003
[130] 1. E. Boyd, Synchronization of Oscillations for Machine Perception of Gaits,
Computer Vision and Image Understanding, 96, 35-59, 2004
[131] M.l-Iild, Estimation of 3D Motion Trajectory and Velocity from Monocular
Image Sequences in The Context of Human Gait Recognition, Proceedings 17th
International Conference on Pattern Recognition, 2004
[164] 1. W. Davis , Visual categorization of children and adult walking styles, in
Proc. of Int!. Conf. on Audio- and Video-based Biometric Person
Authentication, 2001, pp. 295-300
[133] J. W. Davis, S. R. Taylor, Analysis and Recognition of Walking Movements,
Proceedings 16th International Conference on Pattern Recognition , pp 10315-
10319,2002
[134] 1. P. Foster, M. S. Nixon, and A. Prugel-Bennet, Recognising Movement and
Gait by Masking Functions, Lecture Notes in Computer Science (Proceedings of
the Third International Conference on Audio Visual Biometric Person
Authentication) 2091, pp 278-283, 2001
[135] 1. B. Hayfron-Acquah, M. S. Nixon, J. N. Carter, Automatic Gait Recognition
by Symmetry Analysis, Lecture Notes in Computer Science (Proceedings of the
Third International Conference on Audio Visual Biometric Person
Authentication) 2091, pp 312-317, 2001
[136] 1. B. Hayfron-Acquah, M. S. Nixon, and 1. N. Carter, Recognising Human
and Animal Movement by Symmetry, Proceedings of the IEEE International
Conference on Image Processing IClP '01, Thessaloniki, pp 290-293 , 2001
[137] 1. B. Hayfron-Acquah, M. S. Nixon and J. N. Carter, Automatic Gait
Recognition by Symmetry Analysis, Pattern Recognition Letters, 24(13) , pp
2175-2183,2003
[138] J. D. Shutler, Zemike Velocity Moments for Holistic Shape Description of
Moving Features, PhD Thesis , University of Southampton, 2002
[139] C. Y. Yam, M. S. Nixon, and J. N. Carter, Extended Model-Based Automatic
Gait Recognition of Walking and Running, Lecture Notes in Computer Science
(Proceedings of the Third International Conference on Audio Visual Biometric
Person Authentication), 2091 , pp 272-277, 2001
[140] C-Y. Yam, M. S. Nixon and J. N. Carter, Gait Recognition by Walking and
Running : a Model-Based Approach, Proceedings of the Asian Conference on
Computer Vision ACCV 2002, pp 1-6,2002
[141] C-Y. Yam, M. S. Nixon and 1. N. Carter , On the Relationship of Human
Walking and Running: Automatic Person Identification by Gait, Proceedings
16th International Conference on Pattern Recognition, 1, pp 287-290,2002
[142] A. Y. Johnson and A. F. Bobick, A Multi-View Method for Gait Recognition
Using Static Body Parameters, Lecture Notes in Computer Science (Proceedings
165
of the Third International Conference on Audio Visual Biometric Person
Authentication), 2091, pp 301-311, 2001
[143] A. Y. Johnson, J. Sun and A. F. Bobick, Predicting Large Population Data
Cumulative Match Characteristic Performance from Small Population Data,
Lecture Notes in Computer Science (Proceedings of the Fourth International
Conference on Audio Visual Biometric Person Authentication), 2688, pp 821-
829,2001
[144] A. Kale, N. Cuntoor, B. Yegnanarayana, A. Rajagopalan, and R. Chellappa,
Gait Analysis for Human Identification, Lecture Notes in Computer Science
(Proceedings ofthe Fourth International Conference on Audio Visual Biometric
Person Authentication) 2688, pp 706-714, 2003
[145] A. Kale, A. N. Rajagopalan, N. Cuntoor, and V. Kruger, Identification of
Humans using Gait, Proceedings of the IEEE International Conference Face
and Gesture Recognition '02, pp 366-371, 2002
Further Analysis
[146] 1. E. Boyd and J. Little, Biometric Gait Recognition, Proc. Summer School on
Biometrics forthcoming as Lecture Notes in Computer Science. 2003
[147] G. Veres, 1. N Carter and M. S. Nixon, What image information is important
in silhouette-based gait recognition, Proceedings IEEE Computer Vision and
Pattern Recognition 2004, Washington, July 2004
[148] A. Veeraraghavan, R. Chellappa and A. Roy Chowdhury, Role of Shape and
Kinematics in Human Movement Analysis, Proceedings IEEE Computer Vision
and Pattern Recognition 2004, Washington, July 2004
[149] Z. Liu, L. Malave, A. Osuntugun, P. Sudhakar, and S. Sarkar, Towards
Understanding the Limits of Gait Recognition, Proc. SPIE International
Symposium on Defense and Security: Biometric Technology for Human
Identification, pp 195-205, April 2004
[150] Z. Liu, L. Malave, and S. Sarkar, Studies on Silhouette Quality and Gait
Recognition, Proceedings IEEE Computer Vision and Pattern Recognition
2004, Washington, July 2004
[151] D. Tolliver and R. Collins, Gait shape estimation for identification, Lecture
Notes in Computer Science (Proceedings ofthe Fourth International Conference
on Audio Visual Biometric Person Authentication) 2688, pp 734-742, 2003.
[152] M. S. Nixon, 1. N. Carter, 1. D. Shutler and M. G. Grant, Automatic
Recognition by Gait: Progress and Prospects, Sensor Review, 23(4), pp 323-331,
2003
[153] L. Lee, G. Dalley and K. Tieu, Learning pedestrian models for silhouette
refinement, Proceedings 9th International Conference on Computer Vision, pp
663-670 2003.
[154] S. D. Mowbray and M S Nixon, Automatic Gait Recognition via Fourier
Descriptors of Deformable Objects, Lecture Notes in Computer Science
(Proceedings ofthe Fourth International Conference on Audio Visual Biometric
Person Authentication), 2688, pp 566-573, June 2003
[155] S. D. Mowbray and M. S. Nixon, Extraction and Recognition of Periodically
Deforming Objects by Continuous, Spatio-temporal Shape Description, Proc.
IEEE Computer Vision and Pattern Recognition 2004, Washington, July 2004
166
[156] 1. Zhang, R. Collins, and Y. Liu, Representation and Matching of Articulated
Shapes, Proc. IEEE Computer Vision and Pattern Recognition 2004,
Washington, July 2004
[157] D. K. Wagg, and M. S. Nixon, Automated Markerless Extraction of Walking
People Using Deformable Contour Models . Computer Animation and Virtual
Worlds 15(3-4), pp. 399-406, 2004
[158] R. Urtasun and P. Fua, 3D Tracking for Gait Characterization and
Recognition, Proceedings of the IEEE International Conference Face and
Gesture Recognition '04, 2004
[159] J-H. Yoo, M. S. Nixon and C. 1. Harris, Model-Driven Statistical Analysis of
Human Gait Motion, Proceedings IEEE International Conference on Image
Processing, pp 285-288, 2002
[160] V. J. Laxmi, 1. N. Carter and R. I Damper, Support Vector Machines and
Human Gait Classification, Proceedings IEEE Workshop Automatic
Identification Advanced Technologies (AutoID '02) , pp 17-22,2002 .
[161] R. Tanawongsuwan and A. Bobick, Gait Recognition from Time-normalized
Joint-Angle Trajectories in the Walking Plane, Proceedings IEEE Computer
Vision and Pattern Recognition 2001, II, pp726-73I, 2001
[162] R. Tanawongsuwan, and A. F. Bobick, Performance Analysis of Time-
Distance Gait Parameters under Different Speeds, Lecture Notes in Computer
Science (Proceedings of the Fourth International Conference on Audio Visual
Biometric Person Authentication) 2688, pp 715-724, 2003
[163] R. Tanawongsuwan and A. Bobick, Modeling the Effects of Walking Speed
on Appearance-based Gait Recognition, Proceedings IEEE Computer Vision
and Pattern Recognition 2004, Washington, July 2004
[132] 1. W. Davis, Visual categorization of children and adult walking styles, in
Proc. of IntI. Con! on Audio- and Video-based Biometric Person
Authentication, 2001, pp. 295-300
[165] G. V. Veres, M. S. Nixon, and J. N. Carter, Modelling the time-variant
covariates for gait recognition, Proc. A VBPA2005, Springer, 2005
[166] J. N. Carter and M. S. Nixon , On Measuring Gait Signatures which are
Invariant to their Trajectory, Measurement and Control, 32(9), pp 265-269,
1999
[167] N. Spencer and 1. N. Carter, Viewpoint Invariance in Automatic Gait
Recognition, Proceedings IEEE Workshop Automatic Identification Advanced
Technologies (AutoID '02) , pp 1-6,2002
[168] N. Spencer and 1. N. Carter, Pose Invariant Gait Reconstruction, Proc ICIP
2005
[169] A. Kale, A. K. R. Chowdhury, and R. Chellappa, Towards a View Invariant
Gait Recognition Algorithm, Proceedings Advanced Video and Signal Based
Surveillance, pp 143-50,2003
[170] A. Kale, A. Roy-Chowdhury, and R. Chellappa, Gait-based human
identification from a monocular video sequence, in Handbook on Pattern
Recognition and Computer Vision (3rd Edition), C.H.Cheng and P.S.P. Wang,
Eds. World Scientific Publishing Company Pvt. Ltd., In press .
[171] G. Shakhnarovich, L. Lee, and T. Darrell, Integrated face and gait recognition
from multiple views, in Proc. Conf. on Computer Vision and Pattern
Recognition, I, pp. 439-446, 2001
167
[172] S.Zhou and R.Chellappa, Probabilistic human recognition from video Proc. of
European Conference on Computer Vision, 2002
[173] C. BenAbdelkader, R. Cutler, and L. Davis View-invariant Estimation of
Height and Stride for Gait Recognition, Lecture Notes in Computer Science,
2359 , pp 155-159,2002
[174] K. 1. Sharman, M. S. Nixon and J. N. Carter, Extraction and Description of
3D (Articulated) Moving Objects, Proc. 3D Data Processing Visualization and
Transmission, pp 664-667, 2002
[175] S. P. Prismall, M. S. Nixon and 1. N. Carter, On Moving Object Reconstruction
by Moments, Proceedings of the British Machine Vision Conference 2002,
pp73-82, 2002
[176] S. P. Prismall, M. S. Nixon, and 1. N. Carter, Novel Temporal Views of
Moving Objects for Gait Biometrics, Lecture Notes in Computer Science
(Proceedings ofthe Fourth International Conference on Audio Visual Biometric
Person Authentication) 2688, pp 725-733, Guildford (UK), 2003
[177] B. Bhanu, and 1. Han , Kinematic-based human motion analysis in infrared
sequences, Proceedings of the Sixth IEEE Workshop on Applications of
Computer Vision, pp 208-12, Orlando (USA), 2002
[178] B. Bhanu and 1. Han , Bayesian based performance prediction for gait
recognition, in Proc. of Workshop on Motion and Video Computing
(MOTION'02), Orlando, Florida, 2002
[179] Q. Jiang, C. Daniell, Recognition of Human and Animal Movement Using
Infrared Video Streams, Proeedings IEEE International Conference on Image
Processing 2004, 2004
[180] L. Wang, T. Tan, H. Ning , and W. Hu, Fusion of Static and Dynamic Body
Biometrics for Gait Recognition, IEEE Transactions on Circuits and Systems
for Video Special Issue on Image- and Video-Based Biometrics, 14(2), pp 149-
158,2004
[181] H. Ning, T. Tan , L. Wang and W. Hu, People tracking based on motion model
and motion constraints with automatic initialization, Pattern Recognition
(accepted)
[182] G. Shakhnarovich, and T. Darrell, On Probabilistic Combination of Face and
Gait Cues for Identification, Proceedings of the IEEE International Conference
Face and Gesture Recognition '02, pp 176- 181, Washington (USA) 2002
[183] P. C. Cattin, D. Zlatnik, R. Borer, Sensor Fusion for a Biometric System
using Gait, Proc. International Conference on Multisensor Fusion and
Integration for Intelligent Systems, pp 233- 238 , 2001
[184] N. Cuntoor, A. Kale , and R. Chellappa, Combining Multiple Evidences for
Gait Recognition, Proceedings of the International Conference on Audio,
Speech and Signal Processing, 3, pp 6-10, 2003.
[185] D. Wagg, and M. S. Nixon, Model-Based Gait Enrolment in Real-World
Imagery, Proceedings 2003 Workshop on Multimodal User Authentication
MMUA, pp 189-195, Santa Barbara (USA) 2003
[186] C-Y. Yam, M. S. Nixon and 1. N. Carter, Automated Markerless Analysis of
Human Walking and Running by Computer Vision, Proceedings of the World
Congress on Biomechanics, Calgary (Canada) 2002
[187] J-H . Yoo and M. S. Nixon, Markerless Human Gait Analysis via Image
Sequences, Proceedings International Society ofBiomechanics XIXth Congress,
Dunedin NZ, 2003
168
Other Related Work
General
[198] L. R. Rabiner and B. H. Juang, An Introduction to Hidden Markov Models,
IEEE ASSP Magazine, January, pp. 4-16, 1986
[199] L. Rabiner and H. Juang, Fundamentals ofSpeech Recognition, Prentice Hall,
1993
[200] L. R. Rabiner , A tutorial on hidden Markov models and selected applications
in speech recognition, Proceedings of the IEEE, 77(2), pp.257-285, February
1989
[201] B. H. Juang, On the Hidden Markov Model and Dynamic Time Warping for
Speech Recognition - a Unified View, Technical Journal, vol. 63, pp. 1213-
1243, 1984
[202] 1. Kittler, M. Hatef, R. P. W. Duin, and 1. Matas, On combining classifiers,
IEEE Trans. on Pattern Analysis and Machine Intelligence, pp. 226-239, March
1998
[203] R. Brunelli and D. Falavigna, Person identification using multiple cues, IEEE
Trans. on Pattern Analysis and Machine Intelligence, 17(10): 955-966, 1995
169
[204] L. Hong and A. Jain, Integrating faces and fingerprints for personal
identification, IEEE Trans. on Pattern Analysis and Machine Intelligence,
20(12) : 1295-1307, 1998
[205] B. Achermann and H. Bunke, Combination of classifiers on the decision level
for face recognition, Technical Report IAM-96-002, University Bern, 1996
[206] I. L. Dryden and K. V. Mardia, Statistical Shape Analysis , John Wiley and
Sons, 1998
[207] P. V. Overschee and B. D. Moor, Subspace algorithms for the stochastic
identification problem, Automatica, vol. 29, pp. 649-660, 1993
[208] S. Soatto, G. Doretto, and Y.N. Wu, Dynamic textures, Proc. ofInternational
Con! on Computer Vision, 2, pp. 439-446, 2001
[209] G. Golub and C. V. Loan, Matrix Computations, The Johns Hopkins
University Press, Baltimore, 1989.
[2\0] K. D. Cock and D. B. Moor , Subspace angles and distances between ARMA
models, Proc. ofthe IntI. Symp. ofMath. Theory ofnetworks and systems, 2000
[211] M. Isard and A. Blake, Contour tracking by stochastic propagation of
conditional density, Proc. ofEuropean Conference on Computer Vision, ,1, pp.
343-356, 1996
[212] V. Nalwa, A Guided Tour ofComputer Vision, Addison-Wesley, 1993
[213] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision,
Cambridge University Press, 2000
[214] 1. Kent, New directions in shape analysis, The art of statistical science: a
tribute to G. S. Watson, pp: 115-127, Wiley, Chichester, 1992
[215] Y. Yang and M. Levine, The background primal sketch : an approach for
tracking moving objects . Machine Vision and Applications, 5: 17-34,1992
[216] Y. Kuno, T. Watanabe, Y. Shimosakoda and S. Nakagawa, Automated
detection of human for visual surveillance system, in Proc. of IntI. Con! on
Pattern Recognition, pp. 865-869, 1996
[217] I. Haritaoglu, D. Harwood and L. Davis, W4 : real-time surveillance of people
and their activities, IEEE Trans. on Pattern Analysis and Machine Intelligence,
22(8) : 809-830, 2000
[218] J. Phillips, H. Moon, S. Rizvi and P. Rause, The FERET evaluation
methodology for face recognition algorithms, IEEE Trans. on Pattern Analysis
and Machine Intelligence, 2000, 22(10) : 1090-1\04
[219] M. Isard and A. Blake, Condensation - conditional density propagation for
visual tracking, International Journal ofComputer Vision, 29(1) : 5-28,1998
[220] S. Wachter and H. Nagel, Tracking persons in monocular image sequences,
Computer Vision and Image Understanding, 74(3): 174-192, 1999
[221] C. Sminchisescu and B. Triggs, Covariance scaled sampling for monocular 3D body
tracking, Proc. ofIntI. Conj. on Computer Vision and Patt ern Recognition, 2001
170
9 Appendices
Session Number _ - -_ _. . .
Chec k list and gene r al poin ts for eac h 1 hour dat a gr ab sessio n
Please read carefully, then tick each corresponding box once the check has been
completed. The sess ion comments form, (located at the end of this checklist)
needs to be filled in at the end of each session. All the completed forms then
need to be stored for future reference.
• Switch the cameras on at the beginning of the session, leave all of them
recording until the end of the session.
• Try not to talk to the subjects as they walk , as they will look in your
direction .
• When subjects use the treadmill , make sure they use the safety clip.
• When subjects are using the track, don't have anyone in the main lab,
as this encourages them to talk, and makes them feel more self
conscio us.
• Don't let the subjects use the treadmill until they have walked outside
and inside on the track, as it may temporar ily affect their walk.
• Ensure that no subjects are left unattende d while they are in the lab, or
around any of the equipment.
• Ensure that someone is with the equipment outside at all times,
(cameras Emily and Felix).
• Subjects need to write the session number and subje ct number on the
back of their questionnaire (in a thick black marker), which they then
need to hold up in front of the camera for each batch of walking
(outside , inside track and treadmill).
• Ifany of the cameras auto-tum off, or get knocked once they have
been set-up, they must be re-checked (*all settings" ) especially
exposure, focus and zoom.
• There is a first aid kit in the main lab, on the lowest shelfby the
cameras, in a green pouch.
Prior to a session beginning
D Start this checklist (at least) 45 minutes before the subjects are due to
arrive, it may take some time.
D Cone off the area behind the cameras (outside) to allow
~estrian s to walk behind the cameras. (Do this the night before?)
U Switch on all lights, starting with the light switch on the right as
you enter the lab (wall lighting for the main track backdrop ). There is
no need to use the ceiling fluorescent lights.
D There is no need to use the light switches on the
lights/treadmill/camera, use the wall plugs for everything. All wall
plugs that are needed are outlined in green tape, all should be
switched on IS mins before the start of the session. In addition there
is a set behind the right hand side of the treadmill backdrop cloth and
a single light switch for the left hand side track backdrop light.
(located on the wall on the left as you walk into the main lab.
D Check for evidence of any light stands or camera tripods having
moved.
D Plug in the laptops and cameras into the mains (each camera has a
specific position, labelled on the floor directly below each tripod).
D Switch on the radio.
D Go through each of the camera set-up sheets.
D Load labelled tapes into each camera (camera id , session number
and date).
Refer to the Tape labelling scheme.
D Confirm on the camera set-up sheets that the tapes have been
loaded.
D Do you have a copy of the instructions to subjects?
Walking track
D Clean the track floor with a brush, to remove anything that might get
g2.und into the carpet.
U Ensure the canon camera (Amy) is normal to the track, not the
screen
172
DTurn off the air conditioning.
DEnsure that the green material on the floor is flat with no ripples.
If you need to walk on the material to do this, then remove your shoes
before doing so as they will leave foot prints.
DRe-tension the background cloth, if needed.
DEnsure big & subject lighting don't interfere
DWaik along track twice looking for shadows on ground or big
OEnsure no light is leaking out of the edges of the lights bouncing
off of the ceiling. There should be pieces of card placed in the shutters
to stop this. You should not be able to see the lights themselves as you
walk along the track (both directions).
Treadmill
DLubricate both treadmills (training and filming).
DEnsure the canon camera (Bettie) is normal to the treadmill.
ORe-tension the background cloth, if needed.
DHide the treadmill display.
DEnsure the mirror is in front of the Treadmill.
DCheck for direct shadows on belt.
DEnsure smooth join of background floor & vertical cloth.
Room
DEnsure that the air conditioning is off - you may have to re-adjust the
backdrops is it is not.
DEnsure that all of the fluorescent lights are off.
IndoorT~ds
DEnsure that all the tripods are located correctly (triangular gaffer tape
markers on the floor).
Outdoor Tripods
173
DCheck that the spikes at the ends of the legs of both tripods are just
below the black feet, i.e. at the point where they disapp ear from view,
when viewing across the end of the black foot.
DMin height on top leg extension.
DMax height on bottom leg extension.
DCentral height extension at minimum.
DAll tripods located with feet and centres lined up with painted
markers on the tarmac .
DMounting heads attached to cameras with "lens 1\ " marker aligned
along camera axis (i.e, pointing at lens!)
DLevel on both planes (check spirit levels on tripod head with the
cameras mounted ).
DCheck both cameras are aimed at the centre point (marked with a
cross) of the walking area.
At the end of each session
DBefore the subjects leave, check that the correct amount of subject
consent forms and questionnaires are present.
D Remove tapes , and set to write protect , ensure they are labelled, if
not label them, see the tapc labelling scheme.
D Tum off all wall power points (lights, cameras, laptop etc)
DRemove came ras and place back in the storage cupboard in the
office, or on charge (cameras Emily and Felix).
D Fill in session comments form (see below), making a note of any
~blems .
174
Appendix 9.1.2 Camera Set-up Forms
Camera set-ups
Session Number 0
Please tick each corresponding box once the check has been completed. Note :
Not all the cameras have the same settings!
DEnsure that the mount ing head is attached to the camera with "lens A"
marker aligned along camera axis (i.e. pointing at lens!) . There should be a
marked line to help you check the alignment.
DPlug into the mains
DLens cap off.
DPut into progressive scan mode ('P Scan' mode)
DCheck PRO. SCAN appears on the screen
DShutter speed 1I250sec (menu)
DDigital zoom off (menu) .
OImage stabiliser off (menu) .
DWhite balance set to "indoor" (menu) appears as a light bulb on view
screen
DExp lock (exp button on back of camera) set to +/-0.
OCheck view is fully zoomed out
OAuto focus off - button at back ("focus") near power switch . Press &
~st focus using dial
USet the manual focus . When setting the manual focus, it is easier to auto
focus onto someone (i.e. a set-up person) and then switch to manual mode .
Probably more accurate than by hand . ..
OPRO.SCAN
011250
DLight bulb (white balance)
OM.Focus
175
Dsp
DE.Lock +/- 0 (Exposure setting)
DEnsure that the mounting head is attached to the camera with "lens 1\"
marker aligned along camera axis (i.e. pointing at lens!) . There should be a
marked line to help you check the alignment.
Dplug into the mains
DLens cap off.
DInteriaced mode ('Camera' mode)
DCheck view is fully zoomed out
DProgressive scan off (menu)
DAuto shutter off (menu)
DDigital zoom off (menu)
DSteady shot off (menu)
DOdB gain (menu)
DlI300sec shutter.
DAperture F2.8 (set on screen)
DWhite balance "indoor" (hit white balance button on back of camera,
spin dial until a light bulb icon appears on view screen)
DSet the manual focus . Auto focus switch off at front near lens. It is
easier to auto focus onto someone (i.e. a set-up person) and then switch to
manual mode . Probably more accurate than by hand ....
0300
DF2.8
DOdBgain
Dsp
OLight bulb (white balance)
DHand with an 'F' over it (manual focus).
DHand with an 'OFF' over it (steady shot)
176
Outside Cameras: Note settings are different!
Camera E (Emily) - Canon
• Canon
DEnsure that the mounting head is attached to the camera with "lens A "
marker aligned along camera axis (i.e. pointing at lens!). There should be a
marked line to help you check the alignment.
DLens cap off.
DCheck battery is fully charged, if not swap the battery with one of the
other cameras.
DPut into progressive scan mode ('P Scan' mode)
DCheck PRO. SCAN appears on the screen
DShutter speed 1/250sec (menu)
DOigital zoom off (menu).
Dlmage stabiliser off (menu).
DWhite balance set to "outdoor" (menu) appears as a sun on view screen
DAuto exposure
DCheck view is fully zoomed out
OAuto focus off - button at back ("focus") near power switch. Press &
~st focus using dial
USet the manual focus. When setting the manual focus, it is easier to auto
focus onto someone (i.e. a set-up person) and then switch to manual mode.
Probably more accurate than by hand....
OPRO.SCAN
011250
DSun (white balance)
DM.Focus
OSP
OAuto (exposure, i.e. no Klock)
177
OEnsure that the mounting head is attached to the camera with "lens A "
marker aligned along camera axis (i.e. pointing at lens!). There should be a
marked line to help you check the alignment.
OEnsure you have the long life battery.
OLens cap off.
OInterlaced mode ('Camera' mode)
DCheck view is fully zoomed out
OProgressive scan off (menu)
OAuto shutter off (menu)
DDigital zoom off (menu)
DSteady shot off (menu)
DOdB gain (menu)
01l300sec shutter .
DAuto exposure
OWhite balance "outdoor" (hit white balance button on back of camera,
~ dial until a sun icon appears on view screen)
USet the manual focus. Auto focus switch off at front near lens. It is
easier to auto focus onto someone (i.e. a set-up person) and then switch to
manual mode. Probably more accurate than by hand ....
0300
OAuto exposure (i.e. no' F')
OSP
OSun (white balance)
OHand with an 'F' over it (manual focus) .
DHand with an 'OFF' over it (steady shot)
Return to the checklist.
178
Dcamera E shows loaded tape symbol
DCamera F shows loaded tape symbol
• Keep an eye on the battery levels, especialIy the sony camera . Use the
viewfinders, not the LCD displays.
179
Appendix 9.1.3 Session Coordinator's Instructions
Instructions to Subjects:
General
• Ask the subjects to only walk within the grey/green dotted lines.
• Ask them not to run anywhere.
• Then take them into the lab and explain what we would like them to
do.
• Don't let them use the treadmill until they have walked outside and
inside on the track, as it may temporarily affect their walk.
• Take them back into the coffee room/side office and ask them to fill
in the consent and questionnaire forms prior to data grab . issuing each
subject with a subject and session number, both of which they need to
record on the questionnaire.
• Ask them to write the session number and their subject number in
large numbers on the back of their questionnaire (using a thick black
marker). This enables them to hold it up for the still shots .
• Then ask them to one at a time, walk outside (weather permitting) and
on the track inside.
• Train and get them used to the training treadmill in the side
lab/storage room and then film them on the treadmill in the main lab.
• Encourage the subjects who are not walking to sit down on the chairs
provided.
• Remind the subjects to only walk within the grey/green dotted lines .
Treadmill
• Ask them to look in the mirror when walking.
• They can stop the treadmill by pulling out the safety tag, or by hitting
stop .
• Before they start, you will want to take side and frontal pictures using
the still camera.
• When subjects have finished on the treadmill, replace the yellow cord
and safety clip with the single red tab to enable background data
collection.
180
• They need to walk around the loop at least 8 times (8 times each
way).
• Walk on the green track and follow the dotted grey lines at either end
of the straight track .
• Keep count of the number of times they have walked past the
cameras.
Outside
• The subjects need to walk to enable 4 good runs across the track with
no interference, so they may need to walk for longer, depending on
how busy it is. rule of thumb is 8 times each way, 16 in total.
• Don't forget the background data.
• Keep count of the number of times they have walked past the
cameras.
back to main checklist (htmllink).
181
Appendix 9.1.4 Subject Information Form
For purposes of quantifying the gait data we are collecting, we would like to
record personal data that might affect your gait. Please note that neither your
name nor your identity will ever be associat ed with any data we collect. Please
do not feel compelled to answer any question, especially if you consider it to
be too personal in nature.
A. Gait is actually controlled by gender. Women swing their hips more, and
men their shoulders .
As such, are you taking any medication that might affect your gait? YES/ NO
182
Appendix 9.1.5 Subject Consent Form
Signature Date _
Witness Date _
183
Index
3D 47,114-131,134-142,151 DARPA 2
Age 11 ,22, 155 Database 17-34
Alcohol 11 Laboratory 18,24-25
Alzheimer' s 11 Maryland (UMD) 34,105
Analysis CAS Institute of Automation
Covariate analysis 10,46, 64-66 CASIA 33,78-79,125-131
Marker-based 6,10,20,108 CMU 3,20,61,81,100,104,134
Ancillary data 32-34 NIST 21-23,47-49,89-90, 101-
Animation 15,160 105,134
Anova 55,64,105 Outdoor 17,20,27-29,33-34
Area measures 49-51 South Florida (USF) 2 1-23,47-
Aristotle 6 49,89-90,101 -105,134
ARMA model 98-100,104-105 Southampton 22,33,56,65,116
Autoregressive model 94-98 UCSD 17-18
Background subtraction 8,23,27,67, Dictionary definition 5
153 Disorders 11
Bilateral symmetry 8,51 Distance 1,21-23,61,154
Biomechanics 9-10 Double support 7
Biometrics 1-3 Double float 9
Body parameters 132 DTW 96-98,102-104
Borelli 6 Dynamical model 98-100,104-105
Cadence 10,33,46,107 Dynamic Time Warping see DTW
California (Riverside) 47-48 Eigenspace transformation EST 36-
California (San Diego) see UCSD 37,47,71 -82
Canonical Ellipsoidal fit 48
Transformation 36-38,51,55 Energy image 49
View 135-142,155 Face recognition 1-2,10,142
Carnegie Mellon University Feature selection 46,64
see CMU see Anova
CAS Institute of Automation ( Beijing) Float 9
Database 33,78-79,125-131 Fusion 48,58,102,120,141-147
Silhouette-based recognition 47, Gait change
65-106 Age 11,22,152
Model-based recognition 108,115- Alcohol 11
131 Clothes 11,154
Chinese Academy of Sciences Luggage 11,151
CASIA Mood 11
see CAS Institute of Automation Shoes 11
Chromakey 23-27 Time 11,151
Clothing 11,154 Gait challenge 20-22,48,87-88,145-
CMU 3,47 147
Approach 47 Gait disorder 11
Database 3,20,61,81,100,104,134 Gait model 7,13-15,18-20,36, 107-
Confusion matrix 43, 105 108
Coupled oscillator model 109-110 ARMA 98-100,104-105
Covariates 10-11,21 Autoregressive 94-98
Analysis 10,48,64-66,152 Coupled oscillator 109-110
Cycle 7,50,52,54,110 Dynamical 98-100, I04-105
Hip 39-43 Nearest neighbor 62,70,78,87,110
Human body 13-15,109-131 NIST 3,47-48,83
Kinematics 114-132 Database 21-23,47-49,89-90,
Running 5,9,46,109-111 101- I05,134
Structural 133 Neurology II
Walking 5,8,39 -42, 111-114 Notre Dame 3
Galileo 6 Noise 53,61 ,64,85,100,123,126,150
GaTech 3,133 Occlus ion 14, I9,25,34,61,66,
Gender 12,21,47 ,57 123,129 ,147,153
Georgia Institute of Technology see Optical flow 15,37,53 ,59,62,137-138
GaTech Oscillators 39,109
Heel strike 6,9,24,32 ,52 Outdoor database 17,20,27-29,33 -
Hidden Markov Model see HMM 34
Hip model 39-43 Overview of recognition approaches
HMM 46,89-93,100,108 Model-based 107-108
Human body model 13-15,109-131 Silhouette-based 46-47
Human ID at a Distance 3,21 Parkinson's disease 11
k nearest neighbor 62,70,78,87, II 0 PCA 36-37,47,71-82
Kinematics 49 ,97-98,114-132 Pelvis 7,9
Knee 9 Performance analysis 20-22,47,153
Laboratory database 18,24-25 Phase 15,37,42-44,47,52,75,113, 119
LDA 36-37 ,51, 55 Stance 6-9
Least squares 40 Swing 6-9
Least median squares 67 Phase locked loops 49
Literature 5-6 Phillips, Jonathon 3
Linear Discriminant Analysis see LDA Principal Components Analysis
Load 11,151 see PCA
Luggage 11,151 Procrustes shape analysis 69-70 ,
Manuallabeling 22 77-81
Marker-based systems 6,10,12,20,I08 Podiatry 15
Maryland see UMD Point distribution model 48
Medical stud ies 6-9 Potency 47 ,57,63,107,113
MIT 3,47 Psychology 12-13
Model 7,13-15,18-20,36, 107-108 Recognition performance
ARMA 98-100,104-105 Noise 53,61,64,85,100,123,
Autoregressive 94-98 126,150
Coupled oscillator 109-110 Occlusion 14,19,25,34,61,66,
Dynamical 98-100 ,104-105 123,129,147,153
Human body 7,13-15,107-108 size (distance) 1,21-23,61,154
Running 109-111 Recognition approaches
Structural 133 Model-based 107-108
Walking 39-42 ,111-114 Silhouette-based 46-47
Model-based recognition 109-131 Relational statistics 47
Moments 36 Riverside (California) 47-48
Zernike 54,61 Runn ing 5,9,46,109-111
Velocity 53-54,61 Rutgers University 107,133
Mood II Self similarity 47
Murray 6-9 Shakespeare 5
Muybridge 6 Shoes II
National Institute of Standards in Silhouette
Technology see NIST Generation 18,23,27,67,153
National Lab. of Pattern Recognition Recognition 46-47
see CAS Institute of Automation Single support 9
186
Similarity measure 71,75,77,85, Time 11,155
99,103,156 Treadmill 10,30-31,110
Size (distance) 1,21-23,61,154 Trendelenburg gait 11
Soton see Southampton UCSD
South Florida see USF Database 17-18,46,58
Southampton 3 Early work 36
Database 23-33 USF 3,47-48,83
Silhouette-based recognition 36 Database 21-23,47-49,89-90,
39,49-65 101-105,134
Model-based recognition 40-45, Silhouette-based recognition 47
110-115 48
Spatiotemporal US-VISIT 2
Feature extraction 62,73 UMD 3,46,107
Silhouette 46,83 Database 34,105
Stance 6-9 Silhouette-based recognition
Stride 10,82,108 89-106
Length 47,49,86,136 Velocity moments 53-54,61
Frequency 9,75,113 View
Support (single- or double limb) 9 canonical 135-142,155
Swing 6,8,20,58,110,156 Viewpoint invariance 139-140
Structural model 133 da Vinci 6
Surveillance 2,19,34 Walking 5,8,39-42, 111-114
Symmetry Weber 6
Bilateral 8,51 Zemike moments 54,61
Recognition 51, 58-59
Three dimensional imaging see 3D
187