Human Identification Based On Gait (International Series On Biometrics)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 190

HUMAN

IDENTIFICATION
BASED ON GAIT
International Series on Biometrics
Consulting Editors
Professor David D. Zhang Professor Anil K. Jain
Department of Computer Science Dept. of Comput er Scienc e& Eng.
Hong Kong Polytechnic University Michigan State University
Hung Hom, Kowloon, Hong Kong 3115 Engineering Bldg.
East Lansing, MI48824-1226, u.s.s.
email : csdzhang@comp.polyu .edu.hk Email : jain@cse.msu .edu

In our international and interconnected information society, there is an ever-


growing need to authenticate and identify individuals. Biometrics-based
authentication is emerging as the most reliable solution. Currently, there have been
various biometric technologies and systems for authentication, which are either
widely used or under development. The International Book Series on Biometrics
will systematically introduce these relative technologies and systems, presented by
biometric experts to summarize their successful experience, and explore how to
design the corresponding systems with in-depth discussion.

In addition, this series aims to provide an international exchange for researchers,


professionals, and industrial practitioners to share their knowledge of how to surf
this tidal wave of information. The International Book Series on Biometrics will
contain new material that describes, in a unified way, the basic concepts, theories
and characteristic features of integrating and formulating the different facets of
biometrics, together with its recent developments and significant applications.
Different biometric experts, from the global community, are invited to write these
books. Each volume will provide exhaustive information on the development in that
respective area . The International Book Series on Biometrics will provide a
balanced mixture of technology, systems and applications. A comprehensive
bibliography on the related subjects will also be appended for the convenience of
our readers .

Additional titles in the series:


PALM PRINT AUTHENTICATION by David D. Zhang ; ISBN: 1-4020-8096-4
HUMAN· COMPUTER INTERFACE by Antonio J. Colmenarez; Ziyou Xiong and Thomas
S. Huang ; ISBN: 1-4020-7802-1
FACIAL ANALYSIS FROM CONTINUOUS VIDEO WITH APPLICATIONS TO
COMPUTATIONAL ALGORITHMS FOR FINGERPRINT RECOGNITION by Bir
Bhanu and Xuejun Tan; ISBN: 1-4020-7651 -7

Additional information about this series can be obtained from our website:
http ://www.springeronline.com
HUMAN
IDENTIFICATION
BASED ON GAIT

by

Mark S. Nixon
University ofSouthampton, UK

Tieniu Tan
Chinese Academy of Sciences, Beijing, P. R. China

Rama Chellappa
University ofMaryland, USA

~ Springer
Mark S. Nixon Tieniu Tan
School of Electronics & Computer Institute of Automation
Science Chinese Academy of Sciences
Univers ity of Southampton Beijing , P. R. China
UK
Rama Chellappa
Dept. of Electrical & Computer
Engineering
Center for Automation Research
University of Maryland
USA

Library of Congress Control Number: 2005934709

HUMAN IDENTIFICATION BASED ON GAlT


by Mark S. Nixon, Tieniu Tan and Rama Chellappa

ISBN-13: 978-0-387-24424-2
ISBN-IO: 0-387-24424-7
e-ISBN -13 : 978-0-387 -29488-9
e-ISBN-IO: 0-387-29488-0

Printed on acid-free paper.

© 2006 Springer Science--Business Media, Inc.


All rights reserved. This work may not be translated or copied in whole or
in part without the written permi ssion of the publisher (Springer
Science+Business Media, Inc., 233 Spring Street, New York, NY 10013,
USA) , except for brief excerpts in connection with reviews or scholarly
analysis . Use in connection with any form of information storage and
retrieval, electronic adaptation, computer software , or by similar or
dissimilar methodology now know or hereafter developed is forbidden .
The use in this publication of trade names, trademarks , service marks and
similar terms, even if the are not identified as such, is not to be taken as
an expression of opinion as to whether or not they are subject to
proprietary rights .

Printed in the United States of America.

9 8 7 6 5 4 3 2 1 SPIN 11379973, 11570790

springeronl ine.com
Contents

Preface vii

1 Introduction 1
1.1 Biometrics and Gait.. 1
1.2 Contexts 2
1.2.1 Immigration and Homeland Security ------------------------------------- 2
1.2.2 Surveillance ------------------------------------------------------------------ 2
1.2.3 Human ID at a Distance (HiD) Program --------------------------------- 3
1.3 Book Structure 3

2 Subjects Allied to Gait 5


2.1 Overview 5
2.2 Literature 5
2.3 Medicine and Biomechanics 6
2.3.1 Basic Gait Analysis ---------------------------------------- ----------------- 6
2.3.2 Variation in Gait Covariate Factors --------------------------------------1 0
2.4 Psychology 12
2.5 Computer Vision-Based Human Motion Analysis 13
2.6 Other Subjects Allied to Gait 15

3 Gait Databases 17
3.1 Early Databases 17
3.1.1 UCSD Gait Data------------------------------------------------------------17
3.1.2 Early Soton Gait Data------------------------------------------------------18
3.2 Current Databases 20
3.2.1 Overall Design Considerations -------------------------------------------20
3.2.2 NIST/ USF Database------------------------------------------------------- 21
3.2.3 Soton Database -------------------------------------------------------------22
Overview 22
Laboratory Layout 24
Outdoor Data Design Issues 27
Acquisition Set-up Procedure 29
Filming Issues 29
Recording Procedure 30
Ancillary Data 31
3.2.4 CASIA Database -------------------------------- ------------------- --------32
3.2.5 UMD Database ---------------------------------------------------------- ---33

4 Early Recognition Approaches 35


4.1 Initial Objectives and Constraints 35
4.2 Silhouette Based 35
4.3 Model Based 39
5 Silhouette-Based Approaches 45
5.1 Overview 45
5.2 Extending Shape Description to Moving Shapes 48
5.2.1 Area Masks ------------------------------------------------------------------49
5.2.2 Gait Symmetry -------------------------------------------------------------- 51
5.2.3 Velocity Moments ---------------------------------------------------------- 53
5.2.4 Results ----------------------------------------------------------------------- 54
Recognition by Area Masks 55
Recognition by Symmetry 58
Recognition by Velocity Moments 61
5.2.5 Potency of Measurements of Silhouette ---------------------------------63
5.3 Procrustes and Spatiotemporal Silhouette Analysis 65
5.3.1 Automatic Gait Recognition Based on Procrustes Shape Analysis --65
5.3.2 Silhouette Detection and Representation for Procrustes Analysis ---66
Silhouette Extraction 66
Representation of Silhouette Shapes 68
5.3.3 Procrustes Gait Feature Extraction and Classification-----------------69
Procrustes Shape Analysis 69
Gait Signature Extraction 69
Similarity Measure and Classifier 70
5.3.4 Spatiotemporal Silhouette Analysis Based Gait Recognition--------- 70
Spatiotemporal Feature Extraction 72
Feature Extraction and Classification 73
5.3.5 Experimental Results and Analysis --------------------------------------77
Procrustes Shape Analysis 77
Spatiotemporal Silhouette Analysis 82
5.4 Modeling, Matching, Shape and Kinematics 89
5.4.1 HMM Based Gait Recognition -------------------------------------------89
Gait Recognition Framework 90
Direct Approach 91
Indirect Approach 93
5.4.2 DTW Based Gait Recognition --------------------------------------------94
Gait Recognition Framework 96
5.4.3 Shape and Kinematics -----------------------------------------------------97
Shape Analysis 97
Dynamical Models 98
5.4.4 Results --------------------------------------------------------------------- 100
HMM Based Gait Recognition 100
DTW Based Gait Recognition 102
Shape and Kinematics 104

6 Model-Based Approaches 107


6.1 Overview 107
6.2 Planar Human Modeling 109
6.2.1 Modeling Walking and Running --------------------------------------- 109
6.2.2 Model-Based Extraction and Description ----------------------------- 111
6.3 Kinematics-based People Tracking and Recognition in 3D Space ......... 114
6.3.1 Model-based People Tracking using Condensation------------------ 114
Human Body Model... 115

vi
vi
Learning Motion Model and Motion Constraints 117
Experiments and Discussions 125
6.4 Other Approaches 131
6.4.1 Structure by Body Parameters ------------------------------------------ 132
6.4.2 Structural Model-based Recognition----------------------------------- 132

7 Further Gait Developments 135


7.1 View Invariant Gait Recognition 135
7.1.1 Overview of the Algorithm --------------------------------------------- 136
7.1.2 Optical flow based SfM approach -------------------------------------- 137
7.1.3 Homography based approach ------------------------------------------- 138
7.1.4 Experimental Results ---------------------------------------------------- 138
7.2 Gait Biometric Fusion 141
7.3 Fusion of Static and Dynamic Body Biometrics for Gait Recognition 144
7.3.1 Overview of Approach 144
7.3.2 Classifiers and Fusion Rules 145
7.3.3 Experimental Results and Analysis 146

8 Future Challenges 151

References 157
Literature 157
Medicine and Biomechanics 157
Covariate factors 158
Psychology 159
Computer Vision-Based Analysis of Human Motion 160
Databases 161
Early work 162
Current approaches 163
Further Analysis 166
Other Related Work 169
General 169

9 Appendices 171
Appendix 9.1 Southampton Data Acquisition Forms 171
Appendix 9.1.1 Laboratory Set-up Forms ----------------------------------- 171
Appendix 9.1.2 Camera Set-up Forms --------------------------------------- 175
Appendix 9.1.3 Session Coordinator's Instructions ------------------------ 180
Appendix 9.1.4 Subject Information Form ---------------------------------- 182

Index 185

vii
vii
Preface

Ttearly
is a great honor to be associated with subjects at their inception. is certainly
It
in the cycle for gait - as it is for biometrics. is then a great honor to be
It
part of the first ever series on biometrics, as it is to be amongst the first
researchers in gait as a biometric. It has been great fun too - a challenge indeed
since gait concerns not just recognizing objects, but moving objects at that, so we
have had to develop new techniques before we saw the first results that people can
indeed be recognized by the way they walk.
In terms of setting the scene, and the context of this book with others in the
same series, it has been fascinating to see the rise in prominence of biometrics,
from what was originally an academic interest, to one that is on the lips of leading
politicians. This is because biometrics has the capability to solve current problems
of international concern. These essentially center on verification of identity at
speed and with assured performance and biometrics has a unique capability here
since we carry our own identity. As can be found elsewhere in the series, the
earliest biometrics were palm prints - these suited computational facilities
available in the 1970's . Then, there has been interest in the more popular
biometrics: the fingerprint given its long forensic use; the face given that it is non-
invasive and can be captured without a subject's knowledge or interaction; and
the iris. Iris recognition has proved quite an inspiration in biometrics, providing
some of the largest biometric deployments and with some excellent performance .
The fingerprint is now used in products such as mobile phones, computers and
access control. Face recognition has a more checkered history, but it is the
biometric favored by many in view of its practical advantages. These of course
make face recognition more difficult to deploy, as can be found in other volumes
in the International Series 011 Biometrics. Visitors to the US now routinely find
their fingerprints and faces recorded at portals of entry. Our context here is to set
the scene, not to contrast merit and advantage - that comes later. One of the main
reasons for the late entry of gait onto the biometrics stage was not just idea, but
also technology. Recognition by gait requires processing sequences of images and
this imposes a large computational burden and only the recent advances in speed
and memory made gait practicable as a biometric.
Rather than coordinate an edited book, we chose to author this text. We
provide a snapshot of all the biometric work in human identification by gait and
all major centers for research are indicated in the text. To complete the picture, we
have added studies from medicine, psychology and other areas wherein we will
find not only justification for the use of gait as a biometric, but also pointers to
techniques and to analysis. We have collocated the references at the end of the
book, itemized by the area covered and cross referenced to the text. There are of
course many other references we could have included since gait is innate to
human movement so we have aimed here to provide a set of references which
serve as a complete picture of current research in gait for identification, and as
pointers to the richer literature to be found in this topic.
As academics, we know well that this book would not have been possible
without the contributions of colleagues and students who have conducted research
x
1 Introduction

1.1 Biometrics and Gait

A unique advantage of gait as a biometric is that it offers potential for


recognition at a distance or at low resolution or when other biometrics might
not be perceivable. Consider an image from a surveillance camera as in Fig.
1.1: the subject's face can be obscured, their hands are at too Iowa resolution for
recognition by shape; it would be pointless even to attempt to recognize subjects
by iris or fingerprint pattern. In many scene-of-crime data, the situation is
exacerbated by poor quality video data or by poor illumination. In contrast a
subject's gait is often readily apparent in an image sequence. Identity can be
concealed in a covert way quite easily, one does not assume that every customer
entering a bank wearing a scarf over their face is about to rob it. Gait recognition
can handle this and might even answer the question as to whether the subject is
actually a "him", or whether it is likely that the subject was in fact female.

Figure 1.1 Example Surveillance Video Images

Recognition by gait can be based on the (static) human shape as well as on


movement, suggesting a richer recognition cue. It is actually one of the newest
biometrics since its development is contemporaneous with new approaches in
spatiotemporal image processing and computer vision. These new approaches
only started when computer memory and processing speed became sufficient to
process sequences of image data with reasonable performance.
Naturally, its development is complemented in other areas. These
developments can be used for guidance: the medical analysis of gait can help to
guide automated analysis of human condition or to monitor its effects on human
gait; work in psychology has already motivated recognition approaches. These
developments also offer evidence that supports the notion of gait as a biometric:
there is considerable evidence in biomechanics, psychology and literature for the
notion that people can be recognized by the way they walk. As such, we have
written this book not just to show progress in gait as a biometric: the stock of
techniques, the results achieved so far and the insight they provide. We also
describe material from many different areas of potential use in furthering research
in this unique and fascinating biometric.

1.2 Contexts

1.2.1 Immigration and Homeland Security

Biometrics has risen to prominence quickly, even with its short history. The
current political agendas of many countries are permeated by questions that
biometrics might answer, including security and immigration. Now, the u.s.
Citizens and Immigration Services require applicants for immigration benefits to
be fingerprinted for the purpose of conducting FBI criminal background checks;
US-VISIT requires that most foreign visitors traveling to the U.S. on a visa have
their two index fingers scanned and a digital face photograph taken to verify their
identity at the port of entry. In the Enhanced Border Security and Visa Entry
Reform Act of 2002, the U.S. Congress mandated the use of biometrics with U.S.
visas. This law required that Embassies and Consulates abroad must issue to
international visitors "only machine-readable, tamper-resistant visas and other
travel and entry documents that use biometric identifiers," not later than October
26, 2004. From a topic that was largely on a University research agenda in 2002,
biometrics have moved fast.
The move was largely due to performance: biometrics offer a combination of
speed and security, ideal in any mass transit scenario. Also, since they are part of
a human subject, they are in principle difficult to counterfeit. Not only this, but
they are amenable to electronic storage and checking, and devices with such
capability continue to proliferate. It is for these reasons that face, iris and
fingerprint have found evaluation in security and immigration. Other biometrics
have not enjoyed this. This is because some do not lend themselves well to that
application scenario, others - like gait - were simply too new to be considered at
that time.

1.2.2 Surveillance

In many of the developed countries concern over security is manifest in


surveillance systems. These systems are particularly advanced in the UK where
on-line face recognition is already in routine use to deter crime. In fact, a high
profile case in the UK where a child was abducted and murdered and only the gait
of the murderer could be determined from the surveillance data was the
inspiration of Southampton's gait research: as only gait could be perceived was it
a valid biometric? A primary aim of surveillance is naturally as a deterrent for
criminal acts; much of it is video and it has been used as evidence in courts. The
video data can suffer from adverse quality due to poor resolution, time-lapse
imagery (images recorded at a frequency much lower than the video sampling rate
to save on storage), tape re-use as well as a subject concealing the more
conventional biometrics. But it does offer data that gait recognition technology
could and is applied to. Some of the difficulties inherent in recording gait

2
sequences from an arbitrary viewpoint will be shown later. The ongoing trend is
that deployment of surveil1ance systems will continue to increase, suggesting
wider deployment of gait recognition techniques .

1.2.3 Human 10 at a Distance (HiD) Program

The main single contributor to progress in automatic recognition by gait has been
the Defense Advanced Research Projects Agency's (DARPA's) Human ID at a
Distance research program led by Dr. Jonathon Phillips from National Institute of
Standards in Technology (NIST). This program embraced three main areas: face;
gait and new technologies, initial1y aimed to improve security at US embassies
following some terrorist acts in 1998. The Human ID at a Distance program
started in 2000 and finished in 2004 (ironically, privacy concerns in the US led to
its closure). Gait is a natural contender for recognition at a distance, given its
unique capabilities. The DARPA program concentrated on three main areas: face
gait and new technologies and in each area there was new technique; new data;
and evaluation. The aim of the gait program was essential1y to progress from
laboratory-based studies on small populations to large scale populations of real
world data. Of the current approaches to recognition by gait and data that can be
used to analyze performance, those from MIT, Georgia Institute of Technology
(GaTech), NIST and the Universities of Maryland (UMD), Southampton (Soton),
Carnegie Mellon (CMU) and South Florida (USF) were originally associated with
the Human ID at a Distance program. The program achieved many of its initial
objectives: gait achieved capability concurrent in research extent and depth with
research in face recognition.

1.3 Book Structure


In the next Chapter we shall start by reviewing the evidence for the notion that
gait is a biometric: amongst other areas, this arises in medicine, literature and
psychology. Not only will we show how gait can be used to identify people, but
also derive insight to aid development of automated recognition approaches and
analysis. This insight derives from known variation in patterns of gait, including
those due to illness and apparel. In biometrics (and pattern recognition in general),
capability for recognition is usually evidenced by analysis of performance on
specially constructed databases. This allows not only for investigation of
performance of a particular technique, but also for comparison of performance
with that of other approaches. The selection of existing gait databases is described
in Chapter 3 where "early databases" are those which existed prior to the Human
ID at a Distance program and the current databases were developed during or after
HiD research. We shall then describe the current approaches to gait recognition
focusing in particular on techniques and analyses conducted at the Institutions of
the authors of this text. In many applications of pattern recognition, approaches
with recognition capability are usually based on a corpus of data which is treated
either in a holistic manner or which is partitioned by application of prior
knowledge. Accordingly we first describe silhouette-based (holistic) approaches
which derive recognition capability from the (binary) human silhouette, as

3
described in Chapter 5. The alternative is to analyze shape and dynamics of the
moving human body, usually by deployment of a model, and these approaches are
described in Chapter 6. We then describe further application potential for the new
biometric approaches before concluding with an analysis of the potential for this
new, unique and intriguing biometric . You will find an extensive selection of
references on human identification by gait, on gait analysis and on general factors
relevant to this new technology. These have been grouped at the end of the book
for convenience.

4
2 Subjects Allied to Gait

2.1 Overview

T here is considerable support for the notion that each person's gait is unique. As
we shall see, it has been observed in literature that people can be recognized by
the way they walk. The same notion has been observed in medicine and
biomechanics, though not in the context of biometrics but more as an assertion of
individuality. Perhaps driven by these notions, though without reference to them,
there has been work in psychology on the human ability to recognize each other by
using gait. Those suffering myopia often state that they can use gait as a way of
recognizing people. There is other evidence too, which suggests that each person's
gait is unique. People have also studied walking from medical and biomechanical
perspectives, and this gives insight into how its properties can change which is of
general interest in any biometric deployment. We shall start with literature, with
definitions of meaning.

2.2 Literature

Perhaps the oldest gait analysis is due to Aristotle [I] though the word "gait" was
only to arrive some time later. Its usual meaning is "mann er of walking" [2] though
this is sometimes given as a "manner ofmoving onfoot" [3] since this can subsume
running as well. It is variously given either as derived from gang which means gait
in German, or from the Middle English gate [3], meaning path or gait, as derived
from the Old Norse gata , meaning path. In this respect it is interesting that one
'English' word for a double is doppelganger which derives from "a double" and
"goer", the latter given in this case as from middle High German [3].
Shakespeare made several references to the individuality of gait, e.g. in The
Tempest [Act 4 Scene I], Ceres observes "High 'st Queen of state, Great Juno
comes; I know her by her gait" even more, in Twelfth Night [Act 2 Scene 3] Maria
observes of Malviolo "wherein, by the colour of his beard, the shape of his leg, the
manner of his gait, the expressure of his eye, forehead, and comp lexion, he shall
find himselfmost feelingly personated" and in Henry IV Part II [Act 2, Scene 3] "To
seem like him: so that, in speech, in gait, in diet, in affections of delight, in military
rules, humours ofblood, he was the mark and glass, copy and book" . Shakespeare 's
works actually preceded the first complete English dictionary that was only to
appear in the 1755, so it is worth checking that Shakespeare's definition accords
with our own understanding of the meaning of the word gait. In a curious - but
rather expected - circular reference, in Johnson's English dictionary gait was
defined [4] to be the manner of walking and Shakespeare was quoted as an
exemplar of its meaning. Interestingly, Johnson also suggested it derived from gat
in Dutch, but the current meaning of gat in Dutch concerns an aperture and not gait.
Similar anecdotes can be found in more contemporary literature such as "I
noticed this figure coming, and I realized it was John Eubanks from the way he
walked' in the Band of Brothers [5] which is important since it describes
parachutists in the Normandy landings, operating in twilight when few biometrics
can be observed except gait, and in a critical scenario too.

2.3 Medicine and Biomechanics

2.3.1 Basic Gait Analysis

In terms of history, Aristotle was one of the earliest in this area (he was the son of a
physician). Other notable names include Leonardo da Vinci who studied force
vectors and Galileo was a pioneer in mechanics who translated those interests to
biomechanics. Borelli (1608-1679) was an early pioneer in the study of human
locomotion who was interested in the mechanical principles of locomotion,
representing the starting point for the study of biomechanics of locomotion. Later,
the Weber brothers (1836) investigated human gait, both walking and running with
simple instrumentation, and suggested that the lower limbs act like a pendulum.
However, these awaited scientific justification. More advanced mathematical
techniques and reliable instrumentation were necessary to probe into the study of
locomotion. Muybridge (1830-1894) was the first to employ photographic
techniques extensively to record locomotion. Since those early times there has been
much medical and biomechanical research since gait is fundamental to human
activities.
The aim of medical research has been to classify the components of gait for the
treatment of pathologically abnormal patients. Murray et al. [6] produced standard
movement patterns for pathologically normal people which were used to compare
the gait patterns for pathologically abnormal patients [7]. These studies again
suggested that gait appeared unique to each subject. The data collection system used
required markers to be attached to the subject. This is typical of most of the data
collection systems used in the medical field, and although practical in that domain,
they are not suitable for identification purposes.
Fig. 2.1 illustrates the terms involved in a gait cycle. A gait cycle is the time
interval between successive instances of initial foot-to-floor contact 'heel strike' for
the same foot. Each leg has two distinct periods: a stance phase, when the foot is in
contact with the floor, and a swing phase, when the foot is off the floor moving
forward to the next step. The cycle begins with the heel strike of one foot which
marks the start of the stance phase. The ankle flexes to bring the foot flat on the
floor and the body weight is transferred onto it. The other leg swings through in
front as the heel lifts of the ground. As the body weight moves onto the other foot,
the supporting knee flexes. The remainder of the foot, which is now behind, lifts off
the ground ending the stance phase.

6
M 0%
Rt Heel Strike
i~ ML~M 50010
Lt Heel Strike
Rt Swing
100%
Rt Heel Strike

.1

=
~ Single-Limb Support
Double-Limb Support

Rt Stride Length
Rt-Lt Step Length Lt-Rt Step Length

Figure 2.1 The Walking Cycle (from [85])


From Murray et aI's work, it can be concluded that if all gait movements were
considered, gait is unique. In all there appear to be twenty distinct gait components,
some of which can only be measured from an overhead view of the subject. Murray
found "the pelvic and thorax rotations to be highly variable from one subject to
another" [7]. These patterns would be difficult to measure even from an overhead
view of the subject, which would not be suited to application in many practical
situations. Murray also suggested that these rotation patterns were not found to be
consistent for a given individual in repeated trials. In [6, 7] ankle rotation, pelvic
tipping and spatial displacements were shown to possess individual consistency in
repeated trials. Unfortunately, these components would be difficult to extract from
real images. Figure 2.2 shows the rotation angles for the hip and knee, as measured
by Murray [7]. Later, we will see how these angles have featured in model-based
recognition systems.

" -,

"
"
"

(a) (b)
Figure 2.2 Leg Rotation Angles (a) hip and (b) knee (from [85])

7
The normal hip rotation pattern of the angle of the thigh illustrated in Fig. 2.2(a)
is characterized by one period of extension and one period of flexion in every gait
cycle. Fig. 2.3 gives the average rotation pattern as presented by [7]. The upper and
lower lines indicate the standard deviation from the mean. In the first half of the gait
cycle, the hip is in continuous extension as the trunk moves forward over the
supporting limb. In the second phase of the cycle, once the weight has been passed
onto the other limb, the hip begins to flex in preparation for the swing phase. This
flexing action accelerates the hip so as to direct the swinging limb forward for the
next step.

Mean Hip Rotation Pattern


40 . .... ........
.' .

.,
Q)
60 ,
. ,
,

m ,
c 80 , ,
<t
C
---
....0
+>
111
100
-,
.......................................... . ........... ..........................•...: .
+>
0
0:::
120
...................,.'

140
o 20 40 60 80 100
Peroent Of Walking Cyole
Figure 2.3 Mean Hip Rotation Pattern

The pattern for normal knee rotation is more complex than that for the hip
rotation. It shows two phases of flexion and two phases of extension. At the start of
the walk cycle the knee of the outstretched limb has already begun to go into
flexion. The maximum flexion occurs when the trunk moves forward over the
supporting leg. As the trunk moves ahead of the supporting limb, the knee begins its
first phase of extension. The knee begins to flex when the contra lateral foot makes
contact with the ground at the midpoint of the walking cycle. The angular velocity
of the knee increases quite rapidly, characterizing the swing phase by large rapid
excursions into flexion and then into extension. Later, we will see approaches to
model this motion, and observe how it can be extracted from a sequence of images.
Gait has a property known as bilateral symmetry, which means that when one
walks or runs the left arm and right leg interchange direction of swing with the right
arm and left leg, and vice versa, with half a period phase shift. This is illustrated in
Fig. 2.4, which also shows which foot is in contact with the ground. The second half
of the gait cycle is a reflection (about the midpoint of the cycle) of the first half for
both the cases of walking and running. The notion of symmetry in gait is still of
interest [8], and we shall later see how concepts of symmetry can be used to
recognize people by their gait.

8
There is a considerable literature on human gait and its many aspects. These
include notions of balance [9], on kinematics [10, 11] and its relation to stride and
frequency [12]. Naturally, the relationship between stride length and frequency is of
fundamental concern to many studies of walking [13, 14] (as we shall find later,
even this has been used for recognit ion). A study of frequency domain properties of
gait suggests that the fundament al concern is with the lower frequencies [15].
Though we shall later find that many of the model-based approaches tend not to
focus on the arms since a subject could be carrying something, there have been
studies on arm movement in gait [16, 17].

Walking

Left foot
• • •
Right foot
• • •

Running

Left foot

Right foot

Figure 2.4 Symmetry, Stance and Float in Walking and Runn ing [17]

Running is a natural extension of walking as illustrated in Fig. 2.4, with


significant biomechanical difference s [18, 19]. The running cycle, is not solely
discriminated from walking by velocity ; you cannot ju st walk fast to claim that you
are running. By biomechanics definitions, walking and running are distinguished
firstly by the stride duration, stride length, velocities and the range of motion made
by the limbs. That is, the kinematics of running differs from that of walking where
the joints' motion increases significantly as the velocity increases. A second
difference concerns the existence of periods of double support or double float, when
neither foot is in contact with the ground - which does not occur in walking.
The biomechanics literature makes similar observations concerning identity: "A
given person will perf orm his or her walking patt ern in a f airly repeatable and
characteristic way. suffi ciently unique that it is possible to recognize a person at a
distance by their gait" [20]. Also, we find "All of us are aware that individuals walk

9
differently; one can often recognize an acquaintance by his manner ofwalking even
when seen at a distance " [21]. The majority of the medical approaches so far have
used marker-based systems . There is also an approach known as observational gait
analysis [23] which concerns the clinical use of observations made of people
walking unhindered, both physically and psychologically, by the marker based
systems. More recently this has progressed to using video recordings, and this
allows for comparative analysis but appreciation of performance is impeded by the
small numbers of subjects involved .
There have been (medical/ biomechanics) studies on the .effects of treadmills on
gait which is of concern not in deployment of gait as a biometric, but more for data
acquisition since a treadmill can be used to obtain long walking sequences
conveniently, and is the only practicable means to obtain video data of running
subjects . One study suggests that walking on an 'ideal' treadmill, when the
supporting belt moves with a constant speed, does not differ mechanically from
walking over ground, except for wind resistance, which is negligibly small during
walking [12]. The only difference between the two conditions is perceptual: the
environment is stationary when a treadmill is used [24]. However modem treadmills
require selection at least not only of speed, but also of inclination. Murray found
that during treadmill walking, subjects tend to use a faster cadence and shorter stride
length than during floor walking . However, in general, treadmill walking was not
found to differ markedly from floor walking in kinematics measurements [25].
Whether a treadmill will affect one's gait will also depend on the habituation of the
subjects to treadmill walking [26]. Given that one aim of deploying computer vision
in gait will be to produce marker-less video based systems, it will be interesting to
see whether the convenience associated with automated computer vision technique
can help to resolve these matters, not only in respect of the number of subjects
involved, but also it terms of deployment.

2.3.2 Variation in Gait Covariate Factors

As biometrics concern humans then there are naturally many potential variations in
the data. Since these are usually not fundamental to the measured property, but are
the result of human action, they are known as covariate data . In a behavioral
biometric, such as gait, these are likely to be exacerbated. In face recognition the
subject can smile or make other facial expressions; an environment concern is that
the face is unlikely to be centered and that the illumination' is likely to vary . These
factors are likely to obtain in gait: mood is likely to affect gait and people move
throughout an image sequence and thus will interact with any fixed illumination.
One of the purposes of biometric approaches is to determine identity invariant to
imaging conditions and to a subject's disposition . For face recognition we seek a
signature that is invariant to expression and illumination. Similarly in gait we seek a
unique biomechanical signature that is the same for the subject whatever their mood
or walking environment. As biometrics is relatively new, there is as yet little data to
investigate these notions. It is evident that studies of the effects of age on face
recognition are unlikely to have data older than the subject itself. In fact, we shall
find later that approaches to gait as a biometric have not only learnt from the
established biometric approaches, and the most recent databases do now include

10
covariate data, but the research is contemporaneous with other biometrics in that
subjects report outdoor enrolment (finding the subject in "real-world" images) and
determination of potent factors for recognition, to be described later.
We shall review in outline some of these effects, especially since they provide
pointers to areas in which data should be, and indeed has been, collected. Evidently,
the corpus of subjects for whom the data is collected is usually small, since medical
studies await the convenience of markerless gait analysis. As such, there is certainly
high variance attached to any measures made and this is one area where computer
vision-based analysis can contribute since the data collection is much easier and the
corpus of subjects can be made much larger with ease. Certainly, load affects gait
[27, 28] suggesting the need to acquire imagery of subjects carrying luggage of
different weight and of different shape. Naturally, footwear can affect walking [29-
31] as can alcohol [32, 33] (we anticipate no shortage of student volunteers for a
new data study here!). Tight clothing will affect gait; loose clothing will affect the
perception of gait by video. Intuitively, gait will change with age as do most
biometrics, except ears. However, most medical studies concern disorder with gross
(short-term) change [34-40] - essentially the detection of abnormal gait; one study
[35] suggested that only cadence changed with time but the study used floor based
measures based on footprints and these are unlikely to be sufficiently sensitive for
measuring the smaller changes likely to occur with aging. These changes can be due
to compound changes in physiology, neurology and/or illness. There are illnesses
which are known to have particular effects on gait, such as Parkinson's disease or a
Trendelenburg gait where body weight is transferred to the affected side when hip
abductors are unable to stabilize the hip. Without rapid and convenient analysis it is
unlikely that study of effect of aging will progress much further and this is one area
where automated gait analysis via computer vision can make contributions beyond
those associated with biometric issues. Finally, mood can affect gait, as can music
[41, 42] for which reason the Southampton indoor database (Section 3.2.3) was
recorded with a talk-only radio playing in the laboratory to reflect comfort and with
intentional absence of music.
The sources of variation in gait might seem dispiriting at first: how can one even
hope to recognize subjects with this volume of covariate factors? First, this is not
unique to gait and gait has the unique advantage in terms of recognition at a
distance. In terms of recognition, the intention in biometrics (in pattern recognition
even) is to derive a set of measurements for one subject for which the variation for
that subject (the intra-class variation) is less than the variation between subjects (the
inter-class variation). For visualization, in a 3-dimensional feature space (where
subjects' identities are represented by three measurements) then if each subject's
measurements are contained in a small sphere then recognition can be achieved
when all the spheres are spread apart. The problem then becomes one of appropriate
measurement and as we shall find, that is where the research is. Essentially we seek
a unique biomechanical invariant; of note, we shall find that shoe type affected
recognition little in one study (the only shoe type that affected gait considerably was
flip-flops). As such, we seek to understand these covariate factors in recognition,
with potential to reinforce medical assessment given a considerably large number of
subjects consistent with the ease of analysis of a computer-vision based system.

11
2.4 Psychology
In the earliest psychology studies of gait perception [43] participants were presented
with images produced from points of light attached to body joints. When the points
were viewed in static images they were not perceived to be in human form, rather
that they formed a picture - of a Christmas tree even (the illustration in Fig. 2.5 does
not appear to be like a Christmas tree but can similarly be perceived to be without
human form). When the points were animated, they were immediately perceived as
representing a human in motion. Later work showed how by point light displays a
human could be rapidly extracted and that different types of motion could be
discriminated, including jumping and dancing [44]. Later, Binham [45] showed that
point light displays are sufficient for the discrimination of different types of object
motion and that discrete movements of parts of the body can be perceived . As such,
human vision appears adept at perceiving human motion, even when viewing a
display of light points. Indeed, the redundancy involved in the light point display
might provide an advantage for motion perception [46] and could even offer
improved performance over video images.

...
.. • ...
Figure 2.5 Marker-based Gait Analysis [160]

Naturally, studies in perception have also addressed gender as well as pure


motion, again using point light displays. One early study [47] showed how gender
could be perceived, and how accuracy was improved by inclusion of height
information [48]. The ability to perceive gender has been attributed to anatomical
differences which result in greater shoulder swing for men, and more hip swing for
women. Indeed, a torso index (the hip shoulder ratio) has been shown to
discriminate gender [49] and the identification of gender by motion of the center of
movement was also suggested [50]. Gender identification would appear to be less
demanding than person identification . However, it has been shown how subjects
could recognize themselves and their friends [51], later explaining this by
considering gait as a synchronous, symmetric pattern of movement from which
identity can be perceived [53]. Essentially, research into the psychology of gait has
not received much attention, especially using video, in contrast with the enormous
attention paid to face recognition. One more recent study [55], using video rather
than point light displays, has shown that humans can indeed recognize people by
their gait, and learn their gait for purposes of recognition . The study concentrated on
determining whether illumination or length of exposure could impair the ability of
gait perception. The study confirmed that, even under adverse conditions, gait could

12
still be perceived. Studies on the ability to perceive gender from motion are ongoing
[56]. Like Shakespeare's observations, like medicine and like biomechanics, these
studies encourage the view that gait can indeed be used as a biometric.

2.5 Computer Vision-Based Human Motion Analysis


Many studies have considered human motion extraction and tracking, though not for
biometric recognition purposes. There is quite a range of detailed surveys of
research in this area. As with mainstream computer vision, the earliest [58] confined
analyses to motion modeling and tracking. A more recent survey [59] considered
the analysis of non-rigid motion, considering especially articulated and elastic
motion. In the first of these, the motion of coherent objects each of which has rigid
motion leads to the non-rigid motion of the whole. Naturally, human motion falls
within this and reviewed approach include those that are model-free and model-
based. The suggested applications include athletic training, biomechanical
evaluation, animation and machine interaction but not, as it was then at a very early
stage, biometrics. At a similar time, another survey reviewed work on visual
analysis of gestures and whole body movement [60] and distinguished 2D
approaches with or without shape models and 3-D approaches. The application
focuses included gesture based interaction, motion analysis and model-based image
coding, but again not biometric use. Slightly later a survey was conducted for
human motion analysis alone [61] which included tracking human motion and
recognizing activity, as well as motion analysis via human body parts. As we shall
find later in biometric approaches, the body motion analyses were categorized as
with or without a model. The analysis of human activities included walking and
included a precursor to one of the earlier biometric approaches, but considered
activity recognition more as to purpose rather than any attempt to associate identity
with the subject, as in biometrics. A later survey [62] concerned computer vision-
based approaches to human motion capture and covered tracking (and initialization,
naturally), pose estimation and recognition. Though the research covered was that
prior to mid 2000, it is perhaps surprising that there is no inclusion of the notion of
gait as a biometric, as there were by then several journal papers but it was then in its
infancy. This was redressed in a much more recent survey [63] covering papers
from around 1997 to 200I and now included gait as a biometric. The paper gave a
hierarchical view of the computer-vision process from low- to high-level processing
and gave a taxonomy based on detection, tracking and behavior analysis. It
completed with a survey of future difficulties and directions, including the potential
of the relationship of gait as a biometric with human motion analysis. The most
recent survey [64] concerned surveillance and concentrated on the detection,
tracking and description of moving objects, with especial consideration of gait for
identification purposes, given its unique capability for recognition at a distance.
These reviews cover much of the detail in human body and movement analysis.
Given the absence of biometric notions until only recently, we shall consider the
main bases of this research (the earliest approaches in these areas) and find later that
it has indeed guided some of the biometric research. The human body is an
extremely complex object, being highly articulated and capable of a variety of
motions. Rotations and twists of each body parts occur in nearly every movement,
and various parts of the body continually move into and out of occlusion. The

13
selection of good body models is important to efficiently recognize human shapes
from images and properly analyze human motion. Stick figure models and
volumetric models are commonly used for three-dimensional tracking, and the
ribbon model and blob model are also used but are not so popular . Stick figure
models connect sticks at joints to represent the human body. Akita [65] proposed a
model consists of six segments: two arms, two legs, torso and head. Lee and Chen's
model [69] uses 14 joints and 17 segments. Guo et al [Gu094] represent the human
body structure in the silhouette by a stick figure model which has ten sticks
articulated with six joints.
On the other hand, volumetric models are used for a better representation of the
human body. One model [71] consists of 24 segments and 25 joints and those
segments and joints are linked together into a tree-structured skeleton. The "flesh"
of each segment is defined by a collection of spheres located at fixed positions
within the segment's co-ordinate system. At the same time, angle limits and collision
detection are incorporated in the motion restrictions of the human model. Among
the different volumetric models, generalized cones are the most commonly used
ones. A generalized cone [70] is the surface swept out by moving a cross-section of
constant shape but smoothly varying size along an axis. Generalized cylinders are
the simplified case of generalized cones that have a cross-section of constant shape
and size.

(a) blob (b) stick (c) cylinder


Figure 2.6 Human Body Models

The model proposed by Marr and Nishihara [70] consisted of a hierarchy of


cylinders, starting with a unique cylinder describing roughly the size and orientation
of the body. Hogg [66] and Rohr [74] followed Marr and Nishihara's model, a set of
14 elliptical cylinders is used to represent the feet, legs, thighs, hands, arms, upper-
arms, head and torso. Kurakake and Nevatia [68] treated the human body as an
articulated object having parts that can be considered as almost rigid and connected
through articulations. They used the ribbon which is the two-dimensional version of
the generalized cylinder to represent the parts.
The blob model was originally developed by Kauth et af. [67] for application to
multi-spectral satellite (MSS) imagery and used in human motion tracking. The
person is modeled as a connected set of blobs, each of which serves as one class.
Each blob has a spatial and color Gaussian distribution, and a support map that
indicates which pixels are members of the blobs. Fig. 2.6 shows three different
examples for a stick figure model, a cylinder model and a blob model.
However, these structural models need to be modified according to different
applications and are mainly used in human motion tracking. The alternative is to
consider the property of the spatiotemporal pattern as a whole. Among the current

14
research, human motion can be defined by the different gestures of body motion,
different athletic sports (tennis, ballet) or human walking or running. The analysis
varies according to different motions. There are two main methods to model human
motion. The first is model-based: after the human body model is selected, the 3-D
structure of the model is recovered from image sequences with [73, 69] or without
moving light displays [65, 66, 74]. The second emphasizes determining features of
motion fields without structural reconstruction [72, 87, 90].
Ideas from human motion studies [6] can be used for modeling the movement of
human walking. Hogg [66] and Rohr [74] use flexion/extension curves for the hip,
knee, shoulder and elbow joints in their walking models. A different approach for
the modeling of motion was taken by Akita [65], who used a sequence of stick
figures, called key frame sequence, to model rough movements of the body. In his
key frame sequence of stick figures, each figure represents a different phase of body
posture from the point view of occlusion. The key frame sequence is determined in
advance and referred to in the prediction process. In order to find out the
interpretation tree of human body and reduce its computation complexity, Chen and
Lee [69] applied general walking-model constraints that are from walking motion
knowledge to eliminate the number of unfeasible solutions.
Other approaches that are different from above consider the properties of the
spatiotemporal pattern as a whole. These are the model-free approaches, of which
we shall find versions in gait-biometrics approaches. Polana and Nelson [72]
defined temporal textures to be the motion patterns of indeterminate spatial and
temporal extent, activities to be motion patterns which are temporally periodic but
are limited in spatial extent, and motion events to be isolated simple motions that do
not exhibit any temporal or spatial repetition. Little and Boyd's approach [87] is
similar to Polana and Nelson's idea, but they derive dense 2-D optical flow of the
person and derive a series of measures of the position of the person and the
distribution of the flow. The frequency and phase of these periodic signals are
determined and used as features of the motions. As already indicated, some of these
models have found application in systems aimed to track humans in image
sequences. Here of course, we are verging on to biometrics, since some of the
notions that people can be recognized by the way they walk are to be found as
developments of the analysis of human motion.

2.6 Other Subjects Allied to Gait


There are other subjects allied to gait though they contribute less to the formal
analysis. First, there is podiatry concerning the study of feet and then there is
forensic podiatry which involves trying to determine whether the mark made by a
foot or its wear pattern was made by a particular foot. More recently, these studies
have started to use biomechanics to reinforce recognition procedures . There have
also been mathematical studies on the manners of locomotion, but these give little
but conceptual input to model-based approaches for recognition which are based
more on mechanics. Finally, there has been work on graphics, but this more
concerns rendering and depiction rather than on analysis. Some of the motion
models used in animation are currently unrealistic and this is one area where gait
analysis in general can contribute to improvement.

15
3 Gait Databases

N aturally, the success and evolution of a new application relies largely on the
dataset used for evaluation. The early gait databases were collected about 10
years before the time of writing. Then, computers had little power and memory
costs were comparatively high. Clearly, this was before digital video and acquisition
was based on analogue camcorder technology which resulted in frames being
digitized individually. Since techniques were in their infancy, as in face recognition,
early databases only had few subjects. The idea then was to determine whether
recognition could be achieved at all - or not. At that stage we were not interested in
the ramifications of recognition. There were two early databases which were
developed independently: the UCSD data was recorded outdoors and the
Southampton data was indoors, with subjects wearing special trousers. The current
databases are considerably more advanced, but certainly benefited in their
development for the early approaches.

(a) original image (b) extracted moving subject


Figure 3.1 UCSD Gait Data and extraction

3.1 Early Databases


3.1.1 UCSD Gait Data

The first generally available database was from the Visual Computing Group of the
University of California San Diego and originally was 5 subjects for which there
were 5 sequences and this was later augmented to 6 subjects for which there were 7
sequences . Obtained gait data was two different sets taken at two different dates: the
first set was extended by adding two more sequences to each of 5 original subjects
and one new subject with 7 sequences. A Sony Hi8 video camera was used to
acquire these images. The video camera was opposite a concrete wall in an outdoor
courtyard. The use of outdoor conditions and a shaded scene was aimed to make the
lighting as diffuse as possible, though shadows are evident on the images. The
students walked in a circular path around the camera so that only one person at a
time was in the camera's field of view. The use of a circular path ensured a smooth
walking motion was maintained throughout acquisition. The image sequences were
recorded with the subject walking fronto parallel to the camera: the direction of
walk was normal the camera's plane of view. The subjects walked around the track
for around fifteen minutes and the first two passes in front of the camera were
discounted to handle camera awareness, though it is also more likely the subjects
would settle into a steady gait later. The original full color images were of 640x480
pixels and the sequence length was of the order of 100 frames. At 30 frames/ sec
this constitutes around three periods of a human walking (naturally dependent on
speed).
An example of the data is shown in Fig. 3.1(a). In terms of computers available
now, this seems to be a rather limited set of data. Clearly the data was augmented,
so conditions changed though only slightly. At the time though, processing a
database of around 4000 images was a considerable task so some of the early
studies cropped the data and used it in black and white format only [88, 92]. The
extraction shows that the data met its aims, that a single moving object could be
extracted from the image. Clearly the lighting was generally chosen well though
there are shadows on the original image underneath the walking subject's legs, and
the shadow is present in the extraction too (though more advanced techniques are
available now, aimed to reduce the effects of shadows).

3.1.2 Early Satan Gait Data

The other early gait data came from Southampton. In this, a CCD camera was used
to collect the data and its output was recorded on a video recorder and later
digitized. As in the UCSD data, the camera was sited with a plane normal the
subject's path but in an indoor location with controlled (level) illumination. Subjects
walked in front of a plain, static, cloth background. Given that the main aim of the
data was for analysis by a model based approach to recognition (to be described
later), problems occurred with the data in terms of clarity of the moving legs. This
was mainly due to creases in the subjects ' trousers and the self occlusion of gait led
to merging of the legs in the images. One solution was to make each subject wear a
special pair of white trousers that had a dark stripe down the middle of the outside
of each leg. In this way, the leg closest to the camera could be distinguished visually
from the other leg at all times. Fig. 3.2(a) shows an example image of a subject used
in this study.

(a) original image (b) after edge detection


Figure 3.2 Example Walking Subject [96]

18
Each subject walked past the camera ten times and the first and last three of these
sequences were discarded leaving four sequences for each of ten subjects, taken for
the central part of the walking period. Again, this data seems limited now, but the
data was recorded in 1996. This was because the subject would be becoming
familiar with the experiment in the early part of the recording whereas in the latter
part their aim was to finish the experiment. Given that it was desirable that the
subjects achieved constant velocity, subjects were given room to accelerate to a
constant velocity before entering the field of view of the camera. This did not use
the circular path of the UCSD data, though a constant speed was achieved. One
problem with the data was that as the camera had no shutter, the lower leg appeared
blurred during the swing phase of the gait cycle. As the initial target was analysis by
the vision paradigm of edge detection and then feature extraction, the edge image in
3.2(b) shows sufficient contrast for this purpose, noting that there is sufficient edge
strength in the line marking the leg as well as in the front edge of the trousers.

(a) UMD, Maryland (b) CMU silhouette

(c) CAStA (d) Soton outdoor


Figure 3.3 Recent Gait Data

The UCSD and early Soton databases are still available, but largely superceded
now. This is not just in the number of subjects recorded which is now much larger,
but also in terms of covariate factors and application potential. The early databases
sufficed to show that people could be recognized by the way they walk, by
techniques described in the next Chapter. Before that, we shall review the current
state of art in gait database design.

19
3.2 Current Databases

3.2.1 Overall Design Considerations

It is encouraging to note the rich variety of data that has been collected and it is very
encouraging to see how research in gait has benefited from research in other
biometrics: there is a range of scenarios, covariate and ground truth data already
available,

~ ,!
a"·"
.'
, ,
"or'..: t
,

(a) successive clips of example subject


~
r-

(b) general view


Figure 3.4 Subject from Gait Challenge Data

These current databases include: UMD's surveillance data [IOO] ; NISTI USF's
outdoor data, imaging subjects at a distance [75]; GaTech 's data combining marker
based motion analysis with video imagery [106]; CMU's multi-view indoor data
[80]; CASIA's outdoor data [I22] and Southampton 's data [83] which combines
ground truth indoor data (processed by broadcast techniques) with video of the
same subjects walking in an outdoor scenario (for computer vision analysis).
Examples of Maryland 's outdoor surveillance view data, a silhouette derived from
CMU' s treadmill data, and of Southampton's indoor and outdoor data are given in

20
Figs. 3.3(a)-(d), respectively. The NISTI USF data was explicitly col1ected for the
Human ID at a Distance program, for an evaluation known as the gait chal1enge
which concerned recognition capability of outdoor data with study of covariate
factors. These concern potential for within-subject variation which includes
footwear and apparel. Application factors concern deployment via computer vision
though none of the early databases allowed facility for such consideration, save for
striped trousers in the early Southampton database (aiming to allow for assessment
of validity of a model-based approach), as shown in Fig. 3.2(a). The new databases
seek to include more subjects so as to allow for an estimate of inter-subject
variation, together with an estimate of intra-subject variation thus allowing for
better assessment of the potential for gait as a biometric.
The data described here was developed especially for purposes of evaluation and
is usually freely available for evaluation

:Shoe:iYP" ,; >'
" »

A C,A,L,NB G,A,L,NB C,A,L,WB G,A,L,WB


Left
B C,B,L,NB G,B,L,NB C,B,L,WB G,A,L,WB
A C,A,R,NB C,A,R,WB G,A,R,WB
Right
B C,B,R,NB G,B,R,NB C,B,R,WB G,B,R,WB
Concrete Grass Concrete Grass
Table 3.1 May and November Gait Challenge Data

3.2.2 NISTI USF Database


The database associated with the HumanID chal1engeproblem consisted [75] of 452
sequences from 74 individuals, with video collected for each individual for two
camera views in differing surface conditions and shoe types. The data was collected
outdoors, reflecting the added complications of shadows from sunlight, motion in
the background, and moving shadows due to cloud cover. This database was the
largest then available in terms of number of people, number of video sequences, and
variety of conditions under which a person's gait was collected. Later, the database
was extended [149] to be 1870 sequences from 122 subjects.
Each subject walked around two similar sized elliptical courses, one on concrete
and the other on a grass lawn (which can be seen in Fig. 3.4(b)), with a major axis
of about 15m and a minor axis of about 5m. Each course was viewed by two
cameras. The cameras were located approximately 15 meters from each end of the
ellipse, with lines of sight adjusted to view the whole ellipse. Subjects were asked to
read, understand, and sign an IRB-approved consent form before participation.
Information recorded in addition to the video includes sex (75% male), age (19 to
54 yrs), height (1.47 m to 1.91 m), weight (43.1 kg to 122.6 kg), foot dominance
(mostly right), type of shoes (sneakers, sandal, etc.), and heel height. A little over
half of the subjects walked in two different shoe types. Thus, for each subject there
were up to eight video sequences: (grass (G) or concrete (C)) x (two cameras, L or
R) x (shoe A or shoe B). The dataset is quite demanding for other biometrics since
in some cases the only biometric that can be seen is gait, as in Fig. 3.4(a) and the
imagery is whol1y outdoors and the lighting is uncontrolled. Clearly, face
recognition, indeed any biometric analysis, on this data could be a taxing exercise.

21
Originally, the gait challenge concerned analysis of the data for which no briefcase
was carried and later data was added for subjects carrying a briefcase. The range of
data collected and analyses possible is shown in Fig. 3.1 where the gallery subset
for the gait challenge analysis was G,A,R,NB (as highlighted) and the probe data
was the remaining sequences in which no briefcase was carried.
A later study extended the database usage by manual labeling [149]. To gain
insight into the relationship between recognition capability and silhouette quality,
silhouettes were created for one gait cycle for 71 subjects under 4 different
conditions, (shoe-type, surface, and time). This gait cycle was aimed to be selected
from the same 3D location in each sequence, whenever possible, excluding the
»,
portion that included the calibration box with high contrast (Fig. 3.4(b to avoid
errors in background subtraction. Each pixel was also labeled according to body
segment: head, torso, left arm, right arm, left upper leg, left lower leg, right upper
leg, and right lower leg. Examples are shown in Fig. 3.5 which highlight how this
data can be used to analyze difficulty arising from the legs' self-occlusion. This
resource is available online [79] for use by the gait community to test and design
better silhouette detection algorithms. Further, the data allows for understanding of
the contribution not only of body labeling, but also of the segments, to test
recognition capability.

Figure 3.5 Example Manually Labeled Gait Challenge Data

3.2.3 Soton Database

Overview
In order to provide an approximation to ground truth and to acquire imagery for
application analysis, the Southampton data procedure filmed subjects indoors and
outdoors. To investigate the potential for gait as a biometric, the database was aimed
to contain more than 100 subjects, henceforth referred to as the large-subject
database, to allow for estimation of between-subject variation. An overview of the
databases comprising the large-subject database is given in Table 3.2. This is
accompanied by the small-subject database which contained around 10 subjects in
differing scenarios to allow for within-subject variation, described in overview in
Table 3.3. The resource is available for research via ftp download [81] after
completing necessary formalities.

22
(} S cani,Tvpe1{~ suJ)ieCtSt" ~ Locll.litV
IView'Ani!:le t #Y Wll.lk.SurfaCe
A Progressive Normal I 16 Indoors Track
scan
D Interlaced Oblique 116 Indoors Track
B Progressive Normal 116 Indoors Treadmill
scan
C Interlaced Oblique 116 Indoors Treadmill
E Progressive Normal I 16 Outdoors Track
scan
F Interlaced Oblique 116 Outdoors Track
Table 3.2 Overview of Southampton's Large-Subject Gait Databases

(c) cropped chromakey


(a) track data image (b) treadmill image extraction from track data
Figure 3.6 Southampton Indoor Gait Data

Progressive
Scan
BS Interlaced Obli ue 12 Indoors Track
GS Progressive Inclined and 12 Indoors Track
Scan Normal
HS Progressive Frontal 12 Indoors Track
Scan
Table 3.3 Overview of Southampton's Small-Subject Gait Databases

Indoors, treadmills are most convenient for acquisition but there is some debate
as to how they can affect gait. As described earlier, some studies suggest that
kinetics are affected rather than kinematics, but our experience with using untrained
subjects and their limitations on footwear (subjects wearing open-backed shoes
experienced particular difficulty) and clothing motivated us to consider the track as
the most suited for full analysis. As in Fig. 3.6(a), the track was prepared with the
chromakey (bright green, as this is an unusual clothes' color) background
illuminated by photoflood lamps, viewed normally and at an oblique angle. The
track was of the shape of a "dog's bone" , as seen in Fig. 3.7, so that subjects walked
constantly and passed in front of the camera in both directions. The same camera
view and chromakey arrangements were used for the treadmill, but here subjects
were highlighted with diffuse spotlights, as in Fig. 3.6(b). The treadmill was set at

23
constant speed and inclination, arming to mImIC a conventional walk pattern.
Similar layout was used for the outdoor track and here the background contained a
selection of objects such as foliage , pedestrian and vehicular traffic , buildings (also
for calibration) as well as occlusion by bicycles, cars and other subjects. As such,
subjects' silhouettes can be extracted from outdoor and indoor, Fig. 3.6(c), imagery
and their signatures compared. The imagery for the large database was completed
with a high resolution still image of each subject in frontal and profile view,
allowing for comparison with face recognition and for good estimates of body shape
and size. The track data was initially segmented into background and walking and
further labels were introduced for each heel strike and direction of walking. This
allowed for basic analysis including manually imposed gait cycle labels . The
treadmill and outside data was segmented into background and walk data only.
The availab le gait databases, Tables 3.2 and 3.2, will 'be later referred to by a
camera label. These databases are stored as sequences of DV for which a
reader/interface has been made available in C and Python that allows database users
to access frames in the DV directly.

.- - - -=- =--~ -=- =...:::!J


- Chromakey background
- Treadmill
Photoflood light
Spotlight
Figure 3.7 Floor Plan of Southampton Gait Laboratory

Laboratory Layout
A plan of the layout of the gait laboratory is given in Fig . 3.7. The track floor was
painted so as to ensure good chromakey extraction of the feet and ankles as well as
the body . A slight difference in intensity means that this is achieved by a two-pass
procedure rather than a single pass for pure chromakey. The dimensions of the room

24
allowed deployment of the four camcorders, in the chosen configuration and to
illuminate subjects and backgrounds satisfactorily. The track completed with two
circular segments to allow the subjects to tum around without ceasing to walk
normally. This is the dog's bone shape. This superseded an earlier circular track
where subjects entered at one end of the track and exited at the other, and then
walked through other rooms so as to enter at the same point as before. This was
superseded since subjects took a long time to walk round the track and pipelining
subjects was not possible without collision. Further, subjects would only have been
filmed walking in one direction. For these reasons the track was chosen in its final
form.

(a) track from start (X in Fig. 3.7) (b) track from end (Y in Fig. 3.7)

(c) laboratory viewed from track end (d) camera setups (inc. surveillance)
Figure 3.8 Southampton Gait Laboratory

Two cameras were used to view subjects in each scenario, one normal to the
walking direction and the other at an oblique angle. Two viewpoints were chosen so
as to allow for later investigation of viewpoint-invariant gait signatures (which had
already been demonstrated for model-based techniques, for a small change in view
angle of about 20° [166] - to be discussed later). As it is not unlikely that security
video will use interlaced format, the cameras positioned at an oblique angle were set
in interlaced mode whereas those normal to the subject were progressive scan. At
the time of camera procurement, the most appropriate camera types were the Canon

25
MV30i (progressive scan) and Sony TRV900E (interlaced). An evaluation
suggested that the Sony camera's optics and color response were at higher quality
than those of the Canon. Unfortunately, the Sony could achieve progressive scan at
a reduced frame rate whereas the Canon could give video rate progressive scan
capability.
The small-subject database was constructed with all subjects walking along the
indoor track, Figs. 3.8(a) and (d). The outdoor data primarily aims to investigate
performance of computer vision techniques in extracting people whereas the
treadmill aims to enable easy acquisition. Since neither factor was basic to gait as a
biometric, variational data was not recorded for those scenarios. Since the recording
was all inside, the camera settings differed from those used in the large database.
Also, two extra cameras were used, one of which (as can be seen in Fig. 3.8(d» was
placed normal to the track but with increased elevation. This is Database as in
Table 3.3. The other was a front view showing images from the viewpoint of Fig.
3.8(a). This is Database HS in Table 3.3.
The chromakey material was placed behind everywhere subjects appeared in the
field of view of the camera. The main difficulty in illuminating the chromakey
background was lack of ceiling height, as commonly used to ensure even
illumination. In an approximate solution, as shown in Figs. 3.8(b) and (c), 800W
Tre-D neon photoflood lights were reflected off the ceiling with carefully-
positioned screens (made from black wrap) to prevent light spills into the other
areas of the laboratory. The subjects were illuminated on the treadmill using two
500W spotlights diffused through white umbrellas, seen in Fig. 3.8(c). A screen was
positioned between the treadmill and the track areas, to ensure that the lighting for
the scenarios did not interfere, seen in the far left in Fig. 3.8(b).
The eventual use of the database will be by computer vision techniques. The
usual methodology is for high-level feature extraction and description and/or
statistical recognition techniques to follow low-level feature extraction.
Accordingly, we decided to evaluate data quality by performance of low-level
feature extraction techniques. The techniques chosen were Canny/ Sobel and an
established moving-object extraction technique. Simpler subject extraction
techniques (e.g. image subtraction with background obtained from temporal
median) were not chosen to evaluate the data as these have known performance
limitations and are unlikely to be deployed in any future recognition system. Much
of the quality effects are difficult to demonstrate in static format as given here,
though animated imagery is available. Naturally, there was particular concern with
lighting but a layout aiming for uniform lighting when recording a walking subject
is a paradoxical situation, given the interaction of the walking subject with their
illumination. In an iterative procedure, the lighting was optimized so as to obtain the
best subject extraction and quality of edge data for the walkway and treadmill data.
For (Canny) edge extraction, Fig. 3.9(b) shows much better definition of the
walking subject, with many fewer background edges (note that these are static, and
can be much further reduced by thesholding). For subject extraction, by more and
better positioned lighting the imagery of Fig. 3.9(d) was improved from 3.9(c)
where it can be seen that the background is much reduced, as are the effects of
shadows near the feet (consistent largely with the painting of the floor's surface).
The subject extraction also highlights problems with matte surfaces, in the skin and
cloth, which can cause these areas to be perceived as background (in grayscale
imagery). Inside of the silhouette these can easily be removed by infilling; the

26
problems did motivate repositioning of lighting systems to reduce their occurrence.
The three-chip Sony sensor system was found, as expected, to deliver imagery of
much better quality than the single chip (Canon) systems. Chromakey extraction
also evaluated so as first to check the paint used for the flooring and the drape of the
background cloth. Later, the same procedures were used to monitor wear in the
track's surface and continuance of the laboratory set-up.

(a) evaluation data: edge extraction (b) recorded data: edge extraction

-, -- . ".

(c) evaluation data: subject extraction (d) recorded data: subje ct extraction
Figure 3.9 Data Analysis Procedures' Results
Outdoor Data Design Issues
The outdoor set-up was designed to allow for the same viewing geometry of a
subject walking with fronto-parallel and oblique view, again recorded by interlaced
and progressive scan camcorders. The outdoor data is designed to allow recognition
in real-world scenarios which suggests that real world issues should be
accommodated in the data. These included:
• occlusion by other subjects;
• interference from moving background objects;
• interference from static background objects;
• variation in illumination; and
• variation in shadows.
These are specified as zones of activity in Fig. 3.10. For illustrative purposes,
the zones are separate, but the tree could cast a shadow over much of the foreground
area and people other than the subject walked in planes normal or parallel to the
subjects, as well as along a passageway behind the cars parked behind the
foreground area. The recording took place in a zone of access to several buildings,
ensuring that the walking subjects are sometimes obscured moving foreground
objects such other people walking in a similar or different plane, or bicycles. There

27
was traffic on the road behind the subject and this was sometimes stationary and
sometimes moving; there were pedestrians occasionally walking directly behind the
subjects. There was a bush to the left of the camera view and parked cars to the right
of the camera view, giving man-made and natural (textured) static background
objects. In some sequences, other subjects from the database were recorded walking
in the far background. The English weather certainly helped to satisfy the latter of
the main two requirements: the outdoor data was recorded over a month of late
spring when the weather varies considerably in the UK. Recording did not take
place when it was raining, but there is variation in illumination in that for some
sessions it was overcast and in others there was bright sunshine, causing wide
variation in shadows. The tree was observed to move quite freely on windy days,
giving a region where there was much movement of a highly textured object.
Finally, there is also a building in the rear background offering some possibility of
calibration. Examples of some of the frames from the data are given in Fig. 3.11. In
the stored sequences, a subject walks from one side to the other - there are
background sequences for each subject which are the view just before the subject
walked.

Area of walking subject!


moving foreground objects

Figure 3.10 Outdoor Data: Zones of Activity


One of the main advantages of having outdoor data which is similar to the
indoor data is that the performance on real-world outdoor data can be compared
with that on indoor laboratory data [116], confirming how accuracy in extraction
and recognition. In the NIST USF data, this is achieved by manual labeling. A
further advantage of the indoor/outdoor arrangement is that techniques can be

28
trained on indoor data and then deployed on outdoor data, as has already occurred in
an approach aimed to determine the walking human figure [156].

Figure 3.11 Example Frames of Southampton Outdoor Data

Acquisition Set-up Procedure


The database was acquired over 30 recording sessions. Naturally, it was sought to
acquire data with a constant arrangement of equipment. To this end, a set-up
procedure was completed for each session for the laboratory, Appendix 9.1.1, and
for each camera, to ensure that each camera's settings were identical, noting any
inconsistency, given in Appendix 9.1.3. The cameras were securely mounted on
tripods (set at the same height and with the camera base plate leveled) at a distance
of 4.5 m from the center of the walkway and for the treadmill, as indicated in Fig.
3.2. Cameras A and B were set in progressive scan mode. Cameras C and D
provided interlaced data with the plane of view at an angle of 20° to the normal
view.
The cameras had the image stabilizer, auto focus, digital zoom and automatic
shutter all inhibited; the white balance was set to "indoor". Unfortunately, the same
range of shutter speeds was not available on the two camera types so the shutter
speed was set to 1/250 and 1/300 sec for the progressive scan and interlaced
cameras, respectively. These speeds appeared suited to imaging limb movement
without blur when walking, with the light available. The camera was manually
focused on a subject standing in the center of view: using auto focus would have
made the cameras change their focus during gait-data acquisition . The exposure
level was set to be automatic outside and was fixed inside
By experience, the outside illumination was not sufficiently constant to make a
specific exposure level to be maintained (it took around 10 minutes to collect the
outside data for 5 subjects - dependent on traffic!). As such, it was found that one
could not manually preset an exposure level that was appropriate for the entire
session, though it would have preferred to have been so. The full list of camera
settings used to set up each camera is given in Appendix 9.1.3.

Filming Issues
Human psychology plays a large part in collecting data of this type and magnitude.
Firstly, to avoid affecting the subject's walking patterns , the treadmill training and
filming took place after each subject had first walked outdoors, and then inside on
the laboratory track. No other people were in the laboratory, as this could distract
the subjects. There is debate in the use of treadmills for gait analysis concerning

29
their suitability , speed and inclination. The speed and inclination were set at
constant values derived by evaluation, however, it is worth noting that treadmills
allow for capture of long high resolution continuous gait sequences . Further issues
included not informing subjects when the cameras were filming (reducing shyness
issues by switching the cameras on prior to the subjects entering the laboratory) , not
talking to subjects as they walked (as invariably humans will tum their head to
address the person) , using a talk-only radio station for background noise (to reduce
the impulse need of a human to break the talk-silence), removing the need for a
person to control the cameras reduced the camera- shyness and talking issues further.

Recording Procedure
Each of the recording sessions lasted at most 1 hour, this being the time available
for a DV tape and the outside cameras ' batteries. The same procedure was used for
every subject in the large database . The database should avoid any subject
conditioning, especially of those unfamiliar with walking on a treadmill. For this
reason, subjects were recorded first walking on the outside and inside tracks and
finally on the treadmill . This was not an issue during acquisition of the small-
subject database since all subjects were by then well familiar with walking on a
treadmill. Those record ing each session were primed to instruct the subjects in the
same way, Appendix 9.1.4. Each subject was first filmed walking along the outside
track, ensuring that at least 8 good sequences were recorded walking in both
directions . As this was outside data, cars could enter the background, or other
people could walk within the cameras ' field of view and sufficient sequences were
recorded to ensure that the effect of these could be mitigated in later analysis . After
sufficient sequences had been judged to be recorded, each subject then stood in
front of the cameras displaying the session and their unique subject ID. Subjects
were then filmed indoors walking on the track, again for at least 8 sequences in
either direction. Later examination of the data showed the prudence in recording
more data than was needed. Presumably since he was unsupervised, one subject
actually left the track to inspect the laboratory and was recorded as such. Chairs
were provided to ensure that subjects not being filmed did not interfere with the
recording in progress. All subjects then spent at least 3 minutes walking on a
treadmill set to be the same as that used in the laboratory, but one with handles to
ease subject training . Subjects were then filmed walking along the laboratory
treadmill (with only a front handle so that the body was not obscured) for at least 3
minutes. The speed of the treadmill certainly caused concern; subjects of preferred
to walk at different speeds. Evaluation suggested that it was necessary to set the
treadmill to a fixed speed for the whole of the larger database . This speed (4.1
km/h) was set to be the average speed selected by a group of 10 subjects at which
they found their walking to be most comfortable . The treadmill was inclined at 3° as
this was found to lead to a more natural walk. A mirror was placed in front of the
treadmill , seen in Fig. 3.8(b), again helping to improve balance for those unfamiliar
with such exercise and to prevent the subject from looking downwards at the
treadmill's controls (which were also obscured to lessen further the potential for
distraction).
After the completion of recording, each subject completed an information form
and a consent form, given in Appendices 9.1.5 and 9.1.6 respectively. The
information form aimed to record those factors known to influence gait and which

30
were not evident from the video information, including known injury or medication
and in one case fatigue. The consent form complies with current UK Data
Protection legislation. Intentionally, there was no linkage between the consent and
the information forms - the database is totally anonymous and a subject's identity
has never been linked with their data. After recording, the consent forms' order was
randomized to ensure this. After the acquisition process was completed, each
subject was given a book token in appreciation of their time and collaboration.

Ancillary Data
The ancillary data for each database already comprises the information derived from
the subject information forms together with the camera set up forms for each
session. The main difficulty associated with using treadmills to acquire natural walk
data for a large number of subjects is that many subjects are unfamiliar with
walking on a treadmill and that the inclination and speed need to be set so as to
enable natural walking which would further lengthen any training time. As such the
track database appears to be the most appropriate to evaluation of the basic potential
for gait as a biometric. As such, this was given special consideration and is labeled
in depth.
The labeling format was XML as specified within the Human ID program and
an example of an example XML fragment associated with data in database A is
given in Table 3.4. The primary labels on the track Database A allows evaluation to
use images where the whole of the subject is visible. First, the filename for each DV
sequence was arranged to record the camera, the filming session number, the subject
number, the subject's sequence number and the direction of the subject's walk.
Labels were derived for the frame before which the subject starts to enter the scene,
for the frame when the subject is first wholly within the camera's view, for the
frame where the subject is last wholly visible, and for the frame where the subject
has totally left the camera's view. The images between the start of the sequence and
the subject's entrance in the field of view, and between the subject's exit and the
end, are background data. The track Database A was further labeled so as to enable
evaluation of single cycle data. The most evident event to be labeled is heel-strike
so each of the subject's heel-strikes was labeled together with the respective foot.

..-..
-":;
.. .. ""
.~
.
-~

(a) source image (b) moving-object extraction


Figure 3.12 Southampton Outdoor Gait Data
The sequences were first segmented automatically from the DV for the whole
recording session. Then, the labels were derived, again automatically for the
subject's entrance and exit. These were checked manually using purpose-developed

31
software to ensure that the correct labels had been derived. Finally, the heel-strikes
were first extracted automatically and then again checked manually for accuracy. As
such sufficient labels were derived for automatic single-cycle gait data analysis. The
efficacy of using DV was a significant advantage here since, unlike analogue
systems, time is recorded with the image data in digital format. The potential for
drift within the different cameras' timing was checked prior to this analysis and
found to be negligible.
The labels derived for extracting the sequences, the initial segmentation stage,
were then used to extract sequences for the oblique camera, Database B.
Unfortunately, there was no explicit consistency between the labels derived from
the normal view (Database A) and the imagery in the oblique view. As the same
level of ground truth was not required for all databases, only Database A, this was
not investigated further. In this spirit, there is no heel strike data for the treadmill or
the outside data, all four databases being stored as sequences of subject data (from
when the subject is wholly within the camera's view) and background data (when
the subject was not within the camera's field of view. The only extra label in the
outside data concerns change to the background, be it another human or vehicle
moving within the field of view.

<?xml version="1 :0" ? >


<data camera="a" date="23=05=01" direction="left"
filename="008a013s00L .dv" sequence="OOL " session="008"
start frame="2369" start time=" 0 01 34 :19" stop
frame="2546" stop time=" 0 : 01 : 41 :21" subject id="013"
xmlcreatedby="Time Code re-writer"
xmlcreatedon="Mon Mar 25 11 : 51 : 16 2002" >
<c omme n t s>
No comment
</ c omme n t s>
<h e e l s t r i k e s automatic="yes" checked="yes" >
<s t r i k e leg="left" time="62" / >
<s t r i k e leg="right" time="77" / >
Table 3.4 Example Fragment ofXML for the Normal-View Camera.

As with the other databases, we shall later show how the Southampton databases
have been used not only for recognition in both the studio and the more demanding
outdoor data, but also for data analysis to determine those parts that are potent for
recognition purpose, but also to guide migration of new vision technique to human
movement analysis.

3.2.4 CASIA Database

In the CASIA database [82, 122], Panasonic NV-DXlOOEN digital cameras fixed
on a tripod captured gait sequences at a rate of 25 frames per second on two
different days in an outdoor environment. Here we assume that a single subject
moves in the field of view without occlusion. All subjects walked along a straight-
line path at free cadences in three different views with respect to the image plane,
namely: fronto-parallel (0°), obliquely (45°), and frontally (90°). The images are
recovered from video storied in DV tapes to a Microsoft AVI wrapper with an IEEE
1394 interface offline, and finally transcoded using the Sthvcd2000 decoder into 24-

32
bit full-color BMP files with a resolution of 352 x240 . The resulting CASIA gait
database includes 20 different subjects and 4 sequences per view per subject. The
database thus includes a total of 240 (20 x4 x3) sequences. The length of each image
sequence varies with the pace of the walker, but the average is about 90 frames . A
comparative performance of several techniques on the CASIA is available [121].
Some sample images are shown in Fig. 3.13, where the arrowed line represents the
walking path.

Figure 3. I3 Example Images from the CASIA Database: Lateral Views

Figure 3.14 Four Views from the First Maryland Dataset

3.2.5 UMD Database

The gait database [8 I] developed at the University of Maryland comprises of two


datasets . The first dataset consists of video sequences of 25 individuals captured
using a Phillips G3 EnviroDome camera between the months of February and
May2001. The dataset consists of subjects walking along four different paths : (i)
frontal view/walking toward (ii) frontal view/ walking away (iii) fronto-parallel
view/ toward left (iv) fronto-parallel view/ toward right. Fig. 3.14 illustrates sample
images from the first dataset. The second dataset consists of video sequences of 55
individuals walking on a T-shaped path, captured using 2 Phillips G3 EnviroDome
cameras positioned orthogonal to one another. The second dataset was collected
between the months of June and July 2001. The walking sequences of each subject
were recorded in two sessions. Each camera had a spatial resolution of 640 x480and
was operated at 25 frames/second. The cameras were located at a height of 1.5
meters above the ground and the bounding boxes used for background subtraction
were 170 x98 in size. Apart from the image sequences, each dataset also has the

33
background subtracted image sequences. Fig. 3.15 illustrates samples from the
second dataset.

en
~
3
"g

Segmenl 3

Segment Z
Camera 6
(8) (b)

IIIIIIIIIJIJ••
SlanC<'s .lonR s<'gmenl J SlanC<'S along segmenl 2

1J0Bm DDDDSIan"". along '<'gmenl 3 SlanCl.'s along.'l<'gIJlcnl 4

Figure 3.15 (a) Sample image from the Second Maryl and Dataset (b) T-shaped path
(c) Background subtracted image samples from each of the four segments

34
4 Early Recognition Approaches

4.1 Initial Objectives and Constraints

he earliest approaches concerned recognition within small populations, with


T the volume of data limited largely by the computational resources available
then. As illustrated by Fig. 4.1, many sought to derive a human silhouette
from an image and, as common in pattern recognition, and then to derive a
description which can be associated with the identity of the subject. This gives a
'statistical ' approach to automatic gait recognition wherein the image sequence is
described as a whole, and neither by a model- nor by a motion-based approach, but
one which describes the motion content. Given that techniques to separate a moving
object from its background have been developed for many years (especially in
surveillance) there existed a selection of approaches to derive the silhouette of the
moving subject, an example silhouette extraction from the image of Fig. 4.1(a) is
shown in Fig. 4.1(b). What was then needed was approaches to process the
sequence of silhouettes so as to derive a gait signature - a set of numbers to
represent a subject's identity. A selection of signatures is shown in Fig. 4.1(c) which
are displayed in a 3-dimensional space for visualization purposes only (the signature
usually has more dimensions than this). These signatures represent ten of walking
subjects with four instances of each subject. In this case the signatures derived from
the four instances mostly cluster well, representing a good recognition rate.
Accordingly, the target of the early approaches was to process video data of the
form of Fig. 4.1(a) to obtain silhouettes of the form of Fig. 4.1(b) so as to derive
signatures for recognition performance analysis.

.. ...'

(a) video data (b) silhouette (c) feature space

Figure 4.1 Gait Recognition by Silhouette Analysis

4.2 Silhouette Based


In what was perhaps the earliest approach to automatic recognition by gait, the gait
signature was derived from the spatiotemporal pattern of a walking person [86].
Here, in the XYT dimensions (translation and time), the motions of the head and of
the legs have different patterns. These patterns were processed to determine the
body motion's bounding contours and then a five stick model was fitted. The gait
signature was derived by normalizing the fitted model for velocity and then by using
linear interpolation to derive normalized gait vectors . This was then applied to a
database of 26 sequences of five different subjects, taken at different times during
the day. Depending on the values used for the weighting factors in a Euclidean
distance metric, the correct classification rate varied from nearly 60% to just over
80%, a promising start indeed.
Later, optical flow was used to derive a gait signature [87, 88]. This did not aim
to use a model of a human walking, but more to describe features of an optical flow
distribution . The optical flow was filtered to produce a set of moving points together
with their flow values. The geometry of the set of points was then derived by using
a set of basic measures and further information was derived from the flow
information . Then, the periodic structure of the sequence was analyzed to show
several irregularities in the phase differences; measures including the difference in
phase between the centroid's vertical component and the phase of the weighted
points were used to derive a gait signature. Experimentation on the UCSD database
(Section 3.1.1) showed how people could be discriminated with these measures,
appearing to classify all subjects correctly, again an encouraging start for this new
biometric. Later, Boyd developed the notion of using the oscillations in intensity
produced by gait for recognition purposes [89]. This was achieved by using an array
of phase-locked loops, one per pixel, to synchronize the oscillators to the frequency
and phase of internal oscillations. By this, an image is constructed wherein each
point indicates the magnitude and phase of the oscillations in the original image and
Procrustes Analysis was used to compare vectors of the phasors that represent
image oscillations . At the time, the approach could discriminate between four
presented sequences and was later used for gait recognition (Chapter 5).
Another approach was aimed more at generic object-motion characterization
[90], using gait as an exemplar of their approach. The approach was similar in
function to spatiotemporal image correlation, but used the parametric eigenspace
approach to reduce computational requirements and to increase robustness . The
approach first derived body silhouettes by subtracting adjacent images. Then, the
images were projected into eigenspace, and eigenvalue decomposition was
performed where the order of the eigenvectors corresponds to frequency content.
Recognition from a database of 10 sequences of seven subjects showed
classification rates of 100% for 16 eigenvectors and 88% for eight, compared with
100% for the (more computationally demanding) spatiotemporal correlation
approach . Further, the approach appears robust to noise in the input images.
This approach was later extended to include Canonical Analysis (CA) with
better discriminatory capability [91, 92], and extended to analyze flow rather than
just silhouettes - to better effect [93, 94]. Eigenspace transformation (EST) based
on Principal Component Analysis (PCA) or Karhunen-Loeve Transform has been
demonstrated to be a potent metric in automatic face recognition and gait analysis,
but at that time without using data analysis to increase classification capability. An
extended approach combined canonical space transformation based on Canonical
Analysis (CA) or Linear Discriminant Analysis (LDA), with the eigenspace
transformation, for gait analysis [92]. Though face image representations based on
PCA have been used successfully for recognition , PCA based on the global

36
covariance matrix of the full set of image data is not sensitive to class structure in
the data. In order to increase the discriminatory power of various facial features ,
LDA/ CA optimizes the class separability of different face classes and improve the
classification performance. The features are obtained by maximizing between-class
and minimizing within-class variations. Unfortunately, this approach has high
computational cost. Moreover, the within-class covariance matrix obtained via CA
alone may be singular. Combining EST with canonical space transformation (CST)
reduces the data dimensionality and optimizes the class separability of different gait
sequences simultaneously.
Given c training classes to be learnt, where each class represents a walking
sequence of a single subject, X'iJ is the jth image (of n pixels) in class i and N; is the
number of images in i-th class . The total number of training images is
NT =
N1 + N z +..... + N c (4 .1)

and the training set is represented by [X;,I ",X;,N" X;.I' ..., X~.N< ] . First, the brightness
of each sample image is normalized by

= X~;, j II
(4.2)
».,
After normalization, the mean pixel value for the full image set is:
c ~ (4.3)
m, = XrT~~Xi,j
Then we form an nxNT matrix X where each column is formed from each of Xi.j less
the mean as:
X=[x1,-mX' '' ''X' N -mx,,,,,x -m (4.4)
• • I c ' Nc x]
EST uses the eigenvalues and eigenvectors, generated by the data covariance matrix
derived from the product XX T, to rotate the original data co-ordinates along the
direction of maximum variance. Calculating the eigenvalues and eigenvectors of
the nrn matrix XXT is computationally intractable for typical image sizes . Based on
singular value decomposition, we can compute the eigenvalues of XTX where the
matrix size is NT xN r and is much smaller. The eigenvectors of XTX are used as an
orthogonal basis to span a new vector space. Each image can be projected to a
single point in this space. According to the theory of PCA, the image data can be
approximated by taking only the largest eigenvalues and their associated
eigenvectors. This partial set of k eigenvectors spans an eigenspace in which the
points Y i.j are the projections of the original images X i.j by the eigenspace
transformation matrix, [e., ...ek], as
.. = [e" ...ek]Tx I , j .
Y',J (4.5)
After this transformat ion, each original image can be approximated by the linear
combination of these eigenvectors .
In CST, the classes of the transformed vectors resulting from eigenspace
calculation are used to calculate a scatter matrix S" a within-class matrix S; and a
between class matrix S, which reflect the dispersion, the variance and the variance
of the difference , respectively. The objective of CST is to minimize S; and to
maximize Sb, simultaneously. This is achieved by minimiz ing the generalized Fisher
linear discriminant function J, where

37
J(W) = WTSb W / (4.6)
/WTSwW
The ratio of the variances is maximized by the selection of the feature W if
os/ - 0 (4.7)
/OW-
Supposing W· to be the optimal solution and that w'; be its column vector which is
a generalized eigenvector and corresponds to the i-th largest eigenvector Ai, then
SbW; =A;Sww; (4.8)
After the generalized eigenvalue equation is solved, we obtain a set of eigenvalues
and eigenvectors that span the canonical space where the classes are much better
separated and the clusters are much smaller, as required for recognition purposes .

(a) original (b) silhouette n (c) silhouette n+1

(d) 1st eigenvalue (e) 2nd eigenvalue (f) 3rd eigenvalue (g) 4 th eigenvalue
Figure 4.2 Original Image, Derived Silhouettes and Eigenvalues [92]
In application, this analysis was applied to human silhouettes derived by
subtracting the background from the image and then thresholding the result[92].
Fig. 4.2(a) shows an image from the original sequence and Figs. 4.2(b) and (c) show
two of the extracted silhouettes. The eigenvalues were then extracted from the
sequence of silhouettes, the first four of which are shown in Fig. 4.2(d-g). The
trajectories in eigenspace overlap and their centroids are very close together . After
CST the trajectories are much better separated and with lower individual variance as
in Fig. 4.3 (though for three dimensions - the separation in 3D is much clearer when
the view is rotated). Recognition from the canonical space was accomplished using
the distance between the accumulated center to each centroid . On five sequences of
five people from the early San Diego dataset an 85 % classification rate was
achieved by CST alone whereas 100% was achieved with combined EST and CST
(as evidenced by the cluster size and separation in Fig. 4.3).

38
0.1

o
-0.1

~ person 1 0
-0.2 person 2 +
person 3 x
-0 .3 person 4 -
person 5 •
-0 .4
0.5
0.4 0.1
0.3
0.2

v2 v1

Figure 4.3 Canonical Space Trajectories for Five Subjects [92]

4.3 Model Based


An alternative approach to collecting the motion information in an image sequence
is to find feature(s) and collect their motion information. Here we show how gait
signatures were derived from spectra of the variation in inclination of the thigh, as
extracted by computer vision techniques [96]. The model of the thigh concerns the
movement of the human limb. This can be measured, as illustrated in Fig. 4.4, as the
relative angle between the thighs, or the inclination of a thigh with respect to the
vertical (in the form of variation shown in Fig. 2.3).

Figure 4.4 Modeling the Thigh's Motion


The bi-pendular model was used as the leg motion is periodic and each part of
the leg (upper and lower) appear to have pendulum-like motion. Fourier theory
allows periodic signals to be represented as a fundamental and harmonics - the gait
motion of the lower limbs can be described in such a way. The model of legs for
gait motion allows these rotation patterns to be treated as periodic signals so Fourier
Transform techniques can be used to obtain a frequency spectrum. The spectra of

39
different subjects were then compared for distinctive, or unique, characteristics.
The video sequences were averaged to reduce high frequency noise and edge
images were produced by applying the Canny operator with hysteresis thresholding.
The Hough Transform (HT) was then applied to the edge image resulting in an
accumulator space that had several maxima, each corresponding to a line in the edge
image. A peak detection algorithm was applied to extract the parameters of each of
these lines (in x and y co-ordinates) using the standard foot-of-normal form
s =xcos¢+ ysin¢ (4.9)
where sand ¢ are the distance and angle to the foot of normal. There are several
methods for peak detection. In back-mapping, the peak in the accumulator at
(spk,¢pk) is found. For each edge point in the image which lies on the line
represented by (spk, ¢pk), the points in the accumulator associated with that edge
point were decremented. This effectively removed the votes cast by the line
(spk, ¢pk), and so the peak was reduced. This process was repeated until the
parameters for all the lines had been found, the result of processing Fig. 4.2(a) is
shown in Fig. 4.2(b). Gaps in the data occur when the legs cross where it is difficult
to discriminate between the legs and there was also some high frequency noise on
the data. To in fill for missing data, and to smooth noisy components, the thigh
angles given by the lines' inclinations were fitted to a high order polynomial by
least squares. An eighth-order polynomial describing the thigh angle 0 variation
with time t is
O(t) = ao+ a,t + a2t 2 + ....+ a/ (4.10)
and for N points B; by least squares
N N
8
N
(4.11)
N 'l)i Lti a LBi
;=1 ;=1 o ;= )
N N N
a, N

Lti Lt/ Lt/ LB.ti


; =1 ;=1 ;=1 i=1

Lt,'6 a9
N N N N
8 9 8
Lti Lti LB,ti
;=1 i=t ;=1 ;=)

An example least squares fit for four sequences of single cycles of a particular
subject is shown in Fig. 4.5. These fit nicely within Murray's data, Fig. 2.3.

15

10

e
-5

-10

- 15
- 2 0 '--.l---'--l---'---'_'---'--....l---'---'
e 10 20 30 40 50 60 7 0 80 90 ~~~_

Figure 4.5 Least Squares Fitting [96]


These data were then analyzed via the DFT to provide phase and magnitude

40
spectra. The magnitude and phase spectrum for one walking cycle of subjects 2 and
5 are shown in Fig. 4.6. The magnitude spectrum drops to near-zero above the fifth
harmonic, again agreeing with earlier work [15]. The magnitude spectra for the two
subjects can be used to distinguish between them. However, the phase spectra
appear to be much more different but some components carry little information
since their respective magnitude component is very small.

I.S
3

o '-"-"---'--'--'---L---L_'--.L.---'-"--' e L-.L-..L----L---'------'-----'_l--.L-..L--l
e 10 20 30 40 50 60 70 80 90 100 a 10 20 3 0 40 50 60 70 80 90 100

(a) magnitude spectrum for subject 2 (b) phase spectrum for subject 2
10 r--r-r-r-r-r-r-r-r-.,--, 3 r--r-r-r-r-r-r-,.--,.--...-.

2. 5

1. 5

a \. ,/ e '-'--'--'--'--I<-.l.-.L-.L-.L.....:J
a 10 20 30 40 50 60 7 0 80 90 100 o 10 20 30 40 5 0 6 0 70 80 90 100

(c) magnitude spectrum for subject 5 (d) phase spectrum for subject 5.
Figure 4.6 Phase and Magnitude Gait Spectra [96]
The k-nearest neighbor rule was then used to classify the transform data using
the 'leave one out' rule, for k = 3 and for k = 1. In the early Southampton data, four
video sequences were acquired for each of ten subjects. The correct classification
rates (CCR) are summarized in Table 4.1 which gives analysis for classification by
magnitude spectra alone, and for multiplying the magnitude spectra by the phase,
both for differing values of k. Note that the magnitude component of the FT is time
shift invariant; it will retain its spectral envelope regardless of where in time the FT
is performed. The phase component does not share this characteristic, and a time
shift in the signal will change the shape of the phase envelope. Accordingly, the
rotation patterns were aligned to start at the same point, to allow phase comparison .
This is because the magnitude plots do not confer discriminatory ability whereas the
phase plots appear to do so. The multiplication appears reasonable, since gait is not
characterized by extent of flexion alone, but is controlled by musculature which in
tum controls the way the limbs move. Accordingly, there is physical constraint on
the way we move our limbs. However, we cannot use phase alone, since some of the

41
phase components occur at frequencies for which the magnitude component is too
low to be of consequence . By multiplication of the spectra, we retain the phase for
significant magnitude components. Clearly, in this analysis, using phase-weighted
magnitude spectra gives a much better classification rate (90%) than use of
magnitude spectra alone (40%), for k = 3. Selecting the nearest neighbor, as
opposed to the 3-nearest neighbor , reduced the classification capability, as expected.

k=3 40% 90%


Table 4.1 Overall Classification Performance [96]

.

•r
,I' .
I ;
f
.,
...-. ~ ,-

"

Figure 4.7 Extraction Result by Evidence Gathering of Moving Line


As such, feature based metrics can be used to provide gait signatures in a way
which agrees with human insight, allowing the results to be validated visually. Later
the approach was extended to use a complete evidence gathering formulation [97,
98] so as to accumulate evidence of the gait signature directly from the (processed)
video data. The evidence was gathered essentially for the moving thigh which was
modeled as a oscillating line. The motion model used for the vertical displacement
of the hip c y (t) computed from the average vertical velocity Vy at time t as
cy(t)=Vi (4.12)

where Vy essentially reflects any slope in the walking surface . The horizontal
component is modeled as

_l!...-+ (V) ~sin(


(4.13)
c, (t) = + OJot) + l!...-cos ( OJot)J
OJo OJo OJo
where Vx describes the average velocity and a, fJ and OJo describe the amplitude
and phase of the hip's oscillation. These are used to derive the points in a
perambulating line (which then models the thigh) from which the Fourier signature
was derived automatically . The extraction by these means is shown in Fig. 4.7. The
recognition performance was evaluated on the early Southampton data, this time for
ten subjects . The recognition performance was similar to that shown for its

42
predecessor non-automated approach in Table 4.1. The confusion matrix for the
recognition performance on ten subjects is shown in Fig. 4.8. Here, darkness
reflects closeness of matching and lightness represents disparity and the 90%
recognition performance is reflected in the dark diagonal where the signature for
each subject matched well and the light areas outside of the diagonal show a poor
match to the signatures of the other subjects. Fig. 4.8(a) shows the match for the
Fourier magnitude alone and Fig. 4.8(b) shows the match for the phase-weighted
Fourier magnitude and this better performance by the phase-weighted Fourier
magnitude is reflected in the greater difference between its diagonal and the
remainder ofthe matrix.

I J 3 4 .56 7 8 9 LO L J 3 4 .5 6 1 8 9 10
L
J
3
4
.5
s
1
8
9
1 0 _~ _ _

(a) Fourier magnitude (b) phase-weighted Fourier magnitude


Figure 4.8 Confusion Matrices by Evidence Gathering of Moving Line [98]

We shall see later how this approach can be translated to model the whole leg, and
can also be used as a feature extraction stage to model running as well as walking.

43
5 Silhouette-Based Approaches

5.1 Overview

O f the current approaches, most are based on analysis of silhouettes. This is like
the earlier analysis which concerned abstracting a walking subject from the
background and then deriving set of measurements that describe the shape and
motion of that silhouette in a sequence of images. This is similar in approach the
classic computer vision approach to recognize objects by using shape descriptions

--
of objects derived, say, by optimal thresholding.

-----
Seq uences of images of walking subjects
Model-! r;;allalysis - Model-based analysis

Moving Shape Sha pe+ lotion Structural


unwrapped eigenspace stride parameters
silhouette [12 1] sequences [122] [ 102]
silhouette similarity hidden Markov human parameters
[77] model [100, 99] [106]
relational statistics average silhouette joi nt trajecto ries
[78] [ l OS] [117]
self simi larity [ 10 I] moments [ 114]
frieze patterns [ I I I] ellipsoidal fits
[ 119]
key frame analysis kinematic features
[ 110] [ 124]
area [112] gait style and
content [ 127]
symmetry [ 137] higher order
corre lation [128]
point distribution video oscillations [130]
mode ls [ 120]
key poses [126]

Figure 5.1 Current Approaches to Automatic Gait Recognition


The current approaches are usually generic to moving object description and the
analysis, as shown in Fig. 5.1, can broadly be split into model-based approaches
which use information about the gait, determined either by known structure or by
modeling, and those which are model-free, which can be split into those based on a
moving shape or those which use integrate shape and motion within the description.
One advantage of model-based methods is immunity to the effects of clothing and to
slight change in viewpoint, but usually at the cost of greater computational
complexity. This classification is not as hard as it may appear since a description of
a moving shape naturally lends itself to the inclusion of motion within the
description.
A prime example of a model-free approach is two approaches from UMD: Kale
et al. and Sundaresan et al.'s deployment of hidden Markov models (HMM) [100,
99] which consider two different image features : the width of the outer contour of a
binarized silhouette and the entire binary silhouette itself. These will be considered
in more detail, next.
BenAbdelkader et al.'s approach, also from UMD, using self similarity and
structural stride parameters (stride and cadence) [101] used PCA applied to self-
similarity plots that are derived by differencing. The self-similarity matrix encodes
the frequency and phase of gait and thus preserves the dynamics of gait.
Classification was performed by k-nearest neighbor and evaluated on the UCSD
data showing that recognition could be achieved. An extended analysis confirmed
performance on the Mobo and UMD data [102].
Sarkar et al.'s from the NISTI USF baseline approach performs recognition by
temporal correlation of silhouettes [77]. The aim was to develop a technique against
which future performance could be established. This was achieved by semi-
automatic definition of a bounding box from which a silhouette was matched. Then,
the gait period was estimated for use in partitioning sequences for temporal
classification. The approach was then evaluated on the Mobo data and on the
NIST/USF data, in comparison with a selection of other approaches. This allowed
determination of the covariates with the greatest impact, which in one test turned
out to be the surface on which subjects walked and the time interval between data
recording. The influence of the surface might be that the probe was selected as grass
which is rather more unforgiving, and possibly more uneven, than the concrete
surface.
Vega et al. use the change in the relational statistics among the detected image
features (which can handle running too) [78] and which removes the need for object
models, perfect segmentation, or part-level tracking. The relational statistics are
modeled using the probability that a random group of features in an image would
exhibit a particular relation. These distributions are represented in a space of
probability functions, where the Euclidean distance is related to the Bhattacharya
distance between probability functions. Different motion types sweep out different
traces in this space. As with other approaches, this is generic to motion analysis and
is used to recognize by gait with evaluation on an early version of the NISTI USF
data with high confidence, especially with respect to change in viewpoint. Liu et al.
have also developed an average silhouette [105] which is perhaps the simplest
recognition feature and had also been used by Veres et al. [147] and by Han et al.
[125], though in a study of potency, and called the gait energy image, respectively.
Recognition used the Euclidean distance between the averaged silhouette
representations and the technique was shown to be considerably faster than the
baseline algorithm Experiments with portions of the average silhouette
representation showed that recognition power is not entirely derived from upper
body shape, rather the dynamics of the legs also contribute equally to recognition .

46
As wiIl be considered later on analysis of potency, it then raises a feature selection
problem: by what can one, or should one, recognize gait?
Collins et al. from CMU used key frame analysis for sequence matching [110]
with innate viewpoint dependence. The key frames were compared to training
frames using normalized correlation, and subject classification was performed by
nearest neighbor matching among correlation scores . The approach implicitly
captures biometric shape cues such as body height, width, and body-part
proportions, as well as gait cues such as stride length and amount of arm swing. The
approach was evaluated on the Mobo dataset, and on early versions of the UMD and
Southampton and MIT databases. The technique showed excellent performance
across databases and across the gaits covered by the Mobo databases. In another
approach from CMU, Liu et al. used "frieze patterns" [Ill] derived from image
sequences by compressing images into a concatenated pattern, with some similarity
to the earliest approach [86]. The approach also gave facility for viewpoint
correction and was deployed to good effect on the Mobo database. Later, Tolliver et
al. were to show [151] that people could be recognized by shape with especial
consideration of noisy silhouettes. The technique used a variance-weighted
similarity metric to induce clusters that cover different stages in the gait cycle .
Results were evaluated on the NISTIUSF Gait Challenge data and suggested that
gait shape is effective when comparing subjects and test sets are acquired under
similar conditions.
The University of Southampton's newer approaches range from a baseline-type
approach by Foster et al.'s technique measuring area [112], to extension of
technique for object description including symmetry by Hayfron-Acquah et al. [137]
(with some justification from psychology studies [53]) and Shutler et al.'s statistical
moments [114]. These wiIl be considered in more detai11ater.
Lee et al. from the Massachusetts Institute of Technology (MIT) used ellipsoidal
fits to human silhouettes [119] . The gait appearance feature vector is comprised of
parameters of moment features in image regions derived from silhouettes of a
walking person aggregated over time either by averaging or spectral analysis. Each
silhouette is divided into seven parts after centroid determination: the head/shoulder
region; the front of the torso; the back of the torso; the front thigh; the back thigh;
the front calf and foot ; and the back calf and foot. Evaluation was performed on the
MIT and the Mobo data, also with consideration of gender classification (which was
achieved) and of potency for gender classification (for which the thigh orient ation
was ranked as most potent) .
Curtin developed a modified form of a Point Distribution Model (PDM) [120]
which included time to distinguish the movements of a walking and a running
subject. The gait was classified by the movement of points on the 2D shape .
Evaluation on a small number of subjects recorded walking and running on a
treadmill showed that the temporal PDM could be used for recognition and for
discrimination between walking and running, with greatest difficulty experienced in
discriminating between a fast walk and running .
The CAS Institute of Automation (CASIA) developed an eigenspace
transformation of an unwrapped human silhouette [121] and eigenspace
transformation of distance signals derived from sequences of silhouettes [122].
Again , these will be considered in more detail later in this Chapter .
Bhanu et al. from University of California, Riverside, used kinematic and
stationary features [124] by estimating 3D walking parameters by fitting a 3D

47
kinematic model to 2D silhouettes. Shape and structure was extracted separately and
then combined for recognition . The stationary features include body part length and
flexion, estimated from key frames. The combination was by combination
formulations including the sum and product rules of which the former performed
best, evaluated on a proprietary database. Han et al. [125] were also to use the Gait
Energy Image formed by averaging silhouettes and then deployed PCA and multiple
discriminant analysis to learn features for fusion. A feature fusion strategy was
applied at the decision level to improve recognition performance, increasing
capability beyond that of individual approaches . These were deployed to good effect
on the Gait Challenge data, with performance generally exceeding the baseline
algorithm.
Of the more recent approaches, Zhao et al. [126] used the mean amplitude of key
poses, evaluated on the NISTI USF gait challenge data, to achieve recognition . Lee
et al. separate gait into style and content by generating temporally aligned gait
sequences via local linear embedding with separation by a bilinear model and
achieved good performance on the gait challenge database [127]. Kobayashi et al.
used higher order correlation to extract motion correlation as a feature of human
movement, classifying running and walking as well as subject identification [128].
As mentioned previously, Boyd's more recent approach straddles model-based
and model-free approaches , by synchronizing the oscillation of pixel intensity with
those of arrays of phase-locked loops [130] with patterns analyzed by Procrustes
Analysis and directional statistics, and evaluated on Carnegie-Mellon Mobo and the
Southampton databases .
As mentioned in Section 2.5 there is other work on identifying perambulatory
subjects but few of these use gait-directed analysis. On the analysis of walking
movements in the context of gait recognition, Hild [131] estimates the 3D motion
trajectory with further estimation of leg frequency from the area between the legs.
Davis used basic stride parameters with one view-based constraint to identify basic
walking movements from a selection of other motions [133].
These show promise for approaches that impose low computational and storage
cost, together with deployment and development of new computer vision techniques
for analysis of objects moving through image sequences . We adumbrate the basis
for these approaches and show how they can be used to recognize subjects by the
way they walk, together with their advantages (especially over earlier techniques).
Of particular interest is the emerging studies of potency which determine which
parts of the human silhouette (or which measurements) are the most important for
recognition purposes . This parallels emergent research in other biometrics,
especially in face recognition.

5.2 Extending Shape Description to Moving Shapes

The main limitation on using PCAI LDA to analyze a sequence of binary silhouettes
was that the order of the images in the sequence could be changed, but the same
result would ensue. This suggests that including order is essential in the gait
signature. One Southampton approach determined the change in area through the
sequence and the change in area of selected parts of the sequence was used to
characterize each subject. Though this is computationally very attractive the

48
approach does not separate body shape and body dynamics. This was achieved by
extending standard shape description techniques to include motion, here concerning
statistical moments and discrete symmetry. Moments are a suitable candidate for
extension as they are a standard approach to shape description in computer vision,
with a concise analytical framework and with access to the scale of a shape's
description. Symmetry was chosen as it has a direct link to psychological
observations of gait, namely that gait concerns a synchronous pattern of
symmetrical movement. These three approaches from the University of
Southampton have been applied to gait data from a variety of sources, though
naturally concentrating on the Southampton data, and each has demonstrated
capability for recognition.

5.2.1 Area Masks

The area masks derive a measure of the change in area of a selected region of a
silhouette [112, 134]. Examples of area masks are shown in Fig. 5.2, and this is
not unlike the current use of Haar wavelets in face image analysis. Here, a
horizontal line (Fig. 5.2(a)) isolates those parts in the region of the waist whereas a
vertical one, Fig. 5.2(b), selects the thorax and these parts of the legs intersecting
with the vertical window. Many alternatives are possible such as a combination of
the upper body and legs, Fig. 5.2(c) merely measures the entire area change of the
image Fig. 5.2(d) simply analyses the legs of the subject.

=D~D
a) horizontal line b) vertical line c) bottom half d) full
Figure 5.2 Sample Area Masks
Each mask mj represents an area mask as a binary vector, with 1 representing the
white parts of the image. For each mask, we obtain a signature 8 by determining
the area that is congruent between the mask and the images x thus

8/t) = m} x(t) (5.1)

An area history is obtained by applying each mask to each image in the sequence
to then derive a time history of the number of points that are congruent. Note that all
the masks used are left right symmetric masks and therefore all information
corresponding to the asymmetric part of gait is lost. However, this also gives
invariance to the direction of walk. That is, any image within a sequence could be
mirrored and still give the same gait signature. An asymmetric masks which might
give additional information, but all walkers would need to be imaged walking in the
same direction.
Examples of the signatures ~(t) for three different masks are shown in Fig. 5.3
(the full area mask gives a similar output, but is not shown as the details are harder
to see when plotted on a scale showing all four functions). As can be seen, the
dynamic signature is intimately related to gait. The peaks in the graph represent

49
when the selected area of the subject is at a maximum (i.e. when the legs are furthest
apart) and the dips represent when the selected area is at a minimum (i.e. when the
legs are closest together). The bottom half mask selects the lower portion of the
image so has a much greater number of points that the vertical line, but the shape is
naturally very similar. The horizontal line mask selects those points near the region
of the waist which varies less with thorax motion than the associated movement of
the subject's arms. Clearly, the masks emphasize bilateral symmetry: one leg swings
whilst the other is in stance after which the limbs swap in function . Limping is an
inherently asymmetrical gait, and this would be evidenced in a clear disparity in
measured area between the two halves of the gait cycle .

Area
S/t) -+- Vertical Line
.~ Horizontal Line
---.- Bottom Half

600

400'-----'----'----'------'----'---
o 10 15 20 25 30

Time
Figure 5.3 Sample Output of Different Area Masks from the Same Subject
The sequences are aligned for comparison since they start at different points in
the gait cycle. This is achieved by using minima as a starting point. To eliminate
variations due to the sampling time and speed at which the subject walks, the part of
the sequence corresponding to the full walking cycle is resampled using cubic
splines to interpolate between the observations ~(t) . Thirty evenly spaced samples
are taken from the cubic spline, giving a 30 element vector for each area mask used.
Note that this does lose information about the subject's speed.
Each mask yields a vector describing the dynamics of area change within that
mask. Since there are many degrees of freedom in a walking subject, there could be
a considerable degree of independent information from different masks. Therefore,
the information from multiple masks was combined to provide a more complete
dynamics signature . The simplest way to achieve this is by concatenating the 30
element vector from each area mask to form a single long vector V of size n x 30,
where n denotes the number of area masks used . A different vector is derived for
each sequence of image and these are in fact the feature vectors used for
recognition. Note that the vectors still contain some information about the static
shape of the silhouette. This information is contained in the average value of
sequence ~(t) . This information can be removed by subtracting the average of the
sequence, although this reduces the recognition performance.

50
For each walking sequence, an n x30 element vector V is obtained that provides
a dynamic signature describing the gait. Canonical analysis was used to select a low
dimensional subspace where the differences between classes are maximized and the
difference within classes is minimized . The centroids of each class were calculated
in this new space and a k-nearest neighbor classifier was used to decide which
subject a sequence of images belong to.

5.2.2 Gait Symmetry

The discrete symmetry operator is one of the most widely-deployed symmetry


operators in computer vision and uses edge maps of images from the sequences of
subject silhouettes to assign symmetry magnitude and orientation to image points,
accumul ated at the midpoint of each analyzed pair of points. The total symmetry
magnitude (also known as isotropic symmetry), M(P), of each point P is the sum of
the contributions of image points that have P as their mid point, that is
(5.2)

where Pi and P, are pairs of points having P, as their mid point and C(Pi-Pj) as
their symmetry contribution. The symmetry relation or contribution, C(Pi,Pj)
between two points Pi and P, is:
(5.3)

where D i.j and Ph., (see Eqns. 5.4 and 5.6 respectively) are the distance and the
phase between the two points . I, and ~ are the logarithmic intensities at the two
points . The symmetry distance weighting function gives the distance between two
different points Pi and Pj, and is calculated as:

1
Di ·j=.J2Jraexp
[(IIP;-PjIIJJ . .
- 2a ,'rfI*J
(5.4)

where (7 controls the scope of the function . The logarithm intensity function, Ii, of
the edge magnitude Mat point (x,y) is
I, = logf l + M i ) (5.5)

where M, = ~M;(x,y)+ M; (x, y) where the M, and My are derived by the


application of (Sobel) edge templates. The phase weighting function between two
points Pi and P, is:
Phi,j = (1- cos(()i + ()j - 2ai.j )) (1-
cos( B; - ()j )) , \j i '# j (5.6)

where a(i,j) = atan ((Yi- v, )/( Xi -x


j )) is the angle between the line joining the two
points and the horizon . From Eqn 5.6, the phase weighting function has two factors
which make it possible to achieve the same measure for different object reflectance
and lighting conditions. In the distance weighting function, each value of (7 implies a
different scale thus making it suited to multi-resolution schemes. A large value of (7
implies large-scale symmetry that gives distant points similar weighting to close

51
points. Alternatively, a small value of (J implies local operation and local symmetry.
Increasing the value of (J increases the weighting given to the more distant points
without decreasing the influence of the close points. A comparison with the
Gaussian-like functional showed that the mean of the distribution locates the
function on the mean value of the sample. A focus, u, was therefore introduced into
the distance weighting function to control the focusing capability of the function,
hence further improving the scaling possibilities of the symmetry distance function .
The resulting function is called the focu s weighting function, FWF. This replaces
Eqn . 5.4 as follows:
(5.7)
FWF .
'.j
=-I-exp[
.J 21Ca
-(IIPi-Pjll-JlJJ
2a
,'ii=l' j

The addition of the focus into the distance weighting function moves the attention
of the symmetry operator from points close together to a selected distance.
The symmetry contr ibution obtained is then plotted at the midpoint of the two
points. The symmetry transform as discussed here detects reflectional symmetry that
is invariant under 2D rotation, translation transformations and under change in
scale, and as such has potential advantage in automatic gait recognition.
The gait signature for a subject was derived from a full image sequence. Each
sequence of image frames consists of one gait cycle taken between successive heel
strikes of the same foot, thus normalizing for speed . The following gives an
overview of the steps involved in extracting symmetry from silhouette information.
First, the image background was computed from the median of five image frames
and subtracted from the original image , Fig. 5.4(a), to obtain the silhouette, Fig .
5.4(b) . This was possible because the camera used to capture the image sequences
was static and there is no translational motion. Moreover, the subjects were walking
at a constant pace. The Sobel operator was then applied to the image in Fig. 5.4(b)
to derive its edge-map, Fig. 5.4(c). The edge-map was thresholded so as to set all
points beneath a chosen threshold to zero, to reduce noise or remove edges with
weak strength. These processes reduce the amount of computation in the symmetry
calculation. (Note that since optical flow also gives magnitude and direction this can
replace the edge operation in symmetry calculation. This does not taking advantage
of data reduc tion unless the output of an edge operator is used to mask the optic
flow data inside the silhouette.) The symmetry operator is then applied to give the
symmetry map, Fig. 5.4(d) . For each image sequence, the gait signature, GS, is
obtained by averaging all the symmetry maps , that is
(5.8)

The resulting components retained after a low-pass filter (of selected radius) was
applied to a Fourier transform ofGS to form the feature vector for each subject.

52
(a) original (b) silhouette (c) edges (d) symmetry map
Figure 5.4 Applying the Symmetry Operator

5.2.3 Velocity Moments

One method of developing a technique to analyze image sequences is to stack the


images into a three dimensional XYT (space plus time) block, and then apply a 3D
desc riptor to this data [114, 138]. Data in this form could be described using
conventional 3D moments (i.e. Cartesian 3D moments), treating time as the depth
axis . However, this method confounds the separation of the time and space
information, as they are embedded in the data . Time is fundamentally different from
the spatial analysis, thus, we intend to acknowledge this by treating it separately.
Further, analyzing motion concerns the ability to separating time and space (viz :
body shape and movement), enabling description of motion and/or spatial attributes.
An alternative way to analyze image sequences is to reformulate the descriptor to
incorporate time , enabling the separation of the time and spatial descriptions (if
desired), resulting in a more versatile descr iptor. To achieve this, a method of
motion description within the moment basis is required. The COM (center of mass)
describes a unique global position within the field of view forming the basis for the
centralized moments. Using the COM descriptions between consecutive images, a
description of global motion in either axis is possible. Further, the COM is
guaranteed to exist (unless the image is completely empty), independent of the
distribution (unlike altern ative higher order moments), justifying the use of this low
order moment as the basis of a generic framework.
The velocity moments are based around the COM description and are primarily
designed to describe a moving and/or changing shape in an image sequence. The
method enables the structure of a moving shape to be described, together with
motion information. The velocity moments are calculated from a sequence of
image s. Their generalized structure is:
1 (5.9)
vmm.n.a,r = LLLU(i,a,r)S(i,m,n)P."
;=2 x y

for image i and the moments are of order m,n,a,y . The term from standard spatial
moments is denoted by S(i,m,n) and the motion, or velocity, is introduced through
U(i,a,y) which are calculated from the differences between consecutive CaMs in
the image sequence.

53
U (i,a,r) = (~-Xi_1 r
(Yi - Yi-I r (5.10)

Xi is the current COM in the X direction, while Xi_I is the previous COM in the X
- -
direction, Yi and Yi-I are the equivalent values for the Y direction . In this way
moving shapes are characterized by their shape and motion, concurrently (but
separated shape and motion descriptions are available through vmm ,n,O,O and
vmo,o.a.y, respectively).
The shape's structure contributes through each pixel P;> ,y and the weighting
function S which is either a centralized Cartesian or Zernike polynomial. Cartesian
monomials were first studied due to their simplicity and ease of computation.
Further to this, the orthogonal Zernike moments are a weII established and proven
standard technique (in both image noise and pattern recognition), providing an ideal
platform to enable the analysis of the new framework on an orthogonal set. In
Cartesian velocity moments
S(i,m,n) = (x-~r (Y- Yi r (5.11)

The Zenike formulation essentiaIIy replaces the shape structure with a Zernike
polynomial computed over the unit circle, The moment set is computed from the
sequence of silhouettes within a gait cycle, as with symmetry the silhouettes can
either be binary or grayscale derived from the magnitude of the optical flow. The
moment set was reduced by Anova analysis and the remaining moments serve as a
feature vector describing different subjects . As expected, the resulting moments
described not only shape, but also motion, The motion used was horizontal, as to be
expected .

5.2.4 Results

The main criterion for assessing performance is naturaIIy the recognition capability,
the correct classification rate (CCR). This is derived by aIIocating class according to
proximity, in the chosen feature space, to labeled class examples. The k-nearest
neighbor (k-nn) rule is weII known for its robustness and performance, with the
added advantage that it can be easily replicated. Given that recognition rates are
usually high when biometric techniques reach publication status, it is common to
include measures of uncertainty in that performance. This aIIows for some
assessment of recognition capability in larger populations. Further, it is usual to
assess in some way practical capabilities in respect of potential application scenario .
This includes performance in noisy conditions, as surveillance video tapes are often
re-used leading to undesirable video artifacts, and when a subject might be obscured
or part of their body occluded, say when walking behind a street lamp. FinaIIy, a
unique advantage for gait is capability at low resolution, or distance and this too
should be explored.
The statistical recognition techniques from Southampton have been analyzed
primarily on the Southampton dataset A but also on a selection of other database s.
The performance has been analyzed by CCR and a selection of measures of
uncertainty are also given . Performance factors have also been investigated and a
selection is described alongside analysis of each technique's recognition capability.

54
Recognition by Area Masks
Table 5.1 shows the recognition rates derived by application of individual area
masks [47] on the Southampton database A. Six samples of each subject (3 left to
right, 3 right to left) were used to train the database and two samples (l left to right,
1 right to left) were used as the test data.

Area Mask Recognition Rate (% ) Expected Error CT.

Horizontal Line Mask 29.9% 3.03%


Vertical Line Mask 16.1% 2.43%
Bottom Half Mask 27.2% 2.95%
Full Mask 29.2% 3.01%
Table 5.1 Recognition Rates from Individual Masks
As can be seen from the results in Table 5.1, the performance of the system using
only individual area masks is unimpressive to an initial view. Clearly, there is no
significant difference in the Horizontal Line, Bottom Half- and Full masks.
However, the vertical line mask appears to have significantly less discriminatory
power than either of the others, and its estimated error is then proportionally greater
than for the other masks. It should be noted however, that even though the results
are low they are still significantly better than the chance recognition rate which is
less than 1% on a database of this size. All results are reported in terms of
recognition rate with an estimate of the expected error in that rate. The tests
consisted of N independent measurements leading to a binary outcome, i.e. the
results follow a distribution for Bernoulli trials. The probability of obtaining M
correct results is given by the standard binomial relationship
(5.12)

where p is the true recognition rate. The mean and variance for Mare Np and
Np(l- p), respectively. The natural unbiased estimate for the recognition rate is
MIN and the expected error in the unbiased estimate is ~ p(l- p)/N . The observed
recognition rate can be used as an approximation to p . For our database of 114
subjects with 2 test samples of each, N = 228 so the expected error in the mean is
around 3.0%, as shown in Table 5.1.
The advantage of the area mask approach is that the execution speed is very fast,
especially compared with other extant approaches . Since these results are
substantially higher than chance, we must assume that there is clear potential for
area masks to derive discriminatory information.
By combining the results from area masks as discussed previously and using
canonical analysis in the same manner as before the recognition rate increases
markedly. Results are combined by simply concatenating vectors together to
produce a vector of size n x 30 where n denotes the number of area masks used. By
combining masks we greatly increase the information available for recognition and
thus the results improve, shown in Table 5.2. The achieved recognition rate of over
75% is encouraging and significantly greater than chance alone.

55
Area Masks Combined Recognition Result s Expected Error a.
Horizontal + Vertical Line Mask 49.6% 3.3 1%
Bottom Half + Full Mask 50.0% 3.3 1%
All Above Masks 75.4% 2.85%
Table 5.2 Performance by Combining Multiple Masks

Gait differs from traditional biometrics because it contains a temporal as well as a


spatial component. The spatial component represents the shape of the person,
whilst the temporal component represents the movement of the body over time.
Psychological research [43] would seem to suggest that humans can recognize
movement patterns merely from the temporal component. Using only the DC
component of the area masks' output, produces poor recognition results. Using the
DC component of the vertical line mask produces a recognition rate of7.5%.
The full mask component represents the spatial component of the gait sequence.
As can be seen from Table 5.1, the recognition rate using only the spatial
component is approximately 30%. To determine where the recognition capability
was derived, we repeated the tests but this time removing the DC component of gait
sequence. This was accomplished by subtracting the mean of each sequence from
the data. Thus, the sequence only contains temporal information.

Area Mask Recognition Rate Expected Error a.


Horizontal Line Mask 8.9% 1.89%
Vertical Line Mask 26.3% 2.92%
Bottom Half Mask 23.7% 2.81%
Full Mask 24.1% 2.83%
Table 5.3 Results when DC Component is Removed

As can be seen from Table 5.3 the results are similar in levels of performance to
those in Table 5.1. However, we can see that the recognition rates using the
horizontal line mask has dropped by almost 20% to less than 10%. As expected,
this shows that the horizontal mask contains very little temporal information which
can be used for recognition. By combining the masks in the manner described
previously, it is possible to increase this recognition rate. Combining the four
masks produced a recognition rate of 52.7%. This suggests that approximately half
of the recognition rate come from the temporal components of gait, with the
remaining coming from the shape of the silhouettes.
The psychologists' view is that gait is a symmetrical pattern of motion [53, 8].
We have assumed this to be true, and not taken into account the foot on which the
subject starts, or their direction of travel. Does taking this information into account
result in a significant difference in recognition rate? The recognition rates for each
combination of starting leg and walk direction by four area masks (vertical line,
bottom half, horizontal line and full) and leave one out cross validation produced
approximately constant recognition rates for each of the combinations. This was to
be expected as all subjects are able-bodied and none showed a distinct impediment
in their walk, to the human eye.

56
Analysing the results in more detail indicates that if knowledge of the starting
leg is used, then performance can be improved. Table 5.4 shows the recognition
rates of individual masks on a data set where all the subjects are walking right to left
and starting on the left foot.

Mask Used Recognition Rate Expected Error u .


Horizontal Line Mask 52.2% 4.67%
Vertical Line Mask 37.7% 4.54%
Bottom Half Mask 49.6% 4.68%
Full Mask 45.0% 4.66%
Table 5.4 Performance for Same Direction and Starting Leg

If we compare the results in Table 5.4 to those in Table 5.1 we can see there is a
marked increase in the recognition rate when the starting leg and direction of travel
is taken into account. Table 5.5 shows the results on the same data set when the
masks are combined.

Area Masks Combined CCR(%) Expected Error u.


Horizontal + Vertical Line Mask 67.7% 4.40%
Bottom Half + Full Mask 61.9% 4.55%
All Masks 76.6% 3.97%
Table 5.5 Combining Area Masks when Direction and Starting Leg is the Same

From the results in Table 5.4 we can see that whilst knowledge of the starting leg
and direction of walk can increase performance when single masks are used, it does
not provide significant advantages when masks are combined . Further research is
necessary to determine if knowing the starting leg is a substantial advantage to the
potency of gait as a biometric . Essentially this poses a question concerning the
latent symmetry of gait which is yet to be determined by automatic analysis .
The Southampton database A consists of more than 100 subjects, of which 20
are female. Gender discrimination could be a practical first stage in gait
classification if the number of subjects is large, as it would reduce the number of
subjects needed to be searched. To avoid bias in the test sample, 20 people of each
sex were used, giving a total database size of 40. 15 people of each sex were used
as training, with the remainder as test subjects. In this case, eight sequences were
used for each. This gave 120 training sequences and 40 test sequences.
By using canonical analysis, the maximal gender discrimination rate was 64%
which is better than chance, but not by much. Whilst results on gender classification
are disappointing, they are justifiable. With the subject walking normal to the
camera's plane of view there is very little gender discrimination information
available . In mitigation, the clothing of males is far more uniform than that of
females, suggesting that diversity of apparel could increase the potential recognition
rate for women. Previous medical work has actually indicated that male and female
walkers differ in terms of lateral body sway, with males tending to swing their
shoulders from side to side more than their hips, and females tending to swing their
hips more than their shoulders [50, 47]. This sort of information might be better
discerned from a fronto-normal or an overhead view. Further research is needed to

57
determine whether area masks, using different views of the subject, will be able to
fulfil the task of gender discrimination.
In this way, the area masks have shown capability to discriminate people by
their gait. Further, fusion approaches have been shown to improve performance.
Compared with chance and expected error rates, these early results are of statistical
significance and show that this baseline technique can indeed be deployed to
recognise people. Naturally, the simplicity sacrifices discriminatory ability, but
people can be recognized just by change in area over time. Later approaches will
offer much better performance, but usually at the expense of complexity. The
measures have been shown to use shape and motion independently, again a feature
of later approaches. Finally, there is in principal a consideration of the potency of
the measurements. This is a topic of current interest in all of biometrics, and we
shall describe gait's contribution to this growing debate, later.

Recognition by Symmetry
After our earlier work [135, 137] the same method was applied to a much larger
database of 28 subjects, part the Southampton database A. At the time, this new
database equaled in size the largest (published) contemporaneous gait database
though new and larger databases will emerge in the near future. Each subject has at
least four image sequences giving in total 112 image sequences. With this part of
the database, only the silhouette information is used and the recognition rates
obtained are also shown in Table 5.6. Clearly, using a larger database still gave the
same good recognition rates and that is very encouraging.

Silhouette 100 100


Original Southampton 4 16
o tical flow 100 100
Silhouette 97.6 92.9
UCSD 6 42
o tical flow 92.9 92.9
Part of Southam ton A 28 112 Silhouette 97.3 96.4
Table 5.6 Initial Results by Symmetry obtained from Three Databases.

For the low pass filter, all possible values of radius were used to investigate the
number of components that can be used (covering 0.1 to 100% of the Fourier data).
Though the results of Table 5.6 were achieved for all radii greater than 3 (using the
early Southampton database), selecting fewer Fourier components might affect the
recognition rates on a larger database of subjects, and this needs to be investigated
in future. Fig. 5.5 shows the general trend of the recognition rate against the
different cut-off frequencies. The figure shows that selecting about 1.1% of the
Fourier descriptions is enough to give the best recognition rates for both k = 1 and
k = 3, but this is only on a small (early) part of the database.

58
Effect of low pass filtering us ing different rad ii

-t"-""""_I--II-_-_ _-"""'-!!,
100 =",,,,,,,,,,,""=,,",,","~ . . .
98

~ :
96

~-

~ 90
c: 88
~ 86
~ 84
g 82
.:! 80
78
78
74 ......'-..--""~=:.;;;...::.;;;...:"'-'-~-"-
1 10 15 ~ ~ ~ ~ 45
Radius

Figure 5.5 Recognition Rates for Different Cut-off Frequencies

Performance was evaluated with respect to missing spatial data, missing frames and
noise using the early Southampton database. Out of the 16 image sequences in the
database, one (from subject 4) was used as the test subject with the remainder for
training. The evaluation, aimed to simulate time lapse, was done omitting a
consecutive number of frames. For a range of percentages of omitted frames, Fig.
5.6, the results showed no effect on the recognition rates for k = I or k = 3. This is
due to the averaging associated with the symmetry operator. Fig. 5.6 shows the
general trend of deviation of the best match of each subject to the test subject. The
increase in the similarity difference measures as more frames are omitted appears to
be due to the apparent increase in high frequency components when averaging with
fewer frames.

Sim ilarity
20000
18000

16000 1 ' ~~~~~~


14000h

12000
' 0000
t:::;;~~~!::::::::~~~==:~~
8000)-- - -- - ...,.,-::;;;-- .......""""= '---- -1
6OOO -I-- -:;;;..... - =....:..:...-- - - - - - -j
4OOO~~-------------j
2000 j- - - - - - - - - - - - - - -j

so 64.5 74.2 77.4 80.6 86.7 87.1


% of missing frames

Figure 5.6 Effects of Omitting Image Frames

The evaluation was done by masking with a rectangular bar of different widths: 5,
10 and 15 pixels in each image frame of the test subject and at the same position.
The area masked was on average 13.2%,26.3% and 39.5% of the image silhouettes,
respectively. The bar either had the same color as the image silhouette or as the
background color, as shown in Fig. 5.7, simulating omission and addition of spatial
data, respectively. In both cases, recognition rates of 100% were obtained for bar
size of 5 pixels for both k = 1 and k = 3. For a bar width of 10 pixels, Fig. 5.7c

59
failed but Fig . 5.7a gave the correct recognition for k = 3 but not for k = 1. Fig.
5.6c failed as the black bar changes effective body shape and recognition uses body
shape and body dynamics. For bar sizes of 15 (i.e, Figs . 5.7b and d) and above, the
test subject could not be recognized as subject is completely covered in most of the
image frames. This suggests that recognition is unlikely to be adversely affected
when a subject walks behind a vertically placed object, such as a lamp post - when
it is thinner than the human shape.
To investigate the effects of noise (to simulate poor quality video data - usual in
many surveillance applications), synthetic noise was added to each image frame of a
test subject and compared the resulting signature with those of the other subjects in
the database. Figs . 5.7e and fshow samples of the noise levels used . The evaluation
was carried out under two conditions. First by using the same values of o and !l
(Eqn. 5.7) as earlier, for a noise level of 5%, the recognition rates for both k = 1 and
k = 3 were 100%. For 10% added noise, the test subject could still be recognized
correctly for k = 1 but not for k = 3. This suggests that the recognition is not greatly
affected by limited amounts of failure in background extraction, though those errors
are less likely to be uncorrelated. With added noise levels of 20% and above, the
test subject could not be recognized for k = 1 or k = 3. In order to reduce
contribution of distant (noisy) points, the values of c and !l were then made
relatively small. Now the recognition rates (100%) were not affected for both k = 1
and k = 3 for added noise levels even exceeding 60%. Fig . 5.5b shows how in this
case the best match of each subject deviated from that for the test subject, as more
noise was added . Here, the high recognition rate is consistent with the same order
being achieved despite addition of noise . As such, the symmetry operator can be
arranged aiming to include noise tolerance, but the ramifications of this await
further investigation.

(a) 10 pixels (b) 15 pixels (c) 10 pixels (d) 15 pixels

(e) 10% (f) 50%

Figure 5.7 Simulating Occlusion and Noise for Symmetry Analysis

60
Recognition by Velocity Moments
The Velocity moments were analyzed first on the CMU data [138], and later on
part of the Southampton data. The CMU data is silhouette data only, without the
original images, of subjects walking on a treadmill. This restricted deployment of
the velocity moments but they still showed good performance with a recognition
rate exceeding 90%
The part of the Southampton database A used for evaluation of the velocity
moments consisted of 50 subjects, with 4 sequences of each subject, (-600 images) .
The sequences studied contained the subjects walking from left to right for one and
a half gait cycles (three consecutive heel strikes). Moments were computed from
silhouettes and optical. Due to the increased resolution of the images and the
distance over which the subjects walked, the optical flow was computed within a
moving window, moving at the subject's average velocity, a method already used in
gait recognition by Little [88]. The average velocity was calculated using the Center
Of Mass (COM) information from the silhouette moments. As before the moments
of flow are (effectively) windowed data, so the average velocity is placed back into
the velocity moment calculation (as done for the Southampton and UCSD
databases) . However, due to the increased resolution and size of the dataset, the
Zemike moment scaling was switched off to avoid problems through scaling large,
non-binary images (i.e. the mapping could scale the subject causing it to exceed the
unit disc's area). The images were instead scaled to appear visually central to the
unit disc i.e. the thresholded image's COM was used in the mapping (in-place of the
actual grayscale COM). 234 Zemike velocity moments were computed on the
silhouettes and windowed optical flow.

#Subiects ' Chissificll.tion Rate ;


. t-
.:
STs TIs STs+TIs Chance
CMU 03 7 s 25 90% - 90% 0.04
CMU 03 7 f 25 87% - 87% 0.04
CMU 05 7 s 25 92% - 92% 0.04
CMU 05 7 f 25 87% - 87% 0.04
Southampton 50 83.5% 50.5% 96% 0.16xlO,6
Table 5.7 Recognition Rates from Velocity Moments [138]
A comparative analysis is shown in Table 5.7 for the CMU and half of the
Southampton database A. Results on both are quite encouraging, especially as this
analysis was derived just as the newer and larger databases were starting to arrive .
Temporal templates were not available for the CMU data, but increased
performance when classifying the Southampton data.
The moments described both horizontal and vertical velocity components along
with spatial information. The list was manually constructed using the .results from
previous database analyses reducing the computation time (in place of computing a
full list) and contained moments up to and including order/rotations m,n = 12. The
classification results by all 234 moments were 74% and 53% for the nearest
neighbor results on silhouette and optical flow, respectively . These classification
results are low, suggesting the need for feature selection. Results for a subset of
eight ST moments selected using the Anova technique improved k = I capability to
87% and 70%, respectively - a considerable improvement. The F statistic was

61
investigated for discriminatory capability and only moments with a large F statistic
re used; it is worth noting that using just two velocity moments achieves nearly 40%
discrimination capability. The classification rates for the five selected velocity
moments computed from optical flow were relatively low in comparison to those
calculated from the silhouettes . The selected velocity moments based on flow
actually favor those holding solely spatial information. A similar result was found
upon analysis of the UCSD data (Section 3.1.1), supporting the hypothesis that
using optical flow gives detailed information about a subject's limb motion (which
may not vary enough between subjects on its own to allow for good subject
separation), while the silhouettes hold global shape/motion differences . The k = 1
classification results exceed those for k = 3 which suggests that the feature space is
closely packed (with respect to subject clusters). There are two obvious solutions to
overcome this problem. The first is to increase the dimensionality, using more
features to increase cluster separation , or secondly to use a more sophisticated
classifier. Though this effect may be caused by the normalization of the moment
values, the normalization is actually used to stop biasing of the k-nn classifier by
moments which naturally produce larger values. However, if one subject produces
significantly different feature values to the rest of the database, the remaining
subjects within the database (and the differences between them) will be compressed
into a small area of the feature space. This highlights the possible need for an
alternative classifier when analyzing larger datasets, or in situations where the
feature's order of magnitude may vary. Alternatively, we can combine the two
results by silhouette and optical flow which results in higher classification rates of
96% and 93% for k = 1 and k = 3, respectively The proximity of these results
suggests that the feature space is less packed with respect to between-subject
differences, than it was when using j ust silhouettes or optical flow alone. Finally, it
is worth noting that the probability of correctly classifying by chance this part of the
Southampton database A (for four independent samples of each subject) is
c.is- 10-6 •
Some performance factors have already been addressed in other analysis . One
potent facet of gait as a biometric is the capability for recognition at a distance . This
can be simulated by reducing the resolution of the silhouette . In fact, symmetry
retains 90% recognition capability even when the silhouette is reduce to be of size
16x 16. A similar analysis was conducted more comprehensively on the velocity
moments. The analysis was applied to the Southampton silhouettes as the re-scaling
is relatively simple, due to their binary nature. For each sequence, the Zernike
velocity moments used for classification were re-calculated for each different image
resolution. The normalized mean squared error (NMSE) is then calculated between
the original velocity moment values (Oi) - the full resolution versions - and the new
'altered' values (Wi)' at each incremental step. The NMSE was defined as
(5.13)

where K is the number of features, or moments and a NMSE value of 1 indicates


100% variation from the original feature values. Describing the performance
characteristics in terms of an error-rate produces an analysis that is independent of
the characteristics of the database. For example, if this analysis was instead
conducted using the classification rate), the results may depend on subject cluster

62
compactness and separation. Additionally, the subject clusters may all shift in the
feature space relative to each other, representing no change in the classification rate,
even though the features themselves have changed. If this reduction in resolution is
performed before the mapping function then the results are dependent both on the
mapping calculation and the Zernike moment calculation. However, the mapping
process will have a positive effect on the handling of lower resolution images. It
effectively ensures that there is no loss in the accuracy of the Zernike polynomial
calculations, by mapping the reduced resolution image to the same grid size as the
original resolution calculation, thus making the two results directly comparable. If
the reduction in resolution is applied after the mapping process, theory suggests that
the errors will rapidly increase due to loss of both image and calculation precision.
Therefore, the effects of applying the reduction in resolution before the mapping
process were studied. Assuming that the original image is the highest resolution
available, the images were progressively re-sampled to reduce their resolution. Sub-
pixel estimation is allowed, enabling any re-sampling size to be achieved. Eleven
different resolutions were analyzed from 50% of the original resolution, through to
2%. In recognition, the errors began to diverge as the pixel size increases beyond
10. However, the NMSE errors are still low. Degrading the resolution is effectively
adding noise to the perimeter of the silhouette, up to the point where each image
loses its overall shape.

5.2.5 Potency of Measurements of Silhouette

It is interesting that it is only recently face recognition has come to concentrate on


feature potency. In this respect gait research has learned from face recognition since
the databases were constructed with such an aim in mind. The covariate database
was recorded explicitly to explore variation in walking style. An alternative
interpretation is that the database also allows for exploration of those factors which
offer the most potent description of generalized walking. Accordingly the
silhouettes for Southampton databases A, AS and a time version TS were analyzed
for potency[147].
Two of the simplest approaches were used to obtain a gait signature for a given
sequence, aiming not to lose any information by the invariant properties say of
symmetry or moments. First, a sum of all silhouettes in a full gait cycle was used to
obtain an average appearance for each recorded sequence, which is simply the
average of all the binary silhouettes which are centralized within each image. The
alternative way used a differencing operation on adjacent silhouettes to obtain a
basic estimate of motion.
First all three databases were analyzed separately using Anova and peA to find
out which image information (features) is completely redundant, which features
have a relatively high variation between the subjects, and how the original feature
set could be reduced without a big reduction in the variance-explained and
recognition rate. All three databases have redundant features and they are not
necessarily all the same. This is important in application, since it suggests areas on
which a camera or feature extraction approach might concentrate . However, to
jointly compare databases A, AS and TS the datasets have to be reduced to the same
number of features. Therefore, shared important features between three databases
were determined and how the reduction to these features in three databases affect
recognition rate was investigated. The recognition rate was calculated using

63
Euclidean distance and the nearest neighbor rule. A more sophisticated classifier
was not used since the important factor was only the relative reduction/increase in
recognition rate at this stage. It is worth noticing that to be able to display all results
in an easily understandable way, the initial feature sets were extracted from 64 x 64
image including zeros when there are no silhouettes. However, to simplify the
statistical analysis, the following mask of each database was constructed: features
which contain only zero through all considered database were removed and their
locations in the original set were marked recorded for later display purpose.
However, there are feature vectors which still contain zeros for some subjects.
Therefore PCA was run on the covariance matrix rather than the correlation matrix.
Shared feature s - 84% of variance explained

10 10

20 20

30 30

40 40

50 50

60 60
20 40 60 20 40 60

115 shared features 79 sha red features


.60 r -- - - - - - - - - - -, 60

50 50

~
.2l
l'! 40
<:
540 0
""OJ
·c :gOJ 30
0
830 o
~ ~ 20
~
20

10L----~---~----' 0
o 50 100 150 0 20 40 60 80
features features

Figure 5.8 Analyzing the Potency of Silhouette Measures [147]

Two sets of important features, which are the same in all three databases, were
considered. First, the features which explain 100% of variance in each data set, i.e.
236, 100I and 217 features from the three databases. These features contain 115
shared among three datasets important features. Fig. 5.8 shows the location of
shared l l S features on silhouette at the left top picture. The shared features cover
the contours of head, body, some legs and some features of arm. To find out which
role in recognition these 115 features play, database A was considered as a gallery
and database AS was considered as a probe and recognition rates were calculated
both for all, 4096 features, and for shared l l S features. Bottom left picture in Fig.
5.8 shows how recognition rate changes with adding additional features. Again here
the solid line describes dependency of significant features versus recognition rate
(46.3% for lIS features), while the dashed line corresponds to recognition rate

64
when all features are considered (56.4%). In this case 17.9% of recognition rate was
lost. The further reduction was tried. From each dataset 150 features obtained by
PCA earlier were compared and 79 shared features were selected. It was found out
that 79 features explain approximately 84% of variance in each database. These
features were projected on silhouette and presented in Fig. 5.8, top right. In this case
the most important shared features are contours of the head and body. The
recognition rate versus the shared features is presented in bottom right picture of
Fig. 5.8. In this case recognition rate for 79 features was 41.3% in comparison to
56.4% for all 4096 features, i.e. a reduction of 26.8%. Practically it means that it is
not enough for a differential silhouette to include only static component of gait, in
spite of the fact that static components of gait account for 84% of explained
variance. Legs play the important role in a differential silhouette, however a
practically negligible amount of features describing legs is shared through time, i.e.
through all three databases. This then suggests that the motion estimation is crude
and should be improved in future analysis.

5.3 Procrustes and Spatiotemporal Silhouette Analysis

5.3.1 Automatic Gait Recognition Based on Procrustes Shape Analysis

Human gait is usually determined by the walker's weight, limb length, habitual
posture and so on. It includes both the body appearance and the dynamics of human
walking motion. In theory, joint angle changes are sufficient for recognition by gait.
However, their recovery from a video of walking person is difficult for current
vision techniques. The particular difficulties of joint angle computation from
monocular video sequences are self-occlusions of limbs and joint angle
singularities. Empirically, recognizing humans by gait can be achieved by applying
statistical analysis to the temporal patterns of individual subjects, which has been
well demonstrated in gait recognition. These techniques remain statistical in
essence, describing human motion by a compact representation of motion or
structural statistics of a sequence of area distributions rather than an attempt to
match the data to a model. Intuitively, recognizing people through gait depends
greatly on how the silhouette shape of an individual changes over time. Therefore,
we may consider gait as being composed of a set of static poses and their temporal
variations can be analyzed to obtain distinguishable signatures. Based upon the
above consideration, here we present a model-free automatic gait recognition
algorithm using the Procrustes shape analysis method [214]. The algorithm was
developed at the CAS Institute of Automation.
Fig. 5.9 gives an overview of the proposed method [121]. For each input
sequence, an improved background subtraction procedure is first used to extract the
spatial silhouettes of walking figures from the background. Pose changes of these
segmented silhouettes over time are then represented as an associated sequence of
complex configurations in a 2D shape space and are further analyzed by the
Procrustes shape analysis method to obtain an eigenshape as gait signature.
Standard pattern classification techniques such as the k-nearest neighbor classifier
and the nearest exemplar classifier based on the full Procrustes distance measure are
respectively adopted for recognition. Like many previous works, this approach also

65
does not directly analyze gait dynamics. It includes the appearance as part of the
gait recogni tion features . It is in essence holistic because gait is implicitly
characterized by the structura l statistics of the spatiotemporal patterns generated by
the silhouette of the walking person in image sequences.

InputGaitSequence

Motion BlobSequence

Trained Database -----


- - - --,.'
" Recognition

Figure 5.9 Overview of Gait Recognition basedon Procrustes ShapeAnalysis

5.3.2 Silhouette Detection and Representation for Procrustes Analysis

Silhouette Extraction
Human detection and tracking is the first step to gait analysis. To extract and
track moving silhouettes of a walking figure from the background image in each
frame, change detection and tracking is adopted which is based on background
subtraction. The main assumption made here is that the camera is static, and the
only moving obj ect in video sequences is the walker. Although this integrated
method basically performs well on the CAS IA dataset, it should be noted that robust
motion detection in unconstrained environments is an unsolved prob lem for current
vision techn iques because it concerns a number of difficult issues such as shadows
and motion clutter.
Background subtraction has been widely used in foreground detection where a
fixed camera is usually used to observe dynamic scenes . How to reliably generate
the background image from video sequences is critical. Here the LMedS (Least
Median of Squares) method [215] is used to construct the background from a small
portion of image sequences even including moving objects. Its advantages are that it

66
is especially efficient for 1D data process in returning the correct result even when a
many outliers are present. Let I represent a sequence including N images. The
resulting background bxy can be computed by
(5.14)

where p is the background brightness value to be determined for the pixel location
(x, Y), med represents the median value, and t represents the frame index ranging
within I-N. It is found that N over 60 is sufficient for the CASIA dataset to generate
a reliable background .
The brightness change is often obtained through differencing between the
background and current image. However, the selection of a suitable threshold for
binarization is very difficult, especially in the case of low contrast images as most of
moving objects may be missed out since the brightness change is too low to
distinguish regions of moving objects from noise. To solve this problem, we use the
following extraction function to indirectly perform differencing [215]
2.J(a+I)(b+l) 2.J(256-a)(256-b) (5.15)
f( a, b ) = 1- .--'-------
(a+I)+(b+l) (256-a)+(256-b)

where a(x, y) and b(x , y) are the brightness of current image and the background at
the pixel position (x, y) respectively, 0::5:a(x, y),b(x,y):Q55 ,0::5: f(a,b) < I . This
function can detect the change sensitivity of the difference value according to the
brightness level of each pixel in the background image.
For each image I xy , the distribution of the above extraction function f (a (r, y),
b (x, y)) over x and y can be easily obtained. Then, the moving pixels can be
extracted by comparing such a distribution against a threshold value decided by the
conventional histogram method.
It should be noted that the above process is performed independently for each
component R, G and B in an image. For a given pixel, if one of the three
components determines it as the changing point, then it will be set to the
foreground . This produces a mask that is considered as a region of interest for
further processing.
No change detection algorithm is perfect. Hence, it is imperative to remove as
much noise and distortion as possible from the segmented foreground.
Morphological operators such as erosion and dilation are first used to further filter
spurious pixels, and small holes inside the extracted silhouettes are then filled. A
binary connected component analysis is finally applied to extract a single highly
compact connected region with the largest size. An example of gait detection is
shown in Fig. 5.10.

(a) background image (b) original image (c) extracted silhouette


Figure 5.10 Example of Gait Detection

67
Representation of Silhouette Shapes
An important cue in determining underlying motion of a walking figure is the
temporal changes in the walker's silhouette shape. To make the proposed method
insensitive to changes of color and texture of clothing, we ignore the color of the
foreground objects and only use the binary silhouette. Further, for the sake of
reducing redundant information, we use spatial edge contours to approximate
temporal patterns of gaits (see Fig. 5.11).

l~
Re

Figure 5.11 l1Iustrationof Silhouette Shape Representation

Once the spatial silhouette of a walking subject is extracted, its boundary can be
easily obtained using a border following algorithm based on connectivity. Then, we
can compute its shape centroid (x; Yc) by
(5.16)

where Nb is the total number of boundary pixels, and (x;, y;) is a pixel on the
boundary. Let the centroid be the origin of the 2D shape space. We can then unwrap
each shape anticlockwise into a set of boundary pixel points sampled along its
outer-contour in a common complex coordinate system. That is, each shape can be
described as a vector of ordered complex numbers with N; elements
(5.17)

where z;= x;+j xy;. The extraction and representation process of the silhouette's
boundary is illustrated in Fig. 5.11, where the black dot indicates the shape centroid,
and the two axes Re and 1m represent the real and imaginary part of a complex
number respectively. Therefore, each gait sequence will be accordingly converted
into an associated sequence of such 2D shape configurations .
We need a method that allows us to compare a set of static pose shapes in gait
pattern and is robust to changes of position, scale and rotation . A mathematically
elegant way for aligning point sets in a common coordinate system is Procrustes
shape analysis [214]. So it is expected that it can be easily adapted to handle spatial
patterns of gait motion. In the following section, we will give a brief introduction to
the Procrustes shape analysis method and show its application in gait signature
extraction and classification .

68
5.3.3 Procrustes Gait Feature Extraction and Classification

Procrustes Shape Analysis


Procrustes shape analysis [214] is a particularly popular method in directional
statistics. It is intended to cope with two-dimensional shapes and provides a good
method to find mean shapes. The following is mainly taken from [89] for providing
a brief introduction to Procrustes shape analysis.
A shape in 2D space can be described by a vector of k complex numbers,
2 = [2\,zz,."h]T, called a configuration. For two shapes, 2\ and 2Z, if their
configurations are equal through a combination of translation, scaling and rotation
Zl = al k + pzz,a,p E C (5.18)
j LP
{
P =1 P I e
where al k translates 2Z, and IPI and LfJ scale and rotate 2Z respectively, we may
consider they represent the same shape [89].
It is very convenient to center shapes by defining the centered configuration
T - - k •
U = [ut. Uz, ,, ., Uk], Uj =2 j - 2, Z =L j:1 Zj I k . The full Procrustes distance
d,,{ut.uz) between two configurations can be defined as [89]
(5.19)

which minimizes
(5.20)

Note that the superscript * represents the complex conjugation transpose and
o~ d F ~ 1. The Procrustes distance allows us to compare two shapes independent
of position, scale and rotation .
Given a set of n shapes, we can find their mean by finding U that minimizes the
objective function [89]
(5.21)

To find u, we compute the following matrix


n
(5.22)
S" = L(uju;) /(u;uJ
1=1

u
The Procrustes mean shape is the dominant eigenvector of Su, i.e., the eigenvector
that corresponds to the greatest eigenvalue of S; [89].

Gait Signature Extraction


Our approach uses these single shape representations from a gait sequence to find
their mean shape as gait signatures for recognition . Similar to Eigenface analysis,

69
we call this gait signature as "Eigenshape" . The follow ing summarizes the major
steps in determining the Procrustes mean shape for a sequence of shapes from n
frames, e.g., a gait pattern .
I . Select a set of k points from the boundary to represent a 2D shape as a vector
configuration z, in the manner discussed in Section 5.3.2. We tackle the point
correspondence problem through interpolation of bound ary pixels so that the point
set is the same for each image;
2. Set the centered configuration. When we represent the silhouette shape, we
have used shape centroid as the origin of 2D space to move all shapes to a common
center to handle translation al invariance . So, we can directly set u, = Z;, i = I, 2,
...,n;
3. Compute the matrix Su using Eqn.(5.22). Then , compute the eigenvalues and
the associated eigenvectors of Su;
4. Set the Procrustes mean shape U as the eigenvector that corresponds to the
maximum eigenvalue, and this mean shape is used as the gait signature.

Similarity Measure and Classifier


To measure similarity between two gait sequences, we make use of the Procrustes
mean shape distance (MSD) in the following way :
u u
I . Compute the Procrustes mean shape l and 2 of the two gait sequences .
2. Find the full Procrustes distance between the two mean shapes by
(5.23)

The smaller the above distance measure is, the more similar the two gaits are.
Gait recognition is a traditional pattern classification problem which can be
solved by measuring similarities among gait sequences . We try three different
simple classification methods, namely the nearest neighbor classifier (NN), the k-
nearest-neighbor classifier (kNN) , and the nearest neighbor classifier with class
exemplar (ENN) .
Let T represent a test sequence and R, represent the ith reference sequence, we
may classify this test sequence into the class c that minimizes the similarity
distances between the test sequence and all reference patterns by
c = arg min d(T, R j ) (5.24)

where d is the similarity measure defined in Eqn.(5 .23).


No doubt, a more sophisticated classifier could be employed, but the interest here
is to evaluate the genuine discriminatory ability of the features. We use the leave-
one-out cross-validation rule in our experiments in order to obtain an unbiased
estimate of recogn ition accuracy .

5.3.4 Spatiotemporal Silhouette Analysis Based Gait Recognition

This study at the CAS Institute of Automation aimed to establish an automatic gait
recognition method based upon spatiotemporal silhouette analysis measured during

70
walking. Gait includes both the body appearance and the dynamics of human
walking motion. Intuitively, recognizing people by gait depends greatly on how the
silhouette shape of an individual changes over time in an image sequence. So, we
may consider gait motion as being composed of a sequence of static body poses and
expect that some distinguishable signatures with respect to those static body poses
can be extracted and used for recognition by considering temporal variations of
those observations . Also, eigenspace transformation based on PCA has actually
been demonstrated to be a potent metric in face recognition (i.e., eigenface) and gait
analysis . Based on these observations, we proposed a silhouette analysis based gait
recognition algorithm using the traditional PCA. The algorithm implicitly captures
the structural and transitional characteristics of gait, especially the shape cues of
body biometrics . Although it is very simple in essence, the experimental results are
surprisingly promising .

Human Detection
and Tracking

Feature
Extraction

[ Projection in EigenSpace]
Eigenspace
Training or
Classification
Computation H
'-.---/ l'-----------'JReco gnition

Figure 5.12 Overview of the Spatiotemporal Silhouette Analysis Based Method

An overview of the proposed algorithm is shown in Fig. 5.12 [122]. It consists of


three major modules, namely human detection and tracking, feature extraction, and
training or classification. The first module serves to detect and track the walking
figure in an image sequence. A background subtraction procedure is performed to
segment motion from the background, and the moving region corresponding to the
spatial silhouette of the walking figure is successively tracked through a simple

71
correspondence method. The second module is used to extract the binary silhouette
from each frame and map the 2D silhouette image into 1D normalized distance
signal by contour unwrapping with respect to the silhouette centroid . Accordingly ,
the shape changes of these silhouettes over time are transformed into a sequence of
ID distance signals to approximate temporal changes of gait pattern . The third
module either applies PCA on those time-varying distance signals to compute the
predominant components of gait signatures (training phase) or determines the
person's identity using the standard non-parametric pattern classification techniques
in the lower-dimensional eigenspace (classification phase).
Spatiotemporal Feature Extraction
Before training and recognition, each image sequence including a walking figure is
converted into an associated temporal sequence of distance signals at the
preprocessing stage.

Human Detection and Silhou ette Representation


Here, we use the same human detection method as in the preceding section. To
eliminate the inaccuracy due to segmentation error, each foreground region is then
tracked from frame to frame by a simple corresponden ce method based on the
overlap of their respective bounding boxes in any two consecutive frames [217].
That is, we perform binary edge correlation between the current and previous
silhouette profiles over a small set of displacements . We find that the human
detection and tracking procedure performs well on the CASIA data as a whole. It
does not significantly affect subsequent feature extraction though there are a small
portion of silhouette distortions such as partial missing of body parts .
1..---~--~---~-.,.--,

ec 0 8
un...-apPin.1! *
is 0.6
'0

~
Q)

~ 0.4
E
o
z 0.2

100 200 . 300 400


# Boundary POints
(a) illustration of boundary extraction (b) the normalized distance signal
and anti-clockwise unwrapping consisting of all distances between the
centroid and the pixels on the boundary
Figure 5.13 Silhouette Representation

An important cue in determining underlying motion of a walking figure is


changes of the walker 's silhouette over time. Here we convert these 2D silhouette
changes into an associated sequence of ID signals to approximate temporal pattern
of gait. This process is illustrated in Fig. 5.13.
After the moving silhouette of a walking figure has been tracked , its outer
contour can be easily obtained using a border following algorithm based on
connectivity. Then, we may compute its shape centroid (xc, ye>. By choosing the

72
centroid as a reference origin, we unwrap the outer contour anti-clockwise to tum it
into a distance signal S = {dj, di. .. ., d.; .. ., dNb } that is composed of all distances d,
between each boundary pixel (x;,Yi) and the centroid
(5.25)

This signal indirectly represents the original 2D silhouette shape in the lD space.
To eliminate the influence of spatial scale and signal length, we normalize these
distance signals with respect to magnitude and size. First, we normalize its signal
magnitude through Lj-norm. Then, equally spaced re-sampling is used to normalize
its size into a fixed length (360 in our experiments) . Additionally, we regularize the
walking direction of sequences taken from the same view based upon the symmetry
of gait motion during shape representation (e.g., from left to right for all sequences
with lateral view) . By converting such a sequence of silhouette images into an
associated sequence of ID signal patterns that can be further analyzed using robust
signal analysis techniques, we will no longer need to cope with those likely noisy
silhouette data.
Feature Extraction and Classification
Training and Feature Projection
The purpose of PCA training is to obtain several principal components to re-
represent the original gait features from a high-dimensional measurement space to a
low-dimensional eigenspace . The training process is illustrated as follows .
Given s classes for training, and each class represents a sequence of distance
signals of one subject's gait. Multiple sequences of each person can be freely added
for training without the requirement of altering the following method . Let D i,j be the
lh distance signal in class i and N, the number of such distance signals in the lh
class. The total number of training samples is N(= N t+N2+ +N" and the whole
training set can be represented by [Du , D t •2, • • . , D1.Nj, D 2.j, , Ds.NsJ. We can easily
obtain the mean md and the global covariance matrix L of such a data set by
1 s IV, (5.26)
md=-LL~.j
N, ;=1 j= !

1 s N; (5.27)
I=- ""(D
L...L... '.J.-mJ)(D'.J. -mdf u
N( i=t j =l

If the rank of the matrix L is N, then we can compute N nonzero eigenvalues AI. ,.1.2,
. . ., AN and the associated eigenvectors ej, e2, . . ., eN based on SVD (Singular Value
Decomposition).
Generally speaking, the first few eigenvectors correspond to large changes in
training patterns , and the higher-order eigenvectors represent small changes.
Therefore , for the sake of memory efficiency in practical applications, we may
ignore those small eigenvalues and their corresponding eigenvectors using a
threshold value T,
(5.28)

73
where Wk is the accumulated variance of the first k largest eigenvalues with respect
to all eigenvalues. In our experiments, T, is chosen as 0.95 for obtaining steady
results.
Taking only the k<N largest eigenvalues and their associated eigenvectors, the
transform matrix E = [e\, ez, ... , ek] can be constructed to project an original
distance signal D iJ into a point P i,j in the k-dimensional eigenspace spanned by this
partial set of k eigenvectors.
P;'j =[t; E2 ... ekf Q,j (5.29)

Accordingly, a sequential movement of gait can be mapped into a manifold


trajectory in such a parametric eigenspace.
It is well known that k is usually much smaller than the original data dimension
N. That is to say, eigenspace analysis can drastically reduce the dimensionality of
input samples while keeping several of the most effective principal components to
summarize original samples. For each training sequence, the projection centroid C,
in the eigenspace is accordingly given by averaging all single projections
corresponding to each frame in the sequence.
1 N; (5.30)
C=-"R
, "AT L..J ',J
.
lVi j =!

Similarity Measurement and Classifier


Gait recognition is a traditional pattern classification problem which can be
solved by measuring similarities between reference patterns and test samples in the
parametric eigenspace .
Gait is a kind ofspatiotemporal motion pattern, so we use STC (Spatial-Temporal
Correlation, an extension of2D image correlation to 3D correlation in the space and
time domain [90]) to better capture its spatial structural and temporal transitional
characteristics.
For two input sequences, we can first convert them into a sequence of distance
signal 11(t) and lz(t) at the preprocessing stage. Then, they are respectively projected
into a trajectory PI(t) and Pz(t) in the eigenspace. The similarity measure between
such two input vector sequences can be computed by
T (5.31)
d' =minab :2::11 J:(t)-Pz '(at+b) W
1=1

where F;,(at+b) is a dynamic time warping vector from Pz(t) with respect to time
stretching and shifting for an approximation of the temporal alignment between the
two sequences. The selection of the parameters a and b depends on the relative
stride frequency and phase difference within a stride (two steps) respectively. Let./i
andJ2 denote the frequencies of the two gait sequences, then a = J2/f.. . By cropping a
sub-sequence of length J2 from the second sequence vector repeatedly and
expanding or contracting it with a, we may obtain its correlation with PI(t) . The
average minimum of all prominent valleys of the correlation results determines their
similarity.

74
(a)

(b)

(c)

(d)

(e)

(f)

(1) Lateral view (2) Oblique view (3) Frontal view

Figure 5.14 Gait Period Analysis: (a) input sequences, (b) aspect ratio signals of
moving silhouettes, (c) signals after removing the background , (d) autocorrelation
signals, (e) first-order derivative signals of autocorrelations, and (0 the positions
of peaks.

Gait period analysis has been explored in previous work [102, 110], which serves
to determine the frequency and phase of each observed sequence so as to perform
dynamic time warping to align sequences before matching. Note that a step is the
motion between successive heel strikes of opposite feet and that a complete gait
period is comprised of two steps. In [102], width time signal of the bounding box of
moving silhouette derived from an image sequence is used to analyze gait period . In
[110], either width time signal or height time signal is used because that the
silhouette width for frontal views is less informative, but the silhouette height as a
function of time plays an analogous role in periodicity . Different from them, here
we choose the aspect ratio of the bounding box of moving silhouette as a function of
time so as to enable it to cope effectively with both lateral view and frontal view.
The process of period analysis of each gait sequence is shown in Fig. 5.14. For an
input sequence (Fig. 5.14(a», once the person has been tracked for a certain number
of frames, its spatiotemporal gait parameters such as the aspect ratio signal of the
moving silhouette can be estimated (Fig. 5.14(b» . We may remove its background
component by subtracting its mean and dividing by its standard deviation, and then
smooth it with a symmetric average filter (Fig. 5.14(c». Further, we compute its
auto-correlation to find peaks occurring at a fundamental frequency (Fig. 5.14(d».
Finally, we compute its first-order derivative (Fig. 5.14(e» to find peak positions by
seeking the positive-to-negative zero-crossing points (Fig. 5.14(0). Due to the
bilateral symmetry of human gait, the autocorrelation will sometimes have minor

75
peaks half way between each pair of major peaks . The strength of these minor peaks
diminishes from nearly major peaks to zero as the camera viewpoint deviates from
fronto-parallel to perpendicular with respect to the image plane (see each colunm
from left to right in Fig. 5.14). Hence, we estimate the real period as the average
distance between each pair of consecutive major peaks . This process has been
demonstrated to be computationally feasible with respect to our background
subtraction results.
Note that the computational cost will increase quickly if the comparison is
performed in the spatiotemporal domain, especially when time stretching and
shifting is taken into account. Here, we tum to use the NED (Normalized Eucl idean
Distance) between the projection centroids of two gait sequences for the similarity
measure to eliminate the matching problems caused by velocity change and phase
shift in the spatiotemporal correlation.
Assuming that the trajectories of any two sequences in the eigenspace are P1(t)
and Pz(t) respectively, we can easily obtain their associated projection centroids C1
and C2• Each projection centroid implicitly represents a principal structural shape of
certain subject in the eigenspace. The normalized Euclidean distance between the
two sequential projection centroids can be defined by
(5.32)

Furthermore, for multiple sequences of the same subject, we may also obtain its
exemplar projection centroid by further averaging the projection centroids of those
single sequences as a reference template for that class. This exemplar centroid will
also be used for gait classification in our experiments .
The classification process is carried out via two simple different classification
methods, namely the nearest neighbor classifier (NN) and the nearest neighbor
classifier with respect to class exemplars (ENN) derived from the mean projection
centroid of those training sequences for a given subject.
Let T represent a test sequence and R; represent the ith reference sequence. We
may classify this test sequence into class c that can minimize the similarity distance
between the test sequence and all reference patterns by
c = argmindi(T,RJ (5.33)

where d is the similarity measure . Note that d can only choose NED if ENN is used.

A A A
-- - -------._ -- -- . - -- - - -- - ---- - -- - - - - - -- -
A
- -- --- --_. _- - -- - - - - -- - -- - --

A A A
Figure 5.15 Temporal Changes of Moving Silhouette in a GaitPattern (Frames 28-35)
~
76
5.3.5 Experimental Results and Analysis

Procrustes Shape Analysis


To verify the usefulness of the proposed algorithm, we have performed a number
of experiments with the CASIA database (described in Section 3.2.4). We also
present detailed analysis and discussion on the experimental results.
Processing
For each sequence, we perform motion segmentation. An example of temporal
changes of moving silhouettes in a gait pattern is shown in Fig. 5.15.
0.12r;==== = ""
- Sequerx:e1 Slbject 16. adegree o
Slbject 5,B,14,16 ard 20, deg ree
0.1 - Sequence 2 0.1
- a eqoerce a
0.08 - Sequence 4 0.08
0.06 • Exemplar 0.06
0.04 0.04
0.02 0.02

-().Q2 -<l.02
-<l.O4 l
-<l.06 -<l.04 -<l.02 0.02 0.04 0.06 0.06 :;h~04::--: -: :'- --=--- =-O: -- : -: ::-::-'.: - - -: -! 0.06 0.08

0.1 r;==== = =; O.12~;~,16~-;;~~~degree , -- ---,----~--


- Sequence 1
0.08 - - Sequence2 0.1
- - Sequence3
0.06 • ._ - Sequence 4 0.08
• Exemplar
0.06
0.04
0.04
0.02
0.02
I
-<l.02
-<l02 'Ifr- Exe"l'lar l
-<l.04

=
-<l.04 ~ Exe"l'lar2
- Exe"l'lar 3
-<l.06
-<l.06 ~~~~ ~::=~~
-<l'~.06 -<l.06 -<l.04 -<l.02 0 0.02 0.04 0.06 0.08 -<l·~.08 -<l.06 -<l.04 -<l.02 0.02 0.04 0.06 0.08

0.1 r:-- - , - -- , - --,-:-- - - - - - - -"


0.08
0.06 0.06
0.04 0.04
0.02 0 .02

-<l.02 -<l.02
-<l.04 -<l.04
- Exemplar1
-0.06 -<l.06 - Exemplar2
- Exemplar3
-0.08 -<l.06 - Exemplar4
- Exemplar5
-<l-<l.0~8--;-<l:-'.:.06::--<l7.04
::-:--:_0:-'.:.072--:-~~~=~~-
0.02 0.04 0.06 0.08 -<l-<l.08 -<l.06 -<l.04 -<l.02 0.02 0.04 0.06 0.08

(a) mean shapes and their exemplar of (b) Exemplars offive different subjects
four sequences of the same subject
Figure 5.16 Mean Shapes and the Exemplars for Different Views

77
Each sequence is accordingly converted into a sequence of shape representations
with the associated configurations in 2D space (the vector dimensionality is set to
360 here). Then, we can obtain their associated mean shapes. Note that the walking
direction is pre-normalized here to avoid the effect on recognition performance,
e.g., all sequences with the lateral view are flipped from right to left. Further, we
use the class average of mean shapes derived from the same-view sequences of a
subject as an exemplar for that class, which aims to avoid selecting a single and
random reference sample. Fig. 5.16 shows plots of mean shapes and their exemplar
of four sequences of the same subject and plots of the exemplars of five different
subjects (Note that Exemplar 1-5 are corresponding to Subject 5, 8, 14, 16 and 20
respectively), from which we can see that the intra-subject changes in eigenshapes
are very small, while the inter-subject changes are more significant. Such result
implies that the mean shapes have considerable discriminating power for identifying
individuals.

Results and Analysis


We have tried three classification methods. In the NN test, each sequence is
classified as belonging to the class of its nearest neighbor. In the kNN test (k = 3),
we find the three nearest neighbors, and choose the class of the majority, or if no
majority, simply the nearest neighbor. The exemplar method (ENN) classifies a
sequence as the class of its nearest-neighbor exemplar.
First, we evaluate the performance of our approach using classification error in
identification mode in which the classifier determines which class a given
measurement belongs to. For a small number of examples, we expect to compute an
unbiased estimate of the true classification rate using the leave-one-out cross-
validation rule since the leave-one-out error rate estimator is known to be an almost
unbiased estimator of the true error rate of the classifier. We label the order for the
80 same-view gait sequences subject by subject from 1 to 80. Then we leave one
example out, train on the rest, and classify the left-out element according to its MSD
differences with respect to the rest examples. This process is repeated 80 times, and
the recognition rate is obtained as the ratio of the number of correctly classified test
samples out of the total 80 for each viewing angle. The correct classification rates
(CCR) are summarized in Table 5.8. Note that here testing and training are
consistent with respect to viewpoints. From Table 5.8, it can be seen that the
recognition performance under the frontal walking is better than other two views.
This is probably due to the averaging associated with the mean shape analysis
owing to less severe shape variations in such gait patterns. It can also be seen that
the ENN classifier consistently outperforms the other two. For each subject,
although his or her gaits at different times are perceived to be almost invariable,
there are still slight changes between them. Multiple sample sequences' average
may serve to provide a more standard gait pattern for that specific person than a
single and random sample sequence. Although as a whole the results are very
encouraging, more experiments on a more realistic database still need to be further
investigated in future work in order to be more conclusive.

78
Table 5.8 CCRs of Different Classifiers under Different Viewing Angles

••
~. .- 0° ' 45° ':: ' :, > : 90°,>" "" ,:,
k=1 NN) 71.25% 72.50% 81.25%
k=3 3NN) 72.50% 73.75% 80.00%
ENN 88.75% 87.50% 90.00%

Another useful classification performance measure that is probably more general


than classification error is the rank order statistic, which was first introduced by the
FERET protocol for the evaluation of face recognition algor ithms [218]. It is
defined as the cumulative probability P(k) that the real class of a test measurement is
among its top k matches . The basic models for evaluating the performance of an
algorithm are the closed and open universes. In the closed universe , every probe
(unknown measurements) is in the gallery (known measurements). While in an open
universe, some probes are not in the gallery. The performance statistics are reported
as the cumulative match scores . The rank is plotted along the horizontal axis, and
the vertical axis is the percentage of correct matches [218] . Here, we use the closed-
universe model and the leave-one-out cross-validation rule with the CASIA
database to estimate the identification performance of the proposed method . Fig .
5.!7 shows the cumulative match scores (CMS) for ranks up to 20 in Fig. 5.17(a)
based on NN and 10 in Fig. 5.17(b) based on ENN respectively. It is noted that the
correct classification rate is equivalent to P(l) (i.e., rank = I) .

M~

i 0,9

;0.85
1.l
s 0,8

io.75
l5 0.7
O .6~

0,6L-::--,,;--~_._=..!==o""r~::::~='!

(b) ENN
Figure 5.17 Identification Performance Results in Terms of the FERET
Protocol's CMS Curve

For completeness, we also use the ROC (Receiver Operating Characteristic)


curve to report results. For verification mode, the pattern classifier is asked to verify
whether a new measurement really belongs to certain claimed class. As before, we
estimate FAR (False Acceptance Rate) and FRR (False Reject Rate) via the leave-
one-out rule. That is, we train the classifier using all but one left-out sample, and
then verify the left-out sample on all 20 classes . Note that in each of these 80
iterations for each viewing angle, there is one genuine attempt and 19 imposters
since the left-out sample is known a priori to belong to one of the 20 classes . By
varying the decision threshold for the acceptance, we can produce various
combination pairs of FAR and FRR. Fig. 5.18 shows the ROC curve, from which we
see that the EERs (Equal Error Rates) are about 8%, 12% and 14% for 0°, 90° and
45° views respectively.

79
---- . 0 Degree
- 45 Degree
0.8 ......... 90 Degree

I
0.6 ;
a:: i
~ \
0.4 1
EER

0.4 0.6 0.8


FRR
Figure 5.18 Verification Performance Results Reported by the ROC Curves

Evaluation
The performance of the algorithm is further evaluated with respect to the length
of the training sequence and the vector dimensionality of shape representation on
the CASIA database with the lateral view.

1) Influence of the dimensionality of shape representation


The influence of the dimensionality of shape representation (i.e., the number of
points sampled along the boundary contour) is examined by changing the sampling
interval. Fig. 5.19 shows the general trend of correct classification rate vs the
dimensionality of shape representation, from which we can see that the CCR starts
to level off at 36 points. That is, 36 points may be sufficient to represent a shape in
2D space as far as gait recognition is concerned. Clearly, the reduced dimensionality
results in a concomitant decrease in computational cost.

20 40 60 80 100 120 140 160

The number of sampled boundary points

Figure 5.19 Performance Evaluations with respect to the Shape Dimensionality

2) Influence of the training sequence length

80
To evaluate the effects of the length of training samples, we conducted five tests
which respectively make use of the first 15,30,45,60 and 75 frames corresponding
approximately to one, two, three, four and five walking cycles from each gait
sequence captured at a rate of 25 frames per second (Note that 1 stride period = 2
cycles. An average cycle is typically 15 frames in term of a frame rate of 25 fps
according to the study of biomechanics though it seems to have a little difference on
cadences of different people) . The comparisons of recognition performances are
shown in Fig. 5.20. The results reveal that the best performance is achieved by using
just over four walking cycles of training samples from each subject (i.e., 60 frames) .
Furthermore, the recogn ition performance is improved by increasing the number of
training samples . The results thus appear to confirm recognition sensitivity to the
sequence length and imply that in a more extended analysis , care must be taken to
include sufficient samples in the training database .

1 -~,----~--~ -- ~--- - ~

0.9

0.8
2
~ 0.7 ~--::=!> l
y e--===-"~-
c:
~ 0.6
(J

"" 0 .5
'gj
1
'"
~

__. ==j~J
(J
0.4
Q)

~ 0.3
o
0.2 .
0.1 .

o
/-
15 30 45 60 75
Training sequence length (# frame)

Figure 5.20 Performance Evaluations with respect to the Training Sequence


Length

Comparisons
Identification of people by gait is a challenging problem and has attracted
growing interests in the computer vision community . However, there is no baseline
algorithm or standard database for measuring and determining what factors affect
performance. The early unavailability of an accredited common database (e.g.,
something like the FERET database in face recognition) of a reasonable size and
evaluation methodology was a limitation in the development of gait recognition
algorithms . A large number of early papers reported good recognition results
usually on a small database , but few of them made informed comparisons among
various algorithms due to the lack of a standard test protocol. To examine the
performance of the proposed algorithm, here we provide some basic comparative
experiments.
This comparative experiment was to compare the performance of Procrustes
shape analysis with those of five recent methods which are from Maryland [101,
102], eMU [110], MIT [119] and USF [75] respectively, and to some extent reflect

81
the best work of these research groups in gait recogrnuon, The results are
summarized in Table 5.9 where Procrustes shape analysis compares favorably with
others, with performance very similar to [75]. The gait feature vector of [I 19] is
composed of parameters of moment features in image regions containing the
walking person aggregated over time . Intuitively, the mean features describe the
average-looking eIlipses for each of the regions of the body ; taken together, the 7
eIlipses describe the average shape of the body, which is in essence similar to the
idea here. Procrustes shape analysis outperforms the methods described in [10 I,
102, 110 and 75]. From experiments it is also found that the computational cost of
[I 10] and [75] was relatively higher than that of [101, 102, 119] and Procrustes
shape analysis .

Table 5.9 Comparison of Several Recent Algorithms on the CASIA database (0°)
,
...... 1n

BenAbdelkader 2001 101 72.50% 88.75% 96.25%


BenAbdelkader 2002 102 82.50% 93.75% 100%
CoIlins 2002 f1 10 I 71.25% 78.75% 87.50%
Lee 2002 r1l91 87.50% 98.75% 100%
Phillips 2002 r75 78.75% 91.25% 98.75%
Procrustes shape analysis 88.75% 96.25% 100%

The above only provides preliminary comparative results and may not be
generalized to say that a certain algorithm is always better than others . Algorithm
performance is dependent on the gaIlery and probe sets, so further evaluation and
comparisons on a larger and more realistic database are needed in future work .

Spatiotemporal Silhouette Analysis

Extensive experiments were also carried out to verify the effectiveness of the
spatiotemporal silhouette analysis based method . The foIlowing describes the details
of the experiments. For data, the CASIA database was again used .

Preprocessing and Training


For each image sequence, we performed motion segmentation and tracking to
extract the walking figure from the background image. Those extracted 2D
silhouettes are accordingly converted into an associated sequence of 1D distance
signals before training and projection.
We chose a smaIl portion of such distance signal sequences including all classes
for training. The training process based on PCA is accomplished. We keep the first
15 eigenvalues and their associated eigenvectors to form the eigenspace
transformation matrix . Fig. 5.21 gives the first three eigenshapes for each viewing
angle . From Fig. 5.21, we can see that these eigen-curves are either odd symmetric
or even symmetric, which reveals that gait has a characteristic of symmetry.
Once the eigenspace is obtained, each distance signal derived from each
silhouette image can be represented by a linear combination of these 15 principal
eigenvectors. That is, each distance signal can be mapped into one point in a 15-
dimensional eigenspace. Each gait sequence wiII be accordingly projected into a

82
manifold trajectory in the eigenspace. The projection trajectories of three trained
sequences with respect to lateral view, oblique view and frontal view respectively
are shown in Fig. 5.22, where only three-dimensional eigenspace is used for
visualization.

0'25 [ j 0'25 t j 0.20


01 0 J\f' -> oJ V 'v 0 -J '-
-0.2 -0.2 -0.2
o 200 400 0 200 400 0 200 400
0.11 7 f 1 \ : l 0.2[ A l 0.2~

O~J~JLlLJ .r". -ot,,!~j


o 200 400 0 200 400 0 200 400
0.20 0.2[ ; a 0.2r- I
030 \ !\r 0-\ f'J\/ 0r!\j--
-0.2 -0.2 -0.2L- ;
o 200 400 0 200 400 0 200 400
(a) 0° (b) 45° (c) 90°

Figure 5.21 First Three Eigenvectors for each Viewing Angle obtained by PCA
Training

....
(a) lateral view (b) oblique view (c) frontal view
Figure 5.22 The Projection Trajectories of Three Training Gait Sequences
(only the three-dimensional eigenspace is used here for clarity)

Results and Analysis


a) Identification Mode
Here, we use the leave-one-out cross-validation rule with the CASIA database to
estimate the performance of the proposed method. Each time we leave one image
sequence out as a test sample, and train on the remainder . After computing the
similarity differences between the test sample and the training data, the NN or ENN
is then applied for classification. Fig. 5.23 shows the cumulative match scores for
ranks up to 20, where Fig. 5.23(a) uses the STC similarity measure and Fig. 5.23(b)

83
uses the NED similarity measure with respect to projection centroids (solid line) and
exemplar projection centroids (dotted line) respectively. It is noted that the correct
classification rate is equivalent to P(ll (i.e., rank = I). That is, for side view, oblique
view and frontal view, the correct classification rates are respectively 65%, 63.75
and 77.5% with NN and STC, 65%, 66.25% and 85% with NN and NED, and 75%,
81.25% and 93.75% with ENN and NED.

~ j
~ 0.6 13 0.6

~
~
"~
::;
"-e_ 00
~ 0 .4 0.4
..-. - 4 5°
~ ~ ..••.• 90'
<3 0 .2 I
-
_
0'
45'
90°
<3 0.2 ~ o'
- - 45°
- - 90'

10 Rank 15 20 10 Rank 15 20

(a) (b)
Figure 5.23 Identification Performance based on the Cumulative Match Scores:
(a), and (b) classifiers based on NED with respect to single projection centroid
(solid line) and exemplar projection centroid (dotted line), respectively.

From Fig. 5.23, we can draw the following conclusions:


a) The identification performance using NED is in general better than that
using STC. In theory, STC can better capture spatiotemporal characteristics of gait
motion than NED, and it is expected to obtain better recognition accuracy. Such
experimental results are probably due to the fact that the segmentation errors in each
frame caused by either noise or clothing dithering when walking in different gait
sequences may be accumulated into a quick-enlarged match error to some extent
during spatiotemporal correlation using direct frame-frame match. However, the
average projection over a whole temporal sequence can provide a powerful method
for overcoming noise in individual frames. We hypothesize that a more robust
method for silhouette extraction would yield an improvement in STC scores.
b) The NED based on the exemplar projection centroid performs better than
NED using only a single projection centroid. For each subject, although his or her
gaits at different times are almost perceived as the same, there are slightly changes
between them. The average of multiple sample sequences may serve to provide a
more standard gait pattern for that specific person than only a single and random
sample sequence.
c) The recognition performance under frontal walking (900 ) is the best. This
is probably not what happens with most algorithms. The proposed method in
essence implicitly captures more structural appearance information of body
biometrics. Therefore, this result is probably due to the averaging associated with
the silhouette shape analysis because there are less severe variations of silhouette
appearances in such gait patterns compared with the other views.

84
EER

0.2 0.4 FRR 0.6 0.8

Figure 5.24 ROC Curves of Gait Classifier based on NED with respect to
Three Viewing Angles

b) Verification Mode
For completeness, we also estimate FAR (Fals e Acceptanc e Rate) and FRR
(False Reject Rate) via the leave-one-out rule in verification mode. That is, we leave
one example out, train the classifier using the remaining, and then verify the left-out
sample on all 20 classes. Note that in each of these 80 iterations for each viewing
angle, there is one genuine attempt and 19 imposters since the left-out sample is
known a priori to belong to one of the 20 classes. Fig. 5.22 shows the ROC
(Receiver Operating Characteristic) curve using the NED similarity measure with
exemplar projection centroids, from which we can see that the EERs are about 20%,
13% and 9% for 0°, 45° and 90° views respectively. Here, the verification
performance of frontal view is also better than those of the other views.

c) Validation Based on Physical Featur es


In experiments, we find that recognition errors often occurred when the two
smallest values of the similarity function are very close. That is, the classifier
recognizes a test sample as the class with the smallest similarity value, while the
true classification should be the class with the second smallest similarity value.
Therefore, when the absolute difference between the last two minima is lower than a
predefined threshold, we use some additional features available from the training
sequences to validate the final decision. These features may be pace, body height,
build and stride length, some of which have been independently used for personal
identification in previous algorithms [106, 119, 102, 103]. Because we do not
perform camera calibration when establishing the CASIA database, we cannot
extract real parameters in the world coordinate like [102]. To avoid the effect of
foreshortening in terms of different views, here we only take image sequences from
the lateral view as test examples to examine the effectiveness of such validation.
The walking figures from the lateral-view images have approximately the same
depth, and it is believed that static body parameters recovered from such a single
view produce high discrimination power because the recovered measurements exist
in the same parameter space as the training data [106).
Extraction of additional features is illustrated in Fig. 5.25. For each frame, a
bounding box is placed around the tracked silhouette, and we can easily obtain some

85
parameters such as its centroid, width and height. To extract the other body points,
the vertical positions of the chest and ankles for a body height H are estimated by an
anatomical study [47] to be 0.720H and 0.039H respectively . Further, we can obtain
its chest width and the distance between the left foot and the right foot by
calculating the horizontal coordinate from two border points (note that for ankles,
we choose the leftmost and the rightmost points along the horizontal axis). Once the
person has been tracked for a certain number of frames, the spatiotemporal
trajectories of these gait parameters can be derived . We select the following
additional features to describe aspects of pace, stride and build. The first parameter,
gait period T, can be obtained as mentioned in Section 5.3. From the CASIA
database , we find that the gait period usually ranges from 26 frames to 37 frames .
The second parameter is stride 8L which is measured only at the maximal separation
point of the feet during the double support phase of a gait cycle. The last two
parameters, namely the body height H and the ratio R between the chest width and
the body height, are used to reflect build information (tall vs short and thin vs fat).
Because gait is a periodic and non-rigid motion, we obtain these parameters by
averaging these measurements over each time instance, namely the time instance
with the minimal separation points between two feet (when the subject is at his
maximal height) and the time instance with the maximal separation points between
two feet (when the subject have the maximal stride during the walking cycle). These
parameters are finally concatenated to form a four-dimensional vector <T, 8L, H, R>
for each sequence . Fig. 5.25(b) shows a distribution of such physic al features in 3D
space for clarity, where the same markers represent the results of different
sequences of the same subject. From Fig. 5.25(b), we can see that an efficient
combination of these physical properties will bring considerable discriminatory
power to gait classification.

___ ..... - 1-
.......... r
I
--", I
I
........

120 ..... - 1 I 1 ~ -l.. I .... ....


I I j.. '" - .:.1 11..... I
I _ 1_ - - 1 I ......... ..... ~ i
Chest ... J ..... '" I 1 II§I .... ..... [
--- - -- ~ 100 ..... I I I ..... "'" -t- ..... : .......... I
I I ;t;... ..... f" I..... .... ....1
.il,_ ..... -1- + I I ..... ~ ....
Centroid 80 - -.1'~ or0 <9 I -l..
-----. : :__ ~-J- -: - '-'-'-,-: -- . . ;
J'::!~ :.~~...... ~ ':- ""- : -~'" :
D
60 __
35 v ... '::.<..... <,

Ankle T 30 - -'>-~ 140 180 180 200


H
t P-- 1 --- ===-q -- --- .... 25 120

Figure 5.25 Extraction of Additional Features : (a) illustration of human


silhouette , and (b) a distribution of physical features in 3D space (note that 8L
and H are measured by pixel distance) .

The effect of the above validation procedure on the CCR (Correct Classification
Rate) is shown in Table 5.10, from which we can see that the recognition
performance after the inclusion of the above additional features is indeed improved.
If we can construct a depth conversion factor as a function of the depth of the
subject from the camera as described in [106] or use camera calibration to convert
distances measured in the image from pixels to world units, we may extend the

86
validation step to solve the effects of foreshortening in terms of other views.
Although the results are very encouraging, further experiments on a larger database
need to be investigated in order to be more conclusive.

Table 5.10 Effectiveness of Validation based on Physical Features (the CASIA


database (0°»

d) Comparisons
Here, we first compare the performance of the proposed algorithm with that of a
closely related method described in [101] . This approach named Eigengait is based
on PCA, and the major difference from our method is that it uses image self-
similarity plots as the original measurements. This algorithm was evaluated on a
dataset of Little and Boyd [88] and achieved a recognition rate of 80%, 82.5% and
90% with respect to k = I, 3 and 5 using the k-nearest neighbor classifier. We re-
implement this method using the CASIA database with a lateral viewing angle . The
best recognition rate is 72.5% (see Table 5.23), which is a little lower than our
method even with no validation (75.00%) . Due to the lack of the database used in
[88], here we are unable to test the proposed algorithm on the data set.
The performance of the proposed algorithm was also compared with those of a
few recent silhouette-based methods described in [110], [119] and [75] respectively.
Here, we re-implement these methods using the same silhouette data from the
CASIA database with lateral view. Based on the FERET protocol with rank of 1,5
and 10, the best results of all algorithms are summarized in Table 5.11, from which
we can see that our method compares favorably with others, and outperforms
Phillips et al. [75] and Collins et al. [110]. We also found that the computational
cost of [110] and [75] was much higher than that of [110], [75] and our method.
Here, the listed computational cost is an approximate average consumed time for
each test sequence using Matlab 6.1 on a PIlI processor working on 733Mhz with
256Mb DRAM (note that this process only includes feature extraction and
matching, and excludes gait segmentation and the training phase) .

Table 5.11

The effectiveness of the extracted gait features was evaluated with respect to
different variations on the "gait challenge" dataset described by Phillips et al. [75],

87
as this database is one of the largest available to date in terms of number of people,
number of video sequences, and the variety of conditions under which a person 's
gait is collected . Some samples are given in Fig. 5.26. So far, this dataset available
for gait analysis consists?f 452 sequellces from 74 subjects walking in elliptical
paths on 2 surface types ~oncrete and ~rass), from 2 camera viewpoints cMeft and
Bight), and with 2 shoe types ~ and :§). Thus we have 8 possible different
conditions for each person (refer to Table 5.12).

w.:._.. ~ I
I , -
.,#" .•

Figure 5.26 Sample Images from the HumanID Database [75]

For fairness in comparison, the following points should be made clear:


a) Considering the constraints of processing time and storage space (about
300 GB), we directly used the same silhouette data from the University of South
Florida (Ver. 1.0). These data are noisy, e.g., missing of body parts, small holes
inside the objects and severe shadow around feet. Therefore, we make
preprocessing to extract a single-connected region with the largest size in each
frame to fit the data to our method regardless of severe segmentation errors.
b) We put forward a set of challenge experiments in terms of the same gallery
and probe sets as the literature [75]. For each experiment, we also measure
performance for identification scenario following the pattern of the FERET standard
evaluations [218].
c) We use the STC similarity measure similar in essence to the baseline
algorithm described in [76, 75], with only one difference of inter-frame similarity
representations. Here we set a = 1 without considering different paces among
subjects because we cannot effectively perform gait period analysis for a non-linear
(elliptic) walking pattern .

Table 5.12 Basic Results on the Gait Challenge database using Gallery Set (G, A, R)

View
Shoe
Shoe, View
Surface
Surface, Shoe
Surface, View
Surface, Shoe, View

Table 5.12 lists basic performance indicators of the proposed algorithm in the 7
challenge experiments , namely the identification rate (PI) for ranks of 1 and 5,
where the number of subjects in each subset is in square bracket, and the optimized
performance stated by [75] (an extended version of [76]) are in parentheses for
comparison . From Table 5.12, we can draw the same conclusions as [75]. That is,

88
among the three variations, the viewpoint seems to have the least impact and the
surface type has the most impact. As a whole, our method is only slightly inferior to
[75] in identification rate with the USF database, but far superior in computational
cost (see Table 5.11). As for the identification rate, we think that noisy segmentation
results bring about great impact on feature training and extraction to our method,
and that an elliptic walking path bring about the challenge for our view-based
algorithm (i.e., we set a nearly linear walking path in data acquisition, which is
consistent with most past databases and probably more realistic in real cases). The
performance of most existing algorithms performing on sequences with a linear
walking and side view does naturally result from the serious impact of the above
two aspects in both implementation (even may not work effectively) and the
resulting performance. As for the computational cost, the baseline algorithm
proposed by Phillips et al. [75] essentially belongs to an unlimited temporal-spatial
correlation process. Unlike other previous work, it performs inter-sequence
correlation repeatedly using the segmented silhouette images to measure similarity
between any two sequences because it does not have the explicit training procedure
for extracting a genuine compact feature vector for each sequence. As stated above,
these are just basic comparative results. More detailed statistical studies on larger
and more realistic datasets are desirable to further quantify the algorithm
performance.

5.4 Modeling, Matching, Shape and Kinematics

5.4.1 HMM Based Gait Recognition

A Hidden Markov Model (HMM) is a statistical tool used for modeling,generative


sequences characterized by a set of observable sequences [198]. The HMM
framework is used to model stochastic processes where the non-observable state of
the system is governed by a Markov process and the set of observable sequences of
the system have an underlying probabilistic dependence. HMMs have found
applications in many areas such as speech processing [199], natural language
processing (NLP) etc.
An HMM of a system is defined in terms of the following parameters:
A =(tr, A, B) where A is the state transition probability matrix, B is the observation
probability matrix and n: is the initial state distribution. The system has N
states,(sps2""SN)' Mobservation symbols, (vp V 2 , . .. V M), q,denotes the state of
the system at time t and 0, denotes the observation symbol of the system at time t.
• The initial state distribution tr = {trj}where trj = P (ql = Sj), I :s; i :s; N
• The state transition probabilities A = {aij} where
aij =p(ql+l = Sj Iq, =sj),l:S;i,j:S;N
• The observation matrix B = {bj ( k )} where
bj(k)=P(O, =vk Iq, =si),l:S;j:S;N,I:S;k :S;M.

89
The following are three basic problems that need to be addressed before using an
HMM based model for real world applications. The first problem involves
computing the probability of an observation sequence given the model parameters:
i.e. P(01,02, ....O; 12) . The second problem deals with computing the optimal
state sequence that best explains the set of observations given the model parameters.
The third problem deals with adjusting the model parameters 2 to maximize
p( 01'02" '" Or 12) . In the first problem the forward procedure which involves an
°
induction approach is used to efficiently compute the probability p( 12). The
°
forward variable a, (i) = P ( 0 " 2 , .. . , 0 " q, = s, I 2) is computed at every time
instant in an inductive fashion and the probability is computed as
P( °I2) =L:Ia (i). The second problem essentially involves maximizing the
r

probabilityP(QIO,2). The state propagation across time is modeled as a lattice


structure and at every time instant t the quantity
0, (i) = maxq,.q,....q,., P(QI'Q2, ...,q, =spO,,02 ,...,O,12)which corresponds to the
best path undertaken to reach state s, at time t is computed. Finally, the optimal
state sequence is determined by Viterbi decoding (backtracking along state
sequences that maximized the above quantity at every time instant t). The third
problem which deals with adjusting the HMM parameters 2 = (Jr,A,B) is solved
using the Baum-Welch algorithm (or equivalently the Expectation-Maximization
method). Fig. 5.27(a) illustrates the forward algorithm and Fig. 5.27(b) illustrates
the Trellis tree and the Viterbi decoded path.

t=1 t=2 t= 3 t=4

(a) (b)
Figure 5.27 (a) Forward algorithm (b) Lattice structure and Viterbi decoded state
sequence

Gait Recognition Framework


Human gait comprises of a series of stances adopted by an individual, that have an
underlying probabilistic dependence on one another. Since human gait is quasi-
periodic in nature, a gait sequence can be considered as a collection of many gait
cycles. Each gait cycle can further be subdivided into N clusters of temporally

90
adjacent stances and one can identify N stances that best characterize the stances in
each cluster . Modeling human gait using an HMM essentially involves identifying
the N characteristic stances of an individual and modeling the dynamics between the
N such stances. Analogous to the HMM framework, the N characteristic stances (or
exemplars) correspond to the N states of the system; the transition characteristics
between the N stances define the transition probability matrix; the probabilistic
dependence of stances from the N clusters that comprise a gait cycle to the N
characteristic stances, constitutes the observation probability matrix. In the
following subsections , we discuss a HMM based human gait recognition system that
was developed at University of Maryland [99, 100].

Direct Approach
From the video sequences of P individuals walking along a pre-designated path, we
extract the background subtracted silhouettes from each frame of the video
sequence. Selection of feature vectors plays a key role in a HMM based recognition
framework . In the first approach, we use the silhouette image in its entirety as the
feature vector. Given a sequence of images of the r
J,
individual X j = [ xj (I),xj (2), ..., xj (T) we identify the N exemplar stances that
best characterize the gait sequence and compute the HMM parameters
Aj = (Jr j , A j, B ; ) . A plot of the sum of foreground pixels on each frame of the video
sequence illustrates the quasi-periodic nature of human gait as shown in Fig. 5.28.
The boundaries of each gait cycle are identified by means of adaptive filtering and
each gait cycle is subdivided into N clusters of temporally adjacent stances . The
initial estimate of the exemplars &;0) = [ e~, e~, ...,e7 Jis derived from the stances
belonging to each of the N clusters. The initial estimate of the transition probability
matrix is such that transitions allowed are from a state onto itself or to its adjacent
state, i.e. A(O) =[ ~i.j) ] is such that A U.j) = 0.5 and A~.jmOdN+l = 0.5 . The initial state
probabilities Jr are set equal to 1/N . The observation probability matrix B
j

comprises of the probability distribution of feature vectors conditioned on the state


of the system, i. e. B=[bn(x(t))Jand bn(x(t))=p(x(t)len). If the underlying
distribution is non-Gaussian, it could be characterized by a mixture of Gaussians.
But due to the high dimensionality of the feature vector, we define the B as

bn(x(t)) = p(x(t) Ien) = fJe(-aD(X(I),e.)) (5.34)

(5.35)

where a and fJ are constants and D is a distance metric. The motivation behind
using an exemplar based model is that recognition could be based on the distance
between the observed feature vector and the exemplars.

91
OIiginal S'9na l
BOO •. . . . . • . . - Adapli V9 FIltered Signa l
o Media n Fikered Signa l
6. Differentia l Smoothed Signal
000 .. . .... . . . ·, . ..
JIl
.. 0 6. :
· · ·· ·1·· · · · · · ·.·
. · · · · · ·· .1

: : :r: . .
it 400 , , .
~

I 0Kfl·....i.A..•..Ir. J .JfffFv~1
~~,ro:
i
. ~ .v. , .
:
. 0 •

::
:::;-400 : ~
6.
..:
I'
:0
--600 .
. 6.. .. . .... .
.. . . . ....
o . . _ · · . , • • • •

··
• • •• •• • •

..
w • • •••

..
••• •

: 0 : 0
-!lOO ......_ ' 0
....... _---JL......._-'-_ _-'--_........J-'-_-'--_ _J...-_ --L_ _...J......_ _LJ

360 360 400 420 440 460 480 500 520 540
FflIll1ll index

Figure 5.28 Gait cycles comprising a walking sequence


Next, the HMM parameters thus initialized are refined in an iterative fashion as
described next. Using Viterbi decoding, we estimate the most probable sequence of
states Qi = qi (l),qi(2),....q' (T) from the current estimates of the exemplars e' and
the state transition matrix Ai where q l (t) corresponds to the estimated state at time t
during the lh iteration. Let T: = (t Iql (t) = n) be the set of time instants when the nth
state was estimated to be the state of the system . The estimated exempl ars are
refined in the following manner:

e(1+1)
j = argo max n l ET)')
p( X, I)
e (5.36)

e(ij +I) = argomm


. "
LJ' ET(i)
D(x"e ) (5.37)
J

EI E2 E3 E4 E5 E6

Figure 5.29 The Estimated Exemplars for a Walking Sequence

92
Fig. 5.29 displays the exemplars estimated for a particular sequence. Using the
above formulations the j'h exempl ar is updated as e~+l) = L 'ET.ili(t)where i
correspond s to the normalized vector x. The Baum Welch algorithm [200] is used
to compute A ( i+l ) using the refined set of exemplars & U+ I) and A (i ) . At every stage
of the iteration we re-initialize Jr~i+l ) = 1/ N. Thus, the HMM parameters are
iteratively refined . Under the HMM framework, this phase corresponds to the
training phase.
Following the training phase, every individual j in the gallery has a set of
exemplars &U) =[e~ ,e~ , ..., e7 J and the set of HMM parameters Aj =( Jrj , Aj,B j) .
Given a video sequence v of an unknown individual, the background subtracted
silhouettes of the individual are extracted in the same way adopted for the video
sequence s that compri sed the gallery . Using the forward algorithm, we compute the
likelihood that the observed sequence v was produced by the r
individual. The
identity of the unknown individual in the video sequence is established as follows :

p = arg max log ( P ( v I Aj ))


(5.38)
.I

Indirect Approach
The problem of high dimensional ity of feature vectors in the direct approach is
overcome in the indirect approach. In cases where the quality of the silhouettes
extracted through background subtraction is good, the outer contour of the
silhouette captures almost all the information contained in the silhouette image in its
entirety . Thus, the row-wise width of the outer contour could potentially serve as a
good feature vector for recognition purposes . Fig. 5.30 shows the plot of width
profiles of two different individuals over several gait cycles . The brightness
variations that are evident from the width profiles are due to the arm swings and leg
swings that characterize one 's gait.

. -. - .. - - ... -

1;.1.:1••; la)

(b)

Figure 5.30 Width Profiles of Two Individuals across Gait Cycles

93
For every individual from the gallery, the exemplar width vectors
Ei =[ e~, e~, ...,e7 ] are selected as prescribed earlier. Let
X i = [ x i (1), x i (2), ..., x i (T)] correspond to the width vectors across T frames . We
define a quantity, the frame to exemplar distance (FED), as the observed signal of
r
the process. The FED vector for the individual is defined as

(5.39)

where t E (1,2,..., T), d corresponds to a distance measure, Xi (t) and e~


correspond to the feature vector at time t and nth exemplar, respectively. Plotting the
magnitude of the FED vectors across each frame as in Fig. 5.31 illustrates the time
evolution of the FED vector components. The initialization and refinement of the
HMM parameters and the subsequent recognition methods are similar to the ones
discussed in the direct approach .

25lJ,---,..--,.----r---r-----,----.---,-----.-------,

200

/ + ..." T
c '
+ /*. \. '"
+ / ,
/

-+ .f

e 10 12 15 1&
Frame5 --->

Figure 5.31 Time Evolution of FED Vector Components

5.4.2 DTW Based Gait Recognition

Dynamic Time Warping (DTW) is a technique that computes the non-linear


warping function that optimally aligns two variable length time sequences. The
warping function can be used to compute the similarity between the two time series
or to find corresponding region s between the two time series . Speech recognition
applications use DTW to determine if two waveforms correspond to the same

94
spoken word [20 I] . In applications such as data-mining, gesture recogmtion,
robotics, medicine etc . DTW is used to compute the distance between two time
series . Fig. 5.32 shows the differences in computing the distance between two time
series, before and after optimal alignment, and thereby explains the significance of
the time-warping operation.

Ca) (b)

Figure 5.32 Comparison of Two Time Series (a) before alignment (b) after
alignment with DTW

Given two time series X = [XI,x2,.....,Xm ] and Y = [y1,Y2, .....'Yn]' the DTW algorithm
computes the warping function W that optimally aligns the elements of X and Y. A
cost matrix D of size m x n is constructed where the (iJyh element of the matrix
corresponds to the distance between the samples Xi and Yj from X and Y respectively.
A warping path W is a contiguous set of matrix elements that define a mapping
between X and Y denoted as W = [WJ,W2, ......,WK]. The kth element of W is defined as
Wk = (iJ)k and the length of warping path K is such that max(m,n) :::; K < (m+n). The
warping path is typically subject to the following constraints:
• Endpoint Constraints : The endpoints of the warping path Ware fixed at
WI = (1,1) and WK = (m,n). The warping path starts and finishes at the
diagonally opposite comer cells of the cost matrix .
• Monotonicity Constraints: The warping path W monotonically increases
in time i.e. given, then (i1-i 2) 2:0and (h-h) 2: 0 for all k E [I:K]
• Continuity Constraints : Continuity constraints are imposed to restrict the
allowable steps in the warping path to adjacent cells or nearly adjacent
cells, i.e. ifwk= (iIJI) and Wk_1 = (i2J2) , then (i1-i2):::; I and UIJ2):::; 1.

The optimal warping fiT that minimizes the cumulative costs is computed using
dynamic programming. We construct a cumulative cost matrix C, the (iJyh element
of which, corresponds to the cost of the minimum distance warp path that can be
constructed from the two time series XI = [Xl, .... ,xi] and Y1 = [yJ, ....,Yj]. Due to
continuity and monotonicity constraints that are imposed on the warping path, qiJ)
is computed as

C(i,)) = D(i,)) + min(C(i -I,) -I), C(i -I, )), C(i,) -I)) (5.40)

The cumulative cost matrix C is first filled column-wise and then row-wise. ql ,!)
is initialized to D(1,I). Upon computation of the cumulative cost matrix C, the
optimal path fiT is easily obtained by backtracking the path for the warping

95
function from the (m,n)'h element of C, keeping in mind the warping path
constraints as defined above. Since DTW is an (O(N 2) ) operation, due to
computational complexities a few global constraints are imposed on the warping
path. Sakoe-Chuba band and Itakura Parallelogram are some examples of the
constraints imposed on the number of cells that are to be evaluated on the cost
matrix . Figure 5.33 gives an example of the optimal warping path derived by using
Dynamic Time warping to align two time series X and Y and illustrates the
constraints imposed on the cells of the cost matrix .
I I I !",
T~ >-::7,
T 171
V

:t- + >-1- >-


- >->-

- I -I -
(b) Optimal alignment of X and Y
17
./
1/ 1
D T I- I-
~p.~ - _!I - -

Ul lU I

(a) optimal warping path obtained by DTW (c) (i) Sokoe-Chuba band
on time series X and Y (ii)Itakura Parallelogram

Figure 5.33 Example of the Optimal Warping Path

Gait Recognition Framework


In this section, we will discuss a DTW based framework developed for gait based
human recognition [184], developed at UMD. From the background subtracted
images of a walking sequence, the silhouette is centered within the boundary box
and compact feature vectors namely the left-projection vector, the right-projection
vector and the width vector are extracted. The left-projection vector
L = I" t E (1,2, ..., T) and the right-projection vector R = r" t E (1,2, ..., r)
correspond to the distances traversed to reach the rightmost and the leftmost
foreground pixels along each row of the image, respectively. The width vector w( is
computed as
w( = I(+ r, - M (5.41)

where M is the number of columns in the image. Fig. 5.34 illustrates the
computation of the feature vectors discussed above.

96
(a) (b) (c)

Figure 5.34 Illustration of the Generation of (a) left projection vector (b) right
projection vector (c) width vector

5.4.3 Shape and Kinematics

As discussed in previous sections, the shape of the human silhouette and the
transition characteristics from one stance to another act as significant cues towards
gait based human recognition. This section deals with understanding the relative
significance of shape and kinematics in gait based human identification approaches
[6,148]. In addition to their implications for recognition framework, we also study
the relative significance of shape and kinematics in human activity classification. In
the UMD approach, we represent shape adopting Kendall's shape theory and
represent dynamics using ARMA models.
Shape Analysis
Dryden describes the shape of an object as the geometric information that remains
after removing translational, rotational and scale information from an object's
representation [206]. Kendall's representation of shape describes the shape
configuration of k landmark points in an m-dimensional space as a krm matrix
containing the co-ordinates of the landmarks. In our approach, we extract k
landmark points from the outer contour of the human silhouette from each frame
and represent them as pre-shape vectors, i.e, a representation where translation and
scale have been filtered out. The centered pre-shape is given by
(5.42)

where l, is a krk identity matrix and lk is a vector of k ones. Since the pre-shape
vectors, thus obtained lie on a k-l dimensional complex hyper-sphere of unit radius,
the distance between two pre-shape vectors is non-Euclidean in nature.
Consider two complex configurations X and Y with correspond ing preshapes
a and 13. The full Procrustes fit between X and Y is chosen so as to minimize
d(Y,X)=llfJ-ase j8 -(a+ jb)q (5.43)

where s is a scale, (J is the rotation and (a+jb) is the translation. The full Procrustes
distance d,,{Y,){) and the partial Procrustes distance dP(X,Y) are defined as

97
dF (Y,X) = s,O,a,b
inf d(Y,X) (5.44)

inf IIp-aril
dp(X,Y)= r eSO(m) (5.45)

While the translation that minimizes the full Procrustes fit is given by (a+jb) = 0,
the scale s= la",61 is close to unity . The rotation angle () that minimizes the Full
Procrustes fit is given by () = argt] a" ,61). The partial Procrustes distance between
configurations X and Y is obtained by matching their respective preshapes a and p
over different rotation parameters. The Procrustes distance p(X,Y) is the closest
great circle distance between a and p on the preshape sphere. The Procruste s
distance, the full Procrustes distance and the partial Procrustes distance are
trigonometrically related to one another.

d;(X,Y) = 2sin(p/2) (5.46)

We observed that the distance measures defined above resulted in similar


recogrution results . The Procrustes mean shape (ji) is defined as
L
J.l = arg inf', d; (Y, J.l) where Y corresponds to a shape vector. A tangent space is
constructed with the Procrustes mean shape serving as a pole and this acts as an
approximate linear space for the data. The Euclidean distance in this tangent space
is a good approximation to the various Procrustes distances d», dp and p in shape
space in the vicinity of the pole. The Procrustes tangent coordinates of a preshape a
2
is given by v( a ,J.l ) = aa'J.l- J.lla"J.l1 where J.l is the Procrustes mean shape of the
data.
We develop three independent systems to study the role of shape in human gait
recognition. In the first approach, from a sequence of frames we extract six stances
from each gait cycle that best characterize the walking patterns of the individual.
We compute the exemplar stances for the walking sequence by computing the mean
of the six stances extracted from each gait cycle. We compute the similarity measure
between two sequences by computing the correlation between their exemplar
stances. Recognition is based on the similarity scores, thus derived. Secondly, we
employ a DTW algorithm. Kendall's shape vector is used as the feature vector and
the Procrustes distance measures are used to align the two sequences. But for the
above variations, the recognition framework is very similar to the one discussed in
the previous section. Finally, with Kendall's shape vector as a feature vector we
perform recognition using Hidden Markov models . The stance correlation method
uses only shape cues for recognition. DTW and HMM based gait recognition
approaches tend to exploit kinematics apart from shape.

Dynamical Models
We use two time series models : Autoregressive (AR) and Autoregressive and
Moving Average (ARMA) to study the role of kinematics in human gait
recognition.
In the Stance Based AR Model a video sequence of an individual walking is
divided into N distinct stances . Within each stance , we study the dynam ics of the

98
shape vector. We extract the pre-shape vector from each stance and project them
onto the tangent space . The time series of the tangent space projections is modeled
as a Gauss Markov process ,
a j(t)=Ajaj(t-I)+w(t) (5.47)
- -
where, w is a zero mean white Gaussian noise process and Aj is the transition matrix
corresponding to thej" stance. For convenience and simplicity Aj is assumed to be a
diagonal matrix. Aj where ) E (1,2, ..., N) is computed.
Given the transition matrices of the gallery and the probe sequence, the distances
between the corresponding transition matrices are added to obtain a measure of the
distance between their kinematic models . If Aj and B, (for) = I,2,...,N) represent the
transition matrices for two sequences , then the distance between models is defined
as D(A,B)
j =N (5.48)
D(A,B) = 2:IIAj -BjII F
j=1

where II . IIF denotes the Frobenius norm . The model in the gallery that is closest to
the model of the given probe decides the identity of the person .
In the ARMA Model we learn a dynamical model for human gait and perform
recognition by comparing the learnt dynamic models . The dynamical model is a
continuous state, discrete time model, the parameters of which lie in a non-
Euclidean space. Let us assume that the time-series of shapes is given by a(t), t =
1,2,. .. , t: Then an ARMA model is defined as

a(t) = Cx(t)+ w(t) ;w(t) - N(O,R) (5.49)

x(t+I) = Ax(t)+v(t);v(t) - N(O,Q) (5.50)

Also, let the cross correlation between wand v be given by S. The parameters of the
model are given by the transition matrix A and the state matrix C. We note that the
choice of matrices A,C,R,Q,S is not unique . However, we can transform this model
to the "innovation representation" [207] which is unique.
The algorithm is described in [207, 208]. Given observations a(I), a(2),...., a(t),
~ ~ ~

we learn the parameters of the innovation representation given by A, C and K


where K is the Kalman gain matrix of the innovation representation. Note that in
the innovation representation, the state covariance matrix (lim,....", E[x(t)xT (t) ] ) is
asymptotically diagonal. Let [a( 1) a(2) a(3) ..... a( r)] = UI:VT be the singular value
decomposition of the data. Then

C(r)=U (5.51)

(5.52)

where D 1 = [0 0;1T-I 0] and D 2 = [1T-I 0;0 0].


Subspace angles [209] between two ARMA models are defined as the principal
angles (~, i = 1,2,...,n) between the column spaces generated by the observability
spaces of the two models extended with the observability matrices of the inverse

99
models [210]. The subspace angles between two ARMA models ([At,C"Kd and
[Az,Cz,Kz] can be computed by the method described in [210]. Using these subspace
angles 0;, i = 1,2,...,n, three distances namely Martin distance(d M) , gap distance(dg )
and Frobenius distance (dr-) between the ARMA models are defined as follows:
2 1 (5.53)
d M = In Il--:'2--'--
n

cos (OJ)
;=1

dg = sin (Oma.) (5.54)


(5.55)
d~ = 2:t sin 2 ( OJ)
;=1

The results obtained using the different distance measures were comparable. We
later report results arrived at using the Frobenius distance (d/).

5.4.4 Results

HMM Based Gait Recognition


Using the proposed framework, we report recognition experiments on three data
bases: UMD database [81], CMU database [80] and the USF database [79]. The
results reported on the CMU and UMD databases use the indirect approach. We
report recognition results on the USF database employing the indirect approach and
the direct approach.

Results on the eMU database


100

1
I
~ 1
o1;. 60
1::
£ 50
<1l
:2
(J: 40
>
~ 30
~
:> 20 -re- Train on fast walk (8 cycleS); test on fast walk(8cycles))
U -+- Train on slow walk(8 cycles); test on SlOw wak (8 cyces)
- Train onfastwillk(8 cycles} ; lest on slowwak{8 cyCles)
10 4- Train on Slow walk (8 cyCles);test on fastwalk(8cycles)
-.- Trainonballwalk(8 c de s): test on ballwalk8 c ctes

0, 2 3 4 5 6 7 3 9 10 11 ' 2 13 14 15 1617 18 1920 2 1 22232425


Rank ---->
Figure 5.35 HMM Recognition Results on CMU Database

CMU Database
The CMU database comprises of video sequences of 25 subjects walking on a
treadmill under three conditions: (a) Slow pace (b) Fast pace (c) Carrying a ball. We
report results on the following experiments performed on this dataset: (l) Slow walk
vs. Slow walk (2) Fast walk vs. Fast walk (3) Slow walk vs. Fast walk (4) Fast walk
vs. Slow walk (5) Carrying a ball vs. Carrying a ball. The experiments where the

100
gallery and the probes are identical, training is performed using the first half of the
walking sequence and testing is performed using the other half. The recognition
results are shown in Fig. 5.35.

100

...
4
'

.~ ., 3
.
A
II 80

] 150 ,"
"
"

j2
i.I 40
... " ,I
,.,.'

Jl
I
,
20 .
.'
"

..- 0

00 10 20 30 40
RanI< - , .
(a) cumulative match characteristic (b) recognition confidence
Figure 5.36 HMM Recognition Results on UMD Database

UMD Database
The recognition results on the UMD dataset are plotted in Fig. 36(a). We assessed
the confidence in the recognition score by adopting a leave one-out strategy. Fig.
36(b) is a plot of the variance in recognition score at each rank.

Cumulative Match Scores (Tralnlno Set Is Ga ery )


100.---='"""'-- -+- - +-- -t-....~---+---+---+--+-,.....-+--+

80

70 ..

. .. . . . . .... .. . . . . .. _ . .. . ·. 0.
···· .... .... .... .... ....
.. ... ... ..
··: ..:- -:
..
-:.
. ..
.; ;
.. .
~
. . .
30 ·· .. . ·
··· ··r······ ···I' ·· ·· ··· ·· ····· · ···.··· ·······.·····.... . . . ...p ProbeA
···· .... .
. . ProbeB
..
. . . ProbeC
20
· . .
··
~
..! :. .
., ..0 • ••••• • : •

..
•• •• •••• • : •

..
• • • • • • • • •: • • • • • • •

..
•• • : • •• • • • -

ProbeD
ProbeE
ProbeF
10 " , : ., , . ,. : ' ., , . ' : ' . , . , .. , .: . , : : : ProbeG
-t.....:;;:::....;c...:...:c~'-'

··, ... ... ... .... .... ....


OL..-_
· . .
--'-_ _...L-_--''--_--'-_ _...L-_--''--_--'-_ _...L-_--''--_--'
.
o 2 4 6 8 10 12 14 16 18 20
Rank

Figure 5.37 HMM Recognition Results on USF Database

101
USF Database
The USF database comprises of walking sequences of 122 individuals with
variations in viewing directions, surface of walk, shoe type etc. The database also
comprised of walking sequences where the individual carries a briefcase and
walking sequences taken months apart. The nomenclature adopted to represent the
covariates is as follows: The surface variations are indicated as G (grass) and C.
(concrete); A and B indicate the different shoe types; the camera positions are
indicated as L (left) and R (right); B refers to briefcase sequences and t2 refers to
sequences captured 6 months after the initial data collection. We report results on 12
experiments: (G,A,R) sequences comprised the gallery; the 12 probes were:
A(G,A,L), B(G,B,R), C(G,B,L), D(C,A,R), E(C,B,R), F(C,A,L), G(C,B,L),
H(G,A,R,B), I(G,B,R,B), J(G,A,L,B), K(G,A,R,t2), L(C,A,R,t2). Table 5.13 reports
the recognition results on an earlier version of the USF database and draws
comparisons with the baseline algorithm[76]. Fig. 5.37 displays the Cumulative
Match Scores for each of the 12 experiments on the USF database, using the direct
approach.

RankS '}
,· ...·,··..' ::u'o,",/ g.

100
90
90
35
35
60
50
Table 5.13 HMM Recognition Results on USF Database (version1)

DTW Based Gait Recognition


Upon computation of the feature vector from training and test sequences, instead of
computing a direct frame to frame similarity between the two sequences, we
perform DTW on the two sequences before computing the similarity measure
between them. We compute two similarity matrices from matching sequences using
the left and the right projection vectors independently. Recognition is performed by
summing up the two similarity matrices thus obtained. Table 5.14 reports the DTW
based recognition results on the CMU database [80]. Fig. 5.38 plots the recognition
results on the USF database [79].

I· . . .
T...ft
Fast vs Fast 92 88 92
Slow vs Slow 92 97 100
Slow vs Fast 50 50 50
Fast vs Slow 68 55 59
Table 5.14 DTW based recognition results (rank 1) on CMU database

102
GlIJr- - - - ----;===========;_,
• l en \ ectu r
• R ll:hl vecto r
-lA1
• Fu, l"n
311

211

III

II

Figure 5.38 DTW Identification Rates on USF Database

One can extend the recognition approach discussed above to frontal gait sequences
as well. Though arm swings and leg swings are less apparent on frontal gait
sequences, the outer contour of the silhouette does contain the signature of the
individual. We extract the width vectors from the silhouette after appropriate
normalization taking into account the change in height as the individual walks
towards or away from the camera. We employ an approach similar to the one
discussed above to perform recognition on frontal gait sequences. On datasets where
the frontal and the side view of one's gait are available, we comb ine the recogn ition
results obtained from each of the two views [202]. Results are tabulated in Tables
5.15 and 5.16 .
lO°r---....... - - - - - - -......-r~n>P======="""==;_.,......--__,

90

80

.,0

60

50

40

30

20

to

o
B c

Figure 5.39 Rank 1 Recognition Scores from AR, ARMA, Baseline, Stance
Correlation, DTW and HMM

103
Frontal view 91
Side view 93
Fusion of Frontal and side 96
Table 5.15 Effect of Fusion of Frontal and Side View of Gait on CMU Dataset

, :'CMS atrankl .... .•···1" . " .


Frontal VIew 66 86
Side view 58 74
Fusion of Frontal and side 85 95
Table 5.16 Effect of Fusion of Frontal and Side View of Gait on UMD Dataset

Shape and Kinematics


We conducted recognition experiments on the USF dataset. Fig. 5.39 compares the
recognition performance of different algorithms on the USF database. On the CMU
dataset which comprises of 25 individuals performing the four activities namely
slow walk, fast walk, walking on an inclined surface and walking with a ball, we
perform recognition experiments both within an activity and across activity. Table
5.17 reports identification rates on the CMU data using stance correlation. Further,
we perform activity classification on the CMU database. We illustrate the effect of
kinematics and shape on activity recognition by building an ARMA model for the
top half and the bottom half of the silhouette separately and performing recognition.
Fig. 5.40 illustrates the similarity matrices thus obtained. We also report results on
activity recognition on the MOCAP dataset (available from Credo Interactive Inc.
and Carnegie Mellon University). The MOCAP dataset consists of locations of 53
joints during a typical realization of several different activities. We use the
information on human joints to build an AR and an ARMA model for each activity.
The similarity matrix that was computed in each case is displayed in Fig. 5.41. The
ARMA model had a better discriminative power than the AR model as evident from
the figure.

Slow Walk 48
Fast Walk 28
Walk with Ball 12
Inclined lane 92
Table 5.17 Identification Rates on the CMU Dataset using Stance Correlation
Method (Figures in Braces denote HMM identification rates)

104
I.
--

(aj'Iop half of silhouette (b) Bouom half of silho uette


Figure 5.40 Similarity matrix using ARMA model on the top and bottom half of
silhouette

(alAR model (b) ARMA model


Figure 5.41 Similarity Matrix using AR and ARMA Model

105
6 Model-Based Approaches

6.1 Overview

he model based approaches aim to derive the movement of the torso and/or the
T legs. The distinction of a structural approach is one which uses static
parameters illustrated in Fig. 6.1(a) whereas a model can be the (relative) motion of
the angles (a , ~, and ~) between the limbs, shown in Fig. 6.l(b). As earlier, these
angles can also be measured relative to vertical.

height

~ .
stride

(a) structural (b) modeling


Figure 6.1 Model-based Approaches to Gait Description

BenAbdelkader et aI.' s approach using self similarity and the use of structural stride
parameters (stride and cadence) [102] is a prime example of a model-based
approach which uses structural measures. Cadence was estimated via periodicity;
stride length was estimated as the ratio of the distance traveled (given calibration) to
the number of steps taken. By analysis on the UMD data, the variation in stride
length with cadence was found to be linear and unique for different people, and was
used not just for recognition, but also for verification.
Bobick et al. from GaTech used structural human stride parameters [106] which is
the other example of a structural model-based approach. The method used the action
of walking to derive relative body parameters which described the subject's body
and stride. The within-class and between-class variation were analyzed to determine
potency and on motion capture data the relative body parameters appeared to have
greater discriminatory power than the stride parameters. The approach also included
a measure of confusion which evaluated how much identification probability is
reduced following a measurement as well as a cross-condition mapping that allowed
application in conditions which varied from the original analysis, which was an
early approach to capitalize on the viewpoint invariance associated with model-
based recognition approaches. Another structural approach, by Tanawongsuwan et
aI., used joint angle trajectories, derived by markers placed on joint positions in the
legs and on the thorax [161]. A simple method was used to estimate the planar
offsets between the marker positions and the underlying skeleton and the variation
in joint angles (such as the orientation of the femur relative to the back) with time
was then derived. A variance compensated time warping was used to compensate
for temporal variations. Evaluation was conducted on a small database and showed
that recognition could be achieved. Given the small size of the database, a confusion
metric was derived aimed to show likely performance on a larger database. This
theme was continued by Johnson et aI. [143] who showed how the performance
could be predicted for a much larger database, from the same data, estimating
performance capability on a database five times larger.
Yam et aI. from Southampton extended the earlier model-based system to describe
both legs and to handle walking as well as running [115]; an alternative model-
based system uses evidence gathering as an initial step, followed by model-based
analysis driven by anatomical constraints and data and evaluated on the
Southampton dataset with an analysis of feature potency [116] was developed by
Wagg et aI. Both will be described in more detail in the next section.
In the study from the CAS Institute of Automation [117], a model-based approach
derived the dynamic information of gait by using a condensation framework to track
the walker and to recover joint-angle trajectories of lower limbs. Again, these will
be considered in more detail, next.
Zhang et al's approach [118] concerned the change in orientation of human limbs.
In fact, the extraction is model-based, and the description is structural making this a
blend of the two model-based approaches described so far. The lower limbs were
represented by trapezoids and the upper body was planar without the arms. Given
distances normalized by height of the thorax, the human body posture was
represented by a set of distance measurements and inclinations of its constituent
parts. The gait features were extracted from gait sequences by the Metropolis-
Hastings method to match body parts to the image data. The sequence fit was
achieved by minimizing an energy functional which described : difference of the
body from the silhouette derived from the image; the difference between moving
silhouettes; and the difference between modeled appearance and the silhouette.
This allows for derivation of elevation angles which describe dynamics of gait and
trajectories of joint positions which describe spatiotemporal history. The approach
thus centered on capturing temporal differences by extracting the elevation of the
knee and ankle and the width at the knees and ankles. As these are periodic, they
were described by Fourier analysis and then classified via an HMM. The procedure
was evaluated on the CMU Mobo and on the NIST databases and shown to have
discrimination capability, with better results on the Mobo database. Clearly it enjoys
the advantages of model-based techniques in that the data used for classification is
intimately linked to gait itself.
Again, there are emergent studies of the potency of the various model based
measures which is important for camera placement in application and for
development of new recognition techniques, as well as studies of viewpoint
invariance which reflects one of the major advantages of modeling, namely that
invariant properties can be achieved.

108
6.2 Planar Human Modeling

6.2.1 Modeling Walking and Running

The University of Southampton continued its earlier use of pendul ar models. The
extensions aimed to use model-based approaches that could achieve recognition
whether the subject was walking or running. These modeled the thigh and the leg as
coupled penduli. The process again aimed for a frequency-based description, but
rather than the earlier direct extraction of the frequency components the extraction
now stated with extraction of the front of the limbs whose motion was then
described by its frequency content. As such, the new model-based approach
provided direction for the extraction of the front of the limbs, which was then
refined by analysis of the image. The model for the thigh angle was the same
pendular model in the previous approach. The thigh was assumed to drive the
motion of the freely-sw inging leg.
The evidence gathering technique comprised two stages : i) global/temporal
template matching across the whole sequence and, ii) local template matching in
each image . The aim of the first stage was to search for the best motion model that
can describe the leg motion well over a gait cycle, i.e. the gross motion of a
complete gait cycle . This matched a line which moves according either to the
structural or to the motion model to the edge maps of a whole sequence of images to
find the desired moving object. This gave estimates of the inclination of the thigh
and of the lower leg which were refined by a local matching stage in each separate
image. The first stage determined values for the parameters that maximize the match
of the moving line to the edge data, evaluated across the whole sequence, as those
parameters
(6.1)
ApB,C,D,OJK,OJr =max(L: L: L: (P',y(t)=MI:,y(t)))
ter x Elmage y Elmage

where P is the image and MT is the motion template with dynamics derived from
either the structural model, which is based on two parts . The thigh movement Or is
Or = A COS(OJr t)+ Bsin(wrt) (6.2)
where A and B are constants, and t is a time index . The motion of the knee OKis
given by
(6.3)

where C and D are constants. m, is actually the mass of the thigh, its inclusion
motivated by the differential solution to the motion of the knee and it is set to unity.
Having found the best set of parameters, the estimated thigh and lower leg
inclination for each frame was then generated. These angles formed the basis of a
local search for the best fit line to the data in each single image, as
(6.4)

109
where AT is the line resulting from application of the motion template, with
variation of up to ±5° in inclination and ±5 pixels in vertical and horizontal
translation. This aimed to ensure that the best-fit angle and position was found in
each frame. The estimated angles from the first stage vs. their manually derived
estimates are shown in Fig. 6.2, and these look encouragingly close. The second
stage match to a sequence of image data is shown in Fig. 6.3, again high fidelity can
be observed.
:,:u 60

C> 10
g> 40
~ ~.

~ 0 15
:a
15 ~ 20
oc a::
0. 10
Cl>
C.
.
'5
c.
~·20 x Manually Labelled «<: x Manually t.abeaed
- Motion Model x - Motion Mode l
. 20
20 40 60 80 100 0 20 40 60 80 100
Gait Cycle (% ) Galt Cycle (%j

(a) thigh motion for walking (b) leg rotation for running
Figure 6.2 Extended Motion Model Superimposed on Manually Labeled Data [l l S]

A particular advantage of this approach is that it gave automated means for


recognition by walking and by running gait. The data is actually treadmill derived to
expedite image sequence capture and six sequences were captured of20 subjects (in
a dataset different from Southampton Database B). Each showed capability for
recognition in each case and example recognition results, given in Fig. 6.4 for k = 1
and k = 3, show that the running appeared more recognizable than walking with
recognition rates exceeding 90 and 80%, respectively. This is presumably since
running is a forced motion. This also gives illustrative results that reflect capability
to handle image noise and reduction in image resolution, which naturally detract
from recognition performance but continue to confirm the greater perspicacity of
running vs. walking.

(a) running (b) walking


Figure 6.3 Leg Motion Extraction Results by Temporal Template Matching [l l S]

110
100 100
_ Clean I _ Clean
90 D 90
1D
25% Noise
Low Resolution
o 25'.. Noise
80 80 D Low Resolution
70 70 r-c
-
60 60
... 50 ... 50
40 40
30 30
20 20
10 10
0 '- - o' - = -'-
3 3

(a) walking (b) running


Figure 6.4 Performance Analysis of Walking and Running [115]
This approach also affords greater invariance than silhouette-based approaches .
Already shown, it can handle walking and running. Given that the signature is
centered on frequency domain components, these can be related to give a
walking/running transform. This transform is unique to each form of running and
walking for each subject. Given that running is forced, this transform only applies
for each running style, but can be used not only for recognition by itself (as shown
for a small number of subjects in Fig. 6.5) but also to transform signatures from
running to walking and vice versa. A further invariant advantage is that the
approach affords invariance to small change in viewpoint since the limb inclination
will not vary. This is limited to small angles only by the displacement of the hips
and by the rotational components of limb movement.

-4-+
1tr-

Figure 6.5 Three Components of Walking/ Running Transformation in 3D [141]

6.2.2 Model-Based Extraction and Description

More recently, the model-based approaches have been extended further at the
University of Southampton [116], with investigation of discriminatory potential.
The basic paradigm was similar: to model the motion of the legs over the whole
sequence but in this case the body shape was explicitly included. Essentially, an

III
estimate of human shape is derived by shifting and accumulating the edge images
according to
(6.5)

where Av is the accumulation for velocity v (in pixels per frame), £1 is the edge
strength image at frame t, i and j are coordinate indices , N is the number of frames
in the gait sequence and dy, is the y-displacement of the subject from their center of
oscillation at frame t. Each moving object in the scene will appear as a peak in a
plot of maximal accumulation intensity against velocity. If the subject is the most
significant moving object in the scene (in terms of edge strength and visibility),
their velocity can be inferred by selecting the highest peak in this plot.

(a) accumulation (b) period and phase (c) global models (d) local model
Figure 6.6 Overview of Extended Accumulation Process [116]
The subject is then extracted by matching a coarse person-shaped template, Fig.
6.6(a). This template is constructed from mean anatomical data [20] scaled to the
subject's height. This employs a trapezium enclosing the motion of the legs and
rectangular sections define the expected positions of the thorax and the head. A
more accurate model of the subject's bulk shape uses an ellipse for the torso and for
the head . Four line segments are used to model each leg and a rectangle for each
foot; parameters describing leg segment lengths and radii are initially set to fixed
proportions of the subject's height (where radii are measurements of leg width at
chosen points) . The parameters describing the head and torso are determined
separately by template matching within the locality of the initial segmentation,
constrained by mean anatomical proportions. Although all shape dynamics are lost
in the temporal accumul ation process, it is still possible to estimate the amplitude of
hip rotation, which may be used to aid articulated motion estimation. Gait frequency
is determined by finding the frequency and phase that minimizes the error function
given by Eqn. 6.6. This minimization can be performed very quickly for the typical
range of frequency and phase expected for a walking person .
(6.6)

where Xs is the energy functional to be minimized, N is the number of frames, S, is


the sum normalized edge strength measured at frame t, As is the sinusoid amplitude

112
(a fixed ratio of the signal mean magnitude), Wi and <pj are the proposed gait
frequency and phase (the offset phase was determined empirically and holds for the
baseline approximation on this large number of subjects). Note that the dominant
signal frequency is twice that of the gait frequency, and the offset constant phase
shift is required to align the two sinusoids. Data collected from clinical gait studies
[20, 22] was used to build prototypical models for hip, knee, ankle and pelvis
rotation. The use of mean gait models allows extraction of approximate joint
positions for the subject, but this is not sufficient for recognition purposes; the
estimation process assumes average gait motion, or no individuality. To capture
individual variation, adaptation of the mean leg motion models (Fig. 6.6(c)) is
required. However, before matching of leg shape models to image data, it is
necessary to improve the estimate of the radius of the subject's leg at the hip, knee
and ankle. The initial estimates are computed as fixed proportions of the subject's
height, which may not be appropriate for certain types of clothing (baggy trousers,
shorts or skirts for example). An improved estimate is obtained by computing a line
Hough transform for each frame within the upper and lower leg regions. Within
each Hough space we find the peak accumulation satisfying constraints on the
expected rotation of the leg and the distance between the two lines (width of leg).
An overall estimate of leg width is computed as a mean of the best line parameters
from each frame, weighted by accumulation intensity.

Table 6.1 Potency of Model-Based Gait Measures [116]

p ~-_.~-~-
0.9
s:
ll)
()
:::r
-u 0.7 --_._----- --
a
0-
ll)
g
I- I
~ 0.5
- Indoor Dataset :
Outdoor Dataset
- ~- - - ~ ~ - ~ - -_ . _ ~ --- - - - - -- - - - - - - - - -- -

0.3
3 5 7 9 11 13 15 17 19
Rank

Figure 6.7 Cumulative Match Characteristic for Model-based Analysis

113
The recogmtion measures were analyzed by using Anova and for the
performance on the Southampton indoor and outdoor datasets: Database AS (Table
3.3) and the outdoor database E (Table 3.2) [116]. A cumulative match
characteristic is shown in Fig. 6.7 for which the Correct Classification Rate (CCR)
is given by the CMC for a rank of one, showing a CCR of appro ximately 84% on
the indoor dataset and 64% on the outdoor dataset , reflecting complexity of outdoor
data . This gives for an analysis of potency , shown in Table 6.1, with the highest F-
statistic giving greatest discriminatory capability and hence the highest rank. This is
then similar to the analyses of potency of silhouette measures. This analysis
suggests that the majority of the system' s discriminatory capability is derived from
gait frequency (cadence) and from some static shape parameters. Of course, these
shape parameters will be highly dependent on clothing , which may limit the utility
of performing recognition solely on the basis of these parameters. These results may
in part explain why some approaches using primarily static parameters [106] or
cadence [102] achieve good recognition capability from few parameters. There is a
signific ant reduction in discriminatory capability in the outdoor dataset compared to
the indoor dataset, resulting from the lower extraction accuracy , but there is still a
strong case for recognition potential using this data. A further analysis [157] has
improved extraction by using snakes to evolve from the evidence -gathering derived
approximations, as shown in Fig. 6.8, and this shows improved recognition
capability.

Figure 6.8 Extending Model-Based Analysis by Snake Extraction [157]

6.3 Kinematics-based People Tracking and Recognition in


3D Space

6.3.1 Model-based People Tracking using Condensation

We present an effect ive approach to tracking walking human based on both body
model and motion model in a Condensation framework [219]. This approach was
developed at the CAS Institute of Automation . The Condensation framework is very

114
attractive because it can handle clutter and fusion of information. Fig. 6.9 gives the
framework of our tracking approach. In tracking, we maintain, at successive time-
steps, a sample set of poses that are 12-dimensional vectors. The sample set is
derived either from the tracking result of previous frame, or from a specific
initialization process when it comes to the first frame. For each new frame, the
sample set is subjected to the predictive steps. First, samples undergo drift
according to previous pose, motion model and motion constraints. The second
predictive step, diffusion, is random and the drifted samples may split. Our dynamic
model directs the predictive steps. After prediction our PEF (pose evaluation
function) measures the similarities between the image data and the projected human
body model with diffused poses. And the posterior mean pose of the tracked people
can be generated from the sample set by weighting with the similarities. The
tracking results will finally be used for gait recognition. Other tracking approaches
are also available, specifically aimed to track the human body in monocular video
data. In these, a shape-encoded particle filtering technique is used to track body
parts [107, 108]. Later, angle motions were estimated via multiple cameras [109]
with a kinematic chain to model the human motion. In this, the 2D motion of pixels
is related to the 3D motion of body parts via a projection model, a rigid model of the
human torso and the kinematic chain modeling the human body parts.

'---y---J'------.....v--~/ ~ "--y---J
Initialization Dynamic Model Pose Evaluation Function Gait
(Motion Model & (Body Model) Recognition
Motion Constraints)

Figure 6.9 Framework of 3D Model-Based Human Tracking and Gait Recognition

Human Body Model


The human body model used here, similar to [220], is composed of 14 rigid body
parts, including upper torso, lower torso, neck, two upper arms, two lower arms,
two thighs, two legs, two feet and a head. Each body part is represented by a
truncated cone except for the head that is represented by a sphere. They are
connected to others at joints, the angles of which are represented as Euler angles.
We do not model hands because they are very complex and are of little importance
in human body tracking. Fig. 6.10 gives some perspective views of the human body
model used here. This is a generic model. But for person-specific tracking, we must
adjust its dimensions to individualize the model.

115
Figure 6.10 Human Body Model Projected into the Image Plane from 5 Viewing
Angles

The above human body model in its general form has 34 DOFs: 3 DOFs for each
body part(14 x2), 3 DOFs for its global position (translation), and 3 DOFs for its
orientation (rotation). To search quickly in a 34-dimensional state space is
extremely difficult. However, in the case of gait recognition, people are usually
captured walking along a line when the camera is installed in a desirable
configuration (For convenience, we assume that people walk parallel to the image
plane. With little modification, our approach can also be applied to other fixed
directions), and the movements of the head, neck and lower torso relative to the
upper torso are very small. Therefore the state space can be naturally reduced with
such constraints. Here, we assume that only the arms and legs have relative
movements when the upper torso moves along a line. Furthermore, each joint has
thus only one DOF. Accordingly, this reduces the dimensionality of the state space
to 12: I DOF for each joint mentioned above plus 2 DOFs for the global position.
We represent the position and posture by a 12-dimensional state
vectorP ={x,y,0j,02 ," "01O} where (x, y) is the global position of the human body
and 0i is the ith joint angle. This state vector describes the relative position of
different body parts.
In model-based tracking, we need to synthesize and project the body model into
the image plane given the camera parameters and state vector
°°
P = {x, y, 1 , 2 , " ' , 01O } . In other words, we need to calculate the camera coordinate
of each point in the body model and transform it to the image coordinate. To locate
the positions of model parts in the camera coordinate, each part is defined in a local
coordinate frame with the origin at the base of the truncated cone (or center of the
sphere). Each origin corresponds to the center of rotation (the joint position). We
represent the human body model as a kinematical tree, with the torso at its root, to
order the transformations between the local coordinate frames of different parts.
Therefore, the camera coordinate of each part is formulated as the product of
transformation matrices of all the local coordinates on the path from the root of the
kinematical tree to that part. The geometrical optics is modeled as a pinhole camera
with a transformation matrix T such that Xi = T • X c , where X i and X; are image and
camera coordinates of a point on the human body respectively (see [181] for more
information).

116
Learning Motion Model and Motion Constraints
A motion model, encoding the dynamics of the human body, can be used in tracking
to greatly reduce the computational cost while achieving better results . As a highly
constrained activity, the gait patterns of human walking are symmetric, periodical
and of little variation in a wide range of people . So it is relatively easy to learn a
compact and effective motion model for human gait from limited training data.
Here, our motion model for human gait (hereafter referred to as motion model) is
learnt from semi-automatically acquired training examples and formulated as
Gaussian distributions. Also, the dependency of joint angles is analyzed to explore
the motion constraints that, together with the motion model, are integrated into the
dynamical model to focus on the heavy weighted samples in the Condensation
framework.

Learning the Motion Model


In the learning process, training data (9 examples from 5 different subjects)
corresponding to the motion parameters P = {x, y , 81,82 , .. ·, 81O } were semi-
automatically acquired by a specially designed software . Several feature points in
each frame are marked manually and the motion parameters derived from these
features are computed and analyzed automatically. Some acquired data are
illustrated in Fig. 6.1 1(a) which reveals that the temporal curves have different
periods and phases . Therefore the walking cycles in each training example must be
rescaled to the same length and aligned to the same phase before learning the
motion model.
To compute the period and phase of a sequence, we define the correlation
function carr with respect to two matrices Amxn,Bmxn having the same size.

IIAoTa (i)1I (6.7)


carr; (A,B) = IIAllllT
a
(;)11' i = 1,2,...,m

where A·B returns the matrix whose elements are products of the corresponding
elements in A and B. TaU) removes the first i-I rows of B and add i-I rows of
zeros to the end of B. When A and B have different rows, we add enough zeros to
end of the matrix that has less rows to make their rows equal.
We form matrix A 'j for each training example i with row index indicating time
step and column index indicating motion parameters, where i = 1. ..m. Then the
period t, of the ith example is computed from the cross-correlation corr(A';, A'; ) .
The interval between two dominant peaks is chosen as the period . To rescale the
walking cycle to the same length T (here T = 100), the B-spine interpolation
algorithm is applied to the example A 'j with the scalar a; = T/ T;. Given that A '; is
rescaled to A;, a specific one in A; (i = L .m), e.g. AI, is selected as the reference.
Then the phase b, of each example A j relative to the reference example A I is
indicated by the predominant peak in the cross-correlation corr(A;, AI ) . In all,

(6.8)

are the normalized examples with the same period and phase. The segments
B;(l :r), B;(r + 1: 2r) ..., i = L.m, renamed as It} with} = L .n, are exactly all of

117
the normalized walking cycles. Then our motion model is empirically represented as
Gaussian distributions Gk,l~k.I>O"k/) for each joint angle k (k = 1. .. 10) at any
phase t (t = 1.. . T) in the walking cycle with
1 n (6.9)
uk,r=-LW/t,k), k=l.. .lO,t=l.. .T
n j=1

1 n 2 (6.10)
v.. = - I (W; (t,k)- uk,, ) , k=I...lO, t =l...T
n j=1

Figs. 6.11(b) and (c) are temporal models of joint angles of left thigh and left
knee. The learning and representation of our motion model are compact, and it
shows great effectiveness in estimation of the prior distribution of initial pose and in
prediction of new pose for the next frame,

20

,--~-_.~~ -
10 20 30 40 50 60 70 w ~ 60 60 100
Frame Number Pereeotot VIIallMA Cycle

(a) joint angles of the left knee of 4 (b) temporal models of joint angles of
different people walking with various left thigh and left knee during a walking
periods and phases cycle

·80"--- - - - - - -- - ---'
20 40 60 100 ·5 0 5 10
Percentof Walking Cycle Angie oftne ShoulderJoint

(c) temporal models of joint angles of (d) dark lines and the shaded areas
left thigh and left knee during a walking indicate the mean and standard deviation
cycle of the corresponding distribution
Figure 6.I I Motion model and motion constraints

Motion Constraints
Although the motion model describes the basic pattern of walking, it does not

118
contain all information about walking . Therefore we derive motion constraints from
training data by further exploring the dependency of neighboring joints: shoulder
and elbow, thigh and knee, knee and ankle . Obviously, in a walking activity, the
movements of the lower arm and the upper arm are correlated and regular, so the
shoulder joint and the elbow joint are not independent. We assume that the lower
arm is driven by the upper arm, and accordingly the elbow joint is determined by
the shoulder joint except for some noise. So the motion constraint of the elbow joint
can be approximated by the conditional distribution p(Be IBs), where Be and Bs are
the joint angles of the elbow and the shoulder respectively. Using the training data
in the previous subsection, the distribution can be easily computed by the following
procedure . From each walking cycle Wi (i = L.n), a series of pairs of the shoulder
and elbow joint angles (B;s (I),B;e(I)) are formed as the time t varies from 1 to T. We
classify all pairs according to their first element, i.e, pairs having identical first
element are assigned to the same class. Then for any shoulder joint angle Bs '
provided that class (Bs ' . ) includes K pairs (Bs ' 0;), k = 1. . K, the conditional
distribution p(lW~s) is represented by a Gaussian distribution G(u,c?) where u and
0- are the mean and standard deviation of B;,k=I.. .K . Fig. 6.11(d) gives the

motion constraint for the elbow joint. The motion constraints for the knee and ankle
joint are learnt in the same way. Here, Gaussian representation is assumed for
simplicity which seems to work well and its further analysis remains future work .
We also derive intervals of valid value for each motion parameter from training
data by specifying its maximal and minimal value . All of the generated samples are
constrained in their associated intervals by setting the over-set values to its
.. .
mimmum or maximum.

Tracking
The main task here is to relate the image data to the pose vector
P = {x,y,Bt>B2 ,oo .,BIO } . Since the articulated human body model is naturally
formulated as a tree-like structure, a hierarchical estimation, i.e., locating the global
position and tracking each limb separately, is suitable here, especially when the total
number of parameters is large. Additionally, this approach to decomposing the
parameter space is strongly supported by other reasons as follows. Firstly, the global
position (x, y) is much more significant than other joint angles with respect to the
PEF, so it can be estimated separately with other motion parameters fixed .
Secondly, joint parameters are greatly dependent on the global position (x, y). In
detail, a slight deviation of the global position often cause the joint parameters to
drastically deviate from their real values when maximizing the PEF, and the result is
that a large weight is assigned to the actually unimportant sample . So the global
position should be located ahead of sampling the joint angles . Thirdly , locating the
optimal pose in a high-dimensional state space, e.g. 12 DOFs here, is intrinsically
difficult. Decomp-osition will effectively simplify this problem. Finally, since the
upper limbs can sometimes hardly be segmented from the torso, tracking the upper
limbs is more difficult than tracking the lower limbs and accordingly needs a larger
sample set.
Given the above considerations, we first predict the global position from the
centroid of the detected moving human and then refine it by searching the
neighborhood of the predicted position. Each limb is tracked under the

119
Condensation framework [219]. As a popular method in visual tracking, the
Condensation algorithm uses learnt dynamical models, together with visu al
observations, to propagate the random sample set over time . Instead of computing a
single most likely value, it evaluates the posterior distribution by factored sampling
and thus can represent simultaneous alternative hypotheses. Therefore, it is more
robust than Kalman filter, a Gaussian-based and unimodal method. Another
advantage of the Condensation framework is that it can easily handle fusion of
information, especially temporal fusion, in a principled manner. Later, we can see
that the information of observation, prior knowledge of motion model and motion
constraints are all straightforwardly fused by the density propagation rule to derive
the posterior distribution.
The rule of state density propagation over time is [219] :
(6.11)

where X, are the motion parameters at time t, 2, = (z}, Z2,.. ., Z,) is the image sequence
up to time t, and k, is a normalization constant independent of XI' According to this
rule, the posterior distribution of p(x t Iz.) can be derived from the posterior
P(Xt-l IZ/-I) at the previous time step and three other components: the prior
distribution p(xo) at time step 0, i.e., the initial ization, the dynamical model
p(xt IXt- l ) to predict the motion parameters X, by drifting and diffusing X,_}, and the
observation density p(Zt I XI) computed from the PEF . They are respectively
detailed in the following subsections.

I) Initialization
Initialization is concerned with the initial pose of a subject in capturing human
motion. Most previous approaches handled initialization by manually adjusting the
human body model to approximate the real pose or by arbitrarily assuming that the
initial pose is subject to uniform distribution. Unlike previous work on initialization
that attempts to roughly estimate the pose from a single frame , we accomplish it
using spatiotemporal information of the first N frames. Thus our approach is more
robust, and most importantly, it also achieves real-time speed by avoiding
evaluating the cost function. In what follows we describe the initialization
procedure that includes a learning process and an estimation process.
In the learning process, the moving human in each frame in the training data is
detected by subtracting the background image and extracted edges using the Sobel
operator. And then, the moving area is clipped and normalized to the same size .
Similar to the preprocessing of learning motion model, the normalized examples are
adjusted to the same phase and their periods are rescaled to the same length (here
the length is Tawe = 30) . Also, n normalized walking cycles ~ with j = 1.. .n, are
segmented from the m preprocessed examples. We use the average walking cycle
1 n (6.12)
V=- ""V .
n L.J J
j =l

as the reference cycle.


The estimation process begins with the same pre-processing of human
detection, edge extraction and normalization as that in the learning process. The first

120
N frames v (N < Tawe) are then located in the reference cycle by searching the major
peak in the cross-correlation corr( V, v). Referring the location (assumed to be t) to
the motion model, the pose of the last frame in v is roughly estimated as a 10-
dimentional vector (U 3.h U4.h . .. , U/2,t). Accordingly , the prior distribution for
tracking is the Gaussian distribution G((U3.t, U4,t, ...,1I12.t1o/ IIO ) where 110 is an
lOx 10 Iidenti . and
enttty matnx a,2 = ( 0"3 ,12 , 0"4 ,12 , ••• , O"IO,t 2 ) •

700

600

c 500
g
{!j40 0
I:
o
o 300
200

100

10 20 30 40 50 60
Frame Number
(a) cross-correlation between the short (b) projection of the human body model
sequence (frame 15-19 in sequence mp2) with the initial pose.
and two concatenated walking cycles
Figure 6.12 Example of initialization (for frame 19 in sequence mp2)

Fig. 6.12 illustrates the estimation process. To estimate the initial pose for frame 19
in sequence mp2, the short sequence (frame 15 to 19) is used to compute the cross-
correlation at all displacements between the short sequence and the average walking
cycle. Here two average walking cycles are concatenated to make the result more
accurate. Assuming the predominant peak over an shifts locates at t (t = 21, see
Figure 6.l2(a» , we may derive the phase in the motion model for frame 19:
p=t xT I T"we
Then the motion model at phase p indicates the initial pose. In Fig. 6.l 2(b), the
human body model with initial pose is projected to the real image data. The result
shows that, although there are some errors at the left ann and the left leg, the
initialization as a whole is very close to the true pose, demonstrating the
effectiveness of our automatic initialization procedure .
This initialization method can also be used to recover from severe tracking
failures due to occlusion, accumulated error, or image noise. When a severe failure
occurs (When the PEF reaches a predefined threshold) , the tracker will stop for N
frames and reinitialize using the spatio-temporal information derived from such N
frames to estimate the current pose. However, it is worthwhile to mention that the
real-time speed and robustnes s of the initialization and bootstrap is at the expense of
the first N- l frames in which tracking is stopped.

2) Dyn amic Model


The dynamic model is often carefully designed to improve the efficiency of
factored sampling. The idea is to concentrate the samples in the areas of the state

121
space containing most information about the posterior. The desired effect is to avoid
as far as possible generating samples that have low weights, since they contribute
little to the posterior. Here, the learnt motion model which served as a prior is
integrated into the dynamic model for efficient sampling. Given the assumption that
the Gaussian distributions at different phases in the motion model are independent,
at any time instant t the ith motion parameter Bi,1 satisfies the dynamic model

where G is a Gaussian distribution, and a + jJ + y = 1 makes the drifting of Bi,1 not


only from the tracking history Bi,H but also from the motion model, and A is a
scalar that is often set to 1. But when the gait of the tracked person is very normal, a
smaller A is expected to restrict the factored sampling more effectively to portions
of the parameter space that are most likely to correspond to human motion. ui,1 and
17i,1 are defined above.
This dynamic model is generally sufficient for all motion parameters, but motion
constraints can further concentrate the samples for motion parameters : elbow, knee
and ankle joint. For instance, after the shoulder joint Bs,1 is sampled, sample
positions generated from the conditional distribution P(Be,1 I Bs,l) for the elbow
joint Be,1 also contain much information. So a mixed-state Condensation [219] can
be included in the factored sampling scheme by choosing with probability q to
generate samples from the dynamic model (Eqn. 6.13) and with probability l-q to
generate samples from the conditional distribution P(Be,1 I Bs,I) ' i.e., Be,1 satisfies the
dynamic model

p (Be" IBe.,_I,Bs,/) = qG( au e., + /3U e,I_1 + rBe,I_I ' A((aa., f + (/3a",_1 )2) )+.. (6.14)

..+(I-q) p(Be"IBs./)
where a,jJ,Y, A are defined as above. Equations similar to 6.14 can also be
provided for knee and ankle joints .

3) Pose Evaluation Function (PEF)


The PEF reveals the observation density P(ZI IXI) of an image ZI given that the
human model has the posture XI at time t. To match the image ZI with the generative
model, the model must be projected into the image plane. Furthermore, detection
based on background subtraction and Sobel operator are applied successively to the
image ZI to acquire both region and boundary information. In general, boundary
information improves the localization, whereas region information stabilizes the
tracking because more of the image information is used. Therefore, we combine
them in the PEF by computing boundary matching error and region matching error
so as to achieve both accuracy and robustness.
Fig. 6.13 shows the procedure of computing boundary matching error that is
similar to the chamfer distance. For each pixel Pi in the boundary of the projected
human model, we search the corresponding pixel in the edge image along the

122
gradient direction at pixel Pi ' In other words, the pixel nearest to Pi, and along that
direction is desired . Given that qi is the corresponding pixel and that F; stands for
the vector P;;;;, the matching error of pixel Pi to qi can be measured as the norm
IIF;II · Then the average of the matching errors of all pixels in the boundary of the
projected human model is defined as the boundary matching error
1 N (6.15)
s, =-
N
L:IIF;II
;=)

where N is the number of the pixels in the boundary.

Figure 6.13 Measuring the Boundary Matching Error

In general, the boundary matching error can properly measure the similarity
between the human model and image data, but it is insufficient under certain
circumstances. A typical example is given in Fig. 6.14(a) , where a model part falls
into the gap between two body parts in the edge image. Although it is obviously
badly-fitted, the model part may have a small boundary matching error. To avoid
such ambiguities, region information is further considered in our approach. Fig.
6.14(b) illustrates the region matching . Here the region of the projected human
model that is fitted into the image data is divided into two parts : PI is the region
overlapped with the image data and P2 stands for the rest. Then the matching error
with respect to the region information is defined by
(6.16)

where 1p;1,(i=1,2) is the area, i.e., the number of pixels in the corresponding region .
Both boundary and region matching errors are combined into the PEF that is
Cf 2
modeled in terms of a robust radial term Pi (s, 0')= ve-s/ [221] :

S(P) = v e -( a xEh +(1-a) xE,)/u' (6.17)

where P={x,y,Bj ,B2, oo .,BIO } is the pose vector, and a is a scalar to adjust the
weights of Eb and E; Apart from its robustness, the radial term can improve the
efficiency of factored sampling because it assigns heavier weights to important

123
samples and reduces the weights of insignificant ones. A smaller a will magnify
this effect, but it also makes the curve of the PEF peakier, which leads to a lower
survival rate of samples. And then the needed number of samples increases.
Therefore , a must be carefully selected. As far as a is concerned, a bigger value is
preferred for the upper limbs to diminish the influence of region matching error.
The reason is that the upper limbs and the torso often have clothes with the same
texture and they also frequently occlude each other, and therefore the region
information is of relatively little importance .

~~
image

Dlodel

(a) a typical ambiguity : a model part falls (b) measuring region matching error
into the gap between two body parts
Figure 6.14 Illustrating the Necessity of Simultaneous Boundary and Region Matching

Fig. 6.15 shows the effectiveness of the proposed PEF. Its curve is basically
smooth and has no local maxima at the neighborhood of the global maximum. These
two properties are very useful for optimization. Furthermore, according to the
contour of the PEF and other experiments, we can conclude that the global position
(x, y) is more significant than other joint angles with respect to the PEF. This is one
of the reasons why the global position can be firstly determined with other
parameters fixed.

. :
..":'
; . . . .................. :
1.4
,~ .....- . ... ..!.•
1.2 : ....
~ .
u. 1
w
e:O.8
o
~O ,6
'iii
>0.4

o~ _J ~ _;; ' -'


4 0 -4 '- -15
x-position8 -12 -16 -20 Angle of left thigh

124
Figure 6.15 The curve of the PEF with the global position x and the joint angle of the
left thigh changing smoothly and other parameters remaining constant. Also shown
is the contour of the function.
Our PEF is insensitive to noise to some extent. It is true that the PEF is dependent
on the boundary and motion detection that is sensitive to noise. But in tracking, each
limb is considered as a whole. Although a part of the limb is affected by noise, the
PEF can still realistically reveal the pose when the total limb is considered. In other
words, the PEF can utilize the prior knowledge of body model to reduce the
influence of noise to some extent. For example, in Fig. 6.16 the left upper arm and
right leg are missed in the edge image (b) due to noise but the failure was recovered
in tracking where each limb is tracked in its entirety. However, when the whole
boundary of the limb cannot be detected , the PEF may fail that results in incorrect
tracking (see frame 32 in Fig. 6.21 where the left arm is missed due to significant
motion blur).

(a) original image (b) edge image with (c) left upper arm and right
missing data leg correctly tracked
Figure 6.16 Noise Sensitivity for the PEF

Experiments and Discussions


To examine the tracking performance of the proposed algorithm, we conduct
experiments on several persons having various shapes and walking characteristics in
image sequences with low quality and significant self-occlusion. We restrict these
persons to walk in a direction parallel to the image plane. The approach is
implemented with the MATLAB on a personal desktop workstation . The tracked
sequences are selected both from the training data and from new instances . For each
new instance, the tracker needs 300 state samples for the upper limbs and 100 state
samples for the lower limbs. Comparatively, due to the accuracy of the motion
model, each sequence from the training data only requires 100 and 50 state samples
for the upper and lower limbs respectively. We use the first 15 frames of each
sequence to automatically initialize the tracker. The experiments are carried out on
image sequences captured in both indoor and outdoor environments. For the indoor
scene, we use the earlier early Soton gait database. It is noted that the training

125
examples are all selected from this database. The outdoor sequences are from the
CASIA database.

(c) frame 19, 23, 33 and 43 in sequence 2 of subject 3 (vin2)


Figure 6.17 Tracking Subjects in Training Data

Tracking Results and Discussions


After having been started automatically by the initialization procedure, the
system tracks successfully in the entire image sequences, though sometimes stopped
and reinitialized for severe failures most likely due to occlusion, accumulated errors,
or drastic image noise. Here the tracking results of 3 sequences from the training
data (see Fig. 6.17) and two new instances (see Fig. 6.18) are shown . Due to space

126
constraints, only the human areas clipped from the original image sequences are
given. Some sequences include challenging configurations in which the two legs
and thighs occlude each other severely (e.g. frame 27 in Fig. 6.I7(a)), causing most
part of one leg or thigh is unseen. These difficult data verify the effectiveness of our
approach . Other challenges include shadow under the feet, the arm and the torso
having the same color, various colors and styles of clothes, different shapes of the
tracked people, and low quality of the image sequences . It is worthwhile to mention
that sometimes the arms off the camera in these sequences were lost for the severe
occlusion by the torso (see frame 19 in Fig. 6.!7(c)). However , their motion
parameters can usually be properly estimated using the motion model (see frame 27
in Fig. 6.I8(b)) or using the symmetric value of the other arm.

(b) Frame 19,27,38 and 53 in sequence! of subj ect I (dhl )


Figure 6.18 Tracking Subjects not in Training Data

Also, further experiments are carried out to deeply analyze our approach . It is
mentioned that tracking new sequences requires much more state samples than
tracking sequences from thetraining data. The chief reason is that our motion model
is only learnt from limited training data and cannot accurately represent the
variations of all gait sequences , especially the abnormal ones. When encountering a
novel instance, the deficiency of the motion model will reduce the accuracy of the
prediction of the dynamic model. Fortunately, increasing the state samples will
compensate it. This is demonstrated by an experiment. In Fig. 6.19(a), when the
sequence is included in the training data, the prediction is very close to the refined

127
results. The good prediction, which also proves the effectiveness of the dynamic
model, requires a small set of samples . In contrast, in Fig. 6.19(b), when the same
sequence is intentionally removed from the training data, the prediction is less
accurate but the larger sample set offsets the inaccuracy.
Then a question arises : what a role does the motion model play in tracking ? We
track a sequence without sampling, so that the motion parameters are estimated
entirely from the motion model. The tracking results illustrated in Fig. 6.20 reveals
that, although roughly true, the results are wrong at some body parts in contrast to
that in Fig. 6.17(c) . Therefore the motion model does not unduly affect the tracking
and the sample set can offset the prediction errors .
The tracked sequences in Fig. 6.17 and Fig. 6.18 are all selected from the early
Southampton and the CASIA gait databases . In the two databases, the backgrounds
are basically clean. So we further test our approach on some additional sequences in
more complex real-world outdoor scenes . These sequences were captured on
different days and have significant motion blur and changing background due to
winds . Two samples of such sequences are shown in Fig. 6.21. It can be seen that
the tracking results are fairly accurate except some errors on the upper limbs
superposing on the torso (e.g. frame 32 in Fig. 6.21(a) and frame 38 in Fig. 6.21(b)) .
These errors are mainly caused by the noticeable motion blur where the edges of
most of the limbs can hardly be found when the limbs are superposing on other
body parts (e.g. frame 32 in Fig. 6.21(a) and frame 38 in Fig. 6.21(b) where the
arms are pulled to the torso edges) . For this specific issue where the arm is naked,
the skin color model may be used to segment out the arms . But the more applicable
method is to use image restoration to remove the motion blur or reduce the time of
exposure. Finally, it is worthwhile to mention the applicability of our motion model.
The motion model is learnt from the early Southampton gait database where the
subjects are mainly male and European, but it is still applicable to the two walkers
shown in Fig. 6.21. However, It often fails when it is applied to recover the poses of
their arms off the camera. This means that we should extend the motion model to
cover more individuals.

20

"
~1D
·20
:
~~, is ,
~
: ., ~
j
,
~
...
~
~
.. ~
~
-s

0( ·10
."
., -rs

~~~
.g~5 20 "-~
3D .~,,--~--sO-------SS 60 '2~5;---':;;---7.--C:'--:
35'-'ii--- "---"~
20
Fl1ImeNumbe, F,.meNumber

(a) angle ofleft knee of dhl that is (b) angle of left shoulder joint of the same
included in the training data sequence that is removed from the
training data
Figure 6.19 Estimation Results: predicted results are in thin lines with markers and
refined results by factored sampling are in bold lines

128
Figure 6.20 Analyzing the Significance of the Motion Model. Tracking results of
frame 19,23,33 and 43 in Sequence 2 of Subject 3 (vin2) Without Sampling

(b) frame 15,25,38 and 46 in sequence 3 ofmh


Figure 6.21 Tracking Subjects in Complex Real-world Outdoor Scenes

Dynamic Signature Acquisition


Estimating an underlying skeleton from the tracking results enables us to measure
joint-angle trajectories . Fig. 6.22 shows the temporal changes of the angles of four
joints: left and right hips, left and right knees from a walking instance, where the

129
smoothed curves are the results after median filtering. It is the variation in these
joint signals that we wish to consider as dynamic information of body biometrics,
i.e., gait dynamics, for recognition.

Left Hip Left Leg


40 50 ,..--- - ,- - , -- --,
I
Q)
I
OJ I I
c:
<l: 20
c:
0rv=\-- - --r4\~-- ;J\<--
0 :\ i I 'I"
"".l90 0 -50 -- -- -- -- +\
I
--.L -- -- I --
.J I
-- -- -- l-

0:: I
I

60

~ 20
<l:
- -.r\\ ----i-.t \\- ~ 5: _.___- .
I \ I I ~ '\ ! 11./""'"" I / .
~
~
0::
0
\~I
, : \~ :;
- t - - ~ --\ -T- - --
I
I "
n 1 I'
I -50 -\J- f --- -~L -- i
I I
-20 -- ---- -- - -- '-- -- - - - - ~ - - ---- -- -- -1OO ~ -- - - L -- - ......J.-- - ---- -- ~
o 20 40 60 o 20 40 60
# Frame # Frame
Figure 6.22 Joint-angle Trajectories of Lower Limbs

Differences in body structure and dynamics cause joint-angle trajectories to vary


in both magnitude and time. To analyze these signals for identity recognition, we
need to normalize them with respect to duration and walking cycles similar to [161].
Here, we select only one walking cycle from each sequence. Without directly using
the joint-angle trajectories, we firstly perform variance normalization by subtracting
the mean of each signal and then by dividing by the estimated standard deviation to
reduce the effect of noise. DTW (Dynamic Time Warping) is then applied to
temporally align the signals to a fixed reference phase. Fig. 6.23 shows the results
of time-normalized signals of thigh rotation, from which we find that there are little
variations among sequences from the same subject, whereas there are apparent
variations among different subjects. We choose four normalized signals from left
and right hips and knees to form a dynamic feature vector. Similarly, we also use
multiple vectors from the same subject to obtain the exemplar by averaging them,
which is regarded as a dynamic template for that class.

130
Leftlligh RotationAflerTimeNonnarization ardAlignmer1 lligh Rotation fromFOlJ'Offerer! Slbjects
30 ----,-- - ,-- -- -
Slbject4 I I I
25 - ---I---- H~'\:.:i~~~ 25 - - 1- - - -

I
20 20 - 1- - - -
I L---==C-~jU I
15
15
0; 0; 10
! 10 "
2 :; 5
~ 5
en
~ 0

80 100 20 40 80 80 100
<a t Clde (%)

Figure 6.23 Time-normalized Signals of Joint Angles

Recognition Results
We give recognition results with respect to both identification and verification
modes (see Fig. 6.24) using normalized joint-angle trajectories from the left and
right knee and thigh joints as the gait features. Here, we col1ected 80 sequences
from 20 subjects with 4 sequences each subject. From Fig. 37, we can see that there
is indeed identity information in such dynamic features derived from walking video
that can be explored for the recognition task.

0.35 , - - ---,-- - ,--- -,-- - , - ----r-

02: 0.98
V>

0 ::::::::t:::::::::I:::::::::L:::::T:::::::t:::: ~
30.96
(f)

--------l---------! --~~~--~---------~---------~----- ::;;;~ 0.94


s:
(( 0 .2
-c
LL
:: : !:
0 .15 - ----. . -r---- ----t--··-----1---------1---------1----- ~~ 0.92
' I I 1 I

S
0.1 ---- ---; ---------1 ---------1---------1---------(-- oE 0.9 ::l

0.05 ---- ----t---- ----1 ---------(------(------1----- 0.88


0860'----~---~---~--
0 04 0 .5 5 15

Figure 6.24 Performance of Joint-angle Trajectories based Recognition and


Verification

6.4 Other Approaches


Despite the invariant advantages of model-based techniques, the stock of remaining
technique is not large, as reflected in the overview of current approaches in Section
5.1. We shal1 review here the two main flavors that can be achieved . One approach
has already been described as a structural model-based approach; the other is purely
model-based.

131
6.4.1 Structure by Body Parameters

A structural approach [106, 142] aimed to identify people by static body parameters
determined from a subject walking across multiple views, developed at the Georgia
Institute of Technology . These measurements are specific to the periodic gait cycle.
Further, an adjustment procedure was defined which could accommodate subjects
walking at different viewing angles to the viewing plane of the camera and as such
was an early viewpoint invariant technique, one which was not restricted to the side
view alone. This depth correction was achieved by measuring a subject's height
when the feet were closest together, the point of maximum subject height in the
walking cycle. This gave a conversion factor to correct any distance measure, given
a known optical configuration .
The measures used were four dimensional: the height (of the a bounding box
surrounding a subject; the distance between the head and the pelvis; the maximum
of the values of the distance from the pelvis to the left and the right foot; and the
distance between the two feet. The mean of these measures was used to form a
descriptional vector for each walk sequence. The walk vector from a motion
capture system was then used to normalize the data for recognition .
The approach was also formulated to determine a confusion measure which
allowed prediction of how the features used could filter identity over a large
population, as an alternative to recognition performance . This measure can be
formulated a number of ways and here was achieved in a way analogous to
comparing the (Gaussian) distributions that characterize the group variation versus
the individual variation. This is then similar to an assessment of the area of
intersection of the distributions of feature vectors for different subjects: large
intersection (the within class variation is large compared with the group variation)
implies poor recognition capability which is reflected in high confusion. The
confusion rates achieved were around 6% which is equivalent to a CCR exceeding
90% on their database. Albeit there is deployment difficulty with this approach,
though this certainly shows that human identification by gait can be achieved by
structural parameters. A later study [143] showed how the performance could be
predicted for a much larger database, from the same data, estimating performance
capability on a database five times larger.

6.4.2 Structural Model-based Recognition

The team from Rutgers University will feature later for their work on animating
humans. By inference, that work led to an interest in automatic recognition by gait.
Like the refined Southampton approach, Zhang et ai's approach [I 18] concerned the
change in orientation of human limbs, Fig. 5.2(b). In fact, the extraction is model-
based, and the description is structural making this a blend of the two model-based
approaches described so far. The lower limbs were represented by trapezoids and
the upper body was planar without the arms. Given distances normalized by height
of the thorax, the human body posture was represented by a set of distance
measurements and inclinations of its constituent parts. The gait features were
extracted from gait sequences by the Metropolis-Hastings method to match body
parts to the image data. The sequence fit was achieved by minimizing an energy
functional which described: difference of the body from the silhouette derived from

132
the image; the difference between moving silhouettes; and the difference between
modeled appearance and the silhouette. This allows for derivation of elevation
angles which describe dynamics of gait and trajectories of joint positions which
describe spatiotemporal history .
Unlike Southampton's approach, based essentially on the elevation angles, the
approach centered on capturing temporal differences by extracting the elevation of
the knee and ankle and the width at the knees and ankles. As these are periodic , they
were described by Fourier analysis and then classified via an HMM. The procedure
was evaluated on the CMU Mobo and on the NIST HiD databases and shown to
have discrimination capability, with better results on the Mobo database . Clearly it
enjoys the advantages of model-based techniques in that the data used for
classification is intimately linked to gait itself.

133
7 Further Gait Developments

7.1 View Invariant Gait Recognition

H uman gait characteristics are best analyzed when presented in its canonical view
(side view). The dynamics behind an individual's gait, the quasi-periodicity of
an individual's gait etc. which contribute significantly towards establishing his/her
identity are more evident when the canonical view of human gait is presented . Thus,
the availability of such a canonical view is crucial to the success of many gait
recognition systems. We observe that in surveillance applications a non-canonical
view of an individual's gait is captured more often and this necessitates developing
gait recognition algorithms that are view invariant. When individuals walk at an
angle oblique to the camera, the performance of gait recognition systems suffers due
to the gradual change in the individual's height and stride length as captured by the
camera.
One approach could be to build a 3D model of an individual and extracting view-
invariant features that best characterize the individual's gait. But such an approach
typically based on Structure from Motion (SfM) and stereo reconstruction
techniques suffer from many shortcomings. Shakhnarovich et.al. [182] compute a
visual hull from a set of monocular views and render virtual canonical views for
tracking and recognition. They extract moments based image features from the
silhouettes of the synthesized video to perform gait recognition.
Naturally, a model of walking is implicitly invariant to viewpoint for a small
change in viewing angle. On the structural side, Bobick et al. [106] developed a gait
recognition technique that recovers static body and stride parameters of subjects
from their walking sequences. The parameters are evaluated by means of a
confusion metric which predicts the effectiveness of the feature vector in identifying
an individual. They define a mapping function across different views and use the
same to perform gait recognition. The stride parameters that were extracted from
each subject were shown to be more resilient to different viewing directions.
Naturally, BenAbdelkader's stride and cadence structural approach equally has
viewpoint invariant properties [173]. On the pure modeling side, Carter et al.
showed [166] that variations in gait data could be corrected after the data had been
gathered, given that the trajectory is known. This was achieved by considerations of
geometry and use of a simple pendulum modeling the thigh, with correction by
rotation of the thigh swing axis. A Fourier analysis showed that viewpoint
correction had been achieved. The approach was then reformulated to provide a
pose invariant biometric signature which did not require knowledge of the subject's
trajectory. Spencer et al. was to extend these notions [167] and developed a
geometric correction to the measurement of the hip rotation angle, based on known
orientation to the camera. Results on synthesized data showed that simple pose
correction for geometric targets generalizes well for objects on the optical axis.
Later, these techniques were refined for analysis on walking subjects, and showed
that the approach can work well, given that target features can be extracted and
tracked with success [168].
Here, we present a gait recognition algorithm developed at UMD where prior to
recognition a canonical view of an individual's gait is synthesized from his/her non-
canonical view, without the explicit computation of 3D depth information [169,
170]. Figure 7.1 presents a framework for this approach .
Video of
unknown Tracking Estimation
subject at -. and
background
... of 3-D
walking
View
~ synthesis ... Gait
..
recogrution
Identity
-.
arbitrary
angle subtraction direction

l+ Reliability analysis
of track -
Figure 7.1 Framework for View Invariant Gait Recognition

7.1.1 Overview of the Algorithm

Consider the imaging geometry presented in Fig. 7.2. The world coordinate frame is
such that its origin is at the center of perspective projection. The Z-axis is
perpendicular to the image plane . A translational velocity of the subject of the form
V = [vx,O,Of implies that the subject is walking along AB parallel to the image
plane and thereby presents a canonical view to the camera. A translational velocity
of the form V = [vx,O,vzf implies that the subject is walking along AC at an angle ()
to the image plane and thereby presents a non-canonical view to the camera. ()
addressed as the azimuth angle, refers to the angle of rotation about the vertical axis.
In our notation, [X, Y,Z] denotes the coordinates of a point in 3D and [x,y] denotes its
projection on the image plane . In our formulation we assume that the subject is far
away from the camera and hence could be approximated by a planar object. Given a
video of a person walking at an angle () to the image plane, we propose the
following two approaches to estimate (J.

X
...- (.I.yl )
_, t '·'
./ / "

/~<.

- - - - - - - - Zl-- -- -- -

Z I»(

Figure 7.2 Imaging Geometry

136
7.1.2 Optical flow based SfM approach

We begin by tracking the individual across each frame by employing a Sequential


Monte Carlo particle filter [211] and thereby estimate the direction of motion a.
Assuming the individual's motion between two consecutive frames to be negligible,
we adopt an optical flow based SfM approach. Let p(x(t),y(t» and q(x(t),y(t»
represent the horizontal and vertical velocity fields of a point (x(t),y(t» (e.g.
centroid of the head) at time instant t. Let j" denote the focal length and (xjt),yjt»
denote the focus of expansion (FOE) . Since the individual walks along a straight
line AC ,p and q are related to the 3D object motion and scene depth by [212] .

= [x(t)- IXr (t)Jvz (t)


(7.1)
( () ( ))
p x t ,y t Z(x(t),y(t))

y(t)vz (t) (7.2)


q( x(t),y(t)) = Z(x(t),y(t))

For the constant velocity models, vzCt) = Vz and vx(t) = vx, cot( 8) = vx/vz. Thus,
given the initial position of the tracked point (xI ,Yd and focal length 1 (computed by
means of calibration), Bcan be computed from

_ p ( x (t),Y (t)) _ XI - 1 cot ( e) (7.3)


cot ( a ) ( XI ' YI ) - q ( X ()
t ,y ())-
t YI

Let [Xo, Yo, Zo]' denote the coordinates of a point on the person walking at angle
e?o to the plane parallel to the image plane and pass ing through the point [XI Y1
Ztl'. The 3-D coordinates of the synthesized point are computed by means of a
rotation operation.
o f] f] (7.4)
X ] [XO -Xre [Xre
Yo =R(e) · Yo=Y,.ef + y"ef
[
z, z, z.; z.,
where
cos( B)
~ Sin~e)]
R(e) = 0
[
-sin( e) o cos( e)

Denoting the corresponding image plane coordinates as [Xo,Yo]' and [xo, Yo]' (for
B= 0) and using the perspective transformation, we can obtain the equations for
[xo, Yo]' as [169] .
X o cos( e) + xref (1- cos( e))
X o =1 - -.----0--'---'-----,-----'-
(7.5)
-sm( e)( X o +x ref )+ 1

Yo = 1 Yo
-sin(e)(x o +xref)+ 1

137
where x = jX/z and y = jY/z .
Thus the estimated azimuth angle 0 is used to synthesize canonical views by
means of a direct transformation of the 2D image plane coordinates in the non-
canonical view.

7.1.3 Homography based approach

Homography based approaches are used to synthesize the canonical view in


applications where the apparent translation of the walking individual is negligible.
For example, such observations occur when the individual walks on a treadmill. In
such applications given point correspondences for a planar surface between the
canonical and non-canonical views over a set of training images, one could compute
the homography between the two views and apply the same to the binary silhouettes
in the non-canonical view to synthesize the canonical view. The homography is of
the form
f cos( 0)
~ !XI (1- ~os( 0))1 (7.6)
H(O)= 0

r-sin(O) o -x]sin(O)+f

7.1.4 Experimental Results

We present the recognition performance on two publicly available gait databases:


The NIST database and The CMU database, while the homography approach was
used for the CMU database. Keeping in view the limited quantity of training data,
the DTW algorithm [144] is used for gait recognition where either the gallery
images or the probe images are synthesized depending upon the viewing direction
of the camera. A by-product of the above method is a simple algorithm to synthesize
novel views of a planar scene.

Gait gallery

Figure 7.3 Sigma Shaped Path adopted in the NIST Dataset

138
The NIST database consists of 30 people walking along a ~:-shaped walking pattern
displayed in Fig. 7.3. The gait sequences were captured by two cameras. The
camera that was farther away from the subjects was chosen for our experiments as
the planar approximation of an individual is valid in such cases. The probe and the
gallery segments are displayed in Fig. 7.3 and the probe segment is at an angle of
33° from the gallery segment. A few images from the NIST database are shown in
Fig. 7.4. The implicit SfM approach discussed above was used to synthesize
canonical views on the NIST database. Some of the results of the synthesis are
shown in Fig. 7.4. Since it has been shown that the dynamics of the lower half of a
silhouette contributes significantly towards gait recognition, while performing gait
recognition we adopted the fusion of leg dynamics and height. The gait recognition
result is shown in Fig. 7.5. The performance is significantly better than that without
the synthesis of canonical views.

(a)

(b)

(c)
Figure 7.4 Sample Images from the NIST Databas e : (a) Gallery images of person
walking parallel to the camera (b)Space un-normalized images of person walking
at 33° to the camera (c) Space synthesized image for (b)

139
100 7

5 10 15 20 25 30
Rank - >

Figure 7.5 Recognition Results on NISTI USF Database

In the eMU MoBo database there are six synchronized cameras evenly distributed
around the subject walking on the treadmill. For testing our algorithm we
considered views of the slow walk sequences from cameras 3 and B. The views
captured by camera 3 and camera 13 are displayed in Fig. 7.6. The sequence from
camera 3 was used as the gallery while the sequence from camera 13 was used as
the probe . Since the subjects walk on a treadmill, the apparent translation was
minimal and hence the homography approach discussed above was adopted to
synthesize canonical views from the views captured by camera B . We considered
several points S = {(XhYl),oo ,,(xmYn )) on the side of the treadmill which is a planar
rectangular surface. We then constructed the view of this rectangular patch as it
would appear in the canonical view. A set of points S' = {(Xj',Y j'),oo .,(xn',Yn')} is
considered on this hypothesized patch . The homography [81] between the two
views was estimated using the sets Sand S' [213]. Gait recognition performance
using the unsynthesized and synthesized images is shown in Fig. 7.7. Again we
found that canonical view synthesis results in better gait recognition performance.
Though the validity of the planar assumption in this case was questionable, the
results proved to be satisfactory .

140
Cal

Cb l

Figure 7.6 (a) View from Camera 3 (b) View from Camera 13

100
/
90
,./
,.~
A
I
I
80
11'
Ii J
I 70
'" 60 .I
~
I
'5 50
j
!
.f 40 J
:5
§ 30 J
o
20
I
10

0
5 10 15 20
Rank - ->

Figure 7.7 Recognition Results on CMU Database

7.2 Gait Biometric Fusion

Identification of humans from arbitrary views is an important requirement for


different tasks including perceptual interfaces for intelligent environments,
surveillance applications etc. The effectiveness of biometrics such as fingerprints,
iris, face, gait etc. under differing conditions has been studied extensively. Based on
the nature of the environment, different modalities could be employed towards
human identification. In most surveillance applications when individuals are far
away from the surveillance system, human gait proves to be more effective than

141
other biometrics due to the sheer absence of information pertammg to other
biometrics. But in such surveillance applications, human face contributes
significantly towards recognition when the individual is physically closer to the
surveillance system. Optimal recognition systems must be designed in such a way
that they use maximum cues towards human identification and combine the
recognition performance in meaningful ways. Information may be fused in two
ways: the available data could be fused and decisions could be arrived at based on
the fused data (data fusion); or decisions could be based on the fusion of many
decisions made my analyzing each signal/feature individually.

A Gait gallery

C
Face probe

Figure 7.8 Arrangements for Sigma Shaped Path adopted in the NIST Dataset

As discussed earlier, human gait recognition systems perform better when presented
with a canonical view an individual's gait. On the contrary, face recognition systems
perform better when presented with a frontal view of the individual. Shakhnarovich
et al. [182, 171] extensively studied the fusion of face and gait cues. We develop an
approach [184] that combines the view-invariant gait recognition system [169]
described earlier and the probabilistic approach towards face recognition from video
[172]. Experiments at UMD were based on the NIST database which consists of 30
individuals walking along an inverted L shaped path. Fig. 7.8 shows the L path that
was adopted. Walking sequences along segment A are used as the gallery for gait
recognition while those from segment B which is at an angle 33° segment A are
used as a probe. The gait recognition results obtained using the view-invariant
approach discussed earlier is shown in Figs. 7.9(a) and (d). The last part of the
sequence along segment C was used as the probe for the video face recognition
system the gallery of which consists of still face images of the 30 subjects. The
results for face recognition are shown in Figs. 7.9(b) and (e).

142
(b)

(d)

''''
III

..
10
...... -.
11 •

(e) (f)
Figure 7.9 Face and Gait Fusion Results on NIST Database

Fusion of multiple sources of evidence is likely to yield tangible benefits in terms


of improved efficiency and accuracy of the identification system. To improve
efficiency of a multimodal biometric system, one can adopt multistage combination
rules whereby subjects may be coarsely classified by a less accurate classifier,
passing a smaller set oflikely candidates to a more accurate classifier. The results of
the gait classifier, for example, can be used to pass a smaller number of candidates
to the more accurate face recognition unit. Alternatively , decisions from the

143
different classifiers can be combined directly using simple rules like SUM,
PRODUCT etc. To effectively combine the scores the face and gait recognition
systems, it is necessary to make the scores comparable. We use the exponential
transformation for converting the scores obtained from the gait recognition, i.e.
given that the match score for a probe X from the gallery gaits is given by s), ...,SxN
we obtain the transformed scores exp(-S)),... , exp(-Sf) . Finally we normalize the
transformed scores to sum to unity. Next we discuss two forms of fusion:
hierarchical and holistic.

Hierarchical Fusion
Given a normalized similarity matrix obtained from the gait recognition system,
based on the histogram distribution of the diagonal and off-diagonal elements we
decide a threshold determines individuals who are to be screened by the face
recognition system. The threshold chosen for the NIST database such that, the top
six matches from the gait recognition system are passed on to the face recognition
system. The CMC plot for the resulting hierarchical fusion is shown in Fig. 7.9(c).
Note that the top match performance has gone up to 97% from 93%. This approach
reduces the number of computations significantly.

Holisti c Fusion
If the requirement from the fusion is that of accuracy as against computational
speed, alternate fusion strategies can be employed. Assuming that gait and face can
be considered to be independent cues, a simple way of combining the scores is to
use the SUM or PRODUCT rule [202]. The CMC curve for either case is as shown
in Figure 7.9(f).

7.3 Fusion of Static and Dynamic Body Biometrics for Gait


Recognition

7.3.1 Overview of Approach


For obtaining optimal performance, an automatic person identification system
should incorporate as many informative cues as available. There are many
properties of gait that might serve as recognition features. In research at CAS
Institute of Automation we categorize them as static features and dynamic features.
The former usually reflect geometry-based measurements such as body height and
build, while the latter mean joint-angle trajectories of main limbs. Intuitively,
recognizing people by gait depends greatly on how the static silhouette shape
changes over time. So most previous work on gait recognition mainly adopted low-
level information such as silhouettes. Due to the difficulties of automatic parameter
recovery from video, few methods used higher-level information, e.g., temporal
features of joint angles reflecting gait dynamics. Based on the idea that body
biometrics includes both the appearance of human body and the dynamics of gait
motion measured during walking, here we attempt to fuse the two completely
different sources of information available from the walking video for personal
recognition.
The proposed method is schematically shown in Fig. 7.10 [117]. For each image
sequence, background subtraction is used to extract moving silhouettes of the

144
walker. Temporal pose changes of these silhouettes are represented as an associated
sequence of complex vector configurations in a common coordinate, and are then
analyzed using the Procrustes shape analysis method to obtain an eigen-shape for
reflecting the body appearance, i.e., static information. Also, a model-based
approach under a Condensation framework together with human body model,
motion model and constraints is presented to track the walker in image sequences.
From the tracking results, we can easily calculate joint-angle trajectories of main
lower limbs, i.e. the dynamics of gait as discussed in previous sections. Both static
and dynamic information may be independently used for recognition using the
nearest exemplar pattern classifier. They are also combined effectively on decision
level to improve recognition performance. This method is in essence a combination
of model-based and motion-based approaches. It not only analyzes spatiotemporal
motion pattern of gait dynamics but also derives a compact statistical appearance
description of gait as a continuum. So this implicitly captures both structural
(appearances) and transitional (dynamics) characteristics of gait.
1----------------- ------ ---- -------------------------------,
I I
I I
I
I

I
I

:
~
Shape representation Procrustes shape analysis JI

Figure 7.10 Schematic Illustration of Gait Recognition based on Fusion of Static


and Dynamic Information

7.3.2 Classifiers and Fusion Rules


The main reasons for combining classifiers are efficiency and accuracy. A variety of
fusion approaches for biometric recognition are available, a few of which are
mentioned here [202, 203, 204, 205). For example, Hong and Jain [204] integrated
face and fingerprint for personal recognition, and a theoretical framework was
developed in [205] for combining independent classifiers. Here we investigate the
following approaches to classifier combination.
Having obtained the score for each modality given the observations, one generally
cannot directly combine these scores in a statistically meaningful way because these
scores are usually not direct estimates of the posterior, but rather measures of the
distance between the test examples and the reference example [171, 182]. These
scores, with quite different ranges and distributions, must be transformed to be
comparable before fusion (the logistic function e(a+Px) /(1 + e( a+ Px » is used
here).
First, we investigate rank-summation-based and score-summation-based
approaches described in (205). Rank based strategies are a generalization of the
simple voting methods. It is to compute the sum of the rank for every class in the
combination set. The class with the lowest rank sum will be the first choice of the

145
combination classifier. Let r(n , Ri) be the rank of the class with name n in the
ranking R; this rule is defined as argminn(nk,:E~al r(nk,R) . If the score functions
are directly comparable, score-based strategies are a good way for decision fusion .
The simplest way to combine classifiers using the score is to compute the sum of the
score functions . Let s(n, Si) be the score of the class with name n in the Sj, this rule
is defined as argminn(nk>:E~=ls(nk,Sj)' i.e., the class with the lowest score sum
will be the final choice.
Empirical Probability Distribution UsingStaticFeatures EmpiricalProbabilityDistributionUsing DynamicFeatures
1, - -- - -- - - - _- ---, 1 \

0.8 ! 0.8

!
i O•6 ~0 .6
:0

.t""'"0.4 .t""'" 0.4


0.2 0.2

0.01 0.02 0.03 0.04 0.05 1%00 -- 2OOQ---3000 "" 4000 --5000 6000
Score Score

Figure 7.11 Modeling the Probability Distributions of Scores

Following the theoretical framework presented in [202], we also compare the


max, min, mean, and product rules for combination classifier. Let the input feature
to thejth classifier (j:::: 1, . . .,R) be Xj , and the winning label be l. And we assume a
uniform prior across all classes . These rules are summarized as follows :
1. The product rule: t:::: argrnax, n~=1 p(OJk / x).
2. The mean rule (sum) : t:::: argrnax, :E7=1pica, /x);
3. The max rule: 1= argmaxk maXj p(OJk / x j);
4. The min rule: l=argmaxkminjP(OJk / x j).
In order to justify the above rules statistically, a monotonic transformation function
over scores S needs to be applied to reflect the posterior probability. We used the
approach proposed in [182]. That is, we may estimate a probability distribution over
the scores assigned to the correct labels by a mapping function T from scores to the
empirical distribution and treat T(S) as the estimate of the posterior (see Fig. 7.11) .

7.3.3 Experimental Results and Analysis

We collected 80 sequences from 20 different subjects and four sequences per


subject for our experiments. Each sequence includes a walking figure and the
walker moves laterally with respect to the image plane at normal cadence in the
field of view without occlusion. All image sequences are captured by a stationary
digital camera at a rate of 25 frames per second. The length of each sequence varies
with pace, with an average of about 80 frames. These sequen ces were captured on
two different days so that the time influence is considered.

146
For each image sequence, we first extract static features in the manner described
in Section 5.3. In addition, we perform the model-based tracking and recover
dynamic features in the manner described in Section 6.3. It should be noted that
self-occlusion of body parts, shadow under the feet, and the arm and the torso
having the same color, and low quality of the image sequences bring challenges to
our tracking method. For a small portion of failed tracking sequences, we manually
obtain the motion parameters as the focus here is not on tracking per se but on gait
recognition using the tracking data as dynamic features.
Due to a small number of examples, we hope to compute an unbiased estimate of
the true recognition rate using a leave-one-out cross-validation method. That is, we
first leave one example out, train on the remainder, and then classify or verify the
omitted element according to its differences with respect to the rest examples.
First, we use static and dynamic features separately for recognition. In
identification mode, the classifier determines which class a given measurement
belongs to. One useful measure of classification performance that is more general
than classification error is CMS (Cumulative Match Scores). Here, we use it to
report the results of identification. For completeness , we also use the ROC
(Receiver Operating Characteristic) curves to report verification results. Figs.
7.l2(a) and (b) respectively show performance of identification (for ranks up to 20)
and verification using a single modality. It should be mentioned that the CCR
(Correct Classification Rate) is equivalent to P(l) (i.e., rank = 1).
Based on the combination rules described above, we examine the results after
fusing both static features and dynamic features. Figs. 7.13(a) and (b) show the
results of identification and verification using rank-summation-based and score-
summation-based combination rules respectively, and Figs. 7.14(a) and (b) gives the
fusion results using the product, sum, max and min combination rules respectively.
For comparison, we also plot the results using a single modality in Fig. 7.13 and
Fig. 7.14.

TbeCMSClJ'WS of hji'oAdual Featlres 1te ROC cuws of hJi~d",1 Fealtres


I 1 I F Stauc Feanre
0.6 __ ~ __ ~ ~ _ _ ~Featue
I I I I I I
I I I I J I
M - - T---r- - -r --T- -, - - - r- - -
1 I I j I
1 I I , I I
0.4 - T - - ~ - - - r- - T- - ~ - - - I- - -

0: I 1 I I I r
~ 1 I I EER I I I
0.3 - T- - - 1- - - I - - T - - "I - - -1- - -
I I I I I I
_ .1. _ - I I ..!.. __ ...! I _
I t I I I
I I I I I
_ _ ~ L __ l __ J I _
....... StaticFeature
I I I I
; r ~mic Featue I I I
15 20
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7
FRR
(a) identification (b) verification
Figure 7.12 Results using a Single Modality

147
n-e cws ClI1IeS The ROCCuws
I - Static Fealue
, I
- [),name FeatLre
0.6 - - 1- - - I
- FusionBasedonRario:St.mretion
I I
- FusionBasedonScor9 SuTmation
I I

~
- ~ - - r - -- I- - ~ - - 1 - - ' - - T - -

It
!
I r I I t I I
I I I I I I I

~
r-- I --- I --~ - - '-- l- -T--

I I I 1 I I I
I I I I I I I
0.9 ~--'-- - ' - -~- - ' - - l- - T - -

.!!l
::>
:-- - :.---,-: : EER:
- 1---,-- , -- "1 - -
: :
§ 0.85 I Staic Feaue
0 I I I I 1
o ,, -+- Dyraric Feaue __ _ _ _ I __ .J _ _ ...! _ _ .! _ _
0.1
I I I I
- Fusion Based on Ra-K&rrtraion , I ,
--e-- FusionBased onScore SuTrTli;tion
0.1 0.15 0.2 0.25 0.3 0.35 0.4
5 10 15 20 FAR
Ra-K
(a) identification (b) verification
Figure 7.13 Results of Rank and Score Based Summation Rules

The ReX: ClM!S

0.4

0.35

0.3

~025
u,

~~""" _ _ ~ _ _ -l. _ _ l- _ _

0.80 5 10 15 20 0.(6 0.1 0.15 02 025 0.3 0.35 0.4


Ra-K
FRR
(a) identification (b) verification
Figure 7.14 Results using the Product, Sum, Max and Min Combination Rules

From Fig. 7.11, we can see that there is indeed identity information in both static
and dynamic features derived from walking video that can be explored for the
recognition task. The results using dynamic information are somewhat better than
those using static information. This is likely due to the fact that the dynamics reflect
more essential information of gait motion. As we know, Tanawongsuwan and
Bobick [161] also used dynamical features (joint trajectories) for gait recognition.
But their work is quite different from ours. They used a motion capture system to
obtain motion data, whereas here the motion parameters were recovered
automatically. They achieved a recognition rate of 73% on a database including 18
subjects while our recognition rate is 87.5% on a database of20 subjects.
Figs. 7.13 and 7.14 demonstrate the improved performance of both identification
and verification for the integration step than that using any single modality. A
summary of CCRs and EERs is given in Table 7.1 for clarity. Another observation
from the comparative results is that the score-summation based rule outperforms
other combinations schemes as a whole. Of the last 4 statistical combination rules,
the sum rule is the best for identification, which has also been shown in [202] using
the sensitivity analysis to demonstrate that the sum rule is the most resilient to

148
estimation errors. However, the product rule is best for verification . The main
reason for the poor performance of the min rule is probably because that it suffers
more from the noise in score assignment than the relatively robust mean and product
rules. Also, it is believed that there will be better results if there are sufficient data
to model the probability distributions of scores for the two pattern classifiers more
precisely . In all, these studies highlight the importance of a careful choice of the
whole combination strategy.

Table 7.1 Summary ofCCRs and EERs

Static features 83.75% 92.50% 10.0%


Dynamic features 87.50% 97.50% 8.42%
Rank-summation 87.50% 100% 3.75%
Score-summation 97.50% 100% 3.75%
Product 92.50% 97.50% 3.54%
Sum 96.25% 100% 5.00%
Max 95.00% 100% 4.70%

Although the results are very encouraging, more experiments on a larger and
more realistic database still need to be further investigated in future work in order to
be more conclusive . Accordingly , much remains to be done: 1) To establish a larger
and more realistic database. Unlike face recognition, now gait recognition lacks of a
common evaluation dataset. The researchers often established their own gait
databases and then reported a recognition rate. Directly using these reported rates
for comparison seems to mean nothing. We have compared some recent algorithms
using static features on our dataset. But this comparison should be considered with
reservation . So we have a strong will that a common dataset will be established so
that everyone can make a reasonable comparison with other work. 2) To develop
more robust segmentation algorithms and to improve 3D tracking, which is critical
to the accurate and automatic extraction of gait features . 3) To design more
sophisticated classifiers and combination rules. 4) To use a dynamic silhouette
description in future work to obtain a better description of spatiotemporal silhouette
changes in a gait pattern than using static silhouette description here. 6). To further
analyze the correlation of two types of features. As a general rule, the higher of the
correlation, the lower of the recognition rate after fusion. In this study, human
parameters were divided into two categories: static and dynamic. Our two features
respectively reveal the two categories of parameters to some extent. Since the static
parameters of the body are basically uncorrelated with the dynamic ones, the
silhouette features should be uncorrelated with the trajectories to some extent. This
is perhaps why the fusion performs well. Nevertheless , thorough and deep analysis
of the correlation is a piece of future work.

149
8 Future Challenges

lthough a large number of gait recognition algorithms have been reported, it


A should be noted that gait recognition is still in its infancy. This is because
previous work was mainly implemented under some simplified and controlled
conditions, e.g., no occlusion during human walking, relatively plain background,
t~e lack of view generality except [142,103,102], etc (see Table 8.1). Although
reporting good recognition performance, most existing algorithms were usually
evaluated on a small and non-realistic database. Further, recognition rates have not
yet far reached a possible upper bound even under the above two conditions.
As such, results from early attempts suggest that developing highly reliable gait-
based human identification systems in real world applications is, and will continue
to be, very challenging. The challenges involved in vision-based gait recognition
include imperfect foreground segmentation, variations in viewing angle and
clothing, changes in gait as a result of mood, walking speed or carrying objects, etc.
Actually, the overall performance of a biometric system is based upon a variety of
factors, e.g., the situational characteristics of the given scenario, the size and quality
of collected data, algorithm capability, sensor performance, both neutral and
uncooperative subjects, etc. It is thus very clear that gait research is still in an
exploratory stage. Rich opportunities and many open problems exist. In particular,
the following issues deserve more attention in future work.

Table 8.1 Summary of Typical Assumptions in Previous Work and Real Possible
Affecting Factors
Real!"'>ossiblc'affediri '1facforst;;;ht
• No camera motion • Changes of the clothing style
• Only one person in the view • Distance between the camera
field and the walker
• No occlusion • Background or environment
• No carried objects • Carried objects such as
• Normal walking motion briefcase
• Moving on a flat ground plane • Abnormal walking style
• Constrained walking path • Variations in the camera
• Plain background or viewing angles
environment • Walking surface such as flat
• Specific viewing angle ground, grass or slope
• Data recorded over one time • Mood
span • Walking speed
• Chan e with time

J) Automatic tracking and extraction ofgait features


Undoubtedly, automatic feature tracking and extraction in uncontrolled
environments is very important to the marker-less gait analysis systems [174].
Ideally, automatic gait analysis systems should be fast without any intervention,
which requires that computers themselves must be able to automatically extract and
recognize the features of interest. However, this task is very tough for current vision
technologies due to many factors such as self-occlusions between body parts,
occlusions between objects (e.g., a figure walking behind a lamp pole), different
types of clothing (e.g., skirt and overcoat), body segmentation, skeleton fitting, 3D
body shape estimation, etc. The following only lists several hard issues involved in
automatic tracking and extraction of gait features.
Segmentation: Nearly every vision system within human motion analysis starts
with segmentation, so segmentation is of fundamental importance for automatic gait
analysis. Fast and accurate motion segmentation is a significant but difficult
problem. The captured images in dynamic scenes are affected by factors such as
weather, lighting, clutter, shadow, occlusion, and even camera motion. How to
develop more reliable background models adaptive to dynamic changes in real
scenes is a challenge. Modeling the statistics of natural scenes is probably a
promising way.
Occlusion : At present, the majority of vision systems cannot effectively handle
self-occlusion of body parts and mutual occlusions between objects, especially in
the detection and tracking of multiple people. During occlusions, only portions of
each person are visible and often at very low resolution, which naturally results in
segmentation based on background subtraction unreliable. To reduce ambiguities
due to occlusion, better models need be developed to cope with the correspondence
problem between features and body parts. Interesting progress is being made using
statistical methods, which essentially try to predict body pose, position, and so on,
from available image information. Tracking with a single camera easily generates
ambiguity due to occlusion or loss of depth. However this may be eliminated from
another view. So, the most promising method for addressing occlusions is the use of
multiple cameras.
Body and motion modeling: Human body models for vision have been adequately
parameterized by various shape parameters. However few have incorporated
constraints of joints and dynamical properties of body parts. Also, almost all
previous work assumes that a 3-D model is fully specified in advance according to
prior knowledge. In practice, the shape parameters of 3-D model need to be
estimated from the real images. Constraints on human motion or human models are
usually imposed to decrease the ambiguity. However, these assumptions may not
conform to general image conditions. The detection and tracking of humans in
cluttered environments requires models of how people appear and move as well as
models of the appearance and motion of general scenes. Yet constructing such
models is difficult. Modeling the statistics of natural scenes and people in images
(i.e., to exploit a new framework for learning probabilistic models of objects and
scenes) is probably popular.
Tracking : 2D approaches have shown some early successes in human motion
analysis, especially for the low-resolution application areas where the precise
posture reconstruction is not needed. However, the major drawback of 2D approach
is its restriction of the camera angles. Compared with 2D approaches, 3D
approaches are more effective for accurate estimation in physical space, effective
handling of occlusion, and the high-level judgment between various complex
human movements. The 3D tracking is difficult because of the non-rigid and
articulated nature of human body, lack of reliable features, frequent occlusions, and
loose clothing. This has led some researchers to employ multiple cameras or to

152
estimate only some parameters rather than attempt to recover a full volumetric
model. If gait is observed from great distance and the captured video images are of
low resolution, 3D model-based tracking will be inapplicable. Despite this, 2D
tracking with low computational cost is preferable. This is just why silhouette-based
gait recognition methods are currently popular. Generally, the processing speed of a
biometric system such as gait recognition deserves more attention, especially for the
purpose of surveillance. Researchers are looking forward to new tracking
techniques to improve the performance, and meanwhile, to decrease the
computational cost.
It is believed that, by under the need of automatic gait recognition systems, vision
techniques related to automatic extraction of gait dynamics will be developed
quickly.

2) Improvement ofthe evaluation methods


One apparent limitation of current gait recognition is the lack of a suitable
evaluation database. Another limitation is that there is no knowledge of whether the
changes between individuals or within an individual are consistent in both
experimental conditions and real cases. To improve the evaluation methods and
determine the influence of various data sets and factors on recognition performance,
it is very critical to constitute a standard protocol for collecting data and designing
experiments.
• Evaluating the potentialfor gait as a biometric.
We know from experience that people can often recognize others, especially
known persons, by simply observing the way they walk. This raises the question of
whether gait is sufficiently distinct across a large population. Gait recognition is
really just getting started. Essentially, one needs to confirm the notion that people
can be recognized by their gait does hold when evaluated on databases that are
much larger in size and more challenging in quality than those currently available.
One reason is we need to determine uniqueness to individuals, though early studies
in gait recognition have shown promising results. As stated above, the current
limitations on gait as a biometric are the lack of availability of a common evaluation
database and the lack of knowledge of the inter- and intra-subject variability
consistent with factors resulting from imaging conditions. So, it is of great
importance to further evaluate the potential for gait as a biometric at a distance in
future research .
• Improving the quality and size ofdatabases
The quality and size of databases are two extremely important factors in the
human identification field. Establishing a common database of a considerable size
and a realistic condition is particularly necessary to evaluate gait recognition
algorithms. A database with an independent database for test just like FERET
protocol is probably very convincing. Certainly it needs to offer facility for within-
and between-class variation across a suitable number of subjects. A good database
should include realistic factors as much as possible affecting gait perception, e.g.,
clothing, environments, distance and the carried objects such as a briefcase [104].
Only such a database including large number of subjects, sequences, and variations
in conditions allows one to evaluate the effects of varied datasets on human
identification solutions in a wide variety of experiments and to explore the
limitations of the extracted gait signatures. It is exciting to see that some databases

153
have already tried to address such problem by setting up a standard data set to
measure or determine what factors affect performance.
• Evaluating the key factors that affect performance
To develop the reliable recognition techniques in an uncontrolled environment,
the key factors that may affect the identification performance need to be examined
in accordance with scientific methods. As shown in Table 2, and as discussed in
Section 2.3 gait and its recognition is affected by many factors, e.g., viewing angles,
environments, distances, clothing, age [164, 165], walking speed [163], etc., a few
of which are discussed as follows. A detailed study of the potential effects of these
factors on recognition awaits future investigation.
Distance : The goal here is to determine whether the decreases of image resolution
related to viewed distances have a great adverse effect on recognition performance .
Gait recognition aims to develop such recognition systems that can function at a
great distance. It thus needs to translate the performance regarding to low-resolution
images into the associated recognition performance as a function of the viewed
distances. Some researchers are currently setting up experiments to explore this
mapping relationship.
Viewing angles: So far, images in all databases are two-dimensional and depend
greatly on the viewing angle. When a system tries to compare two shots of the same
person taken from different angles, it is far less effective. The obvious way to
generalize the algorithm is to store training sequences taken from multiple
viewpoints, and to classify both the subject and the viewpoint simultaneously [110].
Another interesting progress has been found that the view-dependent constraint
could be removed by synthesizing a virtual walking sequence in a canonical view
like [171, 169]. Actually, 3D body modeling, tracking and recognition [123, 180,
178, 153] may be of significance for extracting view-invariant features such as
motion trajectories and physical parameters.
Clothing: The dependence of gait feature on the appearance of a subject is
difficult to assess, unless the joint movements of a walker can be detected or
appearance-insensitive features are obtained. However, that the appearance of a
walking subject contains information about the identity has actually been
demonstrated in much previous work. To allow accurate recognition of a person
with the change of clothing style, multiple appearance representations of a person
with respect to different clothes are required.
Weather: An all-day biometric recognition system at a distance must contend
with bad weather such as fog, snow and rain. How to reliably extract information of
moving human from dynamic scenes in poor viewing conditions is critical.
Currently, some researchers are trying to improve the abilities of human detection
and recognition in bad weather to enhance the robustness of gait recognition
technique. Other advanced sensors such as infrared, hyper-spectral imaging and
radar are also being observed because over video they can be used at night and other
low-visibility conditions.
In a word, advanced evaluation methodologies will be able to a) determine
principal limits of gait biometrics; b) observe the effects of datasets with different
size and quality on performance; c) establish standard protocols for collecting
datasets and evaluating algorithms; and d) scientifically identify critical affecting
factors.

3) Best-view signature selection

154
The performance of any image-based gait analysis method is inherently view-
dependent, since inevitably cameras capture only a planar projection of gait
dynamics. Thus, for best performance of gait recognition, it is ideal from an
intuition to use a camera that is nearly fronto-parallel to the walking person for
capturing more apparent gait dynamics dominated by arm and leg movements. This
view-selection capability can be provided by a distributed multi-camera system or
by synthesizing a virtual canonical view [171, 169]. Certainly, determining the
sensitivity of the extracted features to the change of viewing angles will practically
enable a multi-camera tracking system to select an optimal view for recognition.
Moreover, it is possible to identify which direction an individual is walking and use
that information to select the approximate set of exemplars corresponding to the
estimated direction. However, for multi-camera tracking systems, one needs to
decide which camera to use at each time instant. That is, the coordination and
information fusion between cameras are significant problems.

4) Fusion ofmultiplefeatures ofgait itself


Different features have different recognition powers. How to efficiently select
more discriminating features is a critical problem. Using the extended features from
different feature combinations may be helpful to improve the recognition
performance [184, 117, 124]. To recognize the person more accurately, one may try
to combine as much information as possible from dynamic features and static
features of gait such as posture, arm/leg swing, hip/upper body sway or some
unique characteristic of that person. For example, in [184], Cuntoor et al.
systematically analyzed different components of human gait. They investigated
dynamic features such as the swing of the hands/legs, the sway of the upper body
and static features like height, in both frontal and side views. Both probabilistic and
non-probabilistic techniques are used for matching the features. Various
combination strategies may be used depending upon gait features being combined.
Different classifier and similarity measures are also directly related to the
performance. So it is of great importance to seek the most suitable distance
measures and classifiers with respect to the extracted gait signatures.

5) Fusion ofgait with other biometrics


Gait has a potential contribution to multi-modal fusion based biometric
authentication systems. Multi-biometric system generally either fuses multiple
biometric features or automatically switches among different biometric features
according to different operational conditions. For example, both face and gait have
the advantage of being non-invasive, so they may be selected for dynamic identity
recognition in applications of surveillance. At a distance, gait can be used. While
the individual approaches, face images provide a powerful cue alternatively. At a
near distance, they may generally be fused to advance the recognition accuracy. To
effectively combine multiple measurements from various active cameras, the
optimal technique needs to be developed to produce a highly reliable signal for
performing detection and recognition of the features.
In a word, developing and implementing highly reliable and robust gait
identification techniques are and will continue to be challenging. As stated above,
these challenges have brought new opportunities to future research and development
of gait recognition systems.

155
As gait is fundamental to human motion, it is not unlikely that gait could find
deployment in many other areas. Here we concentrate on deployment in surveillance
and in animation, as two likely contenders. In surveillance, gait has yet to find use.
This is in part due to development of technique. It is only recently that gait has been
demonstrated to be able to recognize people on large databases of outdoor data.
Even then, its use in surveillance video analysis for forensic purpose mandates the
ability to perform 3D analysis from images derived by a single camera. There are
viewpoint invariant approaches, Section 7.3, and the model-based approaches do
have limited viewpoint invariance. These are insufficiently generalized for forensic
analysis. There is a consideration that the likelihood of error, in analyzing single
frames derived from low-resolution surveillance video, is likely to be sufficiently
low so as to preclude forensic use. This error will be reduced by analyzing
sequences of video data, though experimentation will be required to determine by
how much this occurs. As such, we await development for forensic deployment,
though early and very recent studies have shown that "Surveillance images from a
bank robbery were analyzed and compared with images of a suspect. Based on
general bodily features, gait and anthropometric measurements, we were able to
conclude that one of the perpetrators showed strong resemblance to the
suspect."[188]. There is of course concern at such developments : DARPA's
HumanID at a Distance program was nominated as "Privacy Villain of the Week" in
2002
It is much more likely that the use of gait in surveillance video will be to signal
events likely to be of interest. By way of example, the ability to discriminate
between litter blowing into a perimeter fence and a person walking near it or
climbing it would make an operator focus on relevant data, and so long as the false
alarm rate is sufficiently low then this will improve overall response of a
surveillance system. In this respect some of the approaches to gait are already
sufficiently simple, such as the averaged silhouette or the use of area, to make
video-rate analysis possible. Certainly, the frameworks require viewpoint invariance
and ability to disambiguate articulated motion but this is a much simpler target than
full recognition by gait. These technologies are largely ready for such deployment,
now.
The use in animation is likely to be longer away. Computer vision researchers
have been synthesizing human faces for some time [189, 192], and this has been
used for face recognition [190] (by using it as a vector to achieve viewpoint
invariance). Further, approaches have moved to realistic depiction of the human
shape [191, 196]. This can reduce cost in film production, and reduce risk in special
effects (or even make effects possible), There have been other vision approaches
with this aim [193, 192, 195], but none have used gait modeling to improve
depiction of animated humans. One of these has already used elevation angles [197]
as a basis for modeling, the very angles that have been shown earlier to have
discrimination capability in model-based approaches to human ID by gait.
As such believe there is a considerable future for gait not just in biometrics, but
also in other application domains. We look forward with great interest to the
continuing developments in this field.

156
References

Literature

[1] Aristotle, On the Motion of Animals, B. C. 350 (available at


http ://classics.mit.edu/Aristotle/motion animals .html, 15/4/2004)
[2] The Oxford English Dictionary: Fourth Edition, Oxford University Press
(Oxford UK), 1951.
[3] The American Heritage Dictionary of the English Language : Fourth Edition,
Houghton Mifflin Company (Boston USA), 2000 .
[4] S. Johnson, Dictionary of the English Language , io" Edition (London UK),
1792.
[5] S. E. Ambrose, Band ofBrothers, Pocket Books, 2001

Medicine and Biomechanics

[6] M. P. Murray, A. B. Drought, and R. C. Kory, Walking Patterns of Normal Men,


Journal ofBone and Joint Surgery, 46-A(2), pp 335-360, 1964
[7] M. P. Murray, Gait as a total pattern of movement, American Journal of
Physical Medicine, 46(1), pp 290-332, 1967
[8] M. Sadeghi, P. Allard, F. Prince, and H. Labelle, Symmetry and Limb
Dominance in Able-Bodied Gait: a Review, Gait and Posture, 12(1), pp 34-45,
2000
[9] K. Berg, and K. E. Norman, Functional Assessment of Balance and Gait, Clinics
in Geriatric Medicine, 12(4) pp 705 1996
[10] D. H. Sutherland, The evolution of clinical gait analysis - Part II - Kinematics,
Gait and Posture, 16(2), pp 159-179,2002
[11] T. A. Gore, G. R. Higginson, and 1. Stevens, The Kinematics of Hip Joints,
Clinical Physics and Physiological Measurement 5(4) , pp 233-252, 1984
[12] V. M. Zatsiorky, and C. L. Werner, Basic Kinematics of Walking Step Length
and Step Frequency: a Review, The Journal of Sports, Medicine and Physical
Fitness, 39(2), pp 109-134, 1994
[13] L. Li, E. C. H. van der Bogart, G. E. Caldwell, R. E. A. van Emmerik, and J.
Hamill, Coordination Patterns of Walking and Running at Similar Speed and
Stride Frequency, Human Movement Science, 18, pp67-85, 1999
[14] T. Chau, and K. Parker, On the Robustness of Stride Frequency Estimation,
IEEE Transactions on Biomedical Engineering, 51(2), pp 294-303,2004
[15] C. Angeloni, P. O. Riley , and D. E. Krebs , Frequency Content of Whole Body
Gait Kinematic Data, IEEE Transactions on Rehabilitation Engineering, 2(1),
pp. 40-46, 1994
[16] C. Anglin, U. P. Wyss, Review of Arm Motion Analyses, Proceedings of the
Institution ofMechanical Engineers H, 214 (H5): 541-555 2000
[17] C. Y, Yam, Model-based Approaches for Recognizing People by the Way They
Walk or Run, PhD Thesis: University ofSouthampton , 20027
[18] S. Ounpuu, The Biomechanics of Walking and Running, Clinics in Sports
Medicine, 13(4), pp 843-863, 1994
[19] D. B. Thordarson, Running Biomechanics, Clinics in Sports Medicine, 16(2),
pp 239-247,1997
[20] D. Winter, The Biomechanics and Motor Control of Human Gait, 2nd Ed.,
Waterloo Biomechanics (Ottowa, Canada), 1991
[21] V. T. Inman, H. 1. Ralston and F. Todd, Human Walking, Williams and
Wilkins (Baltimore USA/London UK), 1981
[22] M W Whittle and D Levine. Three-dimensional Relationships between the
Movements of the Pelvis and Lumbar Spine during Normal Gait. Human
Movement Science, 18, pp. 681-692,1999.
[23] B. Toro, C. Nester, and P. Hudson, A review of observational gait assessment
in clinical practice. Physiotherapy Theory and Practice 19, pp 137-149,2003
[24] G. J. V. 1. Schenau, Some Fundamental Aspects of the Biomechanics of Over
Ground versus Treadmill Locomotion, Medicine & Science in Sports &
Exercise, 12, pp 257-261, 1980
[25] M. P. Murray, G. B. Spurr, S. B. Sepic, G. M. Gardner, and L. A. Mollinger,
Treadmill vs. Floor Walking: Kinematics, Electromyogram and Heart Rate,
Journal ofApplied Physiology, 59( I) pp87-91, 1985
[26] 1. C. Wall and J. Charteris, The Process of Habituation to Treadmill Walking at
a Different Velocities, Ergonomics, 23, pp 425-435, 1980

Covariate factors

[27] Y. Bhambhani, S. Buckley, and R. Maikala, Physiological and Biomechanical


Responses during Treadmill Walking with Graded Loads, European Journal of
Applied Physiology and Occupational Physiology, 76(6), pp 544-551, 1997
[28] E. Steele, A. Bialocerkowski, K. Grimmer, The Postural Effects of Load
Carriage on Young People - a Systematic Review, BMC Musculoskeletal
Disorders , 4: article number 12, 2003
[29] J. M. Bumfield, C. D. Few, F. S. Mohamed, 1. Perry. The Influence of Walking
Speed and Footwear on Plantar Pressures in Older Adults, Clinical
Biomechanics , 19(1), pp 78-84, 2004
[30] S. A. Amadottir , V. S. Mercer, Effects of Footwear on Measurements of
Balance and Gait in Women Between the Ages of 65 and 93 Years, Physical
Therapy, 80(1), pp 17-27,2000
[31] M. Linder, C. L. Saltzman, A History of Medical Scientists on High Heels
International Journal ofHealth Services, 28(2), pp 201-225 1998
[32] T. V. Jones, G. M. Karst, P. A. Hageman, K. D. Patil, S. H. Bunner, D. D.
Mundy, Effects of Alcohol on Gait in Elderly Women, Journal ofthe American
Geriatrics Society, 45(9), pp 167-P167, 1997

158
[33] E. C. Jansen, H. H. Thyssen, J. Brynskov, Gait Analysis after Intake of
Increasing Amounts of Alcohol, ZeitschriJtfur Rechtsmedizin-Journal of Legal
Medicine, 2, pp 103-107, 1985
[34] S. Monsell and F. Tennant, Walking Problems in Young Children, Hospital
Medicine, 65(1), pp 34-38, 2004
[35] P. C. Grabiner, S. T. Biswas, and M. D. Grabiner, Age-Related Changes in
Spatial and Temporal Gait Variables, Archives of Physical Medicine and
Rehabilitation, 82(1), pp 31-35, 2001
[36] M. M. Samson, A. Crow, P. L. de Vreede, J. A. G. Dessens, S. A. Duursma and
H. J. Verhaar, Differences in Gait Parameters at a Preferred Walking Speed in
Healthy Subjects due to Age, Height and Body Weight, Aging-Clinical and
Experimental Research, 13(1), pp 16-21,2001
[37] 1. Quinn, 1. Kaye, The Neurology of Aging, Neurologist, 7(2), pp 98-112, 2001
[38] F. A. Rubino, Gait Disorders, Neurologist, 8(4), pp 254-262, 2002
[39] C. A. McGibbon, Toward a Better Understanding of Gait Changes with Age
and Disablement: Neuromuscular Adaptation, Exercise and Sport Sciences
Reviews, 31(2), pp 102-108,2003
[40] M. B. van Iersel, W. Hoefsloot, M. Munneke B. R. Bloem, M. G. M. O.
Rikkert, Systematic Review of Quantitative Clinical Gait Analysis in Patients
with Dementia, ZeitschriJt fur Gerontologie und Geriatrie, 37(1), pp 27-32,
2004
[41] L. Sloman, M. Pierrynowski, M. Berridge, S. Tupling, J. Flowers, Mood,
Depressive-Illness and Gait Patterns, Canadian Journal of Psychiatry - Revue
Canadienne de Psychiatrie, 32(3), pp 190-193, 1987
[42] N. Becker N., C. Chambliss, and C. Marsh, R. Montemayor, Effects of Mellow
and Frenetic Music and Stimulating and Relaxing Scents on Walking by
Seniors, Perceptual and Motor Skills, 80(2), pp 411-415, 1995

Psychology

[43] G. Johannson, Visual Perception of Biological Motion and a Model for its
Analysis, Perception and Psychophysics, 14, pp 201-211,1973
[44] W. H. Dittrich, Action Categories and the Perception of Biological Motion,
Perception, 22, pp. 15-22, 1993
[45] G. P. Binham, R. C. Shmidt and L. D. Rosenblum, Dynamics and the
Orientation of Kinematic Forms for Visual Event Recognition, Journal of
Experimental Psychology: Human Perception and Performance, 21(6), pp.
1473-1493,1995
[46] G. L. Pellechia and G. E. Garrett, Assessing lumbar stabilisation from point
light and normal video displays of lumbar lifting, Perceptual and Motor Skills"
85(3), pp. 931-937,1997
[47] L. T. Kozlowski and 1. E. Cutting, Recognizing the Sex of a Walker from a
Dynamic Point Light Display, Perception and Psychophysics, 21, pp. 575-580,
1977

159
[48] S. Runenson and G. Frykholm, Kinematic Specification of Dynamics as an
Informational Basis for Person-and-Action Perception : Expectation, Gender
Recognition and Deceptive Intention, Journal of Experimental Psychology:
General, 112, pp. 585-615 , 1983
[49] J. E. Cutting and D. R. Proffitt, Gait Perception as an Example of How we
Perceive Events, In R. D. Walk and H. L. Pich Eds., Intersensory Perception
and Sensory Integration, Plenum Press, London UK, 1981
[50] G. Mather and L. Murdock , Gender discrimination in biological motion
displays based on dynamic cues, Proceedings of the Royal Society London,
B:258,pp.273-279,1994
[51] J. E. Cutting and L. T. Kozlowski , Recognising friends by their walk, Bulletin
ofthe Psychonomic Society, 9(5), pp. 353-356,1977
[52] D. D. Hoffman and B. E. Flinchbaugh, The Interpretation of Biological
Motion, Biological Cybernetics, 42, pp. 195-204, 1982
[53] 1. E. Cutting, D. R. Proffitt and L. T. Kozlowski , A biochemical invariant for
gait perception, Journal of Experimental Psychology: Human Perception and
Pery(ormance,4,pp.357-372,1978
[54] R. F. Rashid, Towards a system for the interpretation of moving light displays,
IEEE Trans. on PAMI, 2(6), pp. 574-581, 1980
[55] S. V. Stevenage, M. S. Nixon and K. Vince, Visual Analysis of Gait as a Cue
to Identity, Applied Cognitive Psychology, 13(6), 513-526, 1999
[56] F. E. Pollick, 1. Kay, K. Heim et al., A review of gender recognition from gait,
Perception, 31, pp 118-118 Supp!. S, 2002

Computer Vision-Based Analysis of Human Motion

[57] K. Akita, Image Sequence Analysis of Real-World Human Motion , Pattern


Recognition , Computer Vision and Image Understanding 17(1), pp73-83 , 1984
[58] C. Cedras and M. Shah, Motion-based recognition, a Survey, Image and Vision
Computing, 13(2), pp. 129-155, 1995
[59] J. K. Aggarwal, Q. Cai, W. Liao, B. Sabata, Nonrigid Motion Analysis:
Articulated and Elastic Motion. Computer Vision and Image Understanding,
70(2): 142-156, 1998
[60] D. M. Gavrila, The Visual Analysis of Human Movement: A Survey .
Computer Vision and Image Understanding 73(1): 82-98, 1999
[61] J. K. Aggarwal, and Q. Cai. Human Motion Analysis : A Review . Computer
Vision and Image Understanding, 73(3): 428-440, 1999
[62] T. B. Moeslund, E. Granum, A Survey of Computer Vision-Based Human
Motion Capture, Computer Vision and Image Understanding, 81(3) : 231-268,
2001
[63] L. A. Wang, W. M. Hu, T. N. Tan, Recent Developments in Human Motion
Analysis, Pattern Recognition, 36(3), pp 585-601, 2003
[64] W. M. Hu, T. N. Tan, L. Wang L, S. Maybank, A Survey on Visual
Surveillance of Object Motion and Behaviors, IEEE Transactions on Systems
Man and Cybernetics Part C-Applications and Reviews, 34(3 ), pp 334-352 ,
2004

160
[65] K. Akita, Image Sequence Analysis of Real World Human Motion, Pattern
Recognition, 17(1), pp. 73-83, 1984
[66] D. Hogg, Model-based vision - a program to see a walking person, Image and
Vision Computing, I( 1), pp. 5-20, 1983
[67] R. J. Kauth and A. P. Pentland and G. S. Thomas, Blob: an unsupervised
clustering approach to spatial pre-processing of MSS imagery, 11th Int.
Symposium on Remote Sensing ofthe Environment , April, Ann Arbor, MI, USA,
1977
[68] S. Kurakake and R. Nevatia, Description and tracking of moving articulated
objects, Systems and Computers in Japan, 25(8), pp. 16-26, 1994
[69] H-J Lee and Z. Chen, Determination of 3D human body postures from a single
view, Computer Vision, Graphics, and Image Processing, 30, pp. 148-168, 1985
[70] D. Marr and H. K. Nishihara, Representation and recognition of the spatial
organization of three-dimensional shapes, Proc. Royal Society of London B.,
200,pp.269-294,1978
[71] 1. O'Rourke and N. Badler, Model-based image analysis of human motion
using constraint propagation, IEEE Trans. Pattern Analysis and Machine
Intelligence, 2(6), pp. 522-536, 1980
[72] R. Polana and R. Nelson, Detecting activities, Proc. Con! on Computer Vision
and Pattern Recognition, New York, USA, pp. 2-7, June 1993
[73] R. F. Rashid, Towards a system for the interpretation of moving light displays,
IEEE Trans. Pattern Analysis and Machine Intelligence, 2(6), pp. 574-581, 1980
[74] K. Rohr, Towards model-based recognition of human movements in image
sequences, Computer Vision, Graphics, and Image Processing, 59(1), pp. 94-
115,1994.

Databases

[75] P. J. Phillips, S. Sarkar, 1. Robledo, P. Grother and K. Bowyer, The Gait


Identification Challenge Problem: Data Sets and Baseline Algorithm,
Proceedings 16th International Conference on Pattern Recognition , pp 385-389,
2002
[76] P. 1. Phillips, S. Sarkar, 1. Robledo, P. Grother and K. Bowyer, Baseline
Results for the Challenge Problem of Human ID Using Gait Analysis,
Proceedings of the IEEE International Conference Face and Gesture
Recognition '02, pp. 137-143,2002
[77] S. Sarkar, P. 1. Phillips, Z. Liu, 1. R. Vega, P. Grother and K. Bowyer, The
HumanID Gait Challenge Problem: Data Sets, Performance and Analysis, IEEE
Trans on PAMI, 27(2), pp. 162-177,2005
[78] 1. R. Vega, and S. Sarkar, Statistical motion model based on the change of
feature relationships: human gait-based recognition, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 25(10), pp 1323-1328,2003
[79] http://www.GaitChallenge.org
[80] R. Gross and J. Shi, The eMU Motion of Body (MoBo) Database, CMU-RI-
TR-01-18,2001
[81] http://www.umiacs.umd.edu/labs/pirl/hid/data.html

161
[82] http://www .sinobiometrics.com/resources.htm
[83] J. D. Shutler, M. G. Grant, M. S. Nixon, and 1. N. Carter On a Large Sequence-
Based Human Gait Database, Special Session on Biometrics Proceedings of the
i h International Conference on Recent Advances in Soft Computing,
Nottingham (UK), 2002
[84] ..http://www.gait.ecs .soton .ac.ukldata .php3..

Early work

[85] M. S. Nixon, J. N. Carter, D. Cunado, P. S. Huang and S. V. Stevenage,


Automatic Gait Recognition, In: A. K. Jain, R. Bolle and S. Pankanti Eds.,
Biometrics: Personal Identification in Networked Society, pp. 231-250, Kluwer
Academic Publishing (Dortrecht Nederlands), 1999
[86] S. A. Niyogi, and E. H. Adelson, Analyzing and Recognizing Walking Figures
in XYT, Proc. IEEE Computer Vision and Pattern Recognition, pp 469-474,
1994.
[87] J. Little and J. Boyd, Describing motion for recognition, Proceedings of the
International Symposium on Computer Vision, pp 235-240, 1995
[88] 1. Little and 1. Boyd, Recognizing People by Their Gait: the Shape of Motion,
Videre, 1(2), pp 1-32 1998
[89] J. E. Boyd, Video Phase-Locked Loops in Gait Recognition, Proc. of
International Conferences on Computer Vision, pp 696-703 , 2001
[90] H. Murase and R. Sakai, Moving Object Recognition in Eigenspace
Representation: Gait Analysis and Lip Reading, Pattern Recognition. Letters,
17, pp 155-162, 1996
[91] P. S. Huang, C. J. Harris and M. S. Nixon, Recognizing Humans by Gait via
Parametric Canonical Space, Proc. International ICSC Workshop on
Engineering ofIntelligent Systems EIS '98, 3, pp 384-389, 1998
[92] P. S. Huang, C. J. Harris and M. S. Nixon, Recognizing Humans by Gait via
Parametric Canonical Space, Artificial Intelligence in Engineering, 13(4), pp
359-366, 1999
[93] P. S. Huang, C. J. Harris and M. S. Nixon, A Statistical Approach For
Recognizing Humans by Gait using Spatial-Temporal Templates Proc. IEEE
International Conference on Image Processing ICIP '98, III, pp 178-182, 1998
[94] P. S. Huang, C. J. Harris and M. S. Nixon, Human gait recognition in canonical
space using temporal templates , lEE Proceedings. Vision, Image and Signal
Processing, 146(2) , pp. 93-100,1999
[95] P. S. Huang, Automatic Gait Recognition via Statistical Approaches for
Extended Template Features, IEEE Transactions on SMC-Part B: Cybernetics,
31(5) pp 818-814 , 2001
[96] D. Cunado, M. S. Nixon and 1. N. Carter, Using gait as a biometric, via phase-
weighted magnitude spectra, Lecture Notes in Computer Science (Proceedings
of the First International Conference on Audio Visual Biometric Person
Authentication) 1206, pp 95-102, 1997

162
[97] D. Cunado, 1. M. Nash, M. S. Nixon and 1. N. Carter, Gait Extraction and
Description by Evidence-Gathering, Proceedings of the Second International
Conference on Audio- and Video-Based Biometric Person Authentication
AVBPA99, Washington D.C., pp 43-48,1999
[98] D. Cunado, M. S. Nixon and 1. N. Carter, Automatic Extraction and
Description of Human Gait Models for Recognition Purposes, Computer Vision
and Image Understanding, 90(1), pp 1-41,2003

Current approaches
[99] A. Sundaresan, A. RoyChowdhury, and R. Chellappa, A Hidden Markov
Model Based Framework for Recognition of Humans from Gait Sequences,
Proceedings IEEE International Conference on Image Processing, pp. 143-50,
2003
[100] A. Kale, A. N. Rajagopalan, A. Sundaresan, N. Cuntoor, A. RoyChowdhury,
V. Kruger, and R. Chellappa, Identification of Humans using Gait, IEEE
Transactions on Image Processing, ppI163-1173, Sept. 2004
[101] C. BenAbdelkader, R. Cutler, H. Nanda and L. Davis, EigenGait: Motion-
Based Recognition Using Image Self-Similarity, Lecture Notes in Computer
Science (Proceedings of the Third International Conference on Audio Visual
Biometric Person Authentication) 2091, pp 289-294,2001
[102] C. BenAbdelkader, R Cutler and L Davis. Stride and Cadence as a Biometric
in Automatic Person Identification and Verification . Proceedings IEEE Face
and Gesture Recognition, pp. 372-377, 2002
[103] C. BenAbdelkader, R. Culter and L. Davis, Person identification using
automatic height and stride estimation, Proc. of Inti. Con.f on Pattern
Recognition, Quebec, Canada, 2002
[104] C. BenAbdelkader and L. Davis, Detection of load-carrying people for gait
and activity recognition, in Proc. ofIntI. Con.f on Automatic Face and Gesture
Recognition, Washington, DC, USA, 2002
[105] Z. Liu and S. Sarkar, Simplest Representation Yet for Gait Recognition:
Averaged Silhouette, Proceedings 17th International Conference on Pattern
Recognition, 2004
[106] A. F. Bobick and A. Y. Johnson, Gait Recognition Using Static, Activity-
Specific Parameters, Proceedings IEEE Computer Vision and Pattern
Recognition 2001, I, pp 423-430,2001
[107] H. Moon, R. Chellappa, and A. Rosenfeld, 3D Object Tracking Using Shape-
Encoded Particle Propagation, Proceedings International Conference on
Computer Vision, II, pp 307-314, 2001.
[108] T. Yamamoto and R. Chellappa, Shape and Motion Driven Particle Filtering
for Human Body Tracking, Proc. Inti. Conf. on Multimedia and Expo, 3, pp. 61-
64, July 2003.
[109] A. Sundaresan, A. Roy Chowdhury and R. Chellappa, Multiple View
Tracking of Human Motion Modelled by Kinematic Chains, Proc. Int. Con.f on
Image Processing, October 2004
[110] R. Collins, R. Gross and J. Shi, Silhouette-based Human Identification from
Body Shape and Gait, Proceedings of the IEEE International Conference Face
and Gesture Recognition '02, pp 366-371, 2002

163
(III] Y. Liu, R. Collins, and Y. Tsin. Gait Sequence Analysis using Frieze Patterns.
Proc. European Conference on Comp uter Vision, May 2002
[112] J. P. Foster, M. S. Nixon , and A. Prugei-Bennet, Automatic Gait Recognition
using Area-Based Metrics, Pattern Recognition Letters, 24 , pp 2489 -2497, 2003
[113] 1. B. Hayfron-Acquah, M. S. Nixon and 1. N. Carter , Hum an Identification by
Spatiotemporal Symme try, Proceedings 16th International Conference on
Pattern Recognition, 1, pp 632-635, 2002
[114] J. D. Shutle r, and M. S. Nixon, Zernike Velo city Mom ents for Des cription
and Recognition of Moving Shapes, Proceedings of the British Machine Vision
Conference 2001, pp 705-714, 200 1
(115] C-Y. Yam, M. S. Nixon and J. N. Carter, Automated Person Recognition by
Walking and Runn ing via Model-Ba sed Approaches, Pattern Recognition,
37(5), pp 1057-1072,2004
[116] D. K. Wagg, and M. S. Nixon, On Automated Model -Based Extraction and
Anal ysis of Gait , Proceedings of the IEEE International Conference Face and
Gesture Recognition '04, 2004
[117] L. Wang , T. Tan , H. Ning, and W. Hu. Fusion of Static and Dynamic Body
Biometrics for Gait Recognition, IEEE Transactions on Circuits and Systems for
Video Technology Special Issue on Image- and Video-Based Biometrics,14(2),
pp. 149-158,2004
[118] R. Zhang, C. Vogler, D. Metaxas, Human Gait Recognition, Proceedings
IEEE Computer Vision and Pattern Recognition 2004, Washington , July 2004
[119] L. Lee and W. E. L. Grimson, Gait An alysis for Recognition and
Classification , Proceedings of the IEEE International Conference Face and
Gesture Recognition '02, pp 155-162,2002
[120] E. Tassone , G. West and S. Venkatesh , Temporal PDM s for Ga it
Cla ssification, Proceedings 16th International Conference on Pattern
Recognition, pp 1065-10 69, 2002
[121] L. Wang, T. N. Tan, W. M. Hu, and H. Z. Ning, Automatic Gait Recognition
Based on Statistical Shape Analysis, IEEE Transactions on Image Processing,
12(9), pp 1120-1131 , 2003
(122] L. Wang, T. Tan, H. Z. Ning , and W. M. Hu, Silhou ette Anal ysis-Based Gait
Recognition for Hum an Ident ification, IEEE Transactions Pattern Analysis and
Machine Intelligence, 25(12), pp 1505-2528 , 2003
(123] B. Bhanu and J. Han , Individual Recognition by Kinem atic-B ased Gait
Anal ysis, Proceedings 16th International Conference on Pattern Recognition, 3,
pp 343-6346, 2002
[124] B. Bhanu and 1. Han, Human Recognition on Combining Kinematic and
Stationary Features, Lecture Notes in Computer Science (Proceedings of the
Fourth International Conference on Audio Visual Biometric Person
Authentication) 2688 , pp 600-608, 2003
[125] J. Han and B. Bhanu, Statistical Feature Fusion for Gait-based Human
Recognition, Proc. IEEE Computer Vision and Pattern Recognition 2004,
Washington, 2004
[126] G. Zhao , R. Chen, G. Liu , L. Hua , Amplitude Spectrum-based Gait
Recognition, Proceedings of the IEEE International Conference Face and
Gesture Recognition '04, pp 23-28, 2004

164
[127] C-S. Lee, A. Elgammal, Gait Style and Gait Content: Bilinear Models for Gait
Recognition using Gait Re-Sampling. Proceedings of the IEEE International
Conference Face and Gesture Recognition '04, pp 147- 152,2004
[128] T. Kobayashi, and N. Otsu, Action and Simultaneous Multiple-Person
Identification using Cubic Higher-order Local Auto-Correlation, Proceedings
17th International Conference on Pattern Recognition, 2004
[129] I. R. Vega, and S. Sarkar, Statistical Motion Model Based on the Change of
Feature Relationships : Human Gait-Based Recognition, IEEE Transactions. on
Pattern Analysis and Machine Intelligence, 25(10), ppI323-1328, 2003
[130] 1. E. Boyd, Synchronization of Oscillations for Machine Perception of Gaits,
Computer Vision and Image Understanding, 96, 35-59, 2004
[131] M.l-Iild, Estimation of 3D Motion Trajectory and Velocity from Monocular
Image Sequences in The Context of Human Gait Recognition, Proceedings 17th
International Conference on Pattern Recognition, 2004
[164] 1. W. Davis , Visual categorization of children and adult walking styles, in
Proc. of Int!. Conf. on Audio- and Video-based Biometric Person
Authentication, 2001, pp. 295-300
[133] J. W. Davis, S. R. Taylor, Analysis and Recognition of Walking Movements,
Proceedings 16th International Conference on Pattern Recognition , pp 10315-
10319,2002
[134] 1. P. Foster, M. S. Nixon, and A. Prugel-Bennet, Recognising Movement and
Gait by Masking Functions, Lecture Notes in Computer Science (Proceedings of
the Third International Conference on Audio Visual Biometric Person
Authentication) 2091, pp 278-283, 2001
[135] 1. B. Hayfron-Acquah, M. S. Nixon, J. N. Carter, Automatic Gait Recognition
by Symmetry Analysis, Lecture Notes in Computer Science (Proceedings of the
Third International Conference on Audio Visual Biometric Person
Authentication) 2091, pp 312-317, 2001
[136] 1. B. Hayfron-Acquah, M. S. Nixon, and 1. N. Carter, Recognising Human
and Animal Movement by Symmetry, Proceedings of the IEEE International
Conference on Image Processing IClP '01, Thessaloniki, pp 290-293 , 2001
[137] 1. B. Hayfron-Acquah, M. S. Nixon and J. N. Carter, Automatic Gait
Recognition by Symmetry Analysis, Pattern Recognition Letters, 24(13) , pp
2175-2183,2003
[138] J. D. Shutler, Zemike Velocity Moments for Holistic Shape Description of
Moving Features, PhD Thesis , University of Southampton, 2002
[139] C. Y. Yam, M. S. Nixon, and J. N. Carter, Extended Model-Based Automatic
Gait Recognition of Walking and Running, Lecture Notes in Computer Science
(Proceedings of the Third International Conference on Audio Visual Biometric
Person Authentication), 2091 , pp 272-277, 2001
[140] C-Y. Yam, M. S. Nixon and J. N. Carter, Gait Recognition by Walking and
Running : a Model-Based Approach, Proceedings of the Asian Conference on
Computer Vision ACCV 2002, pp 1-6,2002
[141] C-Y. Yam, M. S. Nixon and 1. N. Carter , On the Relationship of Human
Walking and Running: Automatic Person Identification by Gait, Proceedings
16th International Conference on Pattern Recognition, 1, pp 287-290,2002
[142] A. Y. Johnson and A. F. Bobick, A Multi-View Method for Gait Recognition
Using Static Body Parameters, Lecture Notes in Computer Science (Proceedings

165
of the Third International Conference on Audio Visual Biometric Person
Authentication), 2091, pp 301-311, 2001
[143] A. Y. Johnson, J. Sun and A. F. Bobick, Predicting Large Population Data
Cumulative Match Characteristic Performance from Small Population Data,
Lecture Notes in Computer Science (Proceedings of the Fourth International
Conference on Audio Visual Biometric Person Authentication), 2688, pp 821-
829,2001
[144] A. Kale, N. Cuntoor, B. Yegnanarayana, A. Rajagopalan, and R. Chellappa,
Gait Analysis for Human Identification, Lecture Notes in Computer Science
(Proceedings ofthe Fourth International Conference on Audio Visual Biometric
Person Authentication) 2688, pp 706-714, 2003
[145] A. Kale, A. N. Rajagopalan, N. Cuntoor, and V. Kruger, Identification of
Humans using Gait, Proceedings of the IEEE International Conference Face
and Gesture Recognition '02, pp 366-371, 2002

Further Analysis

[146] 1. E. Boyd and J. Little, Biometric Gait Recognition, Proc. Summer School on
Biometrics forthcoming as Lecture Notes in Computer Science. 2003
[147] G. Veres, 1. N Carter and M. S. Nixon, What image information is important
in silhouette-based gait recognition, Proceedings IEEE Computer Vision and
Pattern Recognition 2004, Washington, July 2004
[148] A. Veeraraghavan, R. Chellappa and A. Roy Chowdhury, Role of Shape and
Kinematics in Human Movement Analysis, Proceedings IEEE Computer Vision
and Pattern Recognition 2004, Washington, July 2004
[149] Z. Liu, L. Malave, A. Osuntugun, P. Sudhakar, and S. Sarkar, Towards
Understanding the Limits of Gait Recognition, Proc. SPIE International
Symposium on Defense and Security: Biometric Technology for Human
Identification, pp 195-205, April 2004
[150] Z. Liu, L. Malave, and S. Sarkar, Studies on Silhouette Quality and Gait
Recognition, Proceedings IEEE Computer Vision and Pattern Recognition
2004, Washington, July 2004
[151] D. Tolliver and R. Collins, Gait shape estimation for identification, Lecture
Notes in Computer Science (Proceedings ofthe Fourth International Conference
on Audio Visual Biometric Person Authentication) 2688, pp 734-742, 2003.
[152] M. S. Nixon, 1. N. Carter, 1. D. Shutler and M. G. Grant, Automatic
Recognition by Gait: Progress and Prospects, Sensor Review, 23(4), pp 323-331,
2003
[153] L. Lee, G. Dalley and K. Tieu, Learning pedestrian models for silhouette
refinement, Proceedings 9th International Conference on Computer Vision, pp
663-670 2003.
[154] S. D. Mowbray and M S Nixon, Automatic Gait Recognition via Fourier
Descriptors of Deformable Objects, Lecture Notes in Computer Science
(Proceedings ofthe Fourth International Conference on Audio Visual Biometric
Person Authentication), 2688, pp 566-573, June 2003
[155] S. D. Mowbray and M. S. Nixon, Extraction and Recognition of Periodically
Deforming Objects by Continuous, Spatio-temporal Shape Description, Proc.
IEEE Computer Vision and Pattern Recognition 2004, Washington, July 2004

166
[156] 1. Zhang, R. Collins, and Y. Liu, Representation and Matching of Articulated
Shapes, Proc. IEEE Computer Vision and Pattern Recognition 2004,
Washington, July 2004
[157] D. K. Wagg, and M. S. Nixon, Automated Markerless Extraction of Walking
People Using Deformable Contour Models . Computer Animation and Virtual
Worlds 15(3-4), pp. 399-406, 2004
[158] R. Urtasun and P. Fua, 3D Tracking for Gait Characterization and
Recognition, Proceedings of the IEEE International Conference Face and
Gesture Recognition '04, 2004
[159] J-H. Yoo, M. S. Nixon and C. 1. Harris, Model-Driven Statistical Analysis of
Human Gait Motion, Proceedings IEEE International Conference on Image
Processing, pp 285-288, 2002
[160] V. J. Laxmi, 1. N. Carter and R. I Damper, Support Vector Machines and
Human Gait Classification, Proceedings IEEE Workshop Automatic
Identification Advanced Technologies (AutoID '02) , pp 17-22,2002 .
[161] R. Tanawongsuwan and A. Bobick, Gait Recognition from Time-normalized
Joint-Angle Trajectories in the Walking Plane, Proceedings IEEE Computer
Vision and Pattern Recognition 2001, II, pp726-73I, 2001
[162] R. Tanawongsuwan, and A. F. Bobick, Performance Analysis of Time-
Distance Gait Parameters under Different Speeds, Lecture Notes in Computer
Science (Proceedings of the Fourth International Conference on Audio Visual
Biometric Person Authentication) 2688, pp 715-724, 2003
[163] R. Tanawongsuwan and A. Bobick, Modeling the Effects of Walking Speed
on Appearance-based Gait Recognition, Proceedings IEEE Computer Vision
and Pattern Recognition 2004, Washington, July 2004
[132] 1. W. Davis, Visual categorization of children and adult walking styles, in
Proc. of IntI. Con! on Audio- and Video-based Biometric Person
Authentication, 2001, pp. 295-300
[165] G. V. Veres, M. S. Nixon, and J. N. Carter, Modelling the time-variant
covariates for gait recognition, Proc. A VBPA2005, Springer, 2005
[166] J. N. Carter and M. S. Nixon , On Measuring Gait Signatures which are
Invariant to their Trajectory, Measurement and Control, 32(9), pp 265-269,
1999
[167] N. Spencer and 1. N. Carter, Viewpoint Invariance in Automatic Gait
Recognition, Proceedings IEEE Workshop Automatic Identification Advanced
Technologies (AutoID '02) , pp 1-6,2002
[168] N. Spencer and 1. N. Carter, Pose Invariant Gait Reconstruction, Proc ICIP
2005
[169] A. Kale, A. K. R. Chowdhury, and R. Chellappa, Towards a View Invariant
Gait Recognition Algorithm, Proceedings Advanced Video and Signal Based
Surveillance, pp 143-50,2003
[170] A. Kale, A. Roy-Chowdhury, and R. Chellappa, Gait-based human
identification from a monocular video sequence, in Handbook on Pattern
Recognition and Computer Vision (3rd Edition), C.H.Cheng and P.S.P. Wang,
Eds. World Scientific Publishing Company Pvt. Ltd., In press .
[171] G. Shakhnarovich, L. Lee, and T. Darrell, Integrated face and gait recognition
from multiple views, in Proc. Conf. on Computer Vision and Pattern
Recognition, I, pp. 439-446, 2001

167
[172] S.Zhou and R.Chellappa, Probabilistic human recognition from video Proc. of
European Conference on Computer Vision, 2002
[173] C. BenAbdelkader, R. Cutler, and L. Davis View-invariant Estimation of
Height and Stride for Gait Recognition, Lecture Notes in Computer Science,
2359 , pp 155-159,2002
[174] K. 1. Sharman, M. S. Nixon and J. N. Carter, Extraction and Description of
3D (Articulated) Moving Objects, Proc. 3D Data Processing Visualization and
Transmission, pp 664-667, 2002
[175] S. P. Prismall, M. S. Nixon and 1. N. Carter, On Moving Object Reconstruction
by Moments, Proceedings of the British Machine Vision Conference 2002,
pp73-82, 2002
[176] S. P. Prismall, M. S. Nixon, and 1. N. Carter, Novel Temporal Views of
Moving Objects for Gait Biometrics, Lecture Notes in Computer Science
(Proceedings ofthe Fourth International Conference on Audio Visual Biometric
Person Authentication) 2688, pp 725-733, Guildford (UK), 2003
[177] B. Bhanu, and 1. Han , Kinematic-based human motion analysis in infrared
sequences, Proceedings of the Sixth IEEE Workshop on Applications of
Computer Vision, pp 208-12, Orlando (USA), 2002
[178] B. Bhanu and 1. Han , Bayesian based performance prediction for gait
recognition, in Proc. of Workshop on Motion and Video Computing
(MOTION'02), Orlando, Florida, 2002
[179] Q. Jiang, C. Daniell, Recognition of Human and Animal Movement Using
Infrared Video Streams, Proeedings IEEE International Conference on Image
Processing 2004, 2004
[180] L. Wang, T. Tan, H. Ning , and W. Hu, Fusion of Static and Dynamic Body
Biometrics for Gait Recognition, IEEE Transactions on Circuits and Systems
for Video Special Issue on Image- and Video-Based Biometrics, 14(2), pp 149-
158,2004
[181] H. Ning, T. Tan , L. Wang and W. Hu, People tracking based on motion model
and motion constraints with automatic initialization, Pattern Recognition
(accepted)
[182] G. Shakhnarovich, and T. Darrell, On Probabilistic Combination of Face and
Gait Cues for Identification, Proceedings of the IEEE International Conference
Face and Gesture Recognition '02, pp 176- 181, Washington (USA) 2002
[183] P. C. Cattin, D. Zlatnik, R. Borer, Sensor Fusion for a Biometric System
using Gait, Proc. International Conference on Multisensor Fusion and
Integration for Intelligent Systems, pp 233- 238 , 2001
[184] N. Cuntoor, A. Kale , and R. Chellappa, Combining Multiple Evidences for
Gait Recognition, Proceedings of the International Conference on Audio,
Speech and Signal Processing, 3, pp 6-10, 2003.
[185] D. Wagg, and M. S. Nixon, Model-Based Gait Enrolment in Real-World
Imagery, Proceedings 2003 Workshop on Multimodal User Authentication
MMUA, pp 189-195, Santa Barbara (USA) 2003
[186] C-Y. Yam, M. S. Nixon and 1. N. Carter, Automated Markerless Analysis of
Human Walking and Running by Computer Vision, Proceedings of the World
Congress on Biomechanics, Calgary (Canada) 2002
[187] J-H . Yoo and M. S. Nixon, Markerless Human Gait Analysis via Image
Sequences, Proceedings International Society ofBiomechanics XIXth Congress,
Dunedin NZ, 2003

168
Other Related Work

[188] N. Lynnerup and J. Vedel, Person Identification by Gait Analysis and


Photogrammetry, Journal of Forensic Science, 50(1), Jan. 2005
[189] A. 1. O'Toole, T. Price, T. Vetter, J. C. Bartlett and V. Blanz, Three-
dimensional shape and two-dimensional surface textures of human faces : The
role of "averages" in attractiveness and age. Image and Vision Computing
Journa~ 18,pp. 9-19,1999
[190] V. Blanz , and T. Vetter, Face Recognition Based on Fitting a 3D Morphable
Model, IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9),
2003
[191] R. Plaenkers and P. Fua . Articulated Soft Objects for Multi-View Shape and
Motion Capture, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 25(10), 2003
[192] K. Waters, D. Terzopoulos, The Computer Synthesis of Expressive Faces,
Philosophical Transactions of the Royal Society of London, B, 335(1273), 87-
93, 1992
[193] P. Faloutsos, M. van de Panne , D. Terzopoulos, The Virtual Stuntman:
Dynamic Characters with a Repertoire of Motor Skills, Computers and
Graphics, 25(6), pp 933-953 , 2001
[194] F. Multon, L. France, M-P . Cani-Gascuel, and G. Debunne, Journals of
Visualisation and Computer Animation, 10, pp 39-54 , 1999
[195] D. Terzopoulos, Visual modeling for computer animation: Graphics with a
vision, Computer Graphics, 33(4), pp 42-45, 1999
[196] L. P. Nedel , and D. Thalmann, Anatomic modeling of deformable human
bodies, Visual Computer, 16 (6), pp 306-321 , 2000
[197] H. C. Sun, and D. N. Metaxas, Automating Gait Generation, Proceedings
ACM SIGGRAPH 2001, pp 261-269, 2001

General
[198] L. R. Rabiner and B. H. Juang, An Introduction to Hidden Markov Models,
IEEE ASSP Magazine, January, pp. 4-16, 1986
[199] L. Rabiner and H. Juang, Fundamentals ofSpeech Recognition, Prentice Hall,
1993
[200] L. R. Rabiner , A tutorial on hidden Markov models and selected applications
in speech recognition, Proceedings of the IEEE, 77(2), pp.257-285, February
1989
[201] B. H. Juang, On the Hidden Markov Model and Dynamic Time Warping for
Speech Recognition - a Unified View, Technical Journal, vol. 63, pp. 1213-
1243, 1984
[202] 1. Kittler, M. Hatef, R. P. W. Duin, and 1. Matas, On combining classifiers,
IEEE Trans. on Pattern Analysis and Machine Intelligence, pp. 226-239, March
1998
[203] R. Brunelli and D. Falavigna, Person identification using multiple cues, IEEE
Trans. on Pattern Analysis and Machine Intelligence, 17(10): 955-966, 1995

169
[204] L. Hong and A. Jain, Integrating faces and fingerprints for personal
identification, IEEE Trans. on Pattern Analysis and Machine Intelligence,
20(12) : 1295-1307, 1998
[205] B. Achermann and H. Bunke, Combination of classifiers on the decision level
for face recognition, Technical Report IAM-96-002, University Bern, 1996
[206] I. L. Dryden and K. V. Mardia, Statistical Shape Analysis , John Wiley and
Sons, 1998
[207] P. V. Overschee and B. D. Moor, Subspace algorithms for the stochastic
identification problem, Automatica, vol. 29, pp. 649-660, 1993
[208] S. Soatto, G. Doretto, and Y.N. Wu, Dynamic textures, Proc. ofInternational
Con! on Computer Vision, 2, pp. 439-446, 2001
[209] G. Golub and C. V. Loan, Matrix Computations, The Johns Hopkins
University Press, Baltimore, 1989.
[2\0] K. D. Cock and D. B. Moor , Subspace angles and distances between ARMA
models, Proc. ofthe IntI. Symp. ofMath. Theory ofnetworks and systems, 2000
[211] M. Isard and A. Blake, Contour tracking by stochastic propagation of
conditional density, Proc. ofEuropean Conference on Computer Vision, ,1, pp.
343-356, 1996
[212] V. Nalwa, A Guided Tour ofComputer Vision, Addison-Wesley, 1993
[213] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision,
Cambridge University Press, 2000
[214] 1. Kent, New directions in shape analysis, The art of statistical science: a
tribute to G. S. Watson, pp: 115-127, Wiley, Chichester, 1992
[215] Y. Yang and M. Levine, The background primal sketch : an approach for
tracking moving objects . Machine Vision and Applications, 5: 17-34,1992
[216] Y. Kuno, T. Watanabe, Y. Shimosakoda and S. Nakagawa, Automated
detection of human for visual surveillance system, in Proc. of IntI. Con! on
Pattern Recognition, pp. 865-869, 1996
[217] I. Haritaoglu, D. Harwood and L. Davis, W4 : real-time surveillance of people
and their activities, IEEE Trans. on Pattern Analysis and Machine Intelligence,
22(8) : 809-830, 2000
[218] J. Phillips, H. Moon, S. Rizvi and P. Rause, The FERET evaluation
methodology for face recognition algorithms, IEEE Trans. on Pattern Analysis
and Machine Intelligence, 2000, 22(10) : 1090-1\04
[219] M. Isard and A. Blake, Condensation - conditional density propagation for
visual tracking, International Journal ofComputer Vision, 29(1) : 5-28,1998
[220] S. Wachter and H. Nagel, Tracking persons in monocular image sequences,
Computer Vision and Image Understanding, 74(3): 174-192, 1999
[221] C. Sminchisescu and B. Triggs, Covariance scaled sampling for monocular 3D body
tracking, Proc. ofIntI. Conj. on Computer Vision and Patt ern Recognition, 2001

170
9 Appendices

Appendix 9.1 Southampton Data Acquis ition Forms

Appendix 9.1.1 Laboratory Set-up Forms

Session Number _ - -_ _. . .

Chec k list and gene r al poin ts for eac h 1 hour dat a gr ab sessio n
Please read carefully, then tick each corresponding box once the check has been
completed. The sess ion comments form, (located at the end of this checklist)
needs to be filled in at the end of each session. All the completed forms then
need to be stored for future reference.

Gen er al not es for ea ch session:


• Ask everyone to wipe their feet before starting .

• Thank everyone at the end, give them their book token !

• Don't tell the subjects when you're filming.

• Switch the cameras on at the beginning of the session, leave all of them
recording until the end of the session.

• Try not to talk to the subjects as they walk , as they will look in your
direction .

• When subjects use the treadmill , make sure they use the safety clip.

• When subjects are using the track, don't have anyone in the main lab,
as this encourages them to talk, and makes them feel more self
conscio us.

• Order of data grab. outside, inside, training on treadmill, treadmill.

• 'Outside' Camera settings are different, see camera set-up sheets.

• Don't let the subjects use the treadmill until they have walked outside
and inside on the track, as it may temporar ily affect their walk.

• Remember that getting background data is just as important as subj ect


data. (especia lly outside)

• Ensure that no subjects are left unattende d while they are in the lab, or
around any of the equipment.
• Ensure that someone is with the equipment outside at all times,
(cameras Emily and Felix).

• Subjects need to write the session number and subje ct number on the
back of their questionnaire (in a thick black marker), which they then
need to hold up in front of the camera for each batch of walking
(outside , inside track and treadmill).

• Ifany of the cameras auto-tum off, or get knocked once they have
been set-up, they must be re-checked (*all settings" ) especially
exposure, focus and zoom.

• There is a first aid kit in the main lab, on the lowest shelfby the
cameras, in a green pouch.
Prior to a session beginning
D Start this checklist (at least) 45 minutes before the subjects are due to
arrive, it may take some time.
D Cone off the area behind the cameras (outside) to allow
~estrian s to walk behind the cameras. (Do this the night before?)

U Switch on all lights, starting with the light switch on the right as
you enter the lab (wall lighting for the main track backdrop ). There is
no need to use the ceiling fluorescent lights.
D There is no need to use the light switches on the
lights/treadmill/camera, use the wall plugs for everything. All wall
plugs that are needed are outlined in green tape, all should be
switched on IS mins before the start of the session. In addition there
is a set behind the right hand side of the treadmill backdrop cloth and
a single light switch for the left hand side track backdrop light.
(located on the wall on the left as you walk into the main lab.
D Check for evidence of any light stands or camera tripods having
moved.
D Plug in the laptops and cameras into the mains (each camera has a
specific position, labelled on the floor directly below each tripod).
D Switch on the radio.
D Go through each of the camera set-up sheets.
D Load labelled tapes into each camera (camera id , session number
and date).
Refer to the Tape labelling scheme.
D Confirm on the camera set-up sheets that the tapes have been
loaded.
D Do you have a copy of the instructions to subjects?
Walking track
D Clean the track floor with a brush, to remove anything that might get
g2.und into the carpet.
U Ensure the canon camera (Amy) is normal to the track, not the
screen

172
DTurn off the air conditioning.
DEnsure that the green material on the floor is flat with no ripples.
If you need to walk on the material to do this, then remove your shoes
before doing so as they will leave foot prints.
DRe-tension the background cloth, if needed.
DEnsure big & subject lighting don't interfere
DWaik along track twice looking for shadows on ground or big
OEnsure no light is leaking out of the edges of the lights bouncing
off of the ceiling. There should be pieces of card placed in the shutters
to stop this. You should not be able to see the lights themselves as you
walk along the track (both directions).

Treadmill
DLubricate both treadmills (training and filming).
DEnsure the canon camera (Bettie) is normal to the treadmill.
ORe-tension the background cloth, if needed.
DHide the treadmill display.
DEnsure the mirror is in front of the Treadmill.
DCheck for direct shadows on belt.
DEnsure smooth join of background floor & vertical cloth.

Room
DEnsure that the air conditioning is off - you may have to re-adjust the
backdrops is it is not.
DEnsure that all of the fluorescent lights are off.

IndoorT~ds

UVisual checks only:


o Min height on top leg extension

o Max height on bottom leg extension

o Central height extension at min

o Dolly feet locked (important because this raises tripod


slightly)

o Level on both planes (check spirit levels on tripod head)

DEnsure that all the tripods are located correctly (triangular gaffer tape
markers on the floor).

Outdoor Tripods

173
DCheck that the spikes at the ends of the legs of both tripods are just
below the black feet, i.e. at the point where they disapp ear from view,
when viewing across the end of the black foot.
DMin height on top leg extension.
DMax height on bottom leg extension.
DCentral height extension at minimum.
DAll tripods located with feet and centres lined up with painted
markers on the tarmac .
DMounting heads attached to cameras with "lens 1\ " marker aligned
along camera axis (i.e, pointing at lens!)
DLevel on both planes (check spirit levels on tripod head with the
cameras mounted ).
DCheck both cameras are aimed at the centre point (marked with a
cross) of the walking area.
At the end of each session
DBefore the subjects leave, check that the correct amount of subject
consent forms and questionnaires are present.
D Remove tapes , and set to write protect , ensure they are labelled, if
not label them, see the tapc labelling scheme.
D Tum off all wall power points (lights, cameras, laptop etc)
DRemove came ras and place back in the storage cupboard in the
office, or on charge (cameras Emily and Felix).
D Fill in session comments form (see below), making a note of any
~blems .

U Lock doors of the lab on the way out.


D Put the 'outside' cameras on charge
D Put the write protected tapes away.
D Record the session numb er on this checklist.
D lncrement the counter on the wipe board in the office (session and
subject number s).
D Store this completed checkli st and session comment form in the
folder.

174
Appendix 9.1.2 Camera Set-up Forms

Camera set-ups
Session Number 0
Please tick each corresponding box once the check has been completed. Note :
Not all the cameras have the same settings!

Camera A (Amy) - Canon (same for B (Bertie) - Canon)


• Canon

DEnsure that the mount ing head is attached to the camera with "lens A"
marker aligned along camera axis (i.e. pointing at lens!) . There should be a
marked line to help you check the alignment.
DPlug into the mains
DLens cap off.
DPut into progressive scan mode ('P Scan' mode)
DCheck PRO. SCAN appears on the screen
DShutter speed 1I250sec (menu)
DDigital zoom off (menu) .
OImage stabiliser off (menu) .
DWhite balance set to "indoor" (menu) appears as a light bulb on view
screen
DExp lock (exp button on back of camera) set to +/-0.
OCheck view is fully zoomed out
OAuto focus off - button at back ("focus") near power switch . Press &
~st focus using dial

USet the manual focus . When setting the manual focus, it is easier to auto
focus onto someone (i.e. a set-up person) and then switch to manual mode .
Probably more accurate than by hand . ..

• Appearing on the display screen :

OPRO.SCAN
011250
DLight bulb (white balance)
OM.Focus

175
Dsp
DE.Lock +/- 0 (Exposure setting)

Camera C (Cathy) - Sony (same for D (Dave) - Sony)


• Sony

DEnsure that the mounting head is attached to the camera with "lens 1\"
marker aligned along camera axis (i.e. pointing at lens!) . There should be a
marked line to help you check the alignment.
Dplug into the mains
DLens cap off.
DInteriaced mode ('Camera' mode)
DCheck view is fully zoomed out
DProgressive scan off (menu)
DAuto shutter off (menu)
DDigital zoom off (menu)
DSteady shot off (menu)
DOdB gain (menu)
DlI300sec shutter.
DAperture F2.8 (set on screen)
DWhite balance "indoor" (hit white balance button on back of camera,
spin dial until a light bulb icon appears on view screen)
DSet the manual focus . Auto focus switch off at front near lens. It is
easier to auto focus onto someone (i.e. a set-up person) and then switch to
manual mode . Probably more accurate than by hand ....

• Appearing on the display screen :

0300
DF2.8
DOdBgain
Dsp
OLight bulb (white balance)
DHand with an 'F' over it (manual focus).
DHand with an 'OFF' over it (steady shot)

176
Outside Cameras: Note settings are different!
Camera E (Emily) - Canon
• Canon

DEnsure that the mounting head is attached to the camera with "lens A "
marker aligned along camera axis (i.e. pointing at lens!). There should be a
marked line to help you check the alignment.
DLens cap off.
DCheck battery is fully charged, if not swap the battery with one of the
other cameras.
DPut into progressive scan mode ('P Scan' mode)
DCheck PRO. SCAN appears on the screen
DShutter speed 1/250sec (menu)
DOigital zoom off (menu).
Dlmage stabiliser off (menu).
DWhite balance set to "outdoor" (menu) appears as a sun on view screen
DAuto exposure
DCheck view is fully zoomed out
OAuto focus off - button at back ("focus") near power switch. Press &
~st focus using dial

USet the manual focus. When setting the manual focus, it is easier to auto
focus onto someone (i.e. a set-up person) and then switch to manual mode.
Probably more accurate than by hand....

• Appearing on the display screen:

OPRO.SCAN
011250
DSun (white balance)
DM.Focus
OSP
OAuto (exposure, i.e. no Klock)

Camera F (Felix) - Sony


• Sony

177
OEnsure that the mounting head is attached to the camera with "lens A "
marker aligned along camera axis (i.e. pointing at lens!). There should be a
marked line to help you check the alignment.
OEnsure you have the long life battery.
OLens cap off.
OInterlaced mode ('Camera' mode)
DCheck view is fully zoomed out
OProgressive scan off (menu)
OAuto shutter off (menu)
DDigital zoom off (menu)
DSteady shot off (menu)
DOdB gain (menu)
01l300sec shutter .
DAuto exposure
OWhite balance "outdoor" (hit white balance button on back of camera,
~ dial until a sun icon appears on view screen)

USet the manual focus. Auto focus switch off at front near lens. It is
easier to auto focus onto someone (i.e. a set-up person) and then switch to
manual mode. Probably more accurate than by hand ....

• Appearing on the display screen :

0300
OAuto exposure (i.e. no' F')
OSP
OSun (white balance)
OHand with an 'F' over it (manual focus) .
DHand with an 'OFF' over it (steady shot)
Return to the checklist.

• To be completed after the tapes have been loaded :

OCamera A shows loaded tape symbol


OCamera B shows loaded tape symbol
DCamera C shows loaded tape symbol
OCamera D shows loaded tape symbol

178
Dcamera E shows loaded tape symbol
DCamera F shows loaded tape symbol

General notes about the outdoor cameras


• White balance set to outdoor (picture of sun on the sony display , on
the canon this is a menu option)

• Keep an eye on the battery levels, especialIy the sony camera . Use the
viewfinders, not the LCD displays.

• Worry about battery levels!

179
Appendix 9.1.3 Session Coordinator's Instructions
Instructions to Subjects:
General
• Ask the subjects to only walk within the grey/green dotted lines.
• Ask them not to run anywhere.

• Then take them into the lab and explain what we would like them to
do.

• Don't let them use the treadmill until they have walked outside and
inside on the track, as it may temporarily affect their walk.

• Take them back into the coffee room/side office and ask them to fill
in the consent and questionnaire forms prior to data grab . issuing each
subject with a subject and session number, both of which they need to
record on the questionnaire.

• Ask them to write the session number and their subject number in
large numbers on the back of their questionnaire (using a thick black
marker). This enables them to hold it up for the still shots .

• Then ask them to one at a time, walk outside (weather permitting) and
on the track inside.

• Train and get them used to the training treadmill in the side
lab/storage room and then film them on the treadmill in the main lab.

• Encourage the subjects who are not walking to sit down on the chairs
provided.

• Remind the subjects to only walk within the grey/green dotted lines .
Treadmill
• Ask them to look in the mirror when walking.
• They can stop the treadmill by pulling out the safety tag, or by hitting
stop .

• Hitting start, increases the speed slowly up to the programmed speed.

• Before they start, you will want to take side and frontal pictures using
the still camera.

• Ask the subjects to start walking whenever they are ready.

• When subjects have finished on the treadmill, replace the yellow cord
and safety clip with the single red tab to enable background data
collection.

• Ensure that adequate background data is captured with the yellow


safety cord removed.
Walking Track

180
• They need to walk around the loop at least 8 times (8 times each
way).
• Walk on the green track and follow the dotted grey lines at either end
of the straight track .

• Keep count of the number of times they have walked past the
cameras.
Outside
• The subjects need to walk to enable 4 good runs across the track with
no interference, so they may need to walk for longer, depending on
how busy it is. rule of thumb is 8 times each way, 16 in total.
• Don't forget the background data.

• Keep count of the number of times they have walked past the
cameras.
back to main checklist (htmllink).

181
Appendix 9.1.4 Subject Information Form

Subject Information Form

For purposes of quantifying the gait data we are collecting, we would like to
record personal data that might affect your gait. Please note that neither your
name nor your identity will ever be associat ed with any data we collect. Please
do not feel compelled to answer any question, especially if you consider it to
be too personal in nature.
A. Gait is actually controlled by gender. Women swing their hips more, and
men their shoulders .

As such, please circle if you are male orfemale.


B. We will need to make recognition at a distance, as such invariant with size.

As such, what is your approximate height? _


what is your approximate weight? _

C. Gait might be affected by limb dominance. By way of example, if you are


right-footed (right limb dominant) you would best kick a football with your
right foot.

As such, please circle the one that best applies to you:

are you lefl-footed? right-footed ? ambidextrous (ambifootrous?) or


don 't know.
Although we don't know any precise relationship between hand and foot
dominance (and you might have circled don't know), please circle for your
dominant hand:

are you lefl-handed? right-handed? ambidextrous? or don 't know.


D. We don't know if ethnicity affects gait.

As such, how would you describe your ethnic origin?

E. Physical and mental condition can affect gait.

As such, are you taking any medication that might affect your gait? YES/ NO

If YES, what is it? _


does any phys ical condition affect your gait? YES/ NO

If YES, what is it? _


Do you know of any other factors that might affect your gait, which might be of
interest in our research? If so, please describe below .

182
Appendix 9.1.5 Subject Consent Form

Subject Consent Form

Automatic Gait Recognition Database Collection

I willingly take part in the database


collect ion for evaluation of automatic gait recognition . I consent to the use of
images taken of me for this database to be used by researchers in biometric
technology for purposes of evaluation of biometric technologie s, and that this
imagery might be available over the World Wide Web (and will therefore be
transferred to countries which may not ensure an adequate level of protection for
the rights and freedom of data in relation to the processing of personal data). I
under stand that neither my name nor identity will be associated with these data.
I certify that I have read these terms of consent for these data.

Signature Date _

Witness Date _

183
Index

3D 47,114-131,134-142,151 DARPA 2
Age 11 ,22, 155 Database 17-34
Alcohol 11 Laboratory 18,24-25
Alzheimer' s 11 Maryland (UMD) 34,105
Analysis CAS Institute of Automation
Covariate analysis 10,46, 64-66 CASIA 33,78-79,125-131
Marker-based 6,10,20,108 CMU 3,20,61,81,100,104,134
Ancillary data 32-34 NIST 21-23,47-49,89-90, 101-
Animation 15,160 105,134
Anova 55,64,105 Outdoor 17,20,27-29,33-34
Area measures 49-51 South Florida (USF) 2 1-23,47-
Aristotle 6 49,89-90,101 -105,134
ARMA model 98-100,104-105 Southampton 22,33,56,65,116
Autoregressive model 94-98 UCSD 17-18
Background subtraction 8,23,27,67, Dictionary definition 5
153 Disorders 11
Bilateral symmetry 8,51 Distance 1,21-23,61,154
Biomechanics 9-10 Double support 7
Biometrics 1-3 Double float 9
Body parameters 132 DTW 96-98,102-104
Borelli 6 Dynamical model 98-100,104-105
Cadence 10,33,46,107 Dynamic Time Warping see DTW
California (Riverside) 47-48 Eigenspace transformation EST 36-
California (San Diego) see UCSD 37,47,71 -82
Canonical Ellipsoidal fit 48
Transformation 36-38,51,55 Energy image 49
View 135-142,155 Face recognition 1-2,10,142
Carnegie Mellon University Feature selection 46,64
see CMU see Anova
CAS Institute of Automation ( Beijing) Float 9
Database 33,78-79,125-131 Fusion 48,58,102,120,141-147
Silhouette-based recognition 47, Gait change
65-106 Age 11,22,152
Model-based recognition 108,115- Alcohol 11
131 Clothes 11,154
Chinese Academy of Sciences Luggage 11,151
CASIA Mood 11
see CAS Institute of Automation Shoes 11
Chromakey 23-27 Time 11,151
Clothing 11,154 Gait challenge 20-22,48,87-88,145-
CMU 3,47 147
Approach 47 Gait disorder 11
Database 3,20,61,81,100,104,134 Gait model 7,13-15,18-20,36, 107-
Confusion matrix 43, 105 108
Coupled oscillator model 109-110 ARMA 98-100,104-105
Covariates 10-11,21 Autoregressive 94-98
Analysis 10,48,64-66,152 Coupled oscillator 109-110
Cycle 7,50,52,54,110 Dynamical 98-100, I04-105
Hip 39-43 Nearest neighbor 62,70,78,87,110
Human body 13-15,109-131 NIST 3,47-48,83
Kinematics 114-132 Database 21-23,47-49,89-90,
Running 5,9,46,109-111 101- I05,134
Structural 133 Neurology II
Walking 5,8,39 -42, 111-114 Notre Dame 3
Galileo 6 Noise 53,61 ,64,85,100,123,126,150
GaTech 3,133 Occlus ion 14, I9,25,34,61,66,
Gender 12,21,47 ,57 123,129 ,147,153
Georgia Institute of Technology see Optical flow 15,37,53 ,59,62,137-138
GaTech Oscillators 39,109
Heel strike 6,9,24,32 ,52 Outdoor database 17,20,27-29,33 -
Hidden Markov Model see HMM 34
Hip model 39-43 Overview of recognition approaches
HMM 46,89-93,100,108 Model-based 107-108
Human body model 13-15,109-131 Silhouette-based 46-47
Human ID at a Distance 3,21 Parkinson's disease 11
k nearest neighbor 62,70,78,87, II 0 PCA 36-37,47,71-82
Kinematics 49 ,97-98,114-132 Pelvis 7,9
Knee 9 Performance analysis 20-22,47,153
Laboratory database 18,24-25 Phase 15,37,42-44,47,52,75,113, 119
LDA 36-37 ,51, 55 Stance 6-9
Least squares 40 Swing 6-9
Least median squares 67 Phase locked loops 49
Literature 5-6 Phillips, Jonathon 3
Linear Discriminant Analysis see LDA Principal Components Analysis
Load 11,151 see PCA
Luggage 11,151 Procrustes shape analysis 69-70 ,
Manuallabeling 22 77-81
Marker-based systems 6,10,12,20,I08 Podiatry 15
Maryland see UMD Point distribution model 48
Medical stud ies 6-9 Potency 47 ,57,63,107,113
MIT 3,47 Psychology 12-13
Model 7,13-15,18-20,36, 107-108 Recognition performance
ARMA 98-100,104-105 Noise 53,61,64,85,100,123,
Autoregressive 94-98 126,150
Coupled oscillator 109-110 Occlusion 14,19,25,34,61,66,
Dynamical 98-100 ,104-105 123,129,147,153
Human body 7,13-15,107-108 size (distance) 1,21-23,61,154
Running 109-111 Recognition approaches
Structural 133 Model-based 107-108
Walking 39-42 ,111-114 Silhouette-based 46-47
Model-based recognition 109-131 Relational statistics 47
Moments 36 Riverside (California) 47-48
Zernike 54,61 Runn ing 5,9,46,109-111
Velocity 53-54,61 Rutgers University 107,133
Mood II Self similarity 47
Murray 6-9 Shakespeare 5
Muybridge 6 Shoes II
National Institute of Standards in Silhouette
Technology see NIST Generation 18,23,27,67,153
National Lab. of Pattern Recognition Recognition 46-47
see CAS Institute of Automation Single support 9

186
Similarity measure 71,75,77,85, Time 11,155
99,103,156 Treadmill 10,30-31,110
Size (distance) 1,21-23,61,154 Trendelenburg gait 11
Soton see Southampton UCSD
South Florida see USF Database 17-18,46,58
Southampton 3 Early work 36
Database 23-33 USF 3,47-48,83
Silhouette-based recognition 36 Database 21-23,47-49,89-90,
39,49-65 101-105,134
Model-based recognition 40-45, Silhouette-based recognition 47
110-115 48
Spatiotemporal US-VISIT 2
Feature extraction 62,73 UMD 3,46,107
Silhouette 46,83 Database 34,105
Stance 6-9 Silhouette-based recognition
Stride 10,82,108 89-106
Length 47,49,86,136 Velocity moments 53-54,61
Frequency 9,75,113 View
Support (single- or double limb) 9 canonical 135-142,155
Swing 6,8,20,58,110,156 Viewpoint invariance 139-140
Structural model 133 da Vinci 6
Surveillance 2,19,34 Walking 5,8,39-42, 111-114
Symmetry Weber 6
Bilateral 8,51 Zemike moments 54,61
Recognition 51, 58-59
Three dimensional imaging see 3D

187

You might also like