Project Hand Gesture

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

4th Year Project Report

Hand Gesture Recognition Using Computer Vision


Ray Lockton Balliol College Oxford University
Supervisor: Dr. A.W. Fitzgibbon Department of Engineering Science

Figure 1: Successful recognition of a series of gestures

Raymond Lockton, Balliol College

1 Contents
1 2 CONTENTS................................................................................................................ 2 INTRODUCTION ...................................................................................................... 3 2.1 2.2 2.3 3 3.1 3.2 3.3 3.4 3.5 3.6 4 4.1 4.2 4.3 4.4 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 6 6.1 6.2 7 7.1 7.2 8 9 REPORT OVERVIEW .............................................................................................. 4 PROJECT SUMMARY.............................................................................................. 4 EXISTING SYSTEMS .............................................................................................. 5 CHOICE OF SENSORS ............................................................................................. 6 HARDWARE SETUP................................................................................................ 7 CHOICE OF VISUAL DATA FORMAT ........................................................................ 8 COLOUR CALIBRATION ......................................................................................... 9 METHOD OF COLOUR DETECTION ........................................................................ 15 CONCLUSION ...................................................................................................... 15 ANALYSIS OF DISTORTION .................................................................................. 17 REMOVAL OF SKIN PIXELS DETECTED AS WRIST BAND PIXELS ............................. 19 REMOVAL OF SKIN PIXELS DETECTED FROM FOREARM ........................................ 19 CONCLUSION ...................................................................................................... 23 CHOICE OF RECOGNITION STRATEGY .................................................................. 24 SELECTION OF TEST GESTURE SET ....................................................................... 24 ANALYSIS OF RECOGNITION PROBLEM ................................................................ 25 RECOGNITION METHOD 1: AREA METRIC ............................................................ 25 RECOGNITION METHOD 2: RADIAL LENGTH SIGNATURE ...................................... 26 RECOGNITION METHOD 3: TEMPLATE MATCHING IN THE CANONICAL FRAME ...... 34 REFINEMENT OF THE CANONICAL FRAME ............................................................ 40 REFINEMENT OF THE TRAINING DATA ................................................................. 41 METHOD OF DIFFERENTIATION (IN CANONICAL FRAME) ...................................... 43 REFINEMENT OF TEMPLATE SCORE METHOD (NO QUANTIZATION) ....................... 48 CONCLUSION ...................................................................................................... 51 SETUP................................................................................................................. 52 DEMONSTRATION ............................................................................................... 53 PROJECT GOALS ................................................................................................. 56 FURTHER WORK ................................................................................................. 56

DETECTION.............................................................................................................. 6

REFINEMENT ......................................................................................................... 17

RECOGNITION....................................................................................................... 24

APPLICATION: GESTURE DRIVEN INTERFACE ............................................ 52

CONCLUSION......................................................................................................... 56

REFERENCES ......................................................................................................... 58 APPENDIX ............................................................................................................... 59 9.1 9.2 9.3 APPENDIX A- GLOSSARY .................................................................................... 59 APPENDIX B- ENTIRE GESTURE SET ................................................................... 60 APPENDIX C- ALGORITHMS ................................................................................ 61

Raymond Lockton, Balliol College

2 Introduction

This project will design and build a man-machine interface using a video camera to interpret the American one-handed sign language alphabet and number gestures (plus others for additional keyboard and mouse control). The keyboard and mouse are currently the main interfaces between man and computer. In other areas where 3D information is required, such as computer games, robotics and design, other mechanical devices such as roller-balls, joysticks and data-gloves are used. Humans communicate mainly by vision and sound, therefore, a man-machine interface would be more intuitive if it made greater use of vision and audio recognition. Another advantage is that the user not only can communicate from a distance, but need have no physical contact with the computer. However, unlike audio commands, a visual system would be preferable in noisy environments or in situations where sound would cause a disturbance. The visual system chosen was the recognition of hand gestures. The amount of computation required to process hand gestures is much greater than that of the mechanical devices, however standard desktop computers are now quick enough to make this project hand gesture recognition using computer vision a viable proposition. A gesture recognition system could be used in any of the following areas: Man-machine interface: using hand gestures to control the computer mouse and/or keyboard functions. An example of this, which has been implemented in this project, controls various keyboard and mouse functions using gestures alone. 3D animation: Rapid and simple conversion of hand movements into 3D computer space for the purposes of computer animation. Visualisation: Just as objects can be visually examined by rotating them with the hand, so it would be advantageous if virtual 3D objects (displayed on the computer screen) could be manipulated by rotating the hand in space [Bretzner & Lindeberg, 1998]. Computer games: Using the hand to interact with computer games would be more natural for many applications. Control of mechanical systems (such as robotics): Using the hand to remotely control a manipulator.

Raymond Lockton, Balliol College

2.1 Report Overview


Chapter 3 onwards describes the various design options, rationale and conclusions. The structure of the write-up and completed project architecture is outlined below. The system will use a single, colour camera mounted above a neutral coloured desk surface next to the computer (see Figure 2). The output of the camera will be displayed on the monitor. The user will be required to wear a white wrist band and will interact with the system by gesturing in the view of the camera. Shape and position information about the hand will be gathered using detection of skin and wrist band colour. The detection will be illustrated by a colour change on the display. The design of the detection process will be covered in Chapter 3. The shape information will then be refined using spatial knowledge of the hand and wrist band. This will be discussed in Chapter 4.

Figure 2 Picture of system in use (note wrist band and neutral coloured background)

The refined shape information will then be compared with a set of predefined training data (in the form of templates) to recognise which gesture is being signed. In particular, the contribution of this project is a novel way of speeding up the comparison process. A label corresponding to the recognised gesture will be displayed on the monitor screen. Figure 1 (front cover) shows the successful recognition of a series of gestures. The design process for the recognition will be discussed in Chapter 5. Chapter 6 describes an application of the system a gesture driven windows interface. Finally, Chapter 7 describes how the project has achieved the goals set and further work that could be carried out.

2.2 Project Summary


In order to detect hand gestures, data about the hand will have to be collected. A decision has to be made as to the nature and source of the data. Two possible technologies to provide this information are: A glove with sensors attached that measure the position of the finger joints. An optical method.

Raymond Lockton, Balliol College

An optical method has been chosen, since this is more practical (many modern computers come with a camera attached), cost effective and has no moving parts, so is less likely to be damaged through use. The first step in any recognition system is collection of relevant data. In this case the raw image information will have to be processed to differentiate the skin of the hand (and various markers) from the background. Chapter 3 deals with this step. Once the data has been collected it is then possible to use prior information about the hand (for example, the fingers are always separated from the wrist by the palm) to refine the data and remove as much noise as possible. This step is important because as the number of gestures to be distinguished increases the data collected has to be more and more accurate and noise free in order to permit recognition. Chapter 4 deals with this step. The next step will be to take the refined data and determine what gesture it represents. Any recognition system will have to simplify the data to allow calculation in a reasonable amount of time (the target recognition time for a set of 36 gestures is 25 frames per second). Obvious ways to simplify the data include translating, rotating and scaling the hand so that it is always presented with the same position, orientation and effective hand-camera distance to the recognition system. Chapter 5 deals with this step.

2.3 Existing Systems


A simplification used in this project, which was not found in any recognition methods researched, is the use of a wrist band to remove several degrees of freedom. This enabled three new recognition methods to be devised. The recognition frame rate achieved is comparable to most of the systems in existence (after allowance for processor speed) but the number of different gestures recognised and the recognition accuracy are amongst the best found. Figure 3 shows several of the existing gesture recognition systems along with recognition statistics and method.
Paper Primary method of recognition Number of gestures recognised 97 Background to gesture images General Additional markers required (such as wrist band) Multicoloured gloves No Number of training images 7-hours signing 400 training sentences 7441 images Accuracy Frame rate

[Bauer & Hienz, 2000] [Starner, Weaver Pentland, 1998] [Bowden Sarhadi, 2000]

Hidden Markov Models Hidden Markov Models

91.7%

40

General

97.6%

10

&

&

[Davis & Shah, 1994]

Linear approximation to non-linear point distribution models Finite state machine / model matching

26

Blue screen

No

Static

Markers glove

on

This project

Fast template matching

46

Static

Wrist band

10 sequences of 200 frames each 100 examples per gesture

98%

10

99.1%

15

Figure 3 Table showing existing gesture recognition systems found during research.

Raymond Lockton, Balliol College

3 Detection

In order to recognise hand gestures it is first necessary to collect information about the hand from raw data provided by any sensors used. This section deals with the selection of suitable sensors and compares various methods of returning only the data that pertains to the hand.

3.1 Choice of sensors


Since the hand is by nature a three dimensional object the first optical data collection method considered was a stereographic multiple camera system. Alternatively, using prior information about the anatomy of the hand it would be possible to garner the same gesture information using either a single camera or multiple two dimensional views provided by several cameras. These three options are considered below: Stereographic system: The stereographic system would provide pixellated depth information for any point in the fields of view of the cameras. This would provide a great deal of information about the hand. Features that would otherwise be hard to distinguish using a 2D system, such as a finger against a background of skin, would be differentiable since the finger would be closer to camera than the background. However the 3D data would require a great deal of processor time to calculate and reliable real-time stereo algorithms are not easily obtained or implemented. Multiple two dimensional view system: This system would provide less information than the stereographic system and if the number of cameras used was not great, would also use less processor time. With this system two or more 2D views of the same hand, provided by separate cameras, could be combined after gesture recognition. Although each view would suffer from similar problems to that of the finger example above, the combined views of enough cameras would reveal sufficient data to approximate any gesture. Single camera system: This system would provide considerably less information about the hand. Some features (such as the finger against a background of skin in the example above) would be very hard to distinguish since no depth information would be recoverable. Essentially only silhouette information (see Glossary) could be accurately extracted. The silhouette data would be relatively noise free (given a background sufficiently distinguishable from the hand) and would require considerably less processor time to compute than either multiple camera system. It is possible to detect a large subset of gestures using silhouette information alone and the single camera system is less noisy, expensive and processor hungry. Although the system exhibits more ambiguity than either of the other systems, this disadvantage is more than outweighed by the advantages mentioned above. Therefore, it was decided to use the single camera system. 6

Raymond Lockton, Balliol College

3.2 Hardware setup


The output of the camera system chosen in Section 3.1 comprises of a 2D array of RGB pixels provided at regular time intervals. In order to detect silhouette information it will be necessary to differentiate skin from background pixels. It is also likely that other markers will be needed to provide extra information about the hand (such as hand yaw- see Glossary) and the marker pixels will also have to be differentiated from the background (and skin pixels). To make this process as achievable as possible it is essential that the hardware setup is chosen carefully. The various options are discussed below. Lighting: The task of differentiating the skin pixels from those of the background and markers is made considerably easier by a careful choice of lighting. If the lighting is constant across the view of the camera then the effects of self-shadowing can be reduced to a minimum (see Figure 4). The intensity should also be set to provide sufficient light for the CCD in the camera.
A B

Figure 4 The effect of self shadowing (A) and cast shadowing (B). The top three images were lit by a single light source situated off to the left. A self-shadowing effect can be seen on all three, especially marked on the right image where the hand is angled away from the source. The bottom three images are more uniformly lit, with little self-shadowing. Cast shadows do not affect the skin for any of the images and therefore should not degrade detection. Note how an increase of illumination in the bottom three images results in a greater contrast between skin and background.

However, since this system is intended to be used by the consumer it would be a disadvantage if special lighting equipment was required. It was decided to attempt to extract the hand and marker information using standard room lighting (in this case a 100 watt bulb and shade mounted on the ceiling). This would permit the system to be used in a non-specialist environment. Camera orientation: It is important to carefully choose the direction in which the camera points to permit an easy choice of background. The two realistic options are to point the camera towards a wall or towards the floor (or desktop). However since the lighting was a single overhead bulb, light intensity would be higher and shadowing effects least if the camera was pointed downwards.

Raymond Lockton, Balliol College

Background: In order to maximise differentiation it is important that the colour of the background differs as much as possible from that of the skin. The floor colour in the project room was a dull brown. It was decided that this colour would suffice initially.

3.3 Choice of visual data format


An important trade-off when implementing a computer vision system is to select whether to differentiate objects using colour or black and white and, if colour, to decide what colour space to use (red, green, blue or hue, saturation, luminosity). For the purposes of this project, the detection of skin and marker pixels is required, so the colour space chosen should best facilitate this. Colour or black and white: The camera and video card available permitted the detection of colour information. Although using intensity alone (black and white) reduces the amount of data to analyse and therefore decreases processor load it also makes differentiating skin and markers from the background much harder (since black and white data exhibits less variation than colour data). Therefore it was decided to use colour differentiation. RGB or HSL: The raw data provided by the video card was in the RGB (red, green, blue) format. However, since the detection system relies on changes in colour (or hue), it could be an advantage to use HSL (hue, saturation, luminosity- see Glossary) to permit the separation of the hue from luminosity (light level). To test this the maximum and minimum HSL pixel colour values of a small test area of skin were manually calculated. These HSL ranges were then used to detect skin pixels in a subsequent frame (detection was indicated by a change of pixel colour to white). The test was carried out three times using either hue, saturation or luminosity colour ranges to detect the skin pixels. Next, histograms were drawn of the number of skin pixels of each value of hue, saturation and luminosity within the test area. Histograms were also drawn for an equal sized area of non-skin pixels. The results are shown in Figure 5:

Figure 5 Results of detection using individual ranges of hue (left), saturation (centre) and luminosity (right) as well as histograms showing the number of pixels detected for each value of skin (top) and background (bottom). Images and graphs show that hue is a poor variable to use to detect skin as the range of values for skin hue and background hue demonstrate significant overlap (although this may have been due to the choice of hue of the background). Saturation is slightly better and luminosity is the best variable. However, a combination of saturation and luminosity would provide the best skin detection in this case.

Raymond Lockton, Balliol College

The histogram test was repeated using the RGB colour space. The results are shown in Figure 6.

Figure 6 Histograms showing the number of pixels detected for each value of red (left), green (centre) and blue (right) colour components for skin pixels (top) and background pixels (bottom). The ranges for each of the colour components are well separated. This, combined with the fact that using the RGB colour space is considerably quicker than using HSL suggests that RGB is the best colour space to use.

Figure 7 shows recognition using red, green and blue colour ranges in combination:

Figure 7 Skin detection using red, green and blue colour ranges in combination. Detection is adequate and frame rate over twice as fast as the HSL option.

Hue, when compared with saturation and luminosity, is surprisingly bad at skin differentiation (with the chosen background) and thus HSL shows no significant advantage over RGB. Moreover, since conversion of the colour data from RGB to HSL took considerable processor time it was decided to use RGB.

3.4 Colour calibration


It is likely that the detection system will be subjected to varying lighting conditions (for example, due to time of day or position of camera relative to light sources). Therefore it is likely that an occasional recalibration will have to be performed. The various calibration techniques are discussed below:

Raymond Lockton, Balliol College

3.4.1

Initial Calibration

The method of skin and marker detection selected above (Section 3.3) involves checking the RGB values of the pixels to see if they fall within red, green and blue ranges (these ranges are different for skin and marker). The choice of how to calculate these ranges is an important one. Not only does the calibration have to result in the detection of all hand and marker pixels at varying light levels, but also the detection of erroneous background pixels has to be reduced to a minimum. In order to automatically calculate the colour ranges, an area of the screen was demarcated for calibration. It was then a simple matter to position the hand or marker (in this case a wrist band) within this area and then scan it to find the maximum and minimum RGB values of the ranges (see Figure 8). A formal description of the initial calibration method is as follows: The image is a 2D array of pixels:

r ( x, y ) * I ( x, y ) = g ( x, y ) b( x, y ) J = {x1 xn } where x i = (x, y )


*

* * * * The colour ranges can then be defined for this area: * * r = max r ( x ) r = min r ( x ) * * g = max g ( x ) g = min g ( x ) * * b = max b( x ) b = min b( x )
The calibration area is a set of 2D points:
max xJ
* *

min

x J
* *

max

xJ

min

x J

max

xJ

min

x J

A formal description of skin detection is then as follows: The skin pixels are those pixels (r ,

g , b ) such that: ((r rmin ) & (r rmax )) & (( g g min ) & ( g g max )) & ((b bmin ) & (b bmax ))

Call this predicate S (r ,

g , b)

The set of all skin pixel locations is then:

L = {x | S (r (x ), g (x ), b(x )) = 1}

* * *

Using this method skin pixels were detected at a rate of 15fps on a 600Mhz laptop (see Figure 9).

10

Raymond Lockton, Balliol College

C (skin ranges)

D (wrist band ranges)

Figure 8 Image A shows the colour calibration areas for wrist band (green) and skin (orange). Calibration is performed by positioning the wrist band under the green calibration area and the hand under the orange calibration area (image B shows a partially positioned hand). The calibration algorithm then reads the colour values of both areas and calculates the ranges by repeatedly updating maximum and minimum RGB values for each pixel. Images C and D show the pixel colour values for the skin and wrist band areas. The colour ranges calculated for each colour component are indicated by double headed arrows.

Figure 9 After calibration of skin and wrist band pixels, colour ranges are used to detect all subsequent frames. A detected frame is shown here with skin pixel detection indicated in white and wrist band pixel detection indicated in red.

11

Raymond Lockton, Balliol College

It was decided to use the calibration routine, discussed in Section 3.4.1, to find the initial values. However, the ranges returned by this method were less than perfect for the reasons below: The calibration was carried out using a single frame; hence pixel colour variations over time, due to camera noise, would not be accounted for. To ensure that the sampled area only contained skin pixels, by necessity it had to be smaller than the hand itself. The extremities of the hand (where, due to selfshadowing, the colour variation is the greatest) were therefore not included in the calibration.

3.4.2

Improving calibration

In order to improve the calibration, four further methods were considered: 1. Multiple frame calibration: If the calibration was repeated over several frames and the overall maximum and minimum colour values calculated, then the variation over time due to camera noise would be included in those ranges and its effect thus negated. The method would require the hand to be held stationary during the calibration process. The routine was thus modified to perform the calibration over 10 frames instead of one. Figure 10 shows the results.

Figure 10 Results of multiple frame calibration. Stage A is the result of the initial calibration. Stage B is the result of calibration over 10 frames. There is no discernable difference in the skin fit.

Calibration of several frames does little to improve skin detection. Therefore this method was not retained. 2. Region-Growing: A second method would be to query pixels close to the detected skin pixels found using the initial method. If the colour components of these fell just outside the calibration ranges then the ranges could be increased to include them. This process could then be repeated a number of times until the skin detection was adequate. Figure 11 shows how the process works (simplified).

12

Raymond Lockton, Balliol College

Figure 11 A simplified illustration of how region-growing works. Image A shows the initial captured hand. Image B shows the result of initial calibration, detected pixels are shown in white. For simplicitys sake the pixels that fall within the initial colour ranges have been drawn as a square. In practise, all pixels within the ranges will have been identified (these pixels would be scattered throughout the hand area). Next, any pixels in the neighbourhood of those already detected are scanned (the area within the black box of image C). If their colour values lie just outside the current colour ranges, the ranges are increased to include them. The result is shown in image D (again simplified). Although the pixels between the index and middle fingers fell within the boundary, their values did not fall close to the ranges, so they were ignored. The process is then repeated (images E and F) until, in theory, the ranges are such that all skin pixels are detected.

A program was written to repeat the region-growing process a number of times on a single frame. The results are shown in Figure 12.

13

Raymond Lockton, Balliol College

Figure 12 Results of region-growing. Stage A is the result of the initial calibration. Stage B is the result of 50 repetitions of the region-growing algorithm (the fit is better still but a single erroneous pixel, circled and arrowed, has been detected in the background). Stage C is the result of 100 repetitions. The background noise is growing even though the shadowed areas of the hand are still not detected adequately. Finally, by Stage D with 200 repetitions there is a considerable amount of background noise.

The results show that performing the region-growing process a small number of times results in slightly better detection but the process becomes noisy if the number of repetitions is too high (>100). It was decided to keep this method but restrict its growth to a maximum of 50 repetitions. 3. Background Subtraction: With this method an image of the background is stored. This information is then subtracted from any subsequent frames. In theory this would negate the background leaving only the hand and marker information, making the detection process much easier. However, although performing background subtraction with a black and white system worked well, doing the same with colour proved much more difficult as a simple subtraction of the colour components made the remaining hand and marker colour information uneven over the frame. This method also made the system considerably slower and was adversely influenced by the automatic aperture adjust of the camera. As the current system worked adequately it was decided not to proceed with this calibration step. 4. Removal of persistent aberrant pixels: Although it is a valid design choice to select a background that differs greatly in hue both from that of the skin and wrist band colour, it is possible that imperfections in the background colour or camera function could result in aberrant pixels falling within the calibrated ranges and therefore be repeatedly misinterpreted as skin or wrist band pixels. It would be possible to scan the 14

Raymond Lockton, Balliol College

image when the hand is not in the frame and store any (aberrant) pixels detected. Ignoring these pixels would affect the recognition depending on where the hand was in the frame. It would therefore be necessary to choose the correct value for the aberrant pixel based on the value of those surrounding it (if all surrounding pixels are skin then detect as skin, else background). However, neither the camera nor the background exhibited such pixels when a hand was in frame so it was decided not to proceed in programming this calibration step.

3.5 Method of colour detection


Until now a simple RGB bounding box has been used in the classification of the skin and marker pixels. However, if a plot is drawn of the detected skin pixels (see Figure 13) it can be seen that they lie not within a cuboid (the principle used by the current detection system) but within an ellipsoid.

Figure 13 Plots of different combinations of skin pixel colour values (green) and background pixel colour values (red). The skin pixels are well separated from the background pixels in all three colour components but lie within an ellipsoid as opposed to a cuboid. The values are well enough separated, however, for a cuboid colour range system to work adequately.

In order to improve accuracy it would be necessary to check if the colour components of the skin and wrist band pixels fell within this ellipsoid. However, this was considered computationally intensive and given that the current cuboid system works adequately it was not implemented.

3.6 Conclusion
This chapter has described the choice and setup of hardware and the methods of calibration and detection in order to detect as many of the skin and marker pixels within the frame as 15

Raymond Lockton, Balliol College

possible. The hardware chosen was a single colour camera pointing down towards a desk (or floor) surface of a constant colour with no special lighting. Calibration is performed by scanning the RGB colour values of pixels within a preset area of the frame and improved using a limited amount of region-growing. Detection is performed by comparing each RGB pixel value with ranges found during calibration. Figure 12 Stage B shows the successful detection of the majority of the hand area.

16

Raymond Lockton, Balliol College

4 Refinement

Using the methods discussed in the previous chapter it is possible to detect the majority of the skin and band pixels in the frame whilst detecting very few aberrant pixels in the background. However, some complications were noticed which could reduce the accuracy of recognition at a later stage. These are: 1. Image distortion: If the cameras visual axis is not perpendicular to the floor plane, a given gesture would appear different depending on the position and yaw of the hand (a given length in one area of the frame would appear longer or shorter in another area of the frame). This is termed projective distortion. Also, if the camera lens is of poor quality then the straight sides of a true square in the frame would appear curved. This is termed radial distortion. 2. Skin pixels detected as wrist band pixels: If the wrist band colour ranges are increased sufficiently for all pixels to be detected then areas of skin that are more reflective (such as the knuckles) start to be incorrectly identified as band pixels. This is disadvantageous as it leads to inaccurate recognition information. 3. Skin pixels of the arm being detected: Any skin pixels above the wrist band will also be detected as skin. It would be preferable if these pixels could be ignored, as they play no part in the gesture. Wearing a long sleeve top helps solve the problem but forearm pixels are still detected between the wrist band and the sleeve (which has a tendency to move up and down the arm as different gestures are made, leading to variations in the amount of skin detected). It was decided to reduce the effects of these complications as much as possible.

4.1 Analysis of distortion


Tests were devised to check for the presence of both radial and projective distortion. These are discussed below.

4.1.1

Radial distortion

In order to assess whether radial distortion was present, a rectangular piece of card was placed in the frame. It was then a simple matter to check the edges of the frame against the edges of the card (see Figure 14).

17

Raymond Lockton, Balliol College

Figure 14 A4 card placed in the frame. If the camera had significant radial distortion, the straight edges of the paper would appear as curves. This is not the case so radial distortion is not significant.

The straight sides of the paper are imaged not as curves, but as straight lines, therefore radial distortion is not present.

4.1.2

Projective distortion

To check for projective distortion a strip of paper was placed in the frame at various positions. By measuring its length (in pixels) at each location, any vertical or horizontal distortion could be found (see Figure 15).

A: Strip length 101 pixels

B: Strip length 102 pixels

C: Strip length 99 pixels

D: Strip length 98 pixels

E: Strip length 97 pixels

F: Strip length 95 pixels

Figure 15 Paper strip placed in the frame at different positions (with superimposed lines to aid measurement). From the measured strip lengths it can be seen that there is only a small amount of projective distortion present. Overall, there is only 6% deviation in apparent strip length anywhere in the frame, therefore it was considered unnecessary to correct for projective distortion.

18

Raymond Lockton, Balliol College

There is slight image distortion present but its effect is limited to only 6% and therefore was not considered serious enough to attempt to remove (removal would involve transforming a distorted to a regular rectangle which would be processor intensive).

4.2 Removal of skin pixels detected as wrist band pixels


Although some skin pixels were incorrectly detected as wrist band pixels when the wrist band colour ranges were increased, no wrist band pixels were incorrectly detected as skin. It was a simple matter, therefore, to permit pixels to be detected as wrist band only if they had not previously been detected as skin. This reduced the number of aberrant wrist band pixels considerably.

4.3 Removal of skin pixels detected from forearm


As there is no difference between the colour ranges of a skin pixel of the hand and a skin pixel of the forearm, position information will have to be used to remove forearm skin pixels.

4.3.1

Centroid calculation

By averaging the position of the pixels detected it is possible to calculate the centroid of both the hand and the wrist band. A formal description of centroid calculation is as follows: From before, the set of all skin pixel locations was defined as:

L = {x | S (r (x ), g (x ), b(x )) = 1}

* * *
1 L 1 Lband
*

Denote the number of elements of L by L This gives the hand centroid as:

chand =

x
xL
*

* *

The wrist band centroid is calculated in the same way:

cband =

xLband

Figure 16 shows an original image and the image with the detected skin pixels, wrist band pixels and centroids visible.

19

Raymond Lockton, Balliol College

Figure 16 Original image before skin and wrist band pixel detection (A) and after (B). Detected skin pixels are shown in blue and wrist band pixels in red. Centroids are displayed as black dots.

Notice how even with priority given to skin pixels over wrist band pixels, a number of wrist band pixels are erroneously detected near the knuckles (where skin has not been detected due to the higher reflectivity of those areas).

4.3.2

Localising the wrist band

It was considered that if the distance and angle of the edges of the wrist band relative to the hand centroid could be found, the forearm skin pixels could be removed by comparing their distances and angles with them. The edges of the wrist band can be found by scanning lines parallel to the line joining the two centroids. Define the vector joining the two centroids as: The yaw angle of the hand is therefore:

c dif = ( x dif , y dif ) = c hand cband y dif hand = tan 1 x dif

The edges of the band are then found thus: For each point p1 (s1 ) along the line

p1 (s1 ) = cband
where ( 50 s1 50 )

cos hand + 2 + s1 sin hand + 2

For each s1 count the number of wrist band pixels n(s1 ) along the line:

cos( hand ) p2 (s1 , s 2 ) = p1 + s 2 sin ( ) hand

where ( 50 s 2 50 )

* are then equal to p (s ) when n(s ) falls below a certain threshold.


1 1 1

The two points defining the edges of the band bleft or xleft , yleft and bright or xright , y right

20

Raymond Lockton, Balliol College

Figure 17 shows a number of the lines scanned (reduced for clarity) along with a graph showing the thresholds used in the program to detect the band edges.
Number of wrist band pixels detected

25 20 15 10 5 0 -100 -50 0 50 100 Distance along line perpendicular to line joining the centroids (pixels)

Figure 17 The left image shows the lines scanned to detect the edges of the wrist band. The number of wrist band pixels detected along each line is counted. The edges have been detected when the number falls below a certain threshold. The graph on the right shows the number of pixels detected along each of the lines with the detected edges marked in red.

Using these thresholds it is then possible to utilize only those wrist band pixels that are within the bands width. This removes any remaining erroneous wrist band pixels detected near the knuckles. The radius of the band is:

rband = max bleft cband , bright cband

(* *

Any band pixels further than rband from cband can then be disqualified. The wrist band centroid can then be recalculated. Figure 18 shows the wrist band pixels that have passed this radius test and the recalculated centroid (passed pixels shown in yellow, radius indicated by black circle).

* )

Figure 18 Radius test applied to wrist band pixels. Any pixels that are further from the wrist band centroid than the band radius (black circle) previously calculated can be ignored (pixels that pass shown in yellow, those that fail in red)

21

Raymond Lockton, Balliol College

4.3.3

Removing skin pixels of the forearm

Finally, using the angle and distance from the hand centroid to the wrist band edges it is possible to differentiate the skin pixels of the forearm and remove them. The minimum distance between the hand centroid and the edges of the band is:

rhand = min bleft chand , bright chand

(* *

* )

The maximum and minimum angles of the band ( band max and band min ) relative to chand are:

yleft y , tan 1 right band max = max tan 1 xleft x right yleft y , tan 1 right band min = min tan 1 xleft xright Any hand pixels further than rhand and between band max and band min relative to chand can

then be disqualified (a case statement deals with the situation that occurs when the band angles lie either side of 0 radians). Figure 19 shows the angle and distance criterion being applied, with skin pixels that fail highlighted in green.

Figure 19 Distance and angle criterion applied to skin pixels. The two straight black lines show the angle in which the radius criterion is applied. The curved black line shows the radius beyond which skin pixels are disqualified. In this example failed skin pixels are shown in green.

Finally the hand centroid can be recalculated. This is shown in Figure 20.

22

Raymond Lockton, Balliol College

Figure 20 Image showing recalculated hand and wrist band centroids. Invalid wrist band pixels have been ignored (passed pixels shown in yellow, failed pixels in red) and skin pixels up the forearm have also been ignored.

4.4 Conclusion
This chapter has described several techniques to improve the hand detection. A combination of pixel position and priority based information was used to remove any erroneous detected pixels. Figure 21 shows that the process was very successful.

Figure 21 Detected pixels before and after refinement. The detected wrist band pixels are shown in red. Notice how after refinement the erroneous wrist band pixels detected on the knuckles have been ignored, with a corresponding shift in wrist band centroid. The detected skin pixels are shown in blue. All of the hand pixels are detected except those in areas of higher reflectivity (near the knuckles) which naturally show up as white. Notice how after refinement all skin pixels detected up the forearm have been ignored; with a corresponding shift in hand centroid.

23

Raymond Lockton, Balliol College

5 Recognition

In the previous two chapters, methods were devised to obtain accurate information about the position of skin and wrist band pixels. This information can then be used to calculate the hand and wrist band centroids with subsequent data pertaining to hand rotation and scaling. The next step is to use all of this information to recognise the gesture within the frame.

5.1 Choice of recognition strategy


Two methods present themselves by which a given gesture could be recognised from two dimensional silhouette information: Direct method based on geometry: Knowing that the hand is made up of bones of fixed width connected by joints which can only flex in certain directions and by limited angles it would be possible to calculate the silhouettes for a large number of hand gestures. Thus, it would be possible to take the silhouette information provided by the detection method and find the most likely gesture that corresponds to it by direct comparison. The advantages of this method are that it would require very little training and would be easy to extend to any number of gestures as required. However, the model for calculating the silhouette for any given gesture would be hard to construct and in order to attain a high degree of accuracy it would be necessary to model the effect of all light sources in the room on the shadows cast on the hand by itself. Learning method: With this method the gesture set to be recognised would be taught to the system beforehand. Any given gesture could then be compared with the stored gestures and a match score calculated. The highest scoring gesture could then be displayed if its score was greater than some match quality threshold. The advantage of this system is that no prior information is required about the lighting conditions or the geometry of the hand for the system to work, as this information would be encoded into the system during training. The system would be faster than the above method if the gesture set was kept small. The disadvantage with this system is that each gesture would need to be trained at least once and for any degree of accuracy, several times. The gesture set is also likely to be user specific. It was decided to proceed with the learning method for reasons of computation speed and ease of implementation.

5.2 Selection of test gesture set


In order to test any comparison metric devised it is important to have a constant set of easily reproducible gestures. It is also important to ensure that the gestures are not chosen to be as dissimilar as possible (so that the system is tested robustly). Sign language gestures are an excellent test, but sign language normally involves both hands with one hand regularly 24

Raymond Lockton, Balliol College

occluding the other. This is outside the project remit. However, there is an American onehanded sign language alphabet, which, with slight modification, can be used (see Appendix B).

5.3 Analysis of recognition problem


In order for any comparison method to work, it is essential to remove as many degrees of freedom as possible in order to make the comparison realistic. For instance, if a given gesture has to be taught for every position in the frame, every hand yaw angle and for various distances from the camera, then the comparison task becomes impossibly large. However, the inclusion of a wrist band in detection helps simplify the process by removing these degrees of freedom. The angle between the centroids of the wrist band and the hand designates the yaw of the hand, so this degree of freedom can be removed. The distance between the centroids allows the hand to be scaled to a constant size so the hand-to-camera distance degree of freedom can be removed. Finally, since the centre of the hand is indicated by the hand centroid the hand position degree of freedom can also be removed by centring detection about this point. The only degree of freedom that cannot be removed is the roll angle of the hand (see Glossary). However it could be argued that if the roll angle is changed (wrist is rotated) then this represents a different gesture and should be detected as such. Three recognition methods will be considered within this chapter. The first, developed mainly to design the comparison architecture, is based on gesture skin area. The second uses the amount of skin under a series of radials emanating from the hand centroid to generate a signature for each gesture. The third is based upon matching templates generated during training with a given test mask in the canonical frame. The three methods are discussed below.

5.4 Recognition method 1: Area metric


A very simple comparison metric would be hand area, which would have the advantage of not being affected by the yaw of the hand. However, the area of any given gesture is unlikely to be unique within the test set. Nevertheless it was decided to proceed with the analysis of this method in order to focus the attention on the comparison architecture of any future system and the testing methodology. See Appendix C Section 1 for a formal description of this method. In order to test this method a program was devised to measure the area of a given gesture (after scaling to keep the hand centroid to wrist band centroid distance constant). Several examples from the one-handed sign language were presented and the average areas of each calculated and stored. A test gesture was then presented to the system and the differences in area between it and those previously stored calculated. The recognition results for the sign language letter c are shown in Figure 22 (compared with letters a through to i):

25

Raymond Lockton, Balliol College

Comparison of area for gesture 'c' with trained letters 'a' through to 'i' 8000 7000 6000 5000 4000 3000 2000 1000 0 a a b b c c d d e e f f g g h h i i Area difference

Figure 22 Comparison of test letter 'c' with pairs of trained examples from 'a' through to 'i'. Although the score is low for the letter c the scores for several of the other gestures is also low. Any of the gestures below the broken line could be misinterpreted as the letter c. This suggests, as predicted, that area is not a good comparison metric to use (although the letters a, e, g and i are well differentiated from c).

As predicted, area is not a good comparison metric as several other trained gestures (b, d and h) also exhibited a similar area to the test letter c.

5.5 Recognition method 2: Radial length signature


A simple method to assess the gesture would be to measure the distance from the hand centroid to the edges of the hand along a number of radials equally spaced around a circle. This would provide information on the general shape of the gesture that could be easily rotated to account for hand yaw (since any radial could be used as datum). Figure 23 shows a gesture with example radials (simplified).

26

Raymond Lockton, Balliol College

Figure 23 Example gesture with radials marked. The black radial lengths can easily be measured (length in pixels shown). However, the red radials present a problem in that they either cross between fingers or palm and finger.

However, a problem (as shown in Figure 23) is how to measure when the radial crosses a gap between fingers or between the palm and a finger. To remedy this it was decided to count the total number of skin pixels along a given radial. This is shown in Figure 24.

Figure 24 One of the problem radials with outlined solution. If only the skin pixels along any given radial are counted then the sum is the effective length of that radial. In this case the radial length is 46 + 21 = 67.

All of the radial measurements could then be scaled so that the longest radial was of constant length. By doing this, any alteration in the hand camera distance would not affect the radial length signature generated. See Appendix C Section 2 for a formal description of the radial length calculation.

5.5.1

Evaluation of radial length metric

To evaluate this method a program was written to calculate the radial length signature of a given gesture and display it in the form of a histogram. Figure 25 shows the skin count of the radials from 0 to 2 radians for an open hand gesture in several different yaw angles and distances from the camera.

27

Raymond Lockton, Balliol College

Figure 25 Open hand gesture in several different positions and yaw angles. The histogram for each gesture is largely the same shape but shifted dependent on the yaw of the hand.

The measurement is not affected by hand-to-camera distance. The measurement is affected by the yaw of the hand, but this only shifts the readings to the left or right and does not affect their shape. Figure 26, however, shows that the measurements are considerably different for different gestures.

Figure 26 Images showing the histogram for two different gestures. The two histograms are sufficiently different to permit differentiation.

5.5.2

Removing the hand yaw degree of freedom

In order to counter the shifting effect of hand yaw, a wrist marker was used. The angle between the centroid of this marker and the centroid of the hand was then used as the initial 28

Raymond Lockton, Balliol College

radial direction. This, along with the maximum radial length scaling makes the system robust against changes in hand position, yaw and distance from camera. Figure 27 shows the same open hand gesture (as in Figure 25) in a variety of positions and yaw angles.

Figure 27 The same open hand gesture as before in a variety of different positions and yaw angles, but with hand yaw independence. The histograms for all the gestures are similar so it should be possible to recognise this gesture from a set of different gestures.

The radial measurements are very similar no matter how the hand is positioned.

5.5.3

Comparison of radial signatures

Now that an invariant signature exists for each gesture it is possible to compare the signature of a test gesture with those of a set of trained gestures. A match score for each trained gesture was then calculated by adding up the differences between corresponding radial lengths. The trained signature with the smallest difference could then be presented as the match. See Appendix C Section 3 for a formal description of the radial signature comparison. A program was written to display an image of the trained gesture with the best score at the top left of the image window. Figure 28 shows the successful recognition of several gestures.

29

Raymond Lockton, Balliol College

Figure 28 Successful recognition of several different gestures. Gesture recognised is shown at the top left of the frame. The gestures are recognised correctly even though the yaw of the test hand is different from that taught.

5.5.4

Improving radial distribution

During tests it was noticed that the quality of recognition depended on the number of radials used (in the example in Figure 28 only 100 radials were used where previously the number was 200). It was also noticed that most of the significant data was concentrated around the fingers, thus it would be more efficient to group radials in these areas. Figure 29 shows the radials in their original grouping and after reorganisation.

30

Raymond Lockton, Balliol College

Figure 29 The left image shows 100 radials in their original pattern. However, this pattern does not give the necessary concentration bias towards the fingers. The image on the right shows 200 radials reorganised so that twice as many lie over the fingers as the rest of the hand (150 over the fingers and 50 elsewhere).

5.5.5

Re-evaluation of radial length metric

Using this improved system the sign language letters a through to o were taught to the system. This enabled a very limited sign language word processor to be made (see Figure 30).

Figure 30 Successful implementation of a simple sign language word processor. Clicking a button whilst gesturing in the frame added the highest scoring gesture to the output window.

The graph in Figure 31 shows that the radial length metric is considerably better than the area metric at differentiating this series of gestures. However, c and i have very similar low scores even though the signs are physically different.

31

Raymond Lockton, Balliol College

Comparison of radial length signature for gesture 'c' with trained letters 'a' through to 'i' 4000 3500 3000 2500 2000 1500 1000 500 0 a a b b c c d d e e f f g g h h i i Total of differences in number of pixels along radials

Figure 31 Comparison of test letter 'c' with trained examples from 'a' through to 'i'. The score is low for the letter c and also high for most of the other gestures. However, one example of the letter i also gets a good comparison score even though the gesture corresponding to the letter i is dissimilar to that of the letter c. However, the range of scores is considerably better than that of the area recognition method discussed earlier.

5.5.6

Analysis of data provided by system

To examine why the scores were so similar for the physically different gestures c and i (see Figure 31), the recognition program was altered so that only a single pixel was displayed along a given radial at a distance proportional to the number of pixels detected (along that radial). This provided a good illustration of the information presented to the recognition process (see Figure 32).

Figure 32 On the left is the original image and on the right is a representation of the data provided by the radial length recognition system. The amount of information provided about individual fingers is dependent on the angle of the radial covering that finger which means that gestures involving the poorly represented fingers will not be well differentiated.

32

Raymond Lockton, Balliol College

Due to the organisation of the radials, the amount of information provided about individual fingers is dependent on the relative angle of the radial and the long axis of the finger (the shallower the angle the more information is provided). This is obviously an inadequate situation as gestures involving the parts of the hand that are not well covered would be hard to differentiate.

5.5.7

Test of system using American sign-language gestures

The effects of the problem highlighted in Section 5.5.6 are further illustrated by the recognition statistics in Figure 33, for a considerably larger gesture set involving all the sign language letters and numbers as well as five mouse commands (left click (lc), right click (rc), open hand (op), closed hand (cl), double click (dc)- see Figure 50) and space (sp). The test procedure involved signing all of the gestures as well as transition gestures interleaved between them. For a perfect score the system would not only have to correctly recognise all the gestures but also provide a blank return for the transition gestures. A false positive is where the system returns a gesture label even though the input was a transition gesture. A false negative is where the system returns a blank even though the input was a valid gesture. Gesture
T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P T H E SP R K 3 5 SP Q U I C K SP B R O U T SP F O X E N SP J U M P

Recognised
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC

Gesture
E D SP O V E U SP T H E SP 6 7 8 9 J SP L A Z Y SP D O G OP CL LC RC DC

Recognised

Correct Incorrect False positives False negatives

55/64 9/64 33/64 2/64

Figure 33 Results from a test of the radial length recognition method. Several of the test gestures were incorrectly recognised. There were also a number of false positives and two false negatives (the number of false positives and negatives is dependent on a threshold above which a score is considered to have been caused by a valid gesture).

33

Raymond Lockton, Balliol College

5.6 Recognition method 3: Template matching in the canonical frame


In this section an alternative recognition strategy is discussed which involves first transforming the hand into the canonical frame and then performing a comparison of the test and taught transformed data. Using the hand yaw and scaling information it is possible to transform the entire hand into a frame where it always has the same yaw angle and scaling (this is called the canonical frame). For each skin pixel the distance and angle to the hand centroid is calculated. The distance can then scaled by the hand centroid to wrist band centroid distance and the angle rotated by the angle of a line joining the centroids. The translated pixel can then be placed in the canonical frame using another point as reference, say the centre of the screen. See Appendix C Section 4 for a pseudocode description of this process.

5.6.1

Evaluation of the transformation into the canonical frame

A program was written to perform the transformation. The results are shown in Figure 34.

Figure 34 On the left is the original image and on the right is the image after transformation into the canonical frame. However, after scaling up from the original frame, gaps appear between the pixels which would make the recognition comparison unreliable.

The problem is that scaling up from the original frame to the canonical frame results in gaps between pixels. This would be disadvantageous in recognition as a specific pixel in the trained set may not match up with a corresponding pixel in the test gesture and as such would not score.

5.6.2

Modification of the transformation method

A solution to the problem highlighted in Section 5.6.1 would be to change the algorithm from using a pixel push from the original to the canonical to using a pixel pull. With this method the distance and angle between every pixel in the canonical frame and some anchor point (such as the centre of the screen) is calculated. The inverse scaling and angle rotation is then performed and the corresponding pixel in the original frame, relative to the hand centroid, queried. If this pixel is skin then the pixel in the canonical frame is coloured blue. If it is not skin it is coloured black. A disadvantage is that any given pixel in the original frame may be queried several times, reducing efficiency. See Appendix C Section 5 for a pseudocode description of the pixel pull from the canonical frame.

34

Raymond Lockton, Balliol College

5.6.3

Evaluation of the modified transformation

A program was written to perform the modified transformation. Figure 35 illustrates how a given gesture in two different positions in the original frame looks very similar in the canonical frame. Notice that shadowing still affects the gesture similarity.

Figure 35 The left two images show two different examples of the same gesture at different positions and rotations. The right two images show the corresponding images in the canonical frame. Performing a pixel pull rather than a pixel push means that the problem of gaps between pixels no longer occurs. The two gestures look similar in the canonical frame, most of the differences being caused by shadowing.

5.6.4

Analysis of methods of representation of training data

The question is now how to compare training data with a test gesture in the canonical frame. Unlike the radial length metric the amount of data to be compared for each gesture is large (>40,000 pixels). Therefore, although it would be possible to directly compare the canonical frame information of a test gesture with all of those trained, this process would be inefficient and slow. It is evident that some pixels are better at differentiating a given set of gestures than others (pixels near the wrist band are likely to be skin for the entirety of the gesture set and those far from it never). It is also the case that some pixels are not reliable in identifying a given gesture (such as pixels near the edge of the hand or those intermittently affected by shadowing). To address this problem a program was written to take a number of example images of a given gesture and compare every pixel over the set. The value for the amount of variation of each pixel was then calculated and displayed by a colour from blue (small amount of variation) to red (large amount of variation). These images were termed jitter maps. See Appendix C Section 6 for a pseudocode description of the creation of these jitter maps.

35

Raymond Lockton, Balliol College

Figure 36 shows the jitter maps for the one handed sign language letters m, n and l (40 examples of each gesture were used).

Figure 36 Jitter maps for the letters m, n and l respectively (40 examples of each gesture were used). The most variation (most red) occurs near the edges of the hand. Greater influence should therefore be given to the bluer pixels for the purposes of recognition.

As expected, the largest amount of variation occurs near the edges of the hand. Therefore, in the recognition of these gestures, greater weight should be given to the bluer pixels. It would also be advantageous to combine the information given by maps such as those in Figure 36 to find the pixels that best differentiate them. In order to facilitate this a program was first written to create a map where the value of each pixel is dictated by the proportion that the corresponding pixels across the training set were skin. These images were termed skin concentration maps (SCMs). See Appendix C Section 7 for a pseudocode description of the creation of these skin concentration maps. A simple subtraction of the SCMs for two sets of gestures could then be performed to find the pixels that best differentiate the two (the best pixels being those that are mostly background on one set and mostly skin on the other). See Appendix C Section 8 for a pseudocode description of the creation of a skin concentration difference map. Figure 37 shows the skin concentration maps for the letters m and n and the result of the subtraction of the two.

36

Raymond Lockton, Balliol College

Figure 37 The top two images are skin concentration maps for the letters m and n respectively. As expected the skin is most concentrated at the centre of the hand (blue areas) and least concentrated near the edges (red areas). The bottom image is the result of an image subtraction of the top two. The best pixel areas to differentiate these two gestures lie just beyond the knuckles of the letter n and in the shadowed area of the letter m (coloured red).

The best pixels to differentiate the letters m and n (coloured red) lie just beyond the knuckles of the letter n and in the shadowed area of the letter m. Both jitter and skin concentration maps are a compact way of representing the large amount of data created during training. However, skin concentration maps proved more useful for the purposes of gesture comparison and so were chosen.

5.6.5

Evaluation of template matching in the canonical frame recognition method

Now that a skin concentration map could be formed for any gesture trained, a method had to be found to compare a test gesture mask with each of them. Fundamentally, a trained and test gesture are a good match if all the areas of skin and background match up. However, a skin concentration map has no skin or background but rather a value between these two limits. Therefore, in order to evaluate this recognition method a program was written to quantize the skin concentration maps so that all areas above a certain threshold were considered skin, all those below a second threshold considered background and all other pixels ignored. A direct skin to skin and background to background comparison then became possible. See

37

Raymond Lockton, Balliol College

Appendix C Section 9 for a pseudocode description of the creation of the quantized skin concentration maps. Figure 38 shows an example skin concentration map before and after quantization.

Figure 38 An example SCM of the letter e before and after quantization (left and right respectively). Any areas below a certain cold threshold are considered skin (coloured blue), all those above another hot threshold considered background (coloured red). All other areas are ignored (coloured white).

A score was then calculated by comparing the test gesture mask with each quantized skin concentration map (QSCM). A point was awarded if the test mask skin pixel coincided with a skin pixel of the QSCM and a point subtracted if the test mask skin pixel coincided with a background pixel. Similarly a point was awarded if the test mask background coincided with the background of the QSCM and vice versa. See Appendix C Section 10 for a pseudocode description of the comparison of a test gesture mask and set of QSCMs. Figure 39 shows the comparison of the QSCM for the letter e (Figure 38 right) with example masks of the letters c and e.

38

Raymond Lockton, Balliol College

Figure 39 The comparison of the QSCM for the letter e (Figure 38 right) with example masks of the letters c and e. Areas that achieve positive scores (background to background or skin to skin match) are shown in green and those with negative scores (background to skin or skin to background) are shown in yellow. The mask for the letter e has many more areas of positive score and fewer areas of negative score than the mask for the letter c.

The graph in Figure 40 shows the scores of a test gesture c compared with the QSCMs of gestures from a through to i.

39

Raymond Lockton, Balliol College

Comparison of QSCM match score for gesture 'c' with trained letters 'a' through to 'i'
30000 25000 20000 15000 10000 5000 0 a a b b c c d d e e f f g g h h i i

QSCM match score

Figure 40 Comparison of test letter 'c' with trained examples from 'a' through to 'i'. The examples of the letter c achieve the top two comparison scores and none of the others achieve similar scores except the letter d which, although close, is still a minimum 1,400 points different. This suggests that the template matching in the canonical frame recognition method is better than both the area and radial length recognition methods.

Both examples of the letter c stored matched better to the test gesture than any of the others. Based on the results obtained for the three metrics it was decided to use the template matching in the canonical frame recognition method as it was the only method that provided sufficient information to differentiate the similar gestures reliably and because it was the easiest to adapt to using multiple training examples of each gesture.

5.7 Refinement of the canonical frame


In order to make the differentiation of a large number of gestures accurate, it is essential that the canonical frame is as invariant as possible to movements of a gesture in the original frame. The current system uses the hand centroid as an anchor in the original frame and the hand centroid to wrist band centroid distance as a scaling factor. However, although the centroids are calculated using the average of a large number of pixels they are not as robust as other methods considered below.

5.7.1

Scaling using average radial distance

With this method the scaling factor is obtained using the average distance from the hand centroid to every skin pixel detected. This is more robust than the hand centroid to wrist band centroid distance scaling factor as it does not involve the use of the wrist band centroid (which is less robust as it is calculated using a smaller number of pixels). See Appendix C Section 11 for a pseudocode description of scaling using the average radial distance.

5.7.2

Shifting the hand in the canonical frame

This method translates the hand in the canonical frame based upon simple rules (e.g. shift up until there are at least 40 skin pixels in the uppermost row). Once again this method makes the

40

Raymond Lockton, Balliol College

canonical frame method more robust as it reduces the reliance on the hand centroid as an anchor point. Several rules were considered, but the one that produced the best results involved shifting the image in the canonical frame to the right until the wrist band was just off the edge of the screen. This was performed by scanning columns of the canonical frame from the right until the number of wrist band pixels detected fell to zero. The positioning in the ydirection was calculated using the hand centroid as before. Figure 41 shows a gesture in the canonical frame before and after translation.

y x
Figure 41 Images showing the canonical frame before (left) and after (right) x-axis shift. The yaxis position of the hand is dictated by the hand centroid as before.

It was decided to use both these methods.

5.8 Refinement of the training data


It was noticed that some gestures from the one handed sign language set exhibited more variation than others. This was primarily the gestures where the fingers of the hand cast shadows on the palm (such as the letters e and f). The shadows cast varied greatly with a small change of hand roll angle causing different areas of the palm not to be detected within the set. It was considered that these gestures would be at a disadvantage relative to those with less variation, as the skin concentration map would have more red areas, which therefore gives less credence to the comparison. For example, an extreme case would be a gesture that has no pixels common to any of the teaching frames. The skin concentration map for this gesture would therefore be entirely red. Any comparison method should give these high variation pixels less weight so for this extreme example none of the pixels would cause a high score even if the test gesture was an example of a taught gesture. A solution to this problem would be to cluster the training set for this gesture into several different exemplars (or sub-groups), all of which would share the same gesture label. The exemplars could be formed using the most similar gestures from the main group. This would then guarantee that the amount of variation within any of the exemplars would be kept low and therefore solve the problem. A simplified example of the clustering process is shown in Figure 42.

41

Raymond Lockton, Balliol College

A: Input

B: Output SCM without clustering

C: Output SCMs with clustering

Figure 42 A simplified example of how clustering improves recognition. In this case several of each of three valid representations of the letter c have been taught to the system. An example of each of the three representations is shown (column A). The resultant SCM (column B) has a large amount of redder area. Any comparison method should give these areas less weight so this gesture would be at a disadvantage relative to those with less variation. Column C shows the SCMs produced after clustering. The three types of gesture input have been split into three separate SCMs, each with much less redder area.

A greedy algorithm was devised to take the first gesture image in the training group and compare it pixel by pixel with all other members of the group. See Appendix C Section 12 for a pseudocode description of the comparison. Any gesture images whose compared difference (in pixels) fell below a set threshold, t max , were then added to a sub-group and removed from the main group. Once all the gesture images in the main group had been compared the next first member of the main group could be compared with all the remaining images and so on. A threshold was also set to define the minimum number of gesture images permitted in an exemplar. In the event that the number of images in an exemplar fell below this threshold the first member of the main group was simply removed entirely with the logic that if it was so dissimilar from all the rest then it must be an outlier and as such could be safely removed without greatly affecting recognition quality. The process continued until no gesture images remained in the main group. See Appendix C Section 13 for a pseudocode description of the clustering process. Figure 43 shows the result of running the algorithm on sets of 100 gesture images of the sign language letters a through to e. The value of t max in this case was 2500 pixels different and a minimum of four gesture images were allowed in an exemplar.

42

Raymond Lockton, Balliol College

Sign language letter A B C D E

Number of exemplars 6 3 3 3 13

Number of gesture images in each exemplar 48,12,11,5,14,5 (5 outliers) 81,12,6 (1 outlier) 41,53,6 (0 outliers) 73,15,12 (0 outliers) 11,12,10,7,12,8,6,5,7,5,4,4,4 (5 outliers)

Figure 43 Table showing the result of applying the segmentation algorithm to sets of 100 gesture images of the sign language letters a through to e. The gestures with the greatest amount of shadowing are a (due to the fingers resting against the palm) and e (due to the suspended fingers above the palm). Notice also how each of these gestures has five outliers. However, this is only 5% of the total number of gesture images in the set so was not considered too large. The gestures with no shadowing (c and d) are still clustered into more than one exemplar. This is due to the range of positions the fingers can occupy and still present a valid version of this gesture.

All of the training gesture image sets are clustered into at least three exemplars. As expected, the gestures with the largest number of exemplars are those with the most shadowing (letters a and e). Those with no shadowing (c and d) are also clustered into a small number of exemplars as they involve a range of possible finger positions that still present a valid gesture. A problem with clustering the training gesture image sets in this way is that it increases the number of SCMs that need to be compared per frame in order to recognise a test gesture. For instance, with no clustering, a set of 24 gestures would produce 24 SCMs to compare per frame. If clustering produces 10 exemplars per gesture, then the number of SCMs increases to 240, with subsequent decrease in recognition frame rate. The choice of how much clustering to perform is a trade-off between speed (less clustering) and accuracy (more clustering) and should be chosen depending on the application. A compromise between the two was chosen here.

5.9 Method of differentiation (in canonical frame)


As mentioned in Section 5.6.4, it is important for the comparison of the stored and test canonical frames to be efficient. Three methods were considered:

5.9.1

Tree method with quantization

In Section 5.6.4 a method was discussed whereby a series of images of a given gesture can be combined to form a skin concentration map (SCM). By subtracting two SCMs it is possible to score each pixel on how effective it is at differentiating one gesture from the other (see Figure 37). This method cannot be easily extended to more than two gestures. However, if a set of skin concentration maps are quantized into three values, say two for mostly skin, zero for mostly background and one if neither, then the equivalent pixel in each of the maps can be examined and that pixel added to a list if the quantized values over all the maps consisted entirely of twos and zeros. The same pixel of a test gesture can then be queried. If it is skin, then that would suggest that it is one of the gestures with mostly skin in that position, if not, then one with mostly background. See Figure 44 for a simplified example of this process and see Appendix C Section 14 for a pseudocode description.

43

Raymond Lockton, Balliol College

Figure 44 Simplified example of how the pixels that split the set can be found. The four tables on the left represent skin concentration maps. After quantization, the value of each pixel in the quantized skin concentration map is either 0, 1 or 2. The pixels that are either 0 or 2 across the set can then be found.

Although the process of quantization means that there is no strict guarantee provided by the analysis of each individual pixel, the combined influence of the many pixels in the list provides a better estimate. With the tree method a group of pixels that split the set of exemplars roughly in two is found. The greater the number of pixels the better the accuracy of the decision, so a compromise has to be found between splitting the set into two halves and finding enough pixels to accurately do so. See Appendix C Section 15 for a formal description of this compromise. Once the set is split the two subsets can be stored in the left and right branch of a tree structure. The same process (of finding pixels that split the set in two) can be applied to both subsets. The process continues until all subsets consist of a single gesture. A program was written to perform the quantization and then scan all the pixels from all the QSCMs for those that split the set roughly in two. Priority was given to finding sufficient pixels so if on a given pass insufficient were found then the process was repeated but with less emphasis on splitting the set exactly in two. After each split the location and value of all the qualifying pixels was stored and a node of a tree structure filled. Both reduced sets of gestures were then passed back into the splitting algorithm. The process was repeated until all the bottom nodes of the tree consisted of a single gesture. See Appendix C Section 16 for a pseudocode description of filling the tree structure. Figure 45 shows the output of the algorithm for a set of five gestures from the one handed sign language set (letters l, b, o, n and m).

44

Raymond Lockton, Balliol College

Input Gestures

Set of pixels found that split the set and tree structure filled

Figure 45 An example of how the tree method works. At each level of the tree the number of skin pixels under the green and yellow masks is counted. If the number under the green mask is larger than that under the yellow mask the green branch is chosen. Alternatively the yellow branch is chosen. The process is repeated until the bottom of the tree is reached.

45

Raymond Lockton, Balliol College

The advantage of this system is that after the tree structure is filled, only a small number of pixels need be analysed before the descent to the next tree level. As, at each stage, the number of possible exemplars is split roughly in two, this method is very quick to execute. The disadvantage of this method is that at the levels of the tree near the root, when the number of exemplars is large, the number of pixels that split the set (even to split off a single exemplar) is very small. During testing it was found that for a set of just 16 exemplars only 200 pixels could be found to split off a single exemplar at the first level of the tree, greatly increasing the possibility of error at this level. Another problem is that the tree can only be traversed downwards- once it is decided to travel down one side of the tree the exemplars represented on the other side cannot be compared even if they would provide a better match at a later stage. For example, if the probability of correct branch traversal at each node is 98% or 0.98 (which corresponds to a 2% probability of failure) and the tree has 10 levels (all of which must be traversed correctly), then the probability of success at the bottom is 0.9810 = 0.82 (which corresponds to a failure probability of 18%). This was reflected in the fact, that for a set of more than eight different exemplars, the correct one was rarely recognised.

5.9.2

Template score method with quantization

With this method the quantization of the SCMs is performed as with the previous method. In order to recognise the test gesture a score is calculated for each QSCM by looking at each pixel in turn. Every pixel is scored as follows (see Appendix C Section 17 for a pseudocode description): If the test gesture pixel is skin then a point is awarded to each of the QSCMs if the value of that pixel is mostly skin. If the test gesture pixel is skin then a point is subtracted from each of the QSCMs if the value of that pixel is mostly background. If the test gesture pixel is background then a point is awarded to each of the QSCMs if the value of that pixel is mostly background. If the test gesture pixel is background then a point is subtracted from each of the QSCMs if the value of that pixel is mostly skin. Otherwise no change is performed.

The final score for each QSCM can then be calculated by dividing the total score by the maximum score possible (equal to the number of pixels over the template which are either mostly skin or mostly background). An advantage of this system is that each exemplar is judged separately so unlike the tree method errors do not accumulate. A disadvantage is that a very large number of pixels have to be examined for each of the QSCMs for a match to be made. Also, if a given training gesture has a large amount of variation then there will be a large number of pixels which are neither mostly skin or mostly background in the QSCM (equivalent to a large amount of white area in Figure 38 right), leaving large areas where no score can be awarded, and as such increase the possibility that two exemplars will be difficult to differentiate. To test the system, the same training and test gesture sets that were used with the radial length metric were fed to the system. Figure 46 shows the results:

46

Raymond Lockton, Balliol College

Gesture
T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P T H E SP 1 2 3 4 5 SP Q U I C K SP B U O W N SP F O X E S SP J U M P

Recognised
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC

Gesture
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC

Recognised

Correct Incorrect False positives False negatives

63/64 1/64 60/64 0/64

Figure 46 Results from a test of the template score method with quantization. All but one of the test gestures was correctly identified and there were no false negatives. However, there were a considerable number of false positives. This is due to the fact that the recognition score for a couple of the gestures was low even though the correct gesture obtained the highest score. This meant that the recognition threshold had to be set low and as such a number of intermediary frames were incorrectly recognised as gestures.

5.9.3

Template score method with no quantization

With this method no quantization is performed. Instead, the amount of skin present over the set of images within the exemplar is represented by a floating point number between 0.5 and 0.5 for each pixel (-0.5 representing all background over the set and 0.5 representing all skin). The score is then calculated as follows (see Appendix C Section 18 for a pseudocode description): Add this floating point number when the corresponding test gesture pixel is skin Subtract this floating point number when the corresponding test gesture pixel is background

Pixels that have a large amount of variation do not affect the score by a significant amount as their value is close to zero. The advantage of this method is that no pixels are ignored, so even exemplars with a large amount of gesture image variation are fully considered. A disadvantage is that many pixels have to be considered for each SCM (as with the quantization method). This method 47

Raymond Lockton, Balliol College

will also be slower than the previous method as many floating point calculations have to be performed (rather than integer ones). Once again the system was tested using the same gesture sets as before. The results are shown in Figure 47: Gesture
T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P

Recognised
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC

Gesture
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC

Recognised

Correct Incorrect False positives False negatives

64/64 0/64 48/64 0/64

Figure 47 Results from a test of the template score method with no quantization. All of the test gestures were correctly identified this time and once again there were no false negatives. There were a considerable number of false positives. This is for the same reason as with the previous figure.

From looking at the results of each of the recognition methods it was clear that the method with the best recognition score was the template score method with no quantization. Therefore this method was chosen.

5.10 Refinement of template score method (no quantization)


Although the template score method correctly recognised all of the gestures in the test set it seemed unnecessary to query such a large number of pixels to make the decision. Therefore two methods were considered to perform the same task with the same accuracy but more efficiently:

48

Raymond Lockton, Balliol College

5.10.1 Removal of pixels that perform the same function


With the method described above, all of the pixels that are skin for any of the trained gesture images are queried (roughly 38,000 pixels for each of the 300 exemplars that result after clustering). It is likely, however, that many of these pixels perform largely the same job and as such any duplicates need not be queried at all. A simple example of this is a training set with only two gestures, say A and B. After the skin concentration maps are processed two types of pixel will result, those that are mostly skin for A and mostly background for B and those that are mostly skin for B and mostly background for A. However, in order to correctly identify which gesture is presented it is only necessary to look at a single pixel from one of the groups, preferably one which is always skin for one of the gestures and always background for the other. It was decided to create an algorithm to find any duplicate pixels and ignore them. In order to make the process simpler it was decided to first quantize the pixel values into three groups, 1 if the skin concentration fell above a certain threshold, 0 if it fell below another threshold and X otherwise. An identification string could then be generated for each of the pixels across all of the groups of exemplars (one character in the string per exemplar and one string per pixel). See Appendix C Section 19 for a pseudocode description of the creation of these strings. A procedure was written to compare each pixel string with all the others. A variable containing the number of bits different was only incremented if a 1 in one string matched with a 0 in the other or vice-versa. In other words the value X was taken to mean either a 1 or a 0. If the number of bits different in the string fell below a certain threshold then the two pixels were considered identical and as such one of the pixels could be discarded. If this was the case then the procedure returned the string with the most Xs so that this one could be discarded (as this pixel contained less information). See Appendix C Section 20 for a pseudocode description of the comparison. The procedure was run several times on the set of exemplars with different threshold values. As the upper and lower thresholds were moved closer and closer to all skin and all background respectively the identification strings contained more and more Xs and as such more and more pixels were identified as duplicate and therefore discarded. Similarly as the minimum number of pixels different allowed was increased, more and more pixel identification strings were considered duplicate and were also discarded. Eventually so many pixels were discarded that some of the gestures were no longer recognised correctly. The widest thresholds that still permitted all of the gestures to be correctly recognised were chosen. This reduced the number of queried pixels from 34,788 to 1,199 with a corresponding 30-fold increase of recognition speed. Figure 48 shows the pixels queried in order to identify one of the exemplars for the letter a before and after the duplicates were removed.

49

Raymond Lockton, Balliol College

Figure 48 Images showing the pixels queried in order to detect one of the exemplars for the letter a before and after removal of duplicates. The duplicate pixels are mostly evenly spread over the recognition area. Notice how the pixels near the wrist band are less concentrated, as the pixels in this area are skin for almost all the trained gestures.

The results show, that after removal of duplicate pixels, the remaining pixels are evenly spread over the recognition area except for the area near the wrist band where a larger number of duplicates exist. This is because most of the pixels near the wrist band are skin for all of the trained gestures.

5.10.2 Sorting the pixels


If each pixel could be given a score based on how much information it provides about which test gesture is being presented then it would be possible to sort them by this value. The advantage of having a sorted pixel set is that the pixels with the worst scores need not be queried at all, as they provide little extra information. This would make the system more efficient. Once again, to make the problem simpler it was decided to use the quantized information from the previous method. The best pixels would be those that have the fewest Xs (as this does not give us any extra information) but also those which have similar numbers of 1s and 0s. The reason for this is that, given a test gesture, repeated applications of pixels such as these, most rapidly cut down the possible number of exemplars that match. Therefore, the score for each pixel was calculated as follows:

abs (no _ of _ ones no _ of _ zeros ) + no _ of _ Xs


It was then a simple matter to sort the pixels using this score, the pixels with the lowest score being placed first. After sorting, the lowest percentage of the pixels that still permitted all the gestures to be detected was found by repeated tests. Using a combination of the two methods the number of pixels queried was reduced from 34,788 to 1,026 (85.5% of 1,199). This corresponds to a change in system frame rate from 0.5fps to 12.5fps (near real-time). After application of both these methods all of the test gestures were still correctly recognised. Therefore it was decided to use both. The results from the application of the set of test gestures from before is shown in Figure 49:

50

Raymond Lockton, Balliol College

Gesture
T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P

Recognised
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC

Gesture
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC

Recognised

Correct Incorrect False positives False negatives

64/64 0/64 43/64 0/64

Figure 49 Results of a test to the template score method with no quantization after sorting and removal of duplicate pixels. All of the test gestures were correctly identified and there were no false negatives. There were a considerable number of false positives. This is for the same reason as before.

5.11 Conclusion
In this section, three methods of recognition have been discussed. Firstly, area comparison was considered. Although this was considered an unsuitable metric it was used in order to focus the attention on the comparison architecture of any future system and the testing methodology. The second method involved the comparison of radial length signatures. This was more suitable, but it was found that the amount of information provided about individual fingers was dependent on the relative angle of the radial and the long axis of the finger, making some gestures hard to differentiate. Finally, template matching in the canonical frame was considered and chosen as it provided the best results. Various refinements were then made to increase recognition speed. Using the methods chosen a set of 42 gestures were all correctly recognised at a frame rate of 12.5fps.

51

Raymond Lockton, Balliol College

6 Application: Gesture driven interface

As a demonstration of the capabilities of the system, a standard Microsoft Windows computer was modified so that the only input device necessary was the hand.

6.1 Setup
The system was set up as in Figure 2. The template score (with no quantization) recognition method was modified so that the recognised gesture generated mouse and keyboard events, as shown in Figure 50. Gesture Label A B C D E F G H I J K L M N O P Q R S T U V Event Press key A Press key B Press key C Press key D Press key E Press key F Press key G Press key H Press key I Press key J Press key K Press key L Press key M Press key N Press key O Press key P Press key Q Press key R Press key S Press key T Press key U Press key V Gesture Label X Y Z 0 1 2 3 4 5 6 7 8 9 CA RE DO SP BS LC RC DC OP Event Press key X Press key Y Press key Z Press key 0 Press key 1 Press key 2 Press key 3 Press key 4 Press key 5 Press key 6 Press key 7 Press key 8 Press key 9 Press caps-lock key Press return key Press key . Press spacebar Press backspace key Left mouse click Right mouse click Left double mouse click Move mouse pointer relative to hand centroid position. Left mouse button hold and move mouse pointer relative to hand centroid position.

Press key W

CL

Figure 50 Table showing the gesture labels and corresponding mouse or keyboard event.

52

Raymond Lockton, Balliol College

In order to ignore transition movements of the hand, an event was only queued if five identical contiguous gestures were recognised. Thereafter, further events were only processed if the gesture changed (therefore, to type two identical letters a brief gesture change would need to be interleaved).

6.2 Demonstration
To demonstrate the system in use, the following sequence of actions were performed using the hand alone: The explorer icon on the task bar was clicked in order to restore it.

The floppy drive was selected. A right click brought up a menu and a new text document was created.

53

Raymond Lockton, Balliol College

This document was renamed my demo.txt.

A right click brought up a menu and a new folder was created. This folder was renamed demo folder.

The text document was then dragged into the folder.

The folder was double clicked to open it. The text document was then double clicked to edit it.

54

Raymond Lockton, Balliol College

The following text was then typed into the document: This is a demo of my 4th year project. I CAN TURN CAPS LOCK ON and off. I can also use the space and backspace keys. Finally I can control the mouse. ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890. The document was then closed and the changes saved.

Finally, the folder was closed and dragged to the top left of the directory window.

During the demonstration six letter errors were made, two of which were due to operator error. An AVI movie file of a similar sequence is available at: http://users.ox.ac.uk/~ball0622/index_files/demo.avi (Recognition frame rate in the video example is slightly reduced due to the effect of the screen capture software.)

55

Raymond Lockton, Balliol College

7 Conclusion

7.1 Project Goals


The goal of this project was to create a system to recognise a set of 36 gestures at a rate of 25fps. The developed template matching in the canonical frame system accurately recognised a set of 46 gestures at a rate of 12.5fps on a 600Mhz system. It was considered that a modern computer system would therefore allow the project goals to be exceeded. Furthermore, the system performance is amongst the best reported in existing literature.

7.2 Further Work


Collection of additional gesture information: The final system developed recognised gestures using silhouette information alone. Although this was sufficient for the number of trained gestures, the accuracy would doubtless suffer if the number of gestures were increased. In order to remedy this, extra information about the test gesture would have to be gathered, such as edge information. Combination of area, radial length and template matching in the canonical frame: It was noticed that each of the different recognition metrics demonstrated different benefits. For instance, the area metric differentiated a and c well, the radial metric differentiated b and c well and template matching in the canonical frame differentiated d and c well. Therefore a weighted combination of all three metrics would result in the highest accuracy. Removal of wrist band: The system relies on the user wearing a coloured wrist band to remove various degrees of freedom, making recognition, via comparison, possible. It would be advantageous if this were not the case. There are methods (see Section 2.3) that could be used to perform the recognition without a wrist band, but they would be unlikely to be as accurate. Using temporal coherence to improve recognition accuracy: English written text has temporal coherence in that each letter has a probability of being followed by a given letter. For instance, the letter q is often followed by the letter u but rarely any other letter. These probabilities could be used to improve recognition accuracy by combining the list of top scoring exemplars with the probability of each following the preceding letter. The same process could also be used to permit standard American one-handed sign language to be used (where the letters O, V and W are the same as the numbers 0, 2 and 6 respectivelysee Appendix B) instead of the modified version. Increase of the number of recognised gestures: For the purposes of a man-machine interface a relatively small set of gestures (100) would be sufficient and is therefore within the bounds 56

Raymond Lockton, Balliol College

of the final system developed. However, if detection of hand gestures for computer animation is required (for instance), then the number of trained gestures would need to be in the thousands. A system which relies on both training and comparison of all gestures used would not be sufficient for this task. Further work, therefore, could involve the implementation of a gesture recognition system which does not require training. An example of this is the direct method based on hand geometry considered in Section 5.1. Multi-stage gestures: It would be possible to represent a much larger number of labels if each label consisted of two or more gestures combined with hand position changes. For instance, the wave hello label could correspond to the open hand gesture with an alternating increase and decrease of hand yaw angle and the thumbs-up label could correspond to the letter m followed by the space gesture. Two-handed sign language: It would be possible, using two different coloured gloves and two different coloured wrist bands, to detect the gesture signed by both hands whilst both are in the frame. A method would have to be devised to detect a gesture (or range of gestures) that is represented by a partially occluded hand. This method would be considerably harder to implement. It is important to note, however, that although the gesture of both hands could be recognised this would not permit the recognition of the full American sign language as this involves recognising many other features including facial expression and arm position.

57

Raymond Lockton, Balliol College

8 References
[Bauer & Hienz, 2000] Relevant feature for video-based continuous sign language recognition. Department of Technical Computer Science, Aachen University of Technology, Aachen, Germany, 2000. [Bowden & Sarhadi, 2000] Building temporal models for gesture recognition. In proceedings British Machine Vision Conference, 2000, pages 32-41. [Bretzner & Lindeberg, 1998] Use your hand as a 3-D mouse or relative orientation from extended sequences of sparse point and line correspondences using the affine trifocal tensor. In proceedings 5th European Conference on Computer Vision, 1998, pages 141-157. [Davis & Shah, 1994] Visual gesture recognition. In proceedings IEEE Visual Image Signal Process, 1994, vol.141, No.2, pages 101-106. [Starner, Weaver & Pentland, 1998] Real-time American sign language recognition using a desk- and wearable computer-based video. In proceedings IEEE transactions on Pattern Analysis and Machine Intelligence, 1998, pages 1371-1375.

58

Raymond Lockton, Balliol College

9 Appendix
9.1 Appendix A- Glossary
Hand roll The rotation of the hand about an axis defined by the wrist. The following three images show the same gesture with increasing roll.

Hand yaw

The rotation of the hand about an axis defined by the camera view direction. The following three images show the same gesture with increasing yaw.

HSL Jitter map

Silhouette information

Skin concentration map

Colour space defined by hue, saturation and luminosity. Also called HSV (hue, saturation and intensity value). A map created using a number of examples of the same gesture. The colour of each pixel in the map is defined by the amount of variation exhibited by the corresponding pixel across all of the examples (the greatest variation is where the pixel is skin for half of the examples and background for the other half). Detection of all skin within the hand without any feature detection (the same information that would be contained in a silhouette of the hand). A map created using a number of examples of the same gesture. The colour of each pixel in the map is defined by the amount the corresponding pixel across all of the examples was skin (the greatest skin concentration is where the pixel is skin for all of the examples).

59

Raymond Lockton, Balliol College

9.2 Appendix B- Entire Gesture Set


The letters and number gestures are based on the American one-handed sign language. Letters J and Z were modified as they were moving gestures. Numbers 0, 2 and 6 were modified as they were identical to the letters O, V and W respectively.

DO

RE

BS

SP

CA

LC

RC

DC

OP

CL

60

Raymond Lockton, Balliol College

9.3 Appendix C- Algorithms


C.1 Area of gesture detection method
A formal description of the area of gesture detection method is as follows: The detected set of pixels from before is L The area of a given gesture can therefore be calculated thus:

a = 1
xL
&

A training sequence of n gestures can then be given and manually labelled. We denote a single (gesture, label ) pair by: e.g. (a1 , ' A') , (a 2 , ' B ') Define this training set as:

(ai , li )

G = {(a i , l i )}i =1
n

Given a test image with signature anew choose the label li min where

i min = arg min a new a i


i =1..n

2 2

C.2 Radial length calculation


A formal description of radial length calculation is as follows: Examine a typical radial at angle . The score for that radial is:

radscore ( ) =

x R

&

S ( x)

where S () is the skin pixel predicate defined earlier and where

x c x cos R = (x, y ) = + r r > 0 y c sin y


The signature for a given gesture g could then be calculated as:

radscore ( ) g= 0 < 2 max (radscore ( ))

C.3 Radial signature comparison


A formal description of the radial signature comparison is as follows: From before, the signature for a given gesture g could be calculated as:

radscore ( ) g= 0 < 2 max (radscore ( ))


A training sequence of n gestures can then be given and manually labelled. We denote a single (gesture, label ) pair by: e.g. (g 1, ' A') , (g 2 , ' B ') Define this training set as:

(g i , li )

61

Raymond Lockton, Balliol College

G = {(g i , l i )}i =1
n

Given a test image with signature g new choose the label li min where

i min = arg min g new g i


i =1..n

2 2

C.4 Transformation into the canonical frame


A pseudocode description of the process of transformation into the canonical frame is as follows: Define the new hand and band centroids after refinement as c hand and c band . The vector joining the two centroids is

v dif = (x dif , y dif ) = c hand cband rcanonicalscalefactor = v dif

The radius scaling factor and angle shift to be used in canonicalisation can then be defined as

y dif canonicalshift = tan 1 x dif Define the anchor of the canonical frame as x canonicalanchor , say (160,120 ) The set of all remaining skin pixel locations after refinement is L

For each x L :

v pixel = (x pixel , y pixel ) = x c hand r pixel = v pixel

* *

y pixel pixel = tan 1 x pixel

100 rscaledpixel = rpixel r canonicals calefactor scaledpixel = ( pixel + canonicalshift )mod 2

The transformation into the canonical frame then proceeds as follows: Pixel distance scaling: Pixel angle rotation:

The equivalent pixel in the canonical frame is then:

cos scaledpixel x canonical = x canonicalanchor + rscaledpixel sin scaledpixel

C.5 Pixel pull from the canonical frame


A pseudocode description of the pixel pull from the canonical frame is as follows: For all pixels x canonical within the canonical frame:

vcanonical = (x canonical , y canonical ) = xcanonical xcanonicalanchor rcanonical = v canonical

y canonical canonical = tan 1 x canonical


The pixel pull from the original frame then proceeds as follows:

62

Raymond Lockton, Balliol College

Inverse pixel distance scaling: rinvscaledpixel = rcanonical Inverse pixel angle rotation:

invscaledpixel

100 r canonicalscalefactor =( canonical canonicalshift )mod 2

The equivalent pixel in the original frame is then:

cos invscaledpixel x = c hand + rinvscaledpixel sin invscaledp ixel If x L then mark the pixel in the canonical frame ( xcanonical ) as skin otherwise

* *

mark it as background.

C.6 Creation of jitter maps


A pseudocode description of the process by which the jitter maps are created is as follows: Each of the n images is defined as a mask (0 for background, 1 for skin) M j = 0..n , i Define the number of skin pixels across the set as: Define the number of background pixels across the set as: Define an array to store the variation (or jitter) of each pixel: Vi For each pixel i :

n skin nbackground

nskin = 0 nbackground = 0

For each image j : If M j ,i is skin then increment nskin else increment nbackground The variation (0-1) for pixel i is then: If nbackground < n then Vi = abs (n skin nbackground ) n else Vi = 1

The jitter map can then be generated by colouring each pixel: Black if Vi = 1 else Blue if Vi = 0 Red if Vi = 1 And colours in between

C.7 Creation of skin concentration maps (SCM)


A pseudocode description of the process by which the skin concentration maps are created is as follows: Define an array to store the skin concentration of each pixel: C i For each pixel i :

nskin = 0 nbackground = 0
For each image j : If M j ,i is skin then increment nskin else increment nbackground The skin concentration (0-1) for pixel i is then: If nbackground < n then C i = (n skin n ) else C i = 1 The skin concentration map can then be generated by colouring each pixel: 63

Raymond Lockton, Balliol College

Black if C i = 1 else Blue if C i = 1 Red if C i = 0 And colours in between

C.8 Creation of skin concentration difference map


A pseudocode description of the process by which the skin concentration difference map is created as follows: The two skin concentration maps are stored in the form of an array, CAi and CBi Define an array to store the difference of each pixel: Di For each pixel i :

Di = abs (CAi CBi )


The skin concentration difference map can then be generated by colouring each pixel: Black if Di = 0 Red if Di = 1 And colours in between

C.9 Creation of quantized skin concentration map (QSCM)


A pseudocode description of the creation of the quantized skin concentration maps (QSCM) is as follows: Define the upper skin concentration threshold as tU (say 0.8) Define the lower skin concentration threshold as t L (say 0.2) Define a quantized map Qi based upon a skin concentration map C i using the following rule: For each pixel i :

2 C i tU Qi = 0 C i t L 1 otherwise

C.10 Comparison of a test gesture mask and set of QSCMs


A pseudocode description of the comparison of a test gesture mask and set of QSCMs is as follows: Given a set of n quantized skin concentration maps Q j = 0.. n ,i that have been manually labelled, we can denote a single (gesture, label ) pair by:

e.g. (Q 1 , ' A') , (Q 2 , ' B ') Define this training set as:

(Q , l )
j j

(Q j , l j )}j =1 G ={
n

Given a test image with mask M i calculate the score for each concentration map thus: Define an array of scores s j where s j = 0 for j = 0..n For each QSCM j: 64

Raymond Lockton, Balliol College

For each pixel i :

+ 1 (M i = 1) & (Q j , i + 1 (M = 0) & (Q i j ,i s j = s j + 1 (M i = 1) & (Q j , i 1 ( M = 0) & (Q i j ,i otherwise 0

= 2) = 0) = 0) = 2)

C.11 Scaling the hand using the average radial distance


A pseudocode description of scaling using the average radial distance is as follows: The set of all remaining skin pixel locations after refinement is L Define the total radius as rtot For each x L :
pixel

* v

= (x pixel , y pixel ) = x c hand

rtot = rtot + v pixel


The average radius is then defined as

* *

rtot L

C.12 Comparison of two examples of a gesture


A pseudocode description of the comparison process between two examples of a single gesture (A and B) is as follows: Define the number of pixels different as ndifferent Each of the two examples is defined as a mask (0 for background, 1 for skin): MA and MB , each with 320 240 = 76,800 pixels. For each pixel i of MA: If MAi MBi then increment ndifferent The maximum difference threshold can be defined as t max (say, 2500 pixels) The two masks are then sufficiently similar for clustering if n different t max

C.13 Process by which a set of gesture images is clustered


A pseudocode description of the process by which the set of gesture images is clustered is as follows: Each of the n gesture images is defined as a mask (0 for background, 1 for skin) M j = 0..n , i Place each of the masks within an initial set SInit = {M 0 , M 1 ...M n } Define a set of m exemplars S l = { { }0 , { }1 ..{ }m }

Define the minimum number of masks permitted in an exemplar as t min (Say four) Perform the clustering as follows: For each mask j = 0 to j = (n 2) : For each mask k = ( j + 1) to k = (n 1) If M j is sufficiently similar to M k (see algorithm above) then

l =0

65

Raymond Lockton, Balliol College

Remove M k from SInit and add to S l If the number of elements in S l t min then Remove M j from SInit and add to S l Increment l Else Remove all elements from S l and replace in SInit

C.14 Finding the pixels that split the set of exemplars


A pseudocode description of the process by which pixels are found to split the set of possible exemplars is as follows: Define the number of ones across the set as: n ones Define a set containing the twos exemplar labels: Define a set containing the zeros exemplar labels: Define a set that contains: The location of each polarised pixel (all zeros and twos) location A set containing the twos exemplar labels for that pixel A set containing the zeros exemplar labels for that pixel Given a set of n quantized skin concentration maps Q j = 0.. n ,i from before For each pixel i :

STwos = { } SZeros = { }

, })} SPolarised = {(x, y , { }{

n ones = 0 STwos = { } SZeros = { }


If Q j ,i =0 then add exemplar label j to SZeros If Q j ,i =1 then increment n ones If Q j ,i =2 then add exemplar label j to STwos

For each QSCM j :

If n ones = 0 then Add the location of pixel i , the set SZeros and the set STwos to

SPolarised
Now take a pixel k of a test mask M k If pixel k is skin then that suggests the mask is an example of one of the STwos exemplars If pixel k is not skin then that suggests the mask is an example of one of the SZeros exemplars

C.15 The compromise between splitting the set into two halves and finding enough pixels to accurately do so
A formal description of this compromise is as follows: SPolarised can be scanned to find the sets of pixels for which: SZeros and STwos are identical or SZeros and STwos are identically opposite (because this pixel split the set in the same way)

66

Raymond Lockton, Balliol College

A compromise then has to be found between finding a large set of pixels and a set that splits the set as accurately in two as possible (a set for which SZeros and STwos are roughly of the same size). Store the eventual pixels decided upon in set SSplit

C.16 Filling the tree structure


A pseudocode description of filling the tree structure is as follows: Define a stack with two procedures, push() to add an element to the top of the stack and pop() to remove the topmost stack element. Define a pointer p to point to a given tree node. Take the set SPolarised and find the best compromise between splitting the set exactly in two and finding sufficient pixels to do so, giving a set of pixels SSplit and two sets of exemplar labels, SZeros and STwos . The root of the tree is simply SSplit . The left branch of each node deals with the exemplars within the SZeros set and the right branch the STwos set. First set p to point to the root of the tree Filling the tree then proceeds as follows: Start: Fill node p with SSplit . If SZeros contains more than one exemplar label: If STwos contains more than one exemplar label: Push the STwos exemplar labels and the right node of p onto the stack Else Fill the rightmost node of p with the single exemplar label in STwos . Set p to point to the leftmost node. Repeat the operation to find the pixels that split the set of exemplars, but with the reduced set of exemplars labelled within SZeros . Goto Start. Else Fill the leftmost node of p with the single exemplar label in SZeros . If STwos contains more than one exemplar label: Set p to point to the rightmost node. Repeat the operation to find the pixels that split the set of exemplars, but with the reduced set of exemplars labelled within STwos . Goto Start. Else Fill the rightmost node of the tree with the single exemplar label in STwos . If the stack contains any elements then Pop element of stack. Set p to point to node popped. Repeat the operation to find the pixels that split the set of exemplars, but with the reduced set of exemplars labelled within element popped off stack. Goto Start. Else Finished!

C.17 Template scoring method (with quantization)


A pseudocode description of the template scoring method (as shown before) is as follows: Given a test image with mask M i calculate the score for each of the n quantized skin concentration map thus: Define an array of scores s j For each template j: For each pixel i : 67

Raymond Lockton, Balliol College

+ 1 ( M i = 1) & (Q j ,i + 1 ( M = 0) & (Q i j ,i s j = s j + 1 ( M i = 1) & (Q j ,i 1 (M = 0) & (Q i j ,i 0 otherwise if Q j ,i = 2 then increment s max j

= 2) = 0) = 0) = 2)

Recognition of the top scoring gesture is then performed by choosing the label l jmax where:

j max = arg max


j =1..n

sj s max j

C.18 Template scoring method (with no quantization)


A pseudocode description of the scoring method with no quantization is as follows: Given a set of n skin concentration maps (0-1) C j = 0..n ,i which have been manually labelled, we can denote a single (gesture, label ) pair by:

e.g. (C 1 , ' A') , (C 2 , ' B ')

(C

,l j )

Define this training set as:

(C j , l j )}j =1 G ={
n

Given a test image with mask M i calculate the score for each concentration map thus: Define an array of scores s j For each SCM j: For each pixel i : If M i = 1 then

s j = s j + (C j ,i 0.5)
Else

s j = s j (C j ,i 0.5)
Then choose the label l jmax where

j max = arg max s j


j =1.. n

C.19 Creation of the pixel identification strings


A pseudocode description of the creation of the pixel identification strings is as follows: Define the identification strings as an array of characters ID j ,i Define the upper skin concentration threshold as tU (say 0.8) Define the lower skin concentration threshold as t L (say 0.2) Given a set of n skin concentration maps (0-1) C j = 0..n ,i Create the identification string as follows: For each pixel i :

68

Raymond Lockton, Balliol College

For each SCM j:

ID j ,i

'1' C j ,i tU = '0' C j ,i t L ' X ' otherwise

C.20 String comparison process


A pseudocode description of the string comparison process is as follows: Define the number of bits different nbits Define the maximum number of bits different below which two strings are considered equal t max (say 2) Define the number of Xs in string A n XsA Define the number of Xs in string B n XsB Given two identification strings IDA j and IDB j the comparison is as follows: For each SCM j : If IDA j =' X ' increment n XsA If IDB j =' X ' increment n XsB If

((IDA

='1')and (IDB j ='0'))or ((IDA j ='0')and (IDB j = '1')) increment nbits

If nbits t max then If IDA j IDB j then Strings are equal, use pixel corresponding to set A Else Strings are equal, use pixel corresponding to set B Else Strings are not equal so do not discard either

69

You might also like