Professional Documents
Culture Documents
Project Hand Gesture
Project Hand Gesture
Project Hand Gesture
1 Contents
1 2 CONTENTS................................................................................................................ 2 INTRODUCTION ...................................................................................................... 3 2.1 2.2 2.3 3 3.1 3.2 3.3 3.4 3.5 3.6 4 4.1 4.2 4.3 4.4 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 6 6.1 6.2 7 7.1 7.2 8 9 REPORT OVERVIEW .............................................................................................. 4 PROJECT SUMMARY.............................................................................................. 4 EXISTING SYSTEMS .............................................................................................. 5 CHOICE OF SENSORS ............................................................................................. 6 HARDWARE SETUP................................................................................................ 7 CHOICE OF VISUAL DATA FORMAT ........................................................................ 8 COLOUR CALIBRATION ......................................................................................... 9 METHOD OF COLOUR DETECTION ........................................................................ 15 CONCLUSION ...................................................................................................... 15 ANALYSIS OF DISTORTION .................................................................................. 17 REMOVAL OF SKIN PIXELS DETECTED AS WRIST BAND PIXELS ............................. 19 REMOVAL OF SKIN PIXELS DETECTED FROM FOREARM ........................................ 19 CONCLUSION ...................................................................................................... 23 CHOICE OF RECOGNITION STRATEGY .................................................................. 24 SELECTION OF TEST GESTURE SET ....................................................................... 24 ANALYSIS OF RECOGNITION PROBLEM ................................................................ 25 RECOGNITION METHOD 1: AREA METRIC ............................................................ 25 RECOGNITION METHOD 2: RADIAL LENGTH SIGNATURE ...................................... 26 RECOGNITION METHOD 3: TEMPLATE MATCHING IN THE CANONICAL FRAME ...... 34 REFINEMENT OF THE CANONICAL FRAME ............................................................ 40 REFINEMENT OF THE TRAINING DATA ................................................................. 41 METHOD OF DIFFERENTIATION (IN CANONICAL FRAME) ...................................... 43 REFINEMENT OF TEMPLATE SCORE METHOD (NO QUANTIZATION) ....................... 48 CONCLUSION ...................................................................................................... 51 SETUP................................................................................................................. 52 DEMONSTRATION ............................................................................................... 53 PROJECT GOALS ................................................................................................. 56 FURTHER WORK ................................................................................................. 56
DETECTION.............................................................................................................. 6
REFINEMENT ......................................................................................................... 17
RECOGNITION....................................................................................................... 24
CONCLUSION......................................................................................................... 56
REFERENCES ......................................................................................................... 58 APPENDIX ............................................................................................................... 59 9.1 9.2 9.3 APPENDIX A- GLOSSARY .................................................................................... 59 APPENDIX B- ENTIRE GESTURE SET ................................................................... 60 APPENDIX C- ALGORITHMS ................................................................................ 61
2 Introduction
This project will design and build a man-machine interface using a video camera to interpret the American one-handed sign language alphabet and number gestures (plus others for additional keyboard and mouse control). The keyboard and mouse are currently the main interfaces between man and computer. In other areas where 3D information is required, such as computer games, robotics and design, other mechanical devices such as roller-balls, joysticks and data-gloves are used. Humans communicate mainly by vision and sound, therefore, a man-machine interface would be more intuitive if it made greater use of vision and audio recognition. Another advantage is that the user not only can communicate from a distance, but need have no physical contact with the computer. However, unlike audio commands, a visual system would be preferable in noisy environments or in situations where sound would cause a disturbance. The visual system chosen was the recognition of hand gestures. The amount of computation required to process hand gestures is much greater than that of the mechanical devices, however standard desktop computers are now quick enough to make this project hand gesture recognition using computer vision a viable proposition. A gesture recognition system could be used in any of the following areas: Man-machine interface: using hand gestures to control the computer mouse and/or keyboard functions. An example of this, which has been implemented in this project, controls various keyboard and mouse functions using gestures alone. 3D animation: Rapid and simple conversion of hand movements into 3D computer space for the purposes of computer animation. Visualisation: Just as objects can be visually examined by rotating them with the hand, so it would be advantageous if virtual 3D objects (displayed on the computer screen) could be manipulated by rotating the hand in space [Bretzner & Lindeberg, 1998]. Computer games: Using the hand to interact with computer games would be more natural for many applications. Control of mechanical systems (such as robotics): Using the hand to remotely control a manipulator.
Figure 2 Picture of system in use (note wrist band and neutral coloured background)
The refined shape information will then be compared with a set of predefined training data (in the form of templates) to recognise which gesture is being signed. In particular, the contribution of this project is a novel way of speeding up the comparison process. A label corresponding to the recognised gesture will be displayed on the monitor screen. Figure 1 (front cover) shows the successful recognition of a series of gestures. The design process for the recognition will be discussed in Chapter 5. Chapter 6 describes an application of the system a gesture driven windows interface. Finally, Chapter 7 describes how the project has achieved the goals set and further work that could be carried out.
An optical method has been chosen, since this is more practical (many modern computers come with a camera attached), cost effective and has no moving parts, so is less likely to be damaged through use. The first step in any recognition system is collection of relevant data. In this case the raw image information will have to be processed to differentiate the skin of the hand (and various markers) from the background. Chapter 3 deals with this step. Once the data has been collected it is then possible to use prior information about the hand (for example, the fingers are always separated from the wrist by the palm) to refine the data and remove as much noise as possible. This step is important because as the number of gestures to be distinguished increases the data collected has to be more and more accurate and noise free in order to permit recognition. Chapter 4 deals with this step. The next step will be to take the refined data and determine what gesture it represents. Any recognition system will have to simplify the data to allow calculation in a reasonable amount of time (the target recognition time for a set of 36 gestures is 25 frames per second). Obvious ways to simplify the data include translating, rotating and scaling the hand so that it is always presented with the same position, orientation and effective hand-camera distance to the recognition system. Chapter 5 deals with this step.
[Bauer & Hienz, 2000] [Starner, Weaver Pentland, 1998] [Bowden Sarhadi, 2000]
91.7%
40
General
97.6%
10
&
&
Linear approximation to non-linear point distribution models Finite state machine / model matching
26
Blue screen
No
Static
Markers glove
on
This project
46
Static
Wrist band
98%
10
99.1%
15
Figure 3 Table showing existing gesture recognition systems found during research.
3 Detection
In order to recognise hand gestures it is first necessary to collect information about the hand from raw data provided by any sensors used. This section deals with the selection of suitable sensors and compares various methods of returning only the data that pertains to the hand.
Figure 4 The effect of self shadowing (A) and cast shadowing (B). The top three images were lit by a single light source situated off to the left. A self-shadowing effect can be seen on all three, especially marked on the right image where the hand is angled away from the source. The bottom three images are more uniformly lit, with little self-shadowing. Cast shadows do not affect the skin for any of the images and therefore should not degrade detection. Note how an increase of illumination in the bottom three images results in a greater contrast between skin and background.
However, since this system is intended to be used by the consumer it would be a disadvantage if special lighting equipment was required. It was decided to attempt to extract the hand and marker information using standard room lighting (in this case a 100 watt bulb and shade mounted on the ceiling). This would permit the system to be used in a non-specialist environment. Camera orientation: It is important to carefully choose the direction in which the camera points to permit an easy choice of background. The two realistic options are to point the camera towards a wall or towards the floor (or desktop). However since the lighting was a single overhead bulb, light intensity would be higher and shadowing effects least if the camera was pointed downwards.
Background: In order to maximise differentiation it is important that the colour of the background differs as much as possible from that of the skin. The floor colour in the project room was a dull brown. It was decided that this colour would suffice initially.
Figure 5 Results of detection using individual ranges of hue (left), saturation (centre) and luminosity (right) as well as histograms showing the number of pixels detected for each value of skin (top) and background (bottom). Images and graphs show that hue is a poor variable to use to detect skin as the range of values for skin hue and background hue demonstrate significant overlap (although this may have been due to the choice of hue of the background). Saturation is slightly better and luminosity is the best variable. However, a combination of saturation and luminosity would provide the best skin detection in this case.
The histogram test was repeated using the RGB colour space. The results are shown in Figure 6.
Figure 6 Histograms showing the number of pixels detected for each value of red (left), green (centre) and blue (right) colour components for skin pixels (top) and background pixels (bottom). The ranges for each of the colour components are well separated. This, combined with the fact that using the RGB colour space is considerably quicker than using HSL suggests that RGB is the best colour space to use.
Figure 7 shows recognition using red, green and blue colour ranges in combination:
Figure 7 Skin detection using red, green and blue colour ranges in combination. Detection is adequate and frame rate over twice as fast as the HSL option.
Hue, when compared with saturation and luminosity, is surprisingly bad at skin differentiation (with the chosen background) and thus HSL shows no significant advantage over RGB. Moreover, since conversion of the colour data from RGB to HSL took considerable processor time it was decided to use RGB.
3.4.1
Initial Calibration
The method of skin and marker detection selected above (Section 3.3) involves checking the RGB values of the pixels to see if they fall within red, green and blue ranges (these ranges are different for skin and marker). The choice of how to calculate these ranges is an important one. Not only does the calibration have to result in the detection of all hand and marker pixels at varying light levels, but also the detection of erroneous background pixels has to be reduced to a minimum. In order to automatically calculate the colour ranges, an area of the screen was demarcated for calibration. It was then a simple matter to position the hand or marker (in this case a wrist band) within this area and then scan it to find the maximum and minimum RGB values of the ranges (see Figure 8). A formal description of the initial calibration method is as follows: The image is a 2D array of pixels:
* * * * The colour ranges can then be defined for this area: * * r = max r ( x ) r = min r ( x ) * * g = max g ( x ) g = min g ( x ) * * b = max b( x ) b = min b( x )
The calibration area is a set of 2D points:
max xJ
* *
min
x J
* *
max
xJ
min
x J
max
xJ
min
x J
A formal description of skin detection is then as follows: The skin pixels are those pixels (r ,
g , b ) such that: ((r rmin ) & (r rmax )) & (( g g min ) & ( g g max )) & ((b bmin ) & (b bmax ))
g , b)
L = {x | S (r (x ), g (x ), b(x )) = 1}
* * *
Using this method skin pixels were detected at a rate of 15fps on a 600Mhz laptop (see Figure 9).
10
C (skin ranges)
Figure 8 Image A shows the colour calibration areas for wrist band (green) and skin (orange). Calibration is performed by positioning the wrist band under the green calibration area and the hand under the orange calibration area (image B shows a partially positioned hand). The calibration algorithm then reads the colour values of both areas and calculates the ranges by repeatedly updating maximum and minimum RGB values for each pixel. Images C and D show the pixel colour values for the skin and wrist band areas. The colour ranges calculated for each colour component are indicated by double headed arrows.
Figure 9 After calibration of skin and wrist band pixels, colour ranges are used to detect all subsequent frames. A detected frame is shown here with skin pixel detection indicated in white and wrist band pixel detection indicated in red.
11
It was decided to use the calibration routine, discussed in Section 3.4.1, to find the initial values. However, the ranges returned by this method were less than perfect for the reasons below: The calibration was carried out using a single frame; hence pixel colour variations over time, due to camera noise, would not be accounted for. To ensure that the sampled area only contained skin pixels, by necessity it had to be smaller than the hand itself. The extremities of the hand (where, due to selfshadowing, the colour variation is the greatest) were therefore not included in the calibration.
3.4.2
Improving calibration
In order to improve the calibration, four further methods were considered: 1. Multiple frame calibration: If the calibration was repeated over several frames and the overall maximum and minimum colour values calculated, then the variation over time due to camera noise would be included in those ranges and its effect thus negated. The method would require the hand to be held stationary during the calibration process. The routine was thus modified to perform the calibration over 10 frames instead of one. Figure 10 shows the results.
Figure 10 Results of multiple frame calibration. Stage A is the result of the initial calibration. Stage B is the result of calibration over 10 frames. There is no discernable difference in the skin fit.
Calibration of several frames does little to improve skin detection. Therefore this method was not retained. 2. Region-Growing: A second method would be to query pixels close to the detected skin pixels found using the initial method. If the colour components of these fell just outside the calibration ranges then the ranges could be increased to include them. This process could then be repeated a number of times until the skin detection was adequate. Figure 11 shows how the process works (simplified).
12
Figure 11 A simplified illustration of how region-growing works. Image A shows the initial captured hand. Image B shows the result of initial calibration, detected pixels are shown in white. For simplicitys sake the pixels that fall within the initial colour ranges have been drawn as a square. In practise, all pixels within the ranges will have been identified (these pixels would be scattered throughout the hand area). Next, any pixels in the neighbourhood of those already detected are scanned (the area within the black box of image C). If their colour values lie just outside the current colour ranges, the ranges are increased to include them. The result is shown in image D (again simplified). Although the pixels between the index and middle fingers fell within the boundary, their values did not fall close to the ranges, so they were ignored. The process is then repeated (images E and F) until, in theory, the ranges are such that all skin pixels are detected.
A program was written to repeat the region-growing process a number of times on a single frame. The results are shown in Figure 12.
13
Figure 12 Results of region-growing. Stage A is the result of the initial calibration. Stage B is the result of 50 repetitions of the region-growing algorithm (the fit is better still but a single erroneous pixel, circled and arrowed, has been detected in the background). Stage C is the result of 100 repetitions. The background noise is growing even though the shadowed areas of the hand are still not detected adequately. Finally, by Stage D with 200 repetitions there is a considerable amount of background noise.
The results show that performing the region-growing process a small number of times results in slightly better detection but the process becomes noisy if the number of repetitions is too high (>100). It was decided to keep this method but restrict its growth to a maximum of 50 repetitions. 3. Background Subtraction: With this method an image of the background is stored. This information is then subtracted from any subsequent frames. In theory this would negate the background leaving only the hand and marker information, making the detection process much easier. However, although performing background subtraction with a black and white system worked well, doing the same with colour proved much more difficult as a simple subtraction of the colour components made the remaining hand and marker colour information uneven over the frame. This method also made the system considerably slower and was adversely influenced by the automatic aperture adjust of the camera. As the current system worked adequately it was decided not to proceed with this calibration step. 4. Removal of persistent aberrant pixels: Although it is a valid design choice to select a background that differs greatly in hue both from that of the skin and wrist band colour, it is possible that imperfections in the background colour or camera function could result in aberrant pixels falling within the calibrated ranges and therefore be repeatedly misinterpreted as skin or wrist band pixels. It would be possible to scan the 14
image when the hand is not in the frame and store any (aberrant) pixels detected. Ignoring these pixels would affect the recognition depending on where the hand was in the frame. It would therefore be necessary to choose the correct value for the aberrant pixel based on the value of those surrounding it (if all surrounding pixels are skin then detect as skin, else background). However, neither the camera nor the background exhibited such pixels when a hand was in frame so it was decided not to proceed in programming this calibration step.
Figure 13 Plots of different combinations of skin pixel colour values (green) and background pixel colour values (red). The skin pixels are well separated from the background pixels in all three colour components but lie within an ellipsoid as opposed to a cuboid. The values are well enough separated, however, for a cuboid colour range system to work adequately.
In order to improve accuracy it would be necessary to check if the colour components of the skin and wrist band pixels fell within this ellipsoid. However, this was considered computationally intensive and given that the current cuboid system works adequately it was not implemented.
3.6 Conclusion
This chapter has described the choice and setup of hardware and the methods of calibration and detection in order to detect as many of the skin and marker pixels within the frame as 15
possible. The hardware chosen was a single colour camera pointing down towards a desk (or floor) surface of a constant colour with no special lighting. Calibration is performed by scanning the RGB colour values of pixels within a preset area of the frame and improved using a limited amount of region-growing. Detection is performed by comparing each RGB pixel value with ranges found during calibration. Figure 12 Stage B shows the successful detection of the majority of the hand area.
16
4 Refinement
Using the methods discussed in the previous chapter it is possible to detect the majority of the skin and band pixels in the frame whilst detecting very few aberrant pixels in the background. However, some complications were noticed which could reduce the accuracy of recognition at a later stage. These are: 1. Image distortion: If the cameras visual axis is not perpendicular to the floor plane, a given gesture would appear different depending on the position and yaw of the hand (a given length in one area of the frame would appear longer or shorter in another area of the frame). This is termed projective distortion. Also, if the camera lens is of poor quality then the straight sides of a true square in the frame would appear curved. This is termed radial distortion. 2. Skin pixels detected as wrist band pixels: If the wrist band colour ranges are increased sufficiently for all pixels to be detected then areas of skin that are more reflective (such as the knuckles) start to be incorrectly identified as band pixels. This is disadvantageous as it leads to inaccurate recognition information. 3. Skin pixels of the arm being detected: Any skin pixels above the wrist band will also be detected as skin. It would be preferable if these pixels could be ignored, as they play no part in the gesture. Wearing a long sleeve top helps solve the problem but forearm pixels are still detected between the wrist band and the sleeve (which has a tendency to move up and down the arm as different gestures are made, leading to variations in the amount of skin detected). It was decided to reduce the effects of these complications as much as possible.
4.1.1
Radial distortion
In order to assess whether radial distortion was present, a rectangular piece of card was placed in the frame. It was then a simple matter to check the edges of the frame against the edges of the card (see Figure 14).
17
Figure 14 A4 card placed in the frame. If the camera had significant radial distortion, the straight edges of the paper would appear as curves. This is not the case so radial distortion is not significant.
The straight sides of the paper are imaged not as curves, but as straight lines, therefore radial distortion is not present.
4.1.2
Projective distortion
To check for projective distortion a strip of paper was placed in the frame at various positions. By measuring its length (in pixels) at each location, any vertical or horizontal distortion could be found (see Figure 15).
Figure 15 Paper strip placed in the frame at different positions (with superimposed lines to aid measurement). From the measured strip lengths it can be seen that there is only a small amount of projective distortion present. Overall, there is only 6% deviation in apparent strip length anywhere in the frame, therefore it was considered unnecessary to correct for projective distortion.
18
There is slight image distortion present but its effect is limited to only 6% and therefore was not considered serious enough to attempt to remove (removal would involve transforming a distorted to a regular rectangle which would be processor intensive).
4.3.1
Centroid calculation
By averaging the position of the pixels detected it is possible to calculate the centroid of both the hand and the wrist band. A formal description of centroid calculation is as follows: From before, the set of all skin pixel locations was defined as:
L = {x | S (r (x ), g (x ), b(x )) = 1}
* * *
1 L 1 Lband
*
Denote the number of elements of L by L This gives the hand centroid as:
chand =
x
xL
*
* *
cband =
xLband
Figure 16 shows an original image and the image with the detected skin pixels, wrist band pixels and centroids visible.
19
Figure 16 Original image before skin and wrist band pixel detection (A) and after (B). Detected skin pixels are shown in blue and wrist band pixels in red. Centroids are displayed as black dots.
Notice how even with priority given to skin pixels over wrist band pixels, a number of wrist band pixels are erroneously detected near the knuckles (where skin has not been detected due to the higher reflectivity of those areas).
4.3.2
It was considered that if the distance and angle of the edges of the wrist band relative to the hand centroid could be found, the forearm skin pixels could be removed by comparing their distances and angles with them. The edges of the wrist band can be found by scanning lines parallel to the line joining the two centroids. Define the vector joining the two centroids as: The yaw angle of the hand is therefore:
The edges of the band are then found thus: For each point p1 (s1 ) along the line
p1 (s1 ) = cband
where ( 50 s1 50 )
For each s1 count the number of wrist band pixels n(s1 ) along the line:
where ( 50 s 2 50 )
The two points defining the edges of the band bleft or xleft , yleft and bright or xright , y right
20
Figure 17 shows a number of the lines scanned (reduced for clarity) along with a graph showing the thresholds used in the program to detect the band edges.
Number of wrist band pixels detected
25 20 15 10 5 0 -100 -50 0 50 100 Distance along line perpendicular to line joining the centroids (pixels)
Figure 17 The left image shows the lines scanned to detect the edges of the wrist band. The number of wrist band pixels detected along each line is counted. The edges have been detected when the number falls below a certain threshold. The graph on the right shows the number of pixels detected along each of the lines with the detected edges marked in red.
Using these thresholds it is then possible to utilize only those wrist band pixels that are within the bands width. This removes any remaining erroneous wrist band pixels detected near the knuckles. The radius of the band is:
(* *
Any band pixels further than rband from cband can then be disqualified. The wrist band centroid can then be recalculated. Figure 18 shows the wrist band pixels that have passed this radius test and the recalculated centroid (passed pixels shown in yellow, radius indicated by black circle).
* )
Figure 18 Radius test applied to wrist band pixels. Any pixels that are further from the wrist band centroid than the band radius (black circle) previously calculated can be ignored (pixels that pass shown in yellow, those that fail in red)
21
4.3.3
Finally, using the angle and distance from the hand centroid to the wrist band edges it is possible to differentiate the skin pixels of the forearm and remove them. The minimum distance between the hand centroid and the edges of the band is:
(* *
* )
The maximum and minimum angles of the band ( band max and band min ) relative to chand are:
yleft y , tan 1 right band max = max tan 1 xleft x right yleft y , tan 1 right band min = min tan 1 xleft xright Any hand pixels further than rhand and between band max and band min relative to chand can
then be disqualified (a case statement deals with the situation that occurs when the band angles lie either side of 0 radians). Figure 19 shows the angle and distance criterion being applied, with skin pixels that fail highlighted in green.
Figure 19 Distance and angle criterion applied to skin pixels. The two straight black lines show the angle in which the radius criterion is applied. The curved black line shows the radius beyond which skin pixels are disqualified. In this example failed skin pixels are shown in green.
Finally the hand centroid can be recalculated. This is shown in Figure 20.
22
Figure 20 Image showing recalculated hand and wrist band centroids. Invalid wrist band pixels have been ignored (passed pixels shown in yellow, failed pixels in red) and skin pixels up the forearm have also been ignored.
4.4 Conclusion
This chapter has described several techniques to improve the hand detection. A combination of pixel position and priority based information was used to remove any erroneous detected pixels. Figure 21 shows that the process was very successful.
Figure 21 Detected pixels before and after refinement. The detected wrist band pixels are shown in red. Notice how after refinement the erroneous wrist band pixels detected on the knuckles have been ignored, with a corresponding shift in wrist band centroid. The detected skin pixels are shown in blue. All of the hand pixels are detected except those in areas of higher reflectivity (near the knuckles) which naturally show up as white. Notice how after refinement all skin pixels detected up the forearm have been ignored; with a corresponding shift in hand centroid.
23
5 Recognition
In the previous two chapters, methods were devised to obtain accurate information about the position of skin and wrist band pixels. This information can then be used to calculate the hand and wrist band centroids with subsequent data pertaining to hand rotation and scaling. The next step is to use all of this information to recognise the gesture within the frame.
occluding the other. This is outside the project remit. However, there is an American onehanded sign language alphabet, which, with slight modification, can be used (see Appendix B).
25
Comparison of area for gesture 'c' with trained letters 'a' through to 'i' 8000 7000 6000 5000 4000 3000 2000 1000 0 a a b b c c d d e e f f g g h h i i Area difference
Figure 22 Comparison of test letter 'c' with pairs of trained examples from 'a' through to 'i'. Although the score is low for the letter c the scores for several of the other gestures is also low. Any of the gestures below the broken line could be misinterpreted as the letter c. This suggests, as predicted, that area is not a good comparison metric to use (although the letters a, e, g and i are well differentiated from c).
As predicted, area is not a good comparison metric as several other trained gestures (b, d and h) also exhibited a similar area to the test letter c.
26
Figure 23 Example gesture with radials marked. The black radial lengths can easily be measured (length in pixels shown). However, the red radials present a problem in that they either cross between fingers or palm and finger.
However, a problem (as shown in Figure 23) is how to measure when the radial crosses a gap between fingers or between the palm and a finger. To remedy this it was decided to count the total number of skin pixels along a given radial. This is shown in Figure 24.
Figure 24 One of the problem radials with outlined solution. If only the skin pixels along any given radial are counted then the sum is the effective length of that radial. In this case the radial length is 46 + 21 = 67.
All of the radial measurements could then be scaled so that the longest radial was of constant length. By doing this, any alteration in the hand camera distance would not affect the radial length signature generated. See Appendix C Section 2 for a formal description of the radial length calculation.
5.5.1
To evaluate this method a program was written to calculate the radial length signature of a given gesture and display it in the form of a histogram. Figure 25 shows the skin count of the radials from 0 to 2 radians for an open hand gesture in several different yaw angles and distances from the camera.
27
Figure 25 Open hand gesture in several different positions and yaw angles. The histogram for each gesture is largely the same shape but shifted dependent on the yaw of the hand.
The measurement is not affected by hand-to-camera distance. The measurement is affected by the yaw of the hand, but this only shifts the readings to the left or right and does not affect their shape. Figure 26, however, shows that the measurements are considerably different for different gestures.
Figure 26 Images showing the histogram for two different gestures. The two histograms are sufficiently different to permit differentiation.
5.5.2
In order to counter the shifting effect of hand yaw, a wrist marker was used. The angle between the centroid of this marker and the centroid of the hand was then used as the initial 28
radial direction. This, along with the maximum radial length scaling makes the system robust against changes in hand position, yaw and distance from camera. Figure 27 shows the same open hand gesture (as in Figure 25) in a variety of positions and yaw angles.
Figure 27 The same open hand gesture as before in a variety of different positions and yaw angles, but with hand yaw independence. The histograms for all the gestures are similar so it should be possible to recognise this gesture from a set of different gestures.
The radial measurements are very similar no matter how the hand is positioned.
5.5.3
Now that an invariant signature exists for each gesture it is possible to compare the signature of a test gesture with those of a set of trained gestures. A match score for each trained gesture was then calculated by adding up the differences between corresponding radial lengths. The trained signature with the smallest difference could then be presented as the match. See Appendix C Section 3 for a formal description of the radial signature comparison. A program was written to display an image of the trained gesture with the best score at the top left of the image window. Figure 28 shows the successful recognition of several gestures.
29
Figure 28 Successful recognition of several different gestures. Gesture recognised is shown at the top left of the frame. The gestures are recognised correctly even though the yaw of the test hand is different from that taught.
5.5.4
During tests it was noticed that the quality of recognition depended on the number of radials used (in the example in Figure 28 only 100 radials were used where previously the number was 200). It was also noticed that most of the significant data was concentrated around the fingers, thus it would be more efficient to group radials in these areas. Figure 29 shows the radials in their original grouping and after reorganisation.
30
Figure 29 The left image shows 100 radials in their original pattern. However, this pattern does not give the necessary concentration bias towards the fingers. The image on the right shows 200 radials reorganised so that twice as many lie over the fingers as the rest of the hand (150 over the fingers and 50 elsewhere).
5.5.5
Using this improved system the sign language letters a through to o were taught to the system. This enabled a very limited sign language word processor to be made (see Figure 30).
Figure 30 Successful implementation of a simple sign language word processor. Clicking a button whilst gesturing in the frame added the highest scoring gesture to the output window.
The graph in Figure 31 shows that the radial length metric is considerably better than the area metric at differentiating this series of gestures. However, c and i have very similar low scores even though the signs are physically different.
31
Comparison of radial length signature for gesture 'c' with trained letters 'a' through to 'i' 4000 3500 3000 2500 2000 1500 1000 500 0 a a b b c c d d e e f f g g h h i i Total of differences in number of pixels along radials
Figure 31 Comparison of test letter 'c' with trained examples from 'a' through to 'i'. The score is low for the letter c and also high for most of the other gestures. However, one example of the letter i also gets a good comparison score even though the gesture corresponding to the letter i is dissimilar to that of the letter c. However, the range of scores is considerably better than that of the area recognition method discussed earlier.
5.5.6
To examine why the scores were so similar for the physically different gestures c and i (see Figure 31), the recognition program was altered so that only a single pixel was displayed along a given radial at a distance proportional to the number of pixels detected (along that radial). This provided a good illustration of the information presented to the recognition process (see Figure 32).
Figure 32 On the left is the original image and on the right is a representation of the data provided by the radial length recognition system. The amount of information provided about individual fingers is dependent on the angle of the radial covering that finger which means that gestures involving the poorly represented fingers will not be well differentiated.
32
Due to the organisation of the radials, the amount of information provided about individual fingers is dependent on the relative angle of the radial and the long axis of the finger (the shallower the angle the more information is provided). This is obviously an inadequate situation as gestures involving the parts of the hand that are not well covered would be hard to differentiate.
5.5.7
The effects of the problem highlighted in Section 5.5.6 are further illustrated by the recognition statistics in Figure 33, for a considerably larger gesture set involving all the sign language letters and numbers as well as five mouse commands (left click (lc), right click (rc), open hand (op), closed hand (cl), double click (dc)- see Figure 50) and space (sp). The test procedure involved signing all of the gestures as well as transition gestures interleaved between them. For a perfect score the system would not only have to correctly recognise all the gestures but also provide a blank return for the transition gestures. A false positive is where the system returns a gesture label even though the input was a transition gesture. A false negative is where the system returns a blank even though the input was a valid gesture. Gesture
T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P T H E SP R K 3 5 SP Q U I C K SP B R O U T SP F O X E N SP J U M P
Recognised
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC
Gesture
E D SP O V E U SP T H E SP 6 7 8 9 J SP L A Z Y SP D O G OP CL LC RC DC
Recognised
Figure 33 Results from a test of the radial length recognition method. Several of the test gestures were incorrectly recognised. There were also a number of false positives and two false negatives (the number of false positives and negatives is dependent on a threshold above which a score is considered to have been caused by a valid gesture).
33
5.6.1
A program was written to perform the transformation. The results are shown in Figure 34.
Figure 34 On the left is the original image and on the right is the image after transformation into the canonical frame. However, after scaling up from the original frame, gaps appear between the pixels which would make the recognition comparison unreliable.
The problem is that scaling up from the original frame to the canonical frame results in gaps between pixels. This would be disadvantageous in recognition as a specific pixel in the trained set may not match up with a corresponding pixel in the test gesture and as such would not score.
5.6.2
A solution to the problem highlighted in Section 5.6.1 would be to change the algorithm from using a pixel push from the original to the canonical to using a pixel pull. With this method the distance and angle between every pixel in the canonical frame and some anchor point (such as the centre of the screen) is calculated. The inverse scaling and angle rotation is then performed and the corresponding pixel in the original frame, relative to the hand centroid, queried. If this pixel is skin then the pixel in the canonical frame is coloured blue. If it is not skin it is coloured black. A disadvantage is that any given pixel in the original frame may be queried several times, reducing efficiency. See Appendix C Section 5 for a pseudocode description of the pixel pull from the canonical frame.
34
5.6.3
A program was written to perform the modified transformation. Figure 35 illustrates how a given gesture in two different positions in the original frame looks very similar in the canonical frame. Notice that shadowing still affects the gesture similarity.
Figure 35 The left two images show two different examples of the same gesture at different positions and rotations. The right two images show the corresponding images in the canonical frame. Performing a pixel pull rather than a pixel push means that the problem of gaps between pixels no longer occurs. The two gestures look similar in the canonical frame, most of the differences being caused by shadowing.
5.6.4
The question is now how to compare training data with a test gesture in the canonical frame. Unlike the radial length metric the amount of data to be compared for each gesture is large (>40,000 pixels). Therefore, although it would be possible to directly compare the canonical frame information of a test gesture with all of those trained, this process would be inefficient and slow. It is evident that some pixels are better at differentiating a given set of gestures than others (pixels near the wrist band are likely to be skin for the entirety of the gesture set and those far from it never). It is also the case that some pixels are not reliable in identifying a given gesture (such as pixels near the edge of the hand or those intermittently affected by shadowing). To address this problem a program was written to take a number of example images of a given gesture and compare every pixel over the set. The value for the amount of variation of each pixel was then calculated and displayed by a colour from blue (small amount of variation) to red (large amount of variation). These images were termed jitter maps. See Appendix C Section 6 for a pseudocode description of the creation of these jitter maps.
35
Figure 36 shows the jitter maps for the one handed sign language letters m, n and l (40 examples of each gesture were used).
Figure 36 Jitter maps for the letters m, n and l respectively (40 examples of each gesture were used). The most variation (most red) occurs near the edges of the hand. Greater influence should therefore be given to the bluer pixels for the purposes of recognition.
As expected, the largest amount of variation occurs near the edges of the hand. Therefore, in the recognition of these gestures, greater weight should be given to the bluer pixels. It would also be advantageous to combine the information given by maps such as those in Figure 36 to find the pixels that best differentiate them. In order to facilitate this a program was first written to create a map where the value of each pixel is dictated by the proportion that the corresponding pixels across the training set were skin. These images were termed skin concentration maps (SCMs). See Appendix C Section 7 for a pseudocode description of the creation of these skin concentration maps. A simple subtraction of the SCMs for two sets of gestures could then be performed to find the pixels that best differentiate the two (the best pixels being those that are mostly background on one set and mostly skin on the other). See Appendix C Section 8 for a pseudocode description of the creation of a skin concentration difference map. Figure 37 shows the skin concentration maps for the letters m and n and the result of the subtraction of the two.
36
Figure 37 The top two images are skin concentration maps for the letters m and n respectively. As expected the skin is most concentrated at the centre of the hand (blue areas) and least concentrated near the edges (red areas). The bottom image is the result of an image subtraction of the top two. The best pixel areas to differentiate these two gestures lie just beyond the knuckles of the letter n and in the shadowed area of the letter m (coloured red).
The best pixels to differentiate the letters m and n (coloured red) lie just beyond the knuckles of the letter n and in the shadowed area of the letter m. Both jitter and skin concentration maps are a compact way of representing the large amount of data created during training. However, skin concentration maps proved more useful for the purposes of gesture comparison and so were chosen.
5.6.5
Now that a skin concentration map could be formed for any gesture trained, a method had to be found to compare a test gesture mask with each of them. Fundamentally, a trained and test gesture are a good match if all the areas of skin and background match up. However, a skin concentration map has no skin or background but rather a value between these two limits. Therefore, in order to evaluate this recognition method a program was written to quantize the skin concentration maps so that all areas above a certain threshold were considered skin, all those below a second threshold considered background and all other pixels ignored. A direct skin to skin and background to background comparison then became possible. See
37
Appendix C Section 9 for a pseudocode description of the creation of the quantized skin concentration maps. Figure 38 shows an example skin concentration map before and after quantization.
Figure 38 An example SCM of the letter e before and after quantization (left and right respectively). Any areas below a certain cold threshold are considered skin (coloured blue), all those above another hot threshold considered background (coloured red). All other areas are ignored (coloured white).
A score was then calculated by comparing the test gesture mask with each quantized skin concentration map (QSCM). A point was awarded if the test mask skin pixel coincided with a skin pixel of the QSCM and a point subtracted if the test mask skin pixel coincided with a background pixel. Similarly a point was awarded if the test mask background coincided with the background of the QSCM and vice versa. See Appendix C Section 10 for a pseudocode description of the comparison of a test gesture mask and set of QSCMs. Figure 39 shows the comparison of the QSCM for the letter e (Figure 38 right) with example masks of the letters c and e.
38
Figure 39 The comparison of the QSCM for the letter e (Figure 38 right) with example masks of the letters c and e. Areas that achieve positive scores (background to background or skin to skin match) are shown in green and those with negative scores (background to skin or skin to background) are shown in yellow. The mask for the letter e has many more areas of positive score and fewer areas of negative score than the mask for the letter c.
The graph in Figure 40 shows the scores of a test gesture c compared with the QSCMs of gestures from a through to i.
39
Comparison of QSCM match score for gesture 'c' with trained letters 'a' through to 'i'
30000 25000 20000 15000 10000 5000 0 a a b b c c d d e e f f g g h h i i
Figure 40 Comparison of test letter 'c' with trained examples from 'a' through to 'i'. The examples of the letter c achieve the top two comparison scores and none of the others achieve similar scores except the letter d which, although close, is still a minimum 1,400 points different. This suggests that the template matching in the canonical frame recognition method is better than both the area and radial length recognition methods.
Both examples of the letter c stored matched better to the test gesture than any of the others. Based on the results obtained for the three metrics it was decided to use the template matching in the canonical frame recognition method as it was the only method that provided sufficient information to differentiate the similar gestures reliably and because it was the easiest to adapt to using multiple training examples of each gesture.
5.7.1
With this method the scaling factor is obtained using the average distance from the hand centroid to every skin pixel detected. This is more robust than the hand centroid to wrist band centroid distance scaling factor as it does not involve the use of the wrist band centroid (which is less robust as it is calculated using a smaller number of pixels). See Appendix C Section 11 for a pseudocode description of scaling using the average radial distance.
5.7.2
This method translates the hand in the canonical frame based upon simple rules (e.g. shift up until there are at least 40 skin pixels in the uppermost row). Once again this method makes the
40
canonical frame method more robust as it reduces the reliance on the hand centroid as an anchor point. Several rules were considered, but the one that produced the best results involved shifting the image in the canonical frame to the right until the wrist band was just off the edge of the screen. This was performed by scanning columns of the canonical frame from the right until the number of wrist band pixels detected fell to zero. The positioning in the ydirection was calculated using the hand centroid as before. Figure 41 shows a gesture in the canonical frame before and after translation.
y x
Figure 41 Images showing the canonical frame before (left) and after (right) x-axis shift. The yaxis position of the hand is dictated by the hand centroid as before.
41
A: Input
Figure 42 A simplified example of how clustering improves recognition. In this case several of each of three valid representations of the letter c have been taught to the system. An example of each of the three representations is shown (column A). The resultant SCM (column B) has a large amount of redder area. Any comparison method should give these areas less weight so this gesture would be at a disadvantage relative to those with less variation. Column C shows the SCMs produced after clustering. The three types of gesture input have been split into three separate SCMs, each with much less redder area.
A greedy algorithm was devised to take the first gesture image in the training group and compare it pixel by pixel with all other members of the group. See Appendix C Section 12 for a pseudocode description of the comparison. Any gesture images whose compared difference (in pixels) fell below a set threshold, t max , were then added to a sub-group and removed from the main group. Once all the gesture images in the main group had been compared the next first member of the main group could be compared with all the remaining images and so on. A threshold was also set to define the minimum number of gesture images permitted in an exemplar. In the event that the number of images in an exemplar fell below this threshold the first member of the main group was simply removed entirely with the logic that if it was so dissimilar from all the rest then it must be an outlier and as such could be safely removed without greatly affecting recognition quality. The process continued until no gesture images remained in the main group. See Appendix C Section 13 for a pseudocode description of the clustering process. Figure 43 shows the result of running the algorithm on sets of 100 gesture images of the sign language letters a through to e. The value of t max in this case was 2500 pixels different and a minimum of four gesture images were allowed in an exemplar.
42
Number of exemplars 6 3 3 3 13
Number of gesture images in each exemplar 48,12,11,5,14,5 (5 outliers) 81,12,6 (1 outlier) 41,53,6 (0 outliers) 73,15,12 (0 outliers) 11,12,10,7,12,8,6,5,7,5,4,4,4 (5 outliers)
Figure 43 Table showing the result of applying the segmentation algorithm to sets of 100 gesture images of the sign language letters a through to e. The gestures with the greatest amount of shadowing are a (due to the fingers resting against the palm) and e (due to the suspended fingers above the palm). Notice also how each of these gestures has five outliers. However, this is only 5% of the total number of gesture images in the set so was not considered too large. The gestures with no shadowing (c and d) are still clustered into more than one exemplar. This is due to the range of positions the fingers can occupy and still present a valid version of this gesture.
All of the training gesture image sets are clustered into at least three exemplars. As expected, the gestures with the largest number of exemplars are those with the most shadowing (letters a and e). Those with no shadowing (c and d) are also clustered into a small number of exemplars as they involve a range of possible finger positions that still present a valid gesture. A problem with clustering the training gesture image sets in this way is that it increases the number of SCMs that need to be compared per frame in order to recognise a test gesture. For instance, with no clustering, a set of 24 gestures would produce 24 SCMs to compare per frame. If clustering produces 10 exemplars per gesture, then the number of SCMs increases to 240, with subsequent decrease in recognition frame rate. The choice of how much clustering to perform is a trade-off between speed (less clustering) and accuracy (more clustering) and should be chosen depending on the application. A compromise between the two was chosen here.
5.9.1
In Section 5.6.4 a method was discussed whereby a series of images of a given gesture can be combined to form a skin concentration map (SCM). By subtracting two SCMs it is possible to score each pixel on how effective it is at differentiating one gesture from the other (see Figure 37). This method cannot be easily extended to more than two gestures. However, if a set of skin concentration maps are quantized into three values, say two for mostly skin, zero for mostly background and one if neither, then the equivalent pixel in each of the maps can be examined and that pixel added to a list if the quantized values over all the maps consisted entirely of twos and zeros. The same pixel of a test gesture can then be queried. If it is skin, then that would suggest that it is one of the gestures with mostly skin in that position, if not, then one with mostly background. See Figure 44 for a simplified example of this process and see Appendix C Section 14 for a pseudocode description.
43
Figure 44 Simplified example of how the pixels that split the set can be found. The four tables on the left represent skin concentration maps. After quantization, the value of each pixel in the quantized skin concentration map is either 0, 1 or 2. The pixels that are either 0 or 2 across the set can then be found.
Although the process of quantization means that there is no strict guarantee provided by the analysis of each individual pixel, the combined influence of the many pixels in the list provides a better estimate. With the tree method a group of pixels that split the set of exemplars roughly in two is found. The greater the number of pixels the better the accuracy of the decision, so a compromise has to be found between splitting the set into two halves and finding enough pixels to accurately do so. See Appendix C Section 15 for a formal description of this compromise. Once the set is split the two subsets can be stored in the left and right branch of a tree structure. The same process (of finding pixels that split the set in two) can be applied to both subsets. The process continues until all subsets consist of a single gesture. A program was written to perform the quantization and then scan all the pixels from all the QSCMs for those that split the set roughly in two. Priority was given to finding sufficient pixels so if on a given pass insufficient were found then the process was repeated but with less emphasis on splitting the set exactly in two. After each split the location and value of all the qualifying pixels was stored and a node of a tree structure filled. Both reduced sets of gestures were then passed back into the splitting algorithm. The process was repeated until all the bottom nodes of the tree consisted of a single gesture. See Appendix C Section 16 for a pseudocode description of filling the tree structure. Figure 45 shows the output of the algorithm for a set of five gestures from the one handed sign language set (letters l, b, o, n and m).
44
Input Gestures
Set of pixels found that split the set and tree structure filled
Figure 45 An example of how the tree method works. At each level of the tree the number of skin pixels under the green and yellow masks is counted. If the number under the green mask is larger than that under the yellow mask the green branch is chosen. Alternatively the yellow branch is chosen. The process is repeated until the bottom of the tree is reached.
45
The advantage of this system is that after the tree structure is filled, only a small number of pixels need be analysed before the descent to the next tree level. As, at each stage, the number of possible exemplars is split roughly in two, this method is very quick to execute. The disadvantage of this method is that at the levels of the tree near the root, when the number of exemplars is large, the number of pixels that split the set (even to split off a single exemplar) is very small. During testing it was found that for a set of just 16 exemplars only 200 pixels could be found to split off a single exemplar at the first level of the tree, greatly increasing the possibility of error at this level. Another problem is that the tree can only be traversed downwards- once it is decided to travel down one side of the tree the exemplars represented on the other side cannot be compared even if they would provide a better match at a later stage. For example, if the probability of correct branch traversal at each node is 98% or 0.98 (which corresponds to a 2% probability of failure) and the tree has 10 levels (all of which must be traversed correctly), then the probability of success at the bottom is 0.9810 = 0.82 (which corresponds to a failure probability of 18%). This was reflected in the fact, that for a set of more than eight different exemplars, the correct one was rarely recognised.
5.9.2
With this method the quantization of the SCMs is performed as with the previous method. In order to recognise the test gesture a score is calculated for each QSCM by looking at each pixel in turn. Every pixel is scored as follows (see Appendix C Section 17 for a pseudocode description): If the test gesture pixel is skin then a point is awarded to each of the QSCMs if the value of that pixel is mostly skin. If the test gesture pixel is skin then a point is subtracted from each of the QSCMs if the value of that pixel is mostly background. If the test gesture pixel is background then a point is awarded to each of the QSCMs if the value of that pixel is mostly background. If the test gesture pixel is background then a point is subtracted from each of the QSCMs if the value of that pixel is mostly skin. Otherwise no change is performed.
The final score for each QSCM can then be calculated by dividing the total score by the maximum score possible (equal to the number of pixels over the template which are either mostly skin or mostly background). An advantage of this system is that each exemplar is judged separately so unlike the tree method errors do not accumulate. A disadvantage is that a very large number of pixels have to be examined for each of the QSCMs for a match to be made. Also, if a given training gesture has a large amount of variation then there will be a large number of pixels which are neither mostly skin or mostly background in the QSCM (equivalent to a large amount of white area in Figure 38 right), leaving large areas where no score can be awarded, and as such increase the possibility that two exemplars will be difficult to differentiate. To test the system, the same training and test gesture sets that were used with the radial length metric were fed to the system. Figure 46 shows the results:
46
Gesture
T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P T H E SP 1 2 3 4 5 SP Q U I C K SP B U O W N SP F O X E S SP J U M P
Recognised
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC
Gesture
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC
Recognised
Figure 46 Results from a test of the template score method with quantization. All but one of the test gestures was correctly identified and there were no false negatives. However, there were a considerable number of false positives. This is due to the fact that the recognition score for a couple of the gestures was low even though the correct gesture obtained the highest score. This meant that the recognition threshold had to be set low and as such a number of intermediary frames were incorrectly recognised as gestures.
5.9.3
With this method no quantization is performed. Instead, the amount of skin present over the set of images within the exemplar is represented by a floating point number between 0.5 and 0.5 for each pixel (-0.5 representing all background over the set and 0.5 representing all skin). The score is then calculated as follows (see Appendix C Section 18 for a pseudocode description): Add this floating point number when the corresponding test gesture pixel is skin Subtract this floating point number when the corresponding test gesture pixel is background
Pixels that have a large amount of variation do not affect the score by a significant amount as their value is close to zero. The advantage of this method is that no pixels are ignored, so even exemplars with a large amount of gesture image variation are fully considered. A disadvantage is that many pixels have to be considered for each SCM (as with the quantization method). This method 47
will also be slower than the previous method as many floating point calculations have to be performed (rather than integer ones). Once again the system was tested using the same gesture sets as before. The results are shown in Figure 47: Gesture
T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P
Recognised
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC
Gesture
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC
Recognised
Figure 47 Results from a test of the template score method with no quantization. All of the test gestures were correctly identified this time and once again there were no false negatives. There were a considerable number of false positives. This is for the same reason as with the previous figure.
From looking at the results of each of the recognition methods it was clear that the method with the best recognition score was the template score method with no quantization. Therefore this method was chosen.
48
49
Figure 48 Images showing the pixels queried in order to detect one of the exemplars for the letter a before and after removal of duplicates. The duplicate pixels are mostly evenly spread over the recognition area. Notice how the pixels near the wrist band are less concentrated, as the pixels in this area are skin for almost all the trained gestures.
The results show, that after removal of duplicate pixels, the remaining pixels are evenly spread over the recognition area except for the area near the wrist band where a larger number of duplicates exist. This is because most of the pixels near the wrist band are skin for all of the trained gestures.
50
Gesture
T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P T H E SP 1 2 3 4 5 SP Q U I C K SP B R O W N SP F O X E S SP J U M P
Recognised
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC
Gesture
E D SP O V E R SP T H E SP 6 7 8 9 0 SP L A Z Y SP D O G S OP CL LC RC DC
Recognised
Figure 49 Results of a test to the template score method with no quantization after sorting and removal of duplicate pixels. All of the test gestures were correctly identified and there were no false negatives. There were a considerable number of false positives. This is for the same reason as before.
5.11 Conclusion
In this section, three methods of recognition have been discussed. Firstly, area comparison was considered. Although this was considered an unsuitable metric it was used in order to focus the attention on the comparison architecture of any future system and the testing methodology. The second method involved the comparison of radial length signatures. This was more suitable, but it was found that the amount of information provided about individual fingers was dependent on the relative angle of the radial and the long axis of the finger, making some gestures hard to differentiate. Finally, template matching in the canonical frame was considered and chosen as it provided the best results. Various refinements were then made to increase recognition speed. Using the methods chosen a set of 42 gestures were all correctly recognised at a frame rate of 12.5fps.
51
As a demonstration of the capabilities of the system, a standard Microsoft Windows computer was modified so that the only input device necessary was the hand.
6.1 Setup
The system was set up as in Figure 2. The template score (with no quantization) recognition method was modified so that the recognised gesture generated mouse and keyboard events, as shown in Figure 50. Gesture Label A B C D E F G H I J K L M N O P Q R S T U V Event Press key A Press key B Press key C Press key D Press key E Press key F Press key G Press key H Press key I Press key J Press key K Press key L Press key M Press key N Press key O Press key P Press key Q Press key R Press key S Press key T Press key U Press key V Gesture Label X Y Z 0 1 2 3 4 5 6 7 8 9 CA RE DO SP BS LC RC DC OP Event Press key X Press key Y Press key Z Press key 0 Press key 1 Press key 2 Press key 3 Press key 4 Press key 5 Press key 6 Press key 7 Press key 8 Press key 9 Press caps-lock key Press return key Press key . Press spacebar Press backspace key Left mouse click Right mouse click Left double mouse click Move mouse pointer relative to hand centroid position. Left mouse button hold and move mouse pointer relative to hand centroid position.
Press key W
CL
Figure 50 Table showing the gesture labels and corresponding mouse or keyboard event.
52
In order to ignore transition movements of the hand, an event was only queued if five identical contiguous gestures were recognised. Thereafter, further events were only processed if the gesture changed (therefore, to type two identical letters a brief gesture change would need to be interleaved).
6.2 Demonstration
To demonstrate the system in use, the following sequence of actions were performed using the hand alone: The explorer icon on the task bar was clicked in order to restore it.
The floppy drive was selected. A right click brought up a menu and a new text document was created.
53
A right click brought up a menu and a new folder was created. This folder was renamed demo folder.
The folder was double clicked to open it. The text document was then double clicked to edit it.
54
The following text was then typed into the document: This is a demo of my 4th year project. I CAN TURN CAPS LOCK ON and off. I can also use the space and backspace keys. Finally I can control the mouse. ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890. The document was then closed and the changes saved.
Finally, the folder was closed and dragged to the top left of the directory window.
During the demonstration six letter errors were made, two of which were due to operator error. An AVI movie file of a similar sequence is available at: http://users.ox.ac.uk/~ball0622/index_files/demo.avi (Recognition frame rate in the video example is slightly reduced due to the effect of the screen capture software.)
55
7 Conclusion
of the final system developed. However, if detection of hand gestures for computer animation is required (for instance), then the number of trained gestures would need to be in the thousands. A system which relies on both training and comparison of all gestures used would not be sufficient for this task. Further work, therefore, could involve the implementation of a gesture recognition system which does not require training. An example of this is the direct method based on hand geometry considered in Section 5.1. Multi-stage gestures: It would be possible to represent a much larger number of labels if each label consisted of two or more gestures combined with hand position changes. For instance, the wave hello label could correspond to the open hand gesture with an alternating increase and decrease of hand yaw angle and the thumbs-up label could correspond to the letter m followed by the space gesture. Two-handed sign language: It would be possible, using two different coloured gloves and two different coloured wrist bands, to detect the gesture signed by both hands whilst both are in the frame. A method would have to be devised to detect a gesture (or range of gestures) that is represented by a partially occluded hand. This method would be considerably harder to implement. It is important to note, however, that although the gesture of both hands could be recognised this would not permit the recognition of the full American sign language as this involves recognising many other features including facial expression and arm position.
57
8 References
[Bauer & Hienz, 2000] Relevant feature for video-based continuous sign language recognition. Department of Technical Computer Science, Aachen University of Technology, Aachen, Germany, 2000. [Bowden & Sarhadi, 2000] Building temporal models for gesture recognition. In proceedings British Machine Vision Conference, 2000, pages 32-41. [Bretzner & Lindeberg, 1998] Use your hand as a 3-D mouse or relative orientation from extended sequences of sparse point and line correspondences using the affine trifocal tensor. In proceedings 5th European Conference on Computer Vision, 1998, pages 141-157. [Davis & Shah, 1994] Visual gesture recognition. In proceedings IEEE Visual Image Signal Process, 1994, vol.141, No.2, pages 101-106. [Starner, Weaver & Pentland, 1998] Real-time American sign language recognition using a desk- and wearable computer-based video. In proceedings IEEE transactions on Pattern Analysis and Machine Intelligence, 1998, pages 1371-1375.
58
9 Appendix
9.1 Appendix A- Glossary
Hand roll The rotation of the hand about an axis defined by the wrist. The following three images show the same gesture with increasing roll.
Hand yaw
The rotation of the hand about an axis defined by the camera view direction. The following three images show the same gesture with increasing yaw.
Silhouette information
Colour space defined by hue, saturation and luminosity. Also called HSV (hue, saturation and intensity value). A map created using a number of examples of the same gesture. The colour of each pixel in the map is defined by the amount of variation exhibited by the corresponding pixel across all of the examples (the greatest variation is where the pixel is skin for half of the examples and background for the other half). Detection of all skin within the hand without any feature detection (the same information that would be contained in a silhouette of the hand). A map created using a number of examples of the same gesture. The colour of each pixel in the map is defined by the amount the corresponding pixel across all of the examples was skin (the greatest skin concentration is where the pixel is skin for all of the examples).
59
DO
RE
BS
SP
CA
LC
RC
DC
OP
CL
60
a = 1
xL
&
A training sequence of n gestures can then be given and manually labelled. We denote a single (gesture, label ) pair by: e.g. (a1 , ' A') , (a 2 , ' B ') Define this training set as:
(ai , li )
G = {(a i , l i )}i =1
n
Given a test image with signature anew choose the label li min where
2 2
radscore ( ) =
x R
&
S ( x)
(g i , li )
61
G = {(g i , l i )}i =1
n
Given a test image with signature g new choose the label li min where
2 2
The radius scaling factor and angle shift to be used in canonicalisation can then be defined as
y dif canonicalshift = tan 1 x dif Define the anchor of the canonical frame as x canonicalanchor , say (160,120 ) The set of all remaining skin pixel locations after refinement is L
For each x L :
* *
The transformation into the canonical frame then proceeds as follows: Pixel distance scaling: Pixel angle rotation:
62
Inverse pixel distance scaling: rinvscaledpixel = rcanonical Inverse pixel angle rotation:
invscaledpixel
cos invscaledpixel x = c hand + rinvscaledpixel sin invscaledp ixel If x L then mark the pixel in the canonical frame ( xcanonical ) as skin otherwise
* *
mark it as background.
n skin nbackground
nskin = 0 nbackground = 0
For each image j : If M j ,i is skin then increment nskin else increment nbackground The variation (0-1) for pixel i is then: If nbackground < n then Vi = abs (n skin nbackground ) n else Vi = 1
The jitter map can then be generated by colouring each pixel: Black if Vi = 1 else Blue if Vi = 0 Red if Vi = 1 And colours in between
nskin = 0 nbackground = 0
For each image j : If M j ,i is skin then increment nskin else increment nbackground The skin concentration (0-1) for pixel i is then: If nbackground < n then C i = (n skin n ) else C i = 1 The skin concentration map can then be generated by colouring each pixel: 63
2 C i tU Qi = 0 C i t L 1 otherwise
e.g. (Q 1 , ' A') , (Q 2 , ' B ') Define this training set as:
(Q , l )
j j
(Q j , l j )}j =1 G ={
n
Given a test image with mask M i calculate the score for each concentration map thus: Define an array of scores s j where s j = 0 for j = 0..n For each QSCM j: 64
= 2) = 0) = 0) = 2)
* v
* *
rtot L
Define the minimum number of masks permitted in an exemplar as t min (Say four) Perform the clustering as follows: For each mask j = 0 to j = (n 2) : For each mask k = ( j + 1) to k = (n 1) If M j is sufficiently similar to M k (see algorithm above) then
l =0
65
Remove M k from SInit and add to S l If the number of elements in S l t min then Remove M j from SInit and add to S l Increment l Else Remove all elements from S l and replace in SInit
STwos = { } SZeros = { }
If n ones = 0 then Add the location of pixel i , the set SZeros and the set STwos to
SPolarised
Now take a pixel k of a test mask M k If pixel k is skin then that suggests the mask is an example of one of the STwos exemplars If pixel k is not skin then that suggests the mask is an example of one of the SZeros exemplars
C.15 The compromise between splitting the set into two halves and finding enough pixels to accurately do so
A formal description of this compromise is as follows: SPolarised can be scanned to find the sets of pixels for which: SZeros and STwos are identical or SZeros and STwos are identically opposite (because this pixel split the set in the same way)
66
A compromise then has to be found between finding a large set of pixels and a set that splits the set as accurately in two as possible (a set for which SZeros and STwos are roughly of the same size). Store the eventual pixels decided upon in set SSplit
= 2) = 0) = 0) = 2)
Recognition of the top scoring gesture is then performed by choosing the label l jmax where:
sj s max j
(C
,l j )
(C j , l j )}j =1 G ={
n
Given a test image with mask M i calculate the score for each concentration map thus: Define an array of scores s j For each SCM j: For each pixel i : If M i = 1 then
s j = s j + (C j ,i 0.5)
Else
s j = s j (C j ,i 0.5)
Then choose the label l jmax where
68
ID j ,i
((IDA
If nbits t max then If IDA j IDB j then Strings are equal, use pixel corresponding to set A Else Strings are equal, use pixel corresponding to set B Else Strings are not equal so do not discard either
69