Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 130

GESTURE VIRTUAL MOUSE

USING AI

A PROJECT REPORT

Submitted by

MADHAN KUMAR M ( 714018104034 )

NIVEDHA M ( 714018104044 )
NANDHINI RS ( 714018104042 )

in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERINGING

COMPUTER SCIENCE AND

ENGINEERING

SRI SHAKTHI INSTITUTE OF ENGINEERING

AND TECHNOLOGY (AUTONOMOUS),

COIMBATORE 641062

Autonomous Institution, Accredited by NAAC with “A” Grade

ANNA UNIVERSITY : CHENNAI 600025

APRIL 2022
i
ANNA UNIVERSITY : CHENNAI 600 25

BONAFIDE CERTIFICATE

Certified that this Report titled “GESTURE VIRTUAL MOUSE USING


AI” is the bonafide work of “MADHAN KUMAR.M (714018104034),
NIVEDHA.M (714018104044 ) AND NANDHINI RS (714018104042 )”, who
carried out the work under my supervision. Certified further that to the best of my
knowledge the work reported herein does not from part of any other thesis or
dissertation on the basis of which a degree or award was conferred on an earlier
occasion on this or any other candidate.

SIGNATURE SIGNATURE

Dr.K.E.Kannammal Mrs Kavya.S.P

HEAD OF THE DEPARTMENT SUPERVISOR

Professor and Head, Assistant Professor,

Department of CSE, Department of CSE,

Sri Shakthi Institute of Engineering Sri Shakthi Institute of

Engineering and Technology, and Technology,


Coimbatore- 641 062. Coimbatore- 641 062.

Submitted for the project work viva voce Examination held on…………….

INTERNAL EXAMINER EXTERNAL EXAMINER

ii
ACKNOWLEDGEMENT

First and foremost, I would like to thank God Almighty for giving me the strength. Without
his blessings this achievement would not have been possible.
We express our deepest gratitude to our Chairman Dr.S.Thangavelu for his
continuous encouragement and support throughout our course of study.
We are thankful to our Secretary Er.T.Dheepan for his unwavering support during
the entire course of this project work.
We are also thankful to our Joint Secretary Mr.T.Sheelan for his support during the
entire course of this project work.
We are highly indebted to Principal Dr.A.R.Ravi Kumar and Director Dr.
Poornachandra S , for their support during the tenure of the project.
We are deeply indebted to our Head of the Department, Computer Science and
Engineering, Dr.K.E.Kannammal, for providing us with the necessary facilities.
It’s a great pleasure to thank our Project Guide and Coordinator
Mrs.Kavya.S.P for her valuabletechnical suggestions and continuous guidance throughout
this project work.
We solemnly extend our thanks to all the teachers and non-teaching staff of our
department, family and friends for their valuable support.

NIVEDHA.M
NANDHINI.R.S
MADHAN KUMAR.M

iii
TABLE OF CONTENTS

ACKNOWLEDGEMENT iii
TABLE OF CONTENTS iv
LIST OF TABLES vii
LIST OF FIGURES viii
LIST OF ABBREVIATION x
ABSTRACT xi

CHAPTER 1 INTRODUCTION
1.1 Background 1
1.2 Problem statement and motivation 3
1.3 Objectives of the thesis 4
1.4 Thesis scopes and approach 5
1.5 Thesis outline 9
1.6 Summary 10

CHAPTER 2 : LITERATURE REVIEW


2.1 Introduction 11
2.2 Hand Gestures 11
2.2.1 Temporal Hand gesture 12
2.2.2 Static Hand Postures 15
2.2.2.1 Appearance based approach 15
2.2.2.2 Model-based approach 17
2.3 Classification 18
2.3.1 Neural Networks 19
2.3.2 AdaBoost 21
2.4 Hand segmentation methodology 24
2.4.1 Active Contour 24
2.4.2 Colour Segmentation 25
2.5 Methodalogy 26

iv
CHAPTER 3 Vision based hand gesture regognition techniques
3.1 Introduction 28
3.2 Detection 31
3.2.1 Color 31
3.2.2 Shape 35
3.2.3 pixel values 38
3.2.4 Motion 40

3.3 Tracking 43
3.3.1 Template based 43
3.3.2 Optimal estimation 44
3.3.3 Particle filtering 46
3.3.4 Camshift 47

3.4 Recognition 51
3.4.1 K-Means 52
3.4.2 K-Nearest Neighbour 53
3.4.3 Mean Shift clustering 55
3.4.4 Support vector Machine 56
3.4.5 Hidden Markov model 57
3.4.6 Dynamic time warping 59
3.4.7 Time Delay Neural Networks 61
3.4.8 Finite State Machine 64

CHAPTER 4 Software Platform / Frame works


4.1 Introduction 68
4.2 Open CV 68
4.3 MATLAB 69
4.4 iGesture 70
4.5 A Forge .NET 71

v
CHAPTER 5 Vision based hand gesture recognition analysis
for future perspectives

5.1 Introduction 74
5.2 Recognition techniques limitations 76
5.3 Application domain constraints 77
5.4 Real – time challenges 77
5.5 Robustness 78

CHAPTER 6 SYSTEM EVALUATION


6.1 Introduction 79
6.2 Experiments 79
6.2.1 User variables 79
6.2.2 Background Robustness 82
6.2.3 Lightining Influence 84
6.2.4 Hand Orientation 87

CHAPTER 7 CONCLUSION AND FUTURE WORKs


7.1 Summary 96
7.2 Suggestions for future works 97
7.3 Appendix 98

REFERENCES 113

vi
LIST OF TABLES
Page.no

Table 3.1 Set of research papers that have used skin color detection for hand gesture 36
and finger counting application.

Table 3.2 A set of research papers that have used appearance-based detection for 37
hand gesture application.

Table 3.3 Set of research papers that have used skeleton-based recognition for hand 40
gesture application.

Table 3.4 A set of research papers that have used motion-based detection for hand 42
gesture application.

Table 4.1 Analysis of some vital literature related to vision based hand gesture 72
recognition systems 1.

Table 4.2 Analysis of some vital literature related to vision based hand gesture 72
recognition systems 2.

Table 4.3 Analysis of some vital literature related to vision based hand gesture 73
recognition systems 3.

Table 4.4 Analysis of some vital literature related to vision based hand gesture 73
recognition systems 4.

vii
LIST OF FIGURES
Page no
Figure 1.1 Flow chart of the overall work stages. 7

Figure 2.1 Some examples of temporal hand gesture movement. 12

Figure 2.2(a) A typical Markov Chain with 5 states (Labeled from S1 to S5)and aij 14
represents the state transition probability. (Lawrence, 1989).
Figure A Hidden Markov Model where Xn are hidden states and Yn areobservable 14
2.2(b) states. (http://en.wikipedia.org/wiki/Hidden_Markov_model).

Figure 2.3 Three common types of hand model. 18

Figure 2.4 Feed forward and recurrent neural network 20

Figure 2.5 AdaBoost Pseudocode 22

Figure 3.1 The color-based glove marker 32

Figure 3.2 skin color detection using YUV color space 33

Figure 3.3 Example on appearance recognition using foreground extraction in order 36


to segment only ROI.

Figure 3.4 Example of skeleton recognition using pixel values to represent hand skeleton 39
model .
Figure 3.5 Example on motion recognition using frame difference subtraction to extract 41
hand feature, where the moving object such as hand is extracted from the fixed
background.

Figure 3.6 Example on motion recognition using frame difference subtraction to extract 41
hand feature, where the moving object such as hand is extracted from the fixed
background.
Figure 3.7 Block diagram of CAMSHIFT. 50

Figure 3.8 K-MEAN Recognition. 53

Figure 3.9 RECOGNITION BY SUPPORT VECTOR MACHINE. 56

Figure 3.10 RECOGNITION BY HIDDEN MARKOV MODEL. 59

Figure 3.11 ARCHITECTURE OF TIME DELAY NEURAL NETWORK. 63

viii
Page.no
Figure 3.12 STRUCTURE OF TIME DELAY NEURAL NETWORK. 64

Figure 6.1 Relationship between the independent and dependent variable. 80

Figure 6.2 Effect of extraneous variables on the relationship between theindependent and 81
dependent variables

Figure 6.3 Flow chart of tripling temporal difference method. 83

Figure 6.4 Influences of luminance contrast and ambient lighting on visual context 87
learningand retrieval

Figure 6.5 A schematic of the experimental apparatus 91

Figure 6.6 SIGNED ERROR OF PERCEIVED HAND ORIENTATION 95

Figure 6.7 PRECISION OF PERCEIVED HAND ORIENTATION ESTIMATES 95

ix
LIST OF ABBREVIATION

ROC Receiver Operating Characteristic

SVM Support Vector Machine

HCI Human Computer Interaction

PC Personal Computer

LUT Look-up Table

HSV Hue, Saturation, Value

ROI Return on Investment

CNN Convolutional Neural Network

RGB Red Green Blue

HMM Hidden Markov Model

x
ABSTRACT

Computer mouse has been an efficient input device. However, the mouse usage limits user’s
freedom. Besides, the devices are easily contaminated with bacteria and spreading disease among
users. The contactless vision-based hand gesture recognition is one of the solutions to the freedom
and hygiene problem. But it faces challenges of usability in term of cost and environmental variation
like lighting.

This thesis proposes and implements hand gesture recognition methods in image browsing
application, to allow users views pictures contactless from input device inreal time. The lower level
of the approach implements the posture recognition with Viola-Jones object detection method that
utilizes Haar-like features and the AdaBoost learning algorithm.

With this algorithm, real-time performance and high recognition accuracy up to 94%
detection rate can be obtained. The application system yield average of 89% successful input
command in a series of evaluation. Moreover, the application requires only common PC and
webcam to address the concern of deployment cost. To further enhance the speed of hand detection
in real-time application, an idea to reduce the area of search window by incorporating skin colour
segmentation is proposed in this thesis. A reduction of 19% of processing time is achieved with the
proposed method, comparing to the processing time without skin colour segmentation. In addition,
the re-training feature in the application enables users to update the classifier easily whenever
needed.
Still many people in this world find difficult to intract with the compoters and hardwares.
Using computer must be accepted to our natural mode of Communication. Many Intelligent
machines are now developed that can work alongside computers that allows the friendlier Human-
Computer Interaction (HCI). Our project is Gesture Virtual Mouse using AI Technology.It deals
with Controlling the movement's of the mouse through Hand Gesture by using fingers.Through this
project we can be able to perform various operations that involves mouse. It can perform left click,
right click, double click ,drag and drop,and also other various operations through Hand Gesture .
The Aim is to control the full mouse movements and cursor with the help of a simple web camera
rather then a additional mouse Device.

xi
CHAPTER 1

INTRODUCTION

1.1 Background

The rapid growth of computerization has made human-computer interaction

(HCI) essential part in daily life. Nowadays, it becomes so important that it has been

deeply embedded in modern human life, ranging from shopping, banking, to

entertainment and medication. According to Jenny (1994), HCI is the study of how

people interact with computers and to what extent computers are or are not developedfor

successful interaction with human beings. The study of HCI considers large number of

factors, including the environmental factors, comfort, user’s interface and system

functionality.

For the case of personal computer (PC), the input method of human-computer

interaction has evolved from primitive keyboard, to high precision laser mouse and

today’s advanced multi touch screen panel. However, there is a drawback as these

devices are easily contaminated with bacteria as user’s physical contact is required

especially in public computer such as hospital (Ciragil et al., 2003). The study of

Schultz et al. (2003) reports 95% of keyboards in clinical areas are contaminated with

harmful microorganism. As the result, the input devices have became a media in

spreading disease from one user to others.

Besides of the hygiene concerns, the commonly used human hand gestures are

expected to be part of HCI to serve the users better in the sense of higher degree of

freedom and natural way compared to device based input method (Mathias et al.,

1
2004). However, recognizing human hand gesture is a highly complex task which

involves many fields of studies including motion analysis, modeling, pattern recognition

and gesture interpretation (Ying et al., 1999). As computational power grows

exponentially making real time recognition more feasible, the integration of hand

gesture recognition into HCI has obtained attention from researchers in recent years.

Basically, there are three major categories of hand gesture HCI which are active

infrared (IR), glove-based and vision-based gesture interface (Moeslund et al.., 2003).

Active IR employs IR camera for detection but sensitive to sunlight, which make it a

major drawback. Glove-based interface refers to the HCI where users are required to

wear certain type of equipment to track the fingers position and hand motion (LaViola,

1999). Glove-based input interface has been exists since 1980s. The gloves technologies

for hand gesture recognition is relatively matured compared to vision based recognition

method, where numerous glove-based input devices are available in marketplace, for

example: Sayre glove, MIT LED glove and Data glove (Sturman et al., 1994). The

gloved-based hand gesture recognition is widely used in virtual reality application and

sign language recognition. As an example, local researchers have developed a wireless

Bluetooth Data gloves to recognize 25 common words signing in Bahasa Isyarat

Malaysia (BIM) successfully (Tan et al., 2007). Besides glove-like equipment, some

researchers even develop a sensor array that canbe worn at wrist to detect muscle

contraction to predict fingers movement and recognize hand gesture (Honda et al.,

2007).

Even though glove-based input does allow user to apply hand gestures in HCI,

input device attachment at hand or any part of body is required to make it works.

Therefore, it still poses certain limit to freedom of usage (Quek, 1994) and can be a

2
media of disease spreading. On the other hand, the vision based gesture recognition

method recognizes hand gesture in real time without any invasive devices attached to

the user’s hand. The vision based hand tracking is being done using image acquisition

and processing with single or multiple cameras. Hence, there is no physical contact

needed by users in this HCI method. There are many successful integration of the

vision-based gesture recognition HCI into application such as replacing TV remote

control with finger tracking, or interpretation of American Sign Language (Pavlovic et

al., 1997). Vision- based gesture recognition HCI is one of the HCI methods that offers

highest degree of freedom and naturalness (Moeslund et al., 2003), comparing to the

commonly used QWERTY keyboard, glove-based gesture and active infrared sensor

recognition. However, it also has toughest technical challenges among all.

1.2 Problem statement and motivation

As stated earlier, to tackle the problem of disease spreading through inputdevice,

the contactless vision based gesture recognition input is one of the solutions since users

do not need to touch or hold any device with this method. Meanwhile, it isa free and

natural way of HCI. But, it is challenging to promote the growth of hand gesture-based

input application. First of all, the vision-based gesture recognition implementation cost

has to be comparably equal or lower than normal input devicescost like keyboard, in

order to encourage the gesture-based input application deployment. Besides, the vision-

based hand gesture’s recognition accuracy is lower and varying under different

environment, compared to devices based input which is consistent over different

condition (Moeslund et al., 2003).

3
Noticeably, many works have been done to make vision-based gesture recognition

system feasible. But, the better usability of recognition system always comes with higher

cost of deployment. For example, Chen et al. (2007) and Hongo, et al. (2000) employs

multiple cameras to achieve desirable accuracy but indirectly increases the cost of

implementation when more hardware and processing power are required. On the other

hand, there are numerous works that successfully implementthe gesture recognition HCI

at lower cost but usability is compromised where the users are bounded to certain

limitation that defeats the purpose of hand gesture recognition HCI. For instance, Gupta

and Ma (2001) employs fast feature extraction with single camera but requires users to

rigidly aligned hand to camera. On the other hand, Yuanxin Zhu et al. (2000) gesture

recognition works is limited for users who wear long sleeve shirts.

Therefore, a HCI system needs to be developed to integrate suitable hand gesture

recognition techniques together to achieve better usability and lower cost at the same

time. Hence, in order to promote vision-based hand gesture recognition, this thesis’s

main motivation is to develop a gesture HCI system that balance between usability and

cost.

1.3 Objectives of the thesis

The goal of this thesis is to develop a real time vision-based hand gesture

recognition user input interface which is low cost but acceptable accuracy. The real time

system is expected to run smoothly in PC without noticeable slowing down. The

4
HCI system comprises many parts, from image processing, segmentation, hand

detection, feature extraction to hand gesture classification. Compared to previous

similar works, the system should have extra features and improvement to meet the

goal. To validate the performance of the HCI, a self-developed image browsing

application with vision based gesture recognition input is chosen as test vehicle because

the image browsing is one of the most common tasks in PC.

In the nutshell, the thesis objectives are:

 To develop a low cost vision-based hand gesture recognition interface

system with comparable accuracy and tolerance under different condition.

 To enhance current existing hand gesture recognition technique to achieve

desirable usability.

 To evaluate the performance of vision-based hand gesture recognition

interface using a self-developed image browsing application

1.4 Thesis scopes and approach

The scope of this thesis is focusing on development of real time vision based

hand gesture recognition HCI, where it requires only modest computing power and

webcam. This thesis does not intend to cover the software compatibility issues like

variance of camera driver and operating system. In general, this thesis is to study,

develop, enhance and evaluate the hand gesture HCI system with the self-developed

image browsing application. Since the thesis does not attempt to study human hand

gesture behaviour, the hand gesture recognized in the HCI should be a non-intuitiveand

non-standard hand gesture like American Sign Language. (Wikipedia, 2008)

5
In order to develop the HCI system and meet the thesis objectives, a step-by-

step methodology is outlined. Basically, there are four stages in the development cycles:

Planning, Implementation, Optimization and Evaluation.

In planning stage, the methodology, techniques selection and system architecture

is drafted. The application mainly requires a gesture recognition methodology that

works fast enough for real time implementation. So, to meet this requirement, literature

review in the field of hand gesture recognition, tracking and segmentation are carried

out to identify any suitable methodology for this thesis; including papers on appearance

and model based recognition, eg: Kalman filter, HMM, Boosting method as well as

active contour and colour segmentation. The architecture of the system and functionality

of each component are drafted.

6
Fig 1.1: Flow chart of the overall work stages

When the overall architecture is well defined and methodology is selected, each

component or module in the system is built independently and combined them together

later, based on system architecture planning. The core of the system is the gesture

recognition based input which is also the major focus of this thesis. So, to build the

recognition system, first of all, numerous hand gesture images from one

7
individual are collected under an optimal lighting condition. Then, features extraction

coding and classifier trainer are developed with this set of data. In this project, four

types of Haar-like features are implemented with over hundreds of instances.

Theresulted classifier from training is verified for syntax error and bugs free in real time

under similar lighting condition by the same individual whose hand gestures are

training sample set.

In Optimization stage, the skeleton of the system is basically ready but

performance enhancement is required. The methodology and coding are then further

optimized to reduce the processing time as much as possible, including the skin colour

segmentation which accelerate the hand localization process.

When the training and classifier coding are confirmed working, more training

samples are collected from 5 individuals in an uncontrolled lighting condition to create

variation and simulate the data randomness. The classifier is then re-trained with more

samples. The accuracy of each instance of the feature is evaluated and weight is

assigned based on AdaBoost methodology. As the result of including more training

samples, the classifier is more robust and accurate, compared to the initial classifier

which is meant for coding debug.

The completed application is tested under different condition based on

experiment requirement. The result is reviewed and compared to similar projects. The

reasons on failure of recognition under certain condition are also studied, understood

and fixed if possible, for instance, going back to Optimization stage for more

comprehensive training sample.

8
Finally, conclusion of the project is summarized and documented in this thesis.

The overall project work flow is illustrated in Figure 1.1.

1.5 Thesis Outline

This thesis is organized as follow. This Chapter 1 introduces the background of

HCI and the motivation to apply visual based hand gesture recognition for HCI. Besides,

the problem statement, application overview and scope of this thesis are described as

well in this chapter.

The Chapter 2 discusses the two hand gestures type: static and temporal. Review

of current available approaches to recognize the hand gesture is revealed. Besides,

several hand detection techniques with segmentation method like colour, contour and

differencing segmentation are presented. Lastly, the recognitionalgorithm – Viola-

Jones and AdaBoosting that is being chosen to be implemented inthis thesis is

explained.

The Chapter 3 describes how each the classifier in the system is being built,

through the sample collection, classifier training and evaluation process. It starts with

description of Haar-like feature, integral image construction and feature extraction

components. Then the way of implementing AdaBoost training is explained and being

evaluated with control data. Receiver Operating Characteristic (ROC) curve is plotted

to show performance of each type of classifier after training. In addition, the skin colour

segmentation technique that being used to localize hand is explained as well at the end

of chapter.

9
The Chapter 4 introduces the image viewing application in this thesis and

explains its system architecture behind the application. First, the system requirements

and setup is defined. Then, the application layout, features and hand gesture commands

are presented. As the system architecture is based on states machines, the flow and

connection of each state are described in detail as well.

In Chapter 5, the completed application is tested for usability with different

users, and environmental background. The hand orientation and lighting are also

artificially changed to simulate different environment. Then, the successful rate ineach

experiment is tabulated and discussed. At the end, the responsiveness of the system is

evaluated in term of number of classification executed per seconds.

Finally, Chapter 6 draws a conclusion and major contributions of this project.

Then, based on the findings throughout the project, future improvements are suggested

and discussed.

1.6 Summary

In Chapter 1, the problem of HCI and possible solutions are being reviewed.

Then the objectives, approach and scope of this project is discussed. The goal of this

thesis is to develop a real time vision-based hand gesture recognition user input

interface which is low cost but acceptable accuracy. At the end of the chapter, the

outline of this thesis is described.

1
CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

Implementation of vision based hand gesture recognition as HCI is a verywide

scope of field for study. The main focus in this chapter is to review on the hand gestures,

hand detection and recognition methodology, and recent related works. Through the

study, we will be able to understand and hence identify the suitabletechniques for the

implementation in the thesis. This chapter is organized as followed. Firstly, the

definition and category of hand gesture are explained. Then, techniques of recognition

for the hand gesture are reviewed and suitable methods are chosen for implementation.

Finally, the recent works on visual based hand gesture recognition system are reviewed

before the chapter ends.

2.2 Hand Gestures

A hand gesture is a form of non-verbal communication made using hand.

According to Ying and Thomas (2001), the hand gestures can be classified into several

categories: controlling gestures, conversational gestures, manipulative gestures and

communicative gestures. Controlling gestures is the navigating gesture which uses hand

orientation and movement direction to navigate and pointing in virtual environment or

some display control applications. The example of controlling gesture is the virtual

mouse interface, which enable users to use hand gesture to navigate the mouse cursors

instead of using a physical mouse on the desk (Tsang et al., 2005). Conversational

gestures are part of human interaction, for example emphasizing certain part of

conversation with hand gesture. Manipulative gesture is a

1
way to interact with virtual objects such as tele-operation (Hasegawa et al., 1995) and

virtual assembly.

Basically, a meaningful hand gesture can be represented by both temporal hand

movements and static hand postures (Ying et al., 2001). Further explanation of temporal

and static hand gesture as followed.

2.2.1 Temporal Hand gesture

The temporal hand movements or dynamic hand gestures represents certain

actions by hand movements. For example, the conductor in orchestra is using temporal

hand movement gesture to communicate music tempo to the team.

Fig 2.1: Some examples of temporal hand gesture movement

To recognize the temporal hand gesture, HCI researchers need to track the hand

movement in video sequences with sets of parameters like coordinates and direction. For

simple hand gesture, Kalman filter is often employed to estimate, interpolate and predict

the hand motion parameters for modeling and recognition (Aditya et al., 2002). Kalman

Filter is a feedback control system that consists of time update equations and

measurement update equations (Greg et al., 2004). Both set of equation forms an

ongoing cycle where time update projects states ahead of time and measurement update

adjusts the projection by actual current measurement (Ying et al.,

1
1999). Besides, Quek (1994) implements vector flow field method to find the velocity

of the moving edges of hand. The vector field computation result correlates to thehand

movement but the method only able to detect the direction and velocity only.

Hence, these methods are insufficient when dynamic hand gesture grows

complex. So, the Hidden Markov Model (HMM) technique is utilized to model and

recognize large variation of temporal hand movement gesture (Stoll et al., 1995).

Basically, HMM is a statistical model which is being modeled and assumed to be a

Markov process with unknown parameters. The challenge of HMM is to determine

the hidden parameters from the observable parameters. Different from typical Markov

Chain which is shown in Figure 2.2(a), the hidden state in the hidden Markov model

is not directly visible by observer. However, the variables which are influenced by the

state are visible as shown in Figure 2.2(b). HMM has been widely applied in speech

recognition for years (Lawrence, 1989). Due to similarity between speech recognition

and temporal gesture recognition, therefore HMM is also employed to recognize human

motion in recent years. For instance the HMM technique is implemented in areal time

gesture recognition system by Ozer et al. (2005) and Byung et al. (1997) where the body

parts are tracked and analyzed by HMM with over 90% of the activities are correctly

classified in their system.

1
Fig 2.2(a): A typical Markov Chain with 5 states (Labeled from S1 to S5)and aij
represents the state transition probability. (Lawrence, 1989)

Fig 2.2(b): A Hidden Markov Model where Xn are hidden states and Yn are
observable states. (http://en.wikipedia.org/wiki/Hidden_Markov_model)

For higher complexity temporal hand gestures, Wang et al. (2008) proposed

implementation of hierarchical dynamic Bayesian networks through low-level image

processing instead of HMM. Their experiment shows a slight accuracy improvement

and dynamic Bayesian network is recommended for complex hand gesture over HMM.

However, the experiment is done offline and further improvement on accuracy is

suggested to enhance the method.


2.2.2 Static Hand Postures

Different from temporal hand gesture, the static hand postures express certain

thought through hand configuration, instead of movement. Some common examples of

static hand postures are “thumb up” to express good feeling or “pointing extended

finger” to show direction. In general, the static hand posture detection methods are

categorized into two categories of approach, which are appearance based and model

based. (Lee J. et al., 1995)

2.2.2.1 Appearance based approach

Appearance-based approaches use image features to model the visual appearance

of the hand and compare these parameters with the extracted image features from the

input video. The features can be a wavelet, intensity gradient or a brightness difference

between two areas like Haar-like feature (Viola et al., 2001). Kolsch M. and Turk M.

(2004a) proposes frequency analysis method whichinstantaneously estimate the posture

appearance suitability for classification, enable researchers to predict the classification

rate of the hand posture upfront.

An appearance based detection, so called Viola-Jones detection method which is

extremely fast and almost arbitrarily accurate approach (Viola et al., 2001) has been

popular in faces and hand detection field especially in the real time application

implementation. The method requires less computing power (Viola et al., 2001) and

even feasible to be implemented in mobile platform like camera and handphone,

which lack of processing speed (Jianfeng et al., 2008). Proposed by Viola and Jones,

this method uses Haar-like feature extracted from image as the input to the classifier.

Haar-like feature is a very simple feature based on intensity comparisons between


rectangular image areas. The method proposed a new image representation called

Intergal Image that allows very fast feature extraction. The integral image can be

constructed from an image using a few operations per pixel. Once the integral imageis

computed, these Haar-like features can be computed at any scale or location in constant

time. The Haar- like feature instances with various sizes at different locationare used as

weak classifiers to separate the two classes. A weak classifier or weak learners means

the feature with an unclear parameter boundary for two classes. Overlapping between

two classes of parameters makes the feature weak in distinguish one to other. However,

under certain special condition, there are some of the instances within the pool have

better ability to separate two classes. Hence, AdaBoost is suggested by Viola and Jones

to be implemented as part of Viola-Jones method to identify the combine many weak

classifiers into a strong classifier.

There is one type of feature called Eigenpicture proposed by Kirby and Sirovich

(1987, 1990), which is able to represent images at smaller dimension offeature for

classification. Turk, M. and A. Pentland (1991) implement the idea into automatic face

recognition system, which is well-known as Eigenfaces method. Eigenface approach is

derived by applying Principle Component Analysis (PCA) on covariance matrix of an

image dataset to find vectors that is best account for the representation of images. These

vectors are called eigenvectors of covariance matrixthat is corresponding to original

face images.

Besides, there are some other appearance based features are implemented

successfully in hand gesture recognition. For instance, Elena et al. (2003) experimental

study on the template based detection system using Hausdorff distance


shows recognition rate up to 90% and fast enough for real time implementation. The

system captures image of user’s hand posture with webcam and segmented the hand

blob using colour segmentation technique. Then the segmented image is compared to

pre- processed template by calculating the bidirectional partial Hausdorff distance.

However, the real-time implementation is limited to only four reference template per

posture, which means robustness to variation like hand rotation angle is limited

aswell.

2.2.2.2 Model-based approach

The hand model-based approach is depending on a 3D hand model to estimate

the hand parameters by comparing the input images to the possible 2D appearance

projected in 3D hand model. Some researchers attempt to create a highly detailed 3D

computerized hand model that simulates the articulation of the hand (Huan Du et al.,

2007; Pavlovic, et al., 1997). Figure 2.3 shows some examples of hand model

commonly used. The hand pose is represented in by a set of parameters which usually

acquired by recovering user’s palm, fingers, joints and fingertips from input images.

The parameters can be an angle between joints, orientation of fingers and etc. Neural

networks are often implemented to recognize hand gesture with the set of parameters

that representing hand. For example, Berci et al., (2007) implements skeleton hand

model to recognize hand postures. Their algorithm performs skeletonization on a hand

silhouette to obtain the model of the hand posture.

Although 3D methods provide a more accurate modeling of a human hand, their

deployment in augmented environment is challenging as the method is highly sensitive

to image noise and hand segmentation errors. Also, this type of method
usually consumes higher computational power and therefore limits it implementation

in real-time running application (Siu, 2005).

Fig 2.3: Three common types of hand model. (a) Cardboard ,(b) Wireframe,

and (c) Contour . (Ying et al., 2001)

However, leveraging on computational power growth in mainstream PC, there

are some successful model-based approach hand gesture recognition which is running

real time. For example, a project by Vámossy et al. (2007) implements neural network

classification on the skeleton hand model input parameter. Their experimental project

achieves 81% recognition rate up to 22 frames rate under 320x240 pixels camera with

simple background.

2.3 Classification

With the features extracted, classification is needed to differentiate orrecognize

the class of the input image; either it is model based or appearance based. Here, neural

network and AdaBoost classification are being discussed because these methods are

extensively implemented in gesture classification.


2.3.1 Neural Networks

Neural network is a non-linear statistical data modeling tool for complex

relationship between input and output of a model. Basically, the architecture of neural

network can be classified into two main categories, feed-forward and recurrent neural

network as shown in Figure 2.4. Different from feed-forward neural network, the

recurrent neural network propagate data from later processing to earlier stage, making

it able to recognize time-varying pattern. (Samir, 2000) The neural network often

implemented in pattern recognition for classification due to its advantages in noise

immunity. To train a feed forward neural network, back propagation learning algorithm

is one of the commonly known method. There are other learning algorithms like Delta

Rule and Perceptron available for training neural network. (Alsmadi et al. 2009)

Neural network is widely implemented in temporal hand gesture recognition. For

example, Vafadar et al., (2008) examine neural network classification with back

propagation training on temporal hand gesture under simple background. Colour

segmentation in HSV colour space and image morphology operators is implemented

to extract hand contour. The classification of the test data yield 99.98% detection rate

in noiseless data set, 92.08% in the noisy data. The experiment is done offline where

data is captured upfront and processed.


Input Layer Hidden Layer Output Layer

Fig 2.4: Feed forward and recurrent neural network

Deyou (2006) developed a DataGlove hand gesture recognition system that

enables users to perform driving tasks through virtual reality concept. The system

recognition core is based on a single hidden layer neural network which is trained with

supervised learning algorithm. A test set of 100 hand gestures from trained usersis

tested on the trained neural networks, yields 98% of recognition rate. The recognition

rate drops to 92% when the system is tested with alien users whose data is not included

during training.

Mu-Chun et al. (1998) utilizes neural network to classify spatial-temporal signal

extracted from hand gesture. In the experiment, 51 hand gestures are collected from 4

persons. The experimental result shows correct recognition rate is up to 92.9%. Ho-Joon

Kim et al., (2008) combines a convolutional neural network with a weighted fuzzy min-

max neural network to perform feature analysis. Then the feature data is
process with a modified convolutional neural network. Six different temporal hand

gestures are tested. The experimental result shows lowest recognition rate at 80% for

“Thumb up” gesture and highest at 97.5% for a ‘wave up” gesture. Then, the weighted

fuzzy min-max neural network is applied to select significant feature to reduce the

number of feature. After reducing the number of feature 50% less, the recognition rate is

still comparable to the initial condition.

Paulraj et al., (2009) presents a method based on neural network to translate

“Kod Tangan Bahasa Melayu” into voice. The hand gesture is recorded using webcam

under simple background. Then segmentation is done to extract the hand movement.

Then discrete cosine transform is applied for feature extraction from the video sequence.

A double hidden layer neural network is employed to classify the gesture. Their

experimental results show 81% of recognition rate, out of 140 samples.

2.3.2 AdaBoost

Boosting is a method to improve the accuracy of learning algorithm. The

AdaBoost algorithm is introduced in 1995 (Yoav et al., 1999) to solve the problems of

the boosting algorithms. The AdaBoost algorithm takes a training set (X1; Y1) to (Xm;

Ym) as input, where X is the data set and Y is the label of class. The m is the sample

size of training set. In this thesis, we assume the label, Y to be 1 or -1, which

represents two classes of data. Pseudo code for AdaBoost is shown in Fig 2.5.

21
Fig 2.5: AdaBoost Pseudo code (Yoav and Robert E, 1999)

In AdaBoost learning algorithm, the training data is applied repeatedly to a

base learning algorithm in a given number of iteration. Initially, all weights are set

equally and get updated in each time of training iteration. On each round of the

training, the weights of incorrectly classified examples are increased so that it will

be focused on next round of training. A weak hypothesis or learner, h t is selected in each

training iteration, where the goodness of a weak learner is measured by its error, e t.

The error is measured with respect to the distribution, D t on which the weak learner

was trained. Alpha, αt measures the importance of ht where αt gets larger when et is

smaller. The distribution Dt is next updated using the rule shown in the figure 2.5. The

effect of this rule is to increase the weight of examples misclassified by weak learner,

ht and to decrease the weight of correctly classified examples. The final outcome of

22
the training is the classifier with a combination of weak learners from each training

iteration.

The AdaBoost algorithm has been implemented widely in numerous pattern

recognition projects (Mathias et al., 2004), (Juan et al., 2005), (Qing et al., 2007), (Qing

et al., 2008) and it is tested empirically successful by many researchers (Harriset al.,

1996), (Jeffrey et al., 1996), (Richard et al.. 1997). Besides, it is a key ingredient of

Viola-Jones detection method as this boosting method helps to determine strongest

feature within a very large pool of data.

There is another boosting learning algorithm that similar to AdaBoost, which

is called Support Vector Machine (SVM). SVM is proposed by Vapnik, (1998) to solve

general pattern recognition problem. When given a set of points belonging totwo

different classes, SVM finds a hyperplane that separates the largest possible fraction of

points of the same class. Yen-Ting et al., (2007) implemented SVM intheir Multiple-

angle Hand gesture recognition system which achieves over 95% of detection rate.

However, there’s difference in the computation requirement between SVM and

AdaBoost. SVM corresponds to quadratic programming, while AdaBoost corresponds

only to linear programming. As quadratic programming is more computational

demanding, it makes SVM less feasible in real time application compared to AdaBoost

(Yoav et al., 1999). A facial expression recognition experiment is carried out (Yubo et

al., 2004) to compare AdaBoost and SVM processing time. Testing on a face sample

database with Pentium IV 2.53GHz processor, the AdaBoost method is 300 times faster

than SVM method.

23
2.4 Hand segmentation methodology

The aim of hand detection is to detect and localize the hand regions in image

sequences. Artificial object detection such as specifically coloured object as described

in Wilson et al. (2003), can achieve very high detection rates despite low false positive

rates. Yet, the same is not true for faces and even less for hands because users are

naturally reluctant to colour their hands. So, segmentation has becoming one of a crucial

part to ensure the success of hand detection in vision-based recognition. Hand detection

has attracted a great amount of interest and many methods relying on shape, texture, or

temporal information have been thoroughly investigated over the years. Besides the

traditional edge-based segmentation, the segmentation techniques like active contour,

colour segmentation and differencing are being discussed here.

2.4.1 Active Contour

Active contours or so-called “Snakes” is commonly used in segmenting objects

and deformable contour tracking in an image. The segmentation with active contours is

done with minimization of the three energies in the active contour equation which are

internal energy, image energy and external energy. Usually, the active contour is

initialized near the object of interest and attracted toward the contourof the object by

the intensity gradient in each iteration (Kass et al., 1987). However,the classic active

contour algorithm will not operate well if there are large differencesin the position or

form of the object between successive images (Yuliang, 2003).

In Kim et al., (2001) work, the tracker utilizes the image flow, which gives

rough information on the direction and magnitude of the moving objects. The

correlation process between two images makes the snake tracks the object of interest.

24
The success of tracking is largely based on the calculation of the image flow. Unfortunately, it

could become complicated in active vision, for example, the situation with moving cameras.

The whole image is moving including both the foreground and background. It is hard to

distinguish the motion of the object of interest when it moves in a similar speed as the camera

(Yuliang, 2003).

2.4.2 Colour Segmentation

Some researchers have used human skin colour information to extract face and hand

regions. That compelling results can be achieved merely by skin colour properties. For example,

Schiele B. and Waibel A. (2005) who used it in combination with a neural network to estimate

gaze direction. Kjeldsen R. and Kender J. (1996) demonstrates interface-quality hand gesture

recognition solely with colour segmentation. Colour space can be mathematically represented by

three dimensional coordinate systems. The colour space that is used for segmentation is RGB,

HSV, CIELAB and YIQ as shown in Figure 2.6.

In RGB colour space, the three axes perpendicular to each other represents red, green and

blue. HSV stands for Hue, Saturation and Value. YIQ is based on luminance and chrominance

where Y is the luminance or brightness component, I and Q are the decoupled component of

chrominance. The CIELAB colour space has three components as well, which are lightness (L*)

and two colour components that position between green/red (a*) and yellow/blue (b*). The

colour segmentation method that uses an HSV colour space is debatably beneficial to skin colour

identification. The appearance of skin colour varies mostly in intensity while the chrominance

remains fairly consistent according to Saxe D. and Foulds R. (1996).

25
2.5 Methodology

This system tell us about four main modules. Below are the modules:

i) Image Acquisition:
We need a sensor(camera) for the system to detect the user's hand movements. The
computer's camera is used as a sensor,. The webcam captures real-time video at a fixed frame rate
and resolution determined by the camera's hardware. If necessary, the system allows you to
change the frame rate and resolution. The pointer image moves to the right, and versa if we move
the color pointer to the left. It is similar to the picture we get when we stand before a mirror, but it
is used to avoid the picture flickering.

ii) Color Filter:

By using Colour Filter the image is converted into a grey image. Once the image was
converted to greyscale then all necessary operations were performed . Then you have to subject a
noise filter, smoothing, and threshold.
iii) Color detection:
This is the most important step. The three-color object is detected by subtracting the tilted
color channel suppressed from the tilted grey image. This creates a picture where the object
detected appears as a grey patch, enclosed in black space.

iv) Hand Gesture:


After Finishing the previous modules, the mouse movement, left-click, right-click,
drag/select, scroll up and scroll down will be carried out with our fingers.

APPLICATION

Hand gesture virtual mouse is used for many applications.Gesture recognition


serves as an alternative source of user interface in order to provide a real-time data to a
computer. It is very most and best suited for the place where we cannot make use of our
physical mouse. And most important is it is space efficient. In recent time hand gesture
technology started to penetrate various industries which advances in machine learning,
deep learning, artificial intelligence .

26
★ Without usage of any device's the system can be used to control robots and
automation systems

★ This proposed model has higher efficiency of 99% .

★ Which is greater than other proposed Virtual Mouse system .

★ HVI can be used for controlling the robots in the field of robotics .

★ For designing virtually the prototyping the proposed system can be used in designing
and architecture.

★ Persons who have certain problems with hands can use this virtual mouse system
efficiently to control the mouse functions .

★ Virtual reality and augmented reality games which are wired or wireless can be played
by using virtual mouse system.

Commonly these technologies are applied on our daily routine life like as follows:
 Automated homes
 Healthcare
 Virtual reality
 Consumer electronics

27
CHAPTER 3

VISION BASED HAND GESTURE RECOGNITION TECHNIQUES

3.1 INTRODUCTION :
With the development of information technology in our society, we can expect that computer
systems to a larger extent will be embedded into our environment. These environments will impose needs for
new types of human computer -interaction, with interfaces that are natural and easy to use. The user interface
(UI) of the personal computer has evolved from a text-based command line to a graphical interface with
keyboard and mouse inputs. However, they are inconvenient and unnatural. The use of hand gestures
provides an attractive alternative to these cumbersome interface devices for human-computer interaction
(HCI). User’s generally use hand gestures for expression of their feelings and notifications of their thoughts.
In particular, visual interpretation of hand gestures can help in achieving the ease and naturalness desired for
HCI. Vision has the potential of carrying a wealth of information in a nonintrusive manner and at a low cost,
therefore it constitutes a very attractive sensing modality for developing hand gestures recognition. Recent
researches [1, 2] in computer vision have established the importance of gesture recognition systems for the
purpose of human computer interaction. The primary goal of gesture recognition research is to create a
system which can identify specific human gestures and use them to convey information or for device control.
A gesture may be defined as a physical movement of the hands, arms, face, and body with the intent to
convey information or meaning. Gesture recognition, then, consists

not only of the tracking of human movement, but also the interpretation of that movement as semantically
meaningful commands. Two approaches are commonly used to interpret gestures for Human Computer
interaction. They are

(a) Methods Which Use Data Gloves: This method employs sensors (mechanical or optical) attached to
a glove that transduces finger flexions into electrical signals for determining the hand posture. This
approach forces the user to carry a load of cables which are connected to the computer and hinders the ease
and naturalness of the user interaction.

(b) Methods Which are Vision Based: Computer vision based techniques are non invasive and based on
the way human beings perceive information about their surroundings. Although it is difficult to design a
vision based interface for generic usage, yet it is feasible to design such an interface for a controlled
environment [3].

28
Features for Gesture Recognition :
Selecting features are crucial to gesture recognition, since hand gestures are very rich in shape
variation, motion and textures. For static hand posture recognition, although it is possible to recognize hand
posture by extracting some geometric features such as fingertips, finger directions and hand contours, such
features are not always available and reliable due to self-occlusion and lighting conditions. There are also
many other non-geometric features such as color, Silhouette and textures, however, they are inadequate in
recognition. Since it is not easy to specify features explicitly, the whole image or transformed image is taken
as the input and features are selected implicitly and automatically by the recognizer. Hand features can be
derived using the following three approaches:

(a) Model based Approaches (Kinematic Model):


Model based approaches attempt to infer the pose of the palm and the joint angles [18, 19]. Such an
approach would be ideal for realistic interactions in virtual environments. Generally, the approach consists
of searching for the kinematic parameters that bring the 2D projection of a 3D model of hand into
correspondence with an edge-based image of a hand [18]. A common problem with the model based
approaches is the problem of the feature extraction (i.e. edges). The human hand itself is rather texture less
and does not provide many reliable edges internally. The edges that are extracted are usually extracted from
the occluding boundaries. In order to facilitate extraction and unambiguous correspondence of edges with
model edges the approaches require homogeneous backgrounds and high contrast backgrounds relative to
the hand.

(b) View based Approaches:


Due the above mentioned fitting difficulties associated with kinematic model based approaches,
many have sought alternative representations of the hand. An alternative approach that has garnered
significant focus in recent years is view-based approach [20]. View Based approaches, also referred to as
appearance based approaches, model the hand by a collection of 2D intensity images. In turn, gestures are
modeled as a sequence of views.

(c) Low Level Features based Approaches:


In many gesture applications though all that is required is a mapping between input video and
gesture. Therefore, many have argued that the full reconstruction of the hand is not essential for

gesture recognition. Instead many approaches have utilized the extraction of low-level image measurements
that are fairly robust to noise and can be extracted quickly. Low-level features that have been proposed in the
literature include: the centroid of the hand region [21], principle axes defining an elliptical bounding region
of the hand [14], and the optical flow/affine flow [22] of the hand region in a scene.

29
Application Domains:
In this section, as the gesture recognition can be used in many more areas, we present an overview
of some of the application domains that employ gesture interactions.

(a) Virtual Reality:


Gestures for virtual and augmented reality applications have experienced one of the greatest levels
of uptake in computing. Virtual reality interactions use gestures to enable realistic manipulations of virtual
objects using one's hands, for 3D display interactions [4] or 2D displays that simulate 3D interactions [5].

(b) Robotics and Telepresence:


Telepresence and telerobotic applications are typically situated within the domain of space
exploration and military-based research projects. The gestures used to interact with and control robots are
similar to fully-immersed virtual reality interactions, however the worlds are often real, presenting the
operator with video feed from cameras located on the robot [6]. Here, gestures can control a robot's hand and
arm movements to reach for and manipulate actual objects, as well its movement through the world.

(c) Desktop and Tablet PC Applications:


In desktop computing applications, gestures can provide an alternative interaction to the mouse and
keyboard [7]. Many gestures for desktop computing tasks involve manipulating graphics, or annotating and
editing documents using pen-based gestures [8].

(d) Games:
When, we look at gestures for computer games. Freeman et al. [9] tracked a player’s hand or body
position to control movement and orientation of interactive game objects such as cars. Konrad et al. [10]
used gestures to control the movement of avatars in a virtual world, and PlayStation 2 has introduced the
Eye Toy, a camera that tracks hand movements for interactive games [11].

(e) Sign Language:


Sign language is an important case of communicative gestures. Since sign languages are highly
structural, they are very suitable as testbeds for vision algorithms [12]. At the same time, they can also be a
good way to help the disabled to interact with computers. Sign language for the deaf (e.g. American Sign
Language) is an example that has received significant attention in the gesture literature [13, 14, 15 and 16].

30
VISION BASED GESTURE RECOGNITION
Vision-based interaction is a challenging interdisciplinary research area, which involves computer
Vision and graphics, image processing, machine learning, bioinformatics, and psychology.
To make a successful working system, there are some requirements which the system should have:

(a) Robustness:
In the real-world, visual information could be very rich, noisy, and incomplete, due to changing
illumination, clutter and dynamic backgrounds, occlusion, etc. Vision-based systems should be user
independent and robust against all these factors.

(b) Computational Efficiency:


Generally, Visionbased interaction often requires real-time systems. The vision and learning
techniques/algorithms used in Vision-based interaction should be effective as well as cost efficient.
(c) User’s Tolerance:
The malfunctions or mistakes of Vision-based interaction should be tolerated. When a mistake is
made, it should not incur much loss. Users can be asked to repeat some actions, instead of letting the
computer make more wrong decisions.
(d) Scalability:
The Vision-based interaction system should be easily adapted to different scales of applications. Fox
eg. The core of Vision-based interaction should be the same for desktop environments, Sign Language
Recognition, robot navigation and also for VE. Most of the systems reviewed rely on the simple idea of
detecting and segmenting the gesturing hand from the background using motion detection or skin color.
According to Wacs et al. [17] proper selection of features or clues, and their combination with sophisticated
recognition algorithms, can affect the success or failure of any existing and future work in the field of
Human Computer interaction using hand gestures.

3.2 DETECTION
3.2.1 COLOR

Color-Based Recognition Using Glove Marker This method uses a camera to track the movement of
the hand using a glove with different color marks, as shown in (Figure 3.1). This method has been used for
interaction with 3D models, permitting some processing, such as zooming, moving, drawing and writing
using a virtual keyboard with good flexibility [9].

31
The colors on the glove enable the camera sensor to track and detect the location of the palm and
fingers, which allows for the extraction of a geometric model of the shape of the hand [13,25]. The
advantages of this method are its simplicity of use and low price compared with the sensor data glove [9].
However, it still requires the wearing of colored gloves and limits the degree of natural and spontaneous
interaction with the HCI [25]. The color-based glove marker is shown.

Fig 3.1 - The color-based glove marker

COLOR BASED RECOGNITION OF SKIN COLOR:


Skin color detection is one of the most popular methods for hand segmentation and is used in a wide
range of applications, such as object classification, degraded photograph recovery, person movement
tracking, video observation, HCI applications, facial recognition, hand segmentation and gesture
identification. Skin color detection has been achieved using two methods. The first method is pixel based
skin detection, in which each pixel in an image is classified into skin or not, individually from its neighbor.

The second method is region skin detection, in which the skin pixels are spatially processed based
on information such as intensity and texture. Color space can be used as a mathematical model to represent
image color information. Several color spaces can be used according to the application type such as digital
graphics, image process applications, TV transmission and application of computer vision techniques
[26,27]. (Figure 3.2) shows an example of skin color detection using YUV color space.

32
Fig 3.2 - skin color detection using YUV color space.

Figure 6. Example of skin color detection. (a) Apply threshold to the channels of YUV color space in order
to extract only skin color then assign 1 value for the skin and 0 to non-skin color; (b) detected and tracked
hand using the resulting binary image.

A several formats of color space are obtained for skin segmentation, as itemized below:
• red, green, blue (R–G–B and RGB-normalized);
• hue and saturation (H–S–V, H–S–I and H–S–L);
• luminance (YIQ, Y–Cb–Cr and YUV).

More detailed discussion of skin color detection based on RGB channels can be found in [28,29]. However,
it is not preferred for skin segmentation purposes because the mixture of the color channel and intensity
information of an image has irregular characteristics [26]. Skin color can detect the threshold value of three
channels (red, green and blue). In the case of normalized-RGB, the color information is simply separated
from the luminance. However, under lighting variation, it cannot be relied on for segmentation or detection
purposes, as shown in the studies [30,31].

The characteristics of color space such as hue/saturation family and luminance family are good under
lighting variations. The transformation of format RGB to HSI or HSV takes time in case of substantial
variation in color value (hue and saturation). Therefore, a pixel within a range of intensity is chosen. The
RGB to HSV transformation may consume time because of the transformation from Cartesian to polar
coordinates. Thus, HSV space is useful for detection in simple images.

33
Transforming and splitting channels of Y–Cb–Cr color space is simple if compared with the HSV
color family in regard to skin color detection and segmentation, as illustrated in [32,33]. Skin tone detection
based Y–Cb–Cr is demonstrated in detail in [34,35].

The image is processed to convert RGB color space to another color space in order to detect the
region of interest, normally a hand. This method can be used to detect the region through the range of
possible colors, such as red, orange, pink and brown. The training sample of skin regions is studied to
obtain the likely range of skin pixels with the band values for R, G and B pixels. To detect skin regions, the
pixel color should compare the colors in the region with the predetermined sample color. If similar, then the
region can be labeled as skin [36]. Table 1 presents a set of research papers that use different techniques to
detect skin color.

The skin color method involves various challenges, such as illumination variation, background
issues and other types of noise. A study by Perimal et al. [37] provided 14 gestures under controlled-
conditions room lighting using an HD camera at short distance (0.15 to 0.20 m) and, the gestures were tested
with three parameters, noise, light intensity and size of hand, which directly affect recognition rate. Another
study by Sulyman et al. [38] observed that using Y–Cb–Cr color space is beneficial for eliminating
illumination effects, although bright light during capture reduces the accuracy. A study by Pansare et al. [11]
used RGB to normalize and detect skin and applied a median filter to the red channel to reduce noise on the
captured image. The Euclidean distance algorithm was used for feature matching based on a comprehensive
dataset. A study by Rajesh et al. [15] used HSI to segment the skin color region under controlled
environmental conditions, to enable proper illumination and reduce the error.

Another challenge with the skin color method is that the background must not contain any elements
that match skin color. Choudhury et al. [39] suggested a novel hand segmentation based on combining the
frame differencing technique and skin color segmentation, which recorded good results, but this method is
still sensitive to scenes that contain moving objects in the background, such as moving curtains and waving
trees. Stergiopoulou et al. [40] combined motion-based segmentation (a hybrid of image differencing and
background subtraction) with skin color and morphology features to obtain a robust result that overcomes
illumination and complex background problems. Another study by Khandade et al. [41] used a cross-
correlation method to match hand segmentation with a dataset to achieve better recognition. Karabasi et al.

34
3.2.2 SHAPE
This method depends on extracting the image features in order to model visual appearance such as
the hand and comparing these parameters with features extracted from the input image frames. Where the
features are directly calculated by the pixel intensities without a previous segmentation process. The method
is executed in real time due to the easy 2D image features extracted and is considered easier to implement
than the 3D model method. In addition, this method can detect various skin tones. Utilizing the AdaBoost
learning algorithm, which maintains fixed features such as key points for a portion of a hand, which can
solve the occlusion issue [47,48], it can separate into two models: a motion model and a 2D static model.
Table 2 presents a set of research papers that use different segmentation techniques based on appearance
recognition to detect regions of interest (ROI).

SHAPE BASED DETECTION


The first approach was based on posture recognition using Haar-like features, which can describe the
hand posture pattern effectively using the AdaBoost learning algorithm to speed up the performance and thus
rate of classification. The second approach focused on gesture recognition using context-free grammar to
analyze the syntactic structure based on the detected postures. Another study by Kulkarni and Lokhande [50]
used three feature extraction method such as a histogram technique to segment and observe images that
contained a large number of gestures, then suggested using edge detection such as Canny, Sobel and Prewitt
operators to detect the edges with a different threshold. The classification gesture performed using feed
forward back propagation artificial neural network with supervision learns. Some of the limitations reported
by the author were to conclude when using histogram technique the system gets misclassified results because
histogram can only be used for the small number of gestures which are completely different from each other.
Fang et al. [51] used an extended AdaBoost method for hand detection and combined optical flow with the
color cue for tracking.
They also collected hand color from the neighborhood of features’ mean position using a single
Gaussian model to describe hand color in HSV color space. Where multi feature extraction and gesture
recognition using palm and finger decomposition, then utilizing scale-space feature detection were integrated
into gesture recognition in order to encounter the limitation of aspect ratio which faced most of the learning
of hand gesture methods. Lisa R et al. [52] used a simple background subtraction method for hand
segmentation and extended it to handle background changes in order to face some challenges such as skin
like color and complex and dynamic background then used boundary-based method to classify hand gesture.
Finally, Zhou et al. [53] proposed a novel method to directly extract the fingers where the edges were
extracted from the gesture images, and then the finger central area was obtained from the obtained edges.
Fingers were then obtained from the parallel edge.

35
Fig 3.3 - Example on appearance recognition using foreground extraction in order to segment only ROI.

Table 3.1 - Set of research papers that have used skin color detection for hand gesture and finger
counting application.

36
Table 3.2 - A set of research papers that have used appearance-based detection for hand gesture
application.

According to information mentioned in Table 3.2. The first row indicates Haar-like features which
are considered good for analyzing ROI patterns efficiently. Haar-like features can efficiently analyze the
contrast between dark and bright objects within a kernel, which can operate faster compared with a pixel
based system. In addition, it is immune to noise and lighting variation because they calculate the gray value
difference between the white and black rectangles. The result of the first row is 90%, but if compared with a
single gaussian model which used to describe hand color in HSV color space in the third row the result of
recognition rate is 93%. Although both proposed systems used the Adaboost algorithm to speed up the
system and classification.

37
3.2.3 PIXEL VALUES

The skeleton-based recognition specifies model parameters which can improve the detection of
complex features [16]. Where the various representations of skeleton data for the hand model can be used for
classification, it describes geometric attributes and constraints and easily translates features and correlations
of data, in order to focus on geometric and statistical features. The most common feature used is the joint
orientation, the space between joints, the skeletal joint location and degree of angle between joints and
trajectories and curvature of the joints. Table 4 presents a set of research papers that use different
segmentation techniques based on skeletal recognition to detect ROI.
Hand segmentation using the depth sensor of the Kinect camera, followed by location of the
fingertips using 3D connections, Euclidean distance, and geodesic distance over hand skeleton pixels to
provide increased accuracy was proposed in [58]. A new 3D hand gesture recognition approach based on a
deep learning model using parallel convolutional neural networks (CNN) to process hand skeleton joints’
positions was introduced in [59], the proposed system has a limitation where it works only with complete
sequence. The optimal viewpoint was estimated and the point cloud of gesture transformed using a curve
skeleton to specify topology, then Laplacian-based contraction was applied to specify the skeleton points in
[60].

The Hungarian algorithm was applied to calculate the match scores of the skeleton point set, but the
joint tracking information acquired by Kinect is not accurate enough which give a result with constant
vibration. A novel method based on skeletal features extracted from RGB recorded video of sign language,
which presents difficulties to extracting accurate skeletal data because of occlusions, was offered in [61].

A dynamic hand gesture using depth and skeletal dataset for a skeleton-based approach was
presented in [62], where supervised learning (SVM) was used for classification with a linear kernel. Another
dynamic hand gesture recognition using Kinect sensor depth metadata for acquisition and segmentation
which used to extract orientation feature, where the support vector machine (SVM) algorithm and HMM was
utilized for classification and recognition to evaluate system performance where the SVM bring a good
result than HMM in some specification such elapsed time, average recognition rate, was proposed in [63]

. A hybrid method for hand segmentation based on depth and color data acquired by the Kinect sensor with
the help of skeletal data was proposed in [64]. In this method, the image threshold is applied to the depth
frame and the super-pixel segmentation method is used to extract the hand from the color frame, then the
two results are combined for robust segmentation.

38
Fig 3.4 - Example of skeleton recognition using pixel values to represent hand skeleton model .

DETECTION OF HAND GESTURE USING PIXEL VALUES


According to information mentioned in Table 4. The depth camera provides good accuracy for
segmentation, because it is not affected by lightning variations and cluttered background. However, the main
issue is in the range of detection. The Kinect V1 sensor has an embedded system which gives feedback
information received by depth sensors as a metadata, which gives information about human body joint
coordinate. The Kinect V1 provides information used to track skeletal joints up to 20 joints, which help to
module the hand skeleton. While Kinect V2 sensor can track joints as 25 joints and up to six people at the
same time with full joints tracking. With a range of detection between (0.5–4.5) meter.

39
Table 3.3 - Set of research papers that have used skeleton-based recognition for hand gesture application.

3.2.4 MOTION
Motion-based recognition can be utilized for detection purposes; it can extract the object through a
series of image frames. The AdaBoost algorithm utilized for object detection, characterization, movement
modeling, and pattern recognition is needed to recognize the gesture [16]. The main issue with motion
recognition is this is an occasion if one more gesture is active at the recognition process and also dynamic
background has a negative effect. In addition, the loss of gesture may be caused by occlusion among tracked
hand gesture or error in region extraction from tracked gesture and effect long-distance on the region
appearance Table 3 presents a set of research papers that used different segmentation techniques based on
motion recognition to detect ROI.
DETECTION OF MOTION BY HAND GESTURE
Two stages for efficient hand detection were proposed in [54]. First, the hand detected for each
frame and center point is used for tracking the hand. Then, the second stage matching model applying to
each type of gesture using a set of features is extracted from the motion tracking in order to provide better
classification where the main drawback of the skin color is affected by lighting variations which lead to
detect non-skin color. A standard face detection algorithm and optical flow computation was used by [55] to
give a user-centric coordinate frame in which motion features were used to recognize gestures for
classification purposes using the multiclass boosting algorithm. A real-time dynamic hand gesture
recognition system based on TOF was offered in [56], in which motion patterns were detected based on hand
gestures received as input depth images.

40
These motion patterns were compared with the hand motion classifications computed from the real
dataset videos which do not require the use of a segmentation algorithm. Where the system provides good
results except the depth range limitation of TOF cameras. In [57], YUV color space was used, with the help
of the CAMShift algorithm, to distinguish between background and skin color, and the naïve Bayes classifier
was implemented to assist with gesture recognition. The proposed system faces some challenges such as
illumination variation where light changes affect the result of the skin segment. Other challenges are the
degree of gesture freedom which affects directly on the output result by changing rotation. Next, hand
position capture problem, if the hand appears in the corner of the frame and the dots which must cover the
hand does not lie on hand that may lead to failing captured user gestures.

Fig 3.5 - Example on motion recognition using frame difference subtraction to extract hand feature, where
the moving object such as hand is extracted from the fixed background.

Fig 3.6 - Example on motion recognition using frame difference subtraction to extract hand feature, where
the moving object such as hand is extracted from the fixed background.

41
According to information mentioned in Table 3.4. The first row recognition rate of the system is 97%, where
the hybrid system based on skin detect and motion detection is more reliable for gesture recognition, where
the motion hand can track using multiple track candidates depending on stand derivation calculation for both
skin and motion approach. Where every single gesture is encoded as chain-code in order to model every
single gesture which considers a simple model compared with (HMM) and classified gesture using a model
of the histogram distribution. The proposed system in the third row uses a depth camera based on (TOF)
where the motion pattern of the arm model for humans is utilized to define motion patterns, where the
authors confirm that using the depth information for hand trajectories estimation is to improve gesture
recognition rate. Moreover, the proposed system has no need for the segmentation algorithm, where the
system is examined using 2D and 2.5D approaches, where 2.5D performs better than 2D and gives
recognition rate 95%.

Table 3.4 - A set of research papers that have used motion-based detection for hand gesture application.

42
3.3 TRACKING
Tracking, or the frame-to-frame correspondence of the segmented hand regions or features, is the
second step in the process towards understanding the observed hand movements. The importance of robust
tracking is twofold. First, it provides the inter-frame linking of hand/finger appearances, giving rise to
trajectories of features in time. These trajectories convey essential information regarding the gesture and
might be used either in a raw form (e.g. in certain control applications like virtual drawing the tracked hand
trajectory directly guides the drawing operation) or after further analysis (e.g. recognition of a certain type of
hand gesture). Second, in model-based methods, tracking also provides a way to maintain estimates of model
parameters, variables and features that are not directly observable at a certain moment in time.

3.3.1 TEMPLATE BASED TRACKING


This class of methods exhibits great similarity to methods for hand detection. Members of this class
invoke the hand detector at the spatial vicinity that the hand was detected in the previous frame, so as to
drastically restrict the image search space. The implicit assumption for this method to succeed is that images
are acquired frequently enough. Correlation-based feature tracking is directly derived from the above
approach. In [CBC95, OZ97] correlation-based template matching is utilized to track hand features across
frames. Once the hand(s) have been detected in a frame, the image regions in which they appear is utilized
as the prototype to detect the hand in the next frame. Again, the assumption is that hands will appear in the
same spatial neighborhood. This technique is employed for a static camera in [DEP96], to obtain
characteristic patterns (or “signatures”) of gestures, as seen from a particular view. The work in [HB96]
deals also with variable illumination. A target is viewed under various lighting conditions. Then, a set of
basis images that can be used to approximate the appearance of the object viewed under various illumination
conditions is constructed. Tracking simultaneously solves for the affine motion of the object and the
illumination. Real-time performance is achieved by pre-computing “motion templates” which are the
product of the spatial derivatives of the reference image to be tracked and a set of motion fields. Some
approaches detect hands as image blobs in each frame and temporally correspond to blobs that occur in
proximate locations across frames. Approaches that utilize this type of blob tracking are mainly the ones that
detect hands based on skin color, the blob being the correspondingly segmented image region (e.g.
[BMM97, AL04b]). Blob-based approaches are able to retain tracking of hands even when there are great
variations from frame to frame. Extending the above approach, deformable contours, or “snakes” have been
utilized to track hand regions in successive image frames [CJ92]. Typically, the boundary of this region is
determined by intensity or color gradient. Nevertheless, other types of image features (e.g. texture) can be
considered. The technique is initialized by placing a contour near the region of interest.

43
The contour is then iteratively deformed towards nearby edges to better fit the actual hand region.
This deformation is performed through the optimization of an “energy” functional that sums up the gradient
at the locations of the snake while, at the same time, favoring the smoothness of the contour. When snakes
are used for tracking, an active shape model is applied to each frame and the convergence of the snake in
that frame is used as a starting point for the next frame. Snakes allow for real-time tracking and can handle
multiple targets as well as complex hand postures. They exhibit better performance when there is sufficient
contrast between the background and the object [CJHG95]. On the contrary, their performance is
compromised in cluttered backgrounds. The reason is that the snake algorithm is sensitive to local optima of
the energy function, often due to ill foreground/background separation or large object displacements and/or
shape deformations between successive images. Tracking local hand features on the hand has been employed
in specific contexts only, probably because tracking local features does not guarantee the segmentation of the
hands from the rest of the image. The methods in [MDC98, BH94], track hands in image sequences by
combining two motion estimation processes, both based on image differencing. The first process computes
differences between successive images. The second computes differences from a background image that was
previously acquired. The purpose of this combination is increased robustness near shadows.

3.3.2 OPTIMAL ESTIMATION


Feature tracking has been extensively studied in computer vision. In this context, the optimal
estimation framework provided by the Kalman filter [Kal60] has been widely employed in turning
observations (feature detection) into estimations (extracted trajectory). The reasons for its popularity are
real-time performance, treatment of uncertainty, and the provision of predictions for the successive
frames.

OPTIMAL ESTIMATION TRACKING


In [AL04b], the target is retained against cases where hands occlude each other, or appear as a single
blob in the image, based on a hypothesis formulation and validation/rejection scheme. The problem of
multiple blob tracking was investigated in [AL04a], where blob tracking is performed in both images of a
stereo pair and blobs are corresponded, not only across frames, but also across cameras.
The obtained stereo information not only provides the 3D locations of the hands, but also facilitates
the potential motion of the observing stereo pair which could be thus mounted on a robot that follows the
user. In [BK98, Koh97], the orientation of the user’s hand was continuously estimated with the Kalman filter
to localize the point in space that the user indicates by extending the arm and pointing with the index finger.
In [UO99], hands are tracked from multiple cameras, with a Kalman filter in each image, to
estimate the 3D hand postures. Snakes integrated with the Kalman filtering framework (see below) have
been used for tracking hands [DS92]. Robustness against background clutter is achieved in [Pet99], where
the conventional image gradient is combined with optical flow to separate the foreground from the
background.
44
In order to provide accurate initialization for the snake in the next frame, the work in [KL01],
utilizes the optical flow to obtain estimations of the direction and magnitude of the target’s motion. The
success of combining optical flow is based on the accuracy of its computation and, thus, the approach is best
suited for the case of static cameras.
Treating the tracking of image features within a Bayesian framework has been long known to
provide improved estimation results. The works in [FB02, IB98b, VPGB02, HLCP02, IM01, KMA01]
investigate the topic within the context of hand and body motion. In [WAP97], a system tracks a single
person by color-segmentation of the image into blobs and then uses prior information about skin color and
topology of a person’s body to interpret the set of blobs as a human figure.
In [Bre97], a method is proposed for tracking human motion by grouping pixels into blobs based on
coherent motion, color and temporal support using an expectation-maximization (EM) algorithm. Each blob
is subsequently tracked using a Kalman filter. Finally, in [MB99, MI00], the contours of blobs are tracked
across frames by a combination of the Iterative Closed Point (ICP) algorithm and a factorization method to
determine global hand pose.
The approaches in [BJ96, BJ98c] reformulate the eigenspace reconstruction problem (reviewed in
Section 2.3.2) as a problem of robust estimation. The goal is to utilize the above framework to track the
gestures of a moving hand. To account for large affine transformations between the eigenspace and the
image, a multi-scale eigenspace representation is defined and a coarse-to-fine matching strategy is adopted.
In [LB96], a similar approach was proposed which uses a hypothesize-and-test approach instead of a
continuous formulation. Although this approach does not address parameterized transformations and
tracking, it exhibits robustness against occlusions. In [GMR+02], a real-time extension of the work in
[BJ96], based on EigenTracking [IB98a] is proposed.
Eigenspace representations have been utilized in a different way in [BH94] to track articulated
objects by tracking a silhouette of the object, which was obtained via image differencing.
A spline was fit to the object’s outline and the knot points of the spline form the representation of
the current view. Tracking an object amounts to projecting the knot points of a particular view onto the
eigenspace. Thus, this work uses the shape (silhouette) information instead of the photometric one (image
intensity values).
In [UO99], the 3D positions and postures of both hands are tracked using multiple cameras. Each
hand position is tracked with a Kalman filter and 3D hand postures are estimated using image features. This
work deals with the mutual hand-to-hand occlusion inherent in tracking both hands, by selecting the views in
which there are no such occlusions .

45
3.3.3 PARTICLE FILTERING .
Particle filters have been utilized to track the position of hands and the configuration of fingers in
dense visual clutter. In this approach, the belief of the system regarding the location of a hand is modeled
with a set of particles. The approach exhibits advantages over Kalman filtering, because it is not limited by
the unimodal nature of Gaussian densities that cannot represent simultaneous alternative hypotheses. A
disadvantage of particle filters is that for complex models (such as the human hand) many particles are
required, a fact which makes the problem intractable especially for high-dimensional models. Therefore,
other assumptions are often utilized to reduce the number of particles. For example in [IB98a],
dimensionality is reduced by modeling commonly known constraints due to the anatomy of the hand.
Additionally, motion capture data is integrated in the model. In [MB99] a simplified and application-specific
model of the human hand is utilized.

The CONDENSATION algorithm [IB98a] which has been used to learn to track curves against cluttered
backgrounds, exhibits better performance than Kalman filters, and operates in real-time. It uses “factored
sampling”, previously applied to the interpretation of static images, in which the probability distribution of
possible interpretations is represented by a randomly generated set. Condensation uses learned dynamical
models, together with visual observations, to propagate this random set over time. The result is highly robust
tracking of agile motion.

In [MI00] the “partitioned sampling” technique is employed to avoid the high computational cost that
particle filters exhibit when tracking more than one object. In [LL01], the state space is limited to 2D
translation, planar rotation, scaling and the number of outstretched fingers.

Extending the CONDENSATION algorithm the work in [MCA01], detects occlusions with some
uncertainty. In [PHVG02], the same algorithm is integrated with color information; the approach is based on
the principle of color histogram distance, but within a probabilistic framework, the work introduces a new
Monte Carlo tracking technique. In general, contour tracking techniques, typically, allow only a small subset
of possible movements to maintain continuous deformation of contours.

This limitation was overcome to some extent in [HH96b], who describe an adaptation of the
CONDENSATION algorithm for tracking across discontinuities in contour shapes.

46
3.3.4 CAMSHIFT.

Object detection has been a rapidly growing technique over the past few years and the use of deep
learning techniques like Convolutional Neural Networks has also been increasing. However, before CNNs
became mainstream, other techniques were commonly used. In this article, we will cover a technique that
has been used often in the past and which uses histograms as object descriptors to track objects.

TRACKING BY CAMSHIFTING

The CAMShift (Continuously Adaptive Mean Shift) algorithm is a color-based object tracking method
introduced by Gary Bradski in 1998 to reduce computational complexity of the methods used during that
period and to deal with problems such as image noise, distractors (e.g. other identical objects in the scene),
irregular object motion due to perspective, and lighting variations.The CAMShift algorithm derived from the
mean shift algorithm, which is responsible for finding the center of the probability distribution of the object
to track. The main difference is that CAMShift adjusts itself to the search window size, for example when
object sizes are changing as they move closer to or farther from the camera.This article is aimed at people
who want to learn how the CAMShift object tracking algorithm works and for those who want to understand
its implementation for personal or academic purposes. For anybody who wants to simply use it as a black

box, they can directly use the implemented variant in OPENCV.The steps to implement this algorithm are
summarized in the diagram illustrated below from the official publication of Gary R. Bradski:Referencing
this diagram, this article is organized into the following sections:

1. Hue Color Histogram


2. Probability Distribution
3. Image Moments
4. CAMShift
5. Final Considerations

47
1. HUE COLOR HISTOGRAM:

After we have found our desired location, we are ready to create the color histogram of the object.
To achieve this goal we use the Hue Saturation Value (HSV) color system to obtain the hue value. The
reason is pretty simple to understand: the HSV space separates the hue from the saturation and brightness.
Using this method, it is possible to represent the color better than in the RGB model as the descriptor of the
object can be represented in a single histogram containing 360 values.As described in the official paper, with
the HSV model it is possible to track different races of people (with black and white skin) with the same
model; in fact, humans have the same hue value as their skin color, the only difference lies in the color
saturation, which is separated out in the HSV model.

2. PROBABILITY DISTRIBUTION:
During the execution, the stored histogram is used as a lookup table to convert each frame to its
corresponding probability image. To do this, we normalize the values of the histogram in a range between 0
and 1 and we interpret these values as a probability.As you can see in the figure below, when S and V are
low, the small number of discrete hue pixels in the spectrum cannot adequately represent the changes in
RGB. For this reason, we have to set all pixels with a saturation value below a certain threshold to 0.The
threshold for these two values is dependent on the type of scenario. To trace the car, I chose a threshold of S
= 0.2 and V = 0.5, but feel free to test out other values. Furthermore, I decided to exclude all pixels that have
a probability under 0.21.

3. IMAGE MOVEMENTS
Now that we have converted the frame, we want to move the search window to the area of the maximum
pixel density of the probability distribution. To do this, we use the mean shift algorithm.

48
4. CAMSHIFT ALGORITHM:

The CAMShift algorithm is calculated using the following steps:

1. Choose the initial location of the search window


2. Execute the mean shift (one or many iterations):
 2.1) Compute the mean location in the search window
 2.2) Centre the search window at the mean location computed in the previous step
 2.3) Repeat steps 2.1 and 2.2 until you obtain convergence (or until the mean location moves
to lower than the preset threshold.
 3. Set the search window size equal to a function of the zeroth moment found in Step 2.

For simplicity, a histogram of the first frame’s region of interest (ROI) is created and the object is then
tracked in the subsequent frames using this histogram. However, in the original algorithm, the histogram
should be adapted to each frame.Furthermore, the new size of the ROI is set only at the end of the mean shift
iterations. If you want to, you can set it to after each iteration.After finding the new ROI of the object, you
can simply calculate the angle of the orientation of the object with the following formula:
5. FINAL CONSIDERATIONS:

CAMShift is a color-based object detection algorithm and, therefore, has some limitations. One of the
major drawbacks of the standard variant is that it cannot track the desired target when the background

49
(or an object nearby) is of the same color. The mean shift moves the search window to the area of the
maximum pixel density of the probability distribution, which is created with the histogram of the target
to track. In this case, it would be necessary to encode structural image features, which is not possible
with color-based tracking.
The color can often be irrelevant in many scenarios (for example, the color of different dogs for object
classification) where the target can change its appearance at any time.
This algorithm has been improved over the years to try to cope with this limitation but, to take into
account the structural features of the image, other approaches, such as HOG (Histogram of oriented
gradients) have been developed. However, nowadays better results can be achieved using deep
learning.

Fig 3.7 - Block diagram of CAMSHIFT

50
3.4 RECOGNITION

The overall goal of hand gesture recognition is the interpretation of the semantics that the hand(s)
location, posture, or gesture conveys. Basically, there have been two types of interaction in which hands are
employed in the user’s communication with a computer. The first is control applications such as drawing,
where the user sketches a curve while the computer renders this curve on a 2D.canvas [LWH02, WLH01].
Methods that relate to hand-driven control focus on the detection and tracking of some feature (e.g. the
fingertip, the centroid of the hand in the image etc) and can be handled with the information extracted
through the tracking of these features. The second type of interaction involves the recognition of hand
postures, or signs, and gestures. Naturally, the vocabulary of signs or gestures is largely application
dependent. Typically, the larger the vocabulary is, the hardest the recognition task becomes.

Two early systems indicate the difference between recognition [BMM97] and control [MM95]. The first
recognizes 25 postures from the International Hand Alphabet, while the second was used to support
interaction in a virtual workspace. The recognition of postures is of topic of great interest on its own,
because of sign language communication.

Moreover, it also forms the basis of numerous gesture-recognition methods that treat gestures as a series of
hand postures. Besides the recognition of hand postures from images, recognition of gestures includes an
additional level of complexity, which involves the parsing, or segmentation, of the continuous signal into
constituent elements. In a wide variety of methods (e.g. [TVdM98]), the temporal instances at which hand
velocity (or optical flow) is minimized are considered as observed postures, while video frames that portray
a hand in motion are sometimes disregarded (e.g. [BMM97]). However, the problem of simultaneous
segmentation and recognition of gestures without being confused with inter-gesture hand motions remains a
rather challenging one. Another requirement for this segmentation process is to cope with the shape and time
variability that the same gesture may exhibit, e.g. when performed by different persons or by the same
person at different speeds.

The fact that even hand posture recognition exhibits considerable levels of uncertainty casts the above
processing computationally complex or error prone. Several of the reviewed works indicate that lack of
robustness in gesture recognition can be compensated by addressing the temporal context of detected
gestures. This can be established by letting the gesture detector know of the grammatical or physical rules
that the observed gestures are supposed to express. Based on these rules, certain candidate gestures may be
improbable. In turn, this information may disambiguate candidate gestures, by selecting to recognize the
most likely candidate.

51
3.4.1 K-MEAN RECOGNITION
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what K-means clustering algorithm
is, how the algorithm works, along with the Python implementation of k-means clustering.K-Means
Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters.
Here K defines the number of predefined clusters that need to be created in the process, as if K=2, there will
be two clusters, and for K=3, there will be three clusters, and so on.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.

Hence each cluster has data points with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be different from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means assign each datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

 Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different clusters.
 We need to choose some random k points or centroid to form the cluster. These points can be either
the points from the dataset or any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below image:

52
Fig 3.8 - K-MEAN Recognition.

Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it
by applying some mathematics that we have studied to calculate the distance between two points. So, we
will draw a median between both the centroids.The performance of the K-means clustering algorithm
depends upon highly efficient clusters that it forms. But choosing the optimal number of clusters is a big
task. There are some different ways to find the optimal number of clusters, but here we are discussing the
most appropriate method to find the number of clusters or value of K.

3.4.2 K-NEAREST NEIGHBOR RECOGNITION

Face classification is a stage for the process of matching testing data and training data from
face datasets. KNN is one of the simple algorithms that can be used for classification. Regardless of its
simplicity, this method is quite effective as a classification. This method was first proposed by T. M. Cover
and P. E. Hart in 1967 [13] but then modified to improve the performance of the KNN. The Basic
concept of KNN is to have several training samples and testing samples determined by members. If
k=1, the testing sample is assigned to the nearest single neighbor class. However, finding the right k value
for a particular problem is a problem that affects the performance of the KNN [14].
K-nearest-neighbor (kNN) classification is one of the most fundamental and simple classification methods
and should be one of the first choices for a classification study when there is little or no prior knowledge
about the distribution of the data. K-nearest-neighbor classification was developed from the need to perform
discriminant analysis when reliable parametric estimates of probability densities are unknown or difficult to
determine. In an unpublished US Air Force School of Aviation Medicine report in 1951, Fix and Hodges
introduced a non-parametric method for pattern classification that has since become known the k-nearest

53
neighbor rule (Fix & Hodges, 1951). Later in 1967, some of the formal properties of the k-nearest-neighbor
rule were worked out; for instance it was shown that for k=1 and n→∞ The k-nearest-neighbor classification
error is bounded above by twice the Bayes error rate (Cover & Hart, 1967). Once such formal properties of
k-nearest-neighbor classification were established, a long line of investigation ensued including new
rejection approaches (Hellman, 1970), refinements with respect to Bayes error rate (Fukunaga & Hostetler,
1975), distance weighted approaches (Dudani, 1976; Bailey & Jain, 1978), soft computing (Bermejo &
Cabestany, 2000) methods and fuzzy methods (Jozwik, 1983; Keller et al., 1985).

CHARACTERISTICS OF KNN:

the strong linear relationship between 10-fold cross-validation accuracy for the 9 data sets as a function of
the ratio of the feature sum[-log(p)] to number of features. The liver data set resulted in the lowest accuracy,
while the Fisher Iris data resulted in the greatest accuracy. The low value of sum[-log(p-value)] for features
in the liver data set will on average result in lower classification accuracy, where's the greater level of sum[-
log(p-value)] for the Fisher Iris data and cancer data set will yield much greater levels of accuracy.reflects k-
nearest neighbor performance (k=5, feature standardization) for various cross validation methods for each
data set. 2- and 5-fold cross validation ("CV2" and "CV5") performed worse than 10-fold ("CV10") and
leave-one-out cross validation ("CV-1"). 10-fold cross validation ("CV10") was approximately the same as
leave-one-out cross validation ("CV-1"). Bootstrapping resulted in slightly lower performance when
compared with CV10 and CV-1. Figure 5 shows that when averaging performance over all data sets (k=5),
that both feature standardization and feature fuzzification resulted in greater accuracy levels when compared
with no feature transformation. Figures 6, 7, and 8 illustrate the CV10 accuracy for each data set as a
function of k without no transformation, standardization, and fuzzification, respectively. It was apparent that
feature standardization (Figure 7) and fuzzification (Figure 8) greatly improved the accuracy of the
dermatology and wine data sets. Fuzzification (Figure 8) slightly reduced the performance of the Fisher Iris
data set. Interestingly, performance for the soybean data set did not improve with increasing values of k,
suggesting overlearning or overfitting.

Performance of the unsupervised k-nearest neighbor classification method was assessed using several data
sets, cross validation, and bootstrapping. All methods involved initial use of a distance matrix and
construction of a confusion matrix during sample testing, from which classification accuracy was
determined. With regard to accuracy calculation, for cross-validation it is recommended that the confusion
matrix be filled incrementally with results for all input samples partitioned into the various groups, and then
calculating accuracy -- rather than calculating accuracy and averaging after each partition of training
samples is used for testing. In other words, for e.g. 5-fold cross-validation, it is not recommended to
calculate accuracy after the first 4/5ths of samples are used for training and the first 1/5th of samples are
used for

54
testing. Instead, it is better to determine accuracy after all 5 partitions have been used for testing to fill in the
confusion matrix for each input sample considered along the way. Then, re-partition the samples into 5
groups again and repeat training and testing on each of the partitions. Another example would be to consider
an analysis for which there are 100 input samples and 10-fold cross-validation is to be used. The suggestion
is not to calculate average accuracy every time 10 of the samples are used for testing, but rather to go
through the 10 partitions in order to fill in the confusion matrix for the entire set of 100 samples, and then
calculate accuracy. This should be repeated e.g. 10 times during which re-partitioning is done.
The hold-out method of accuracy determination is another approach to assess the performance of k-nearest
neighbor. Here, input samples are randomly split into 2 groups with 2/3 (~66%) of the input samples
assigned to the training set and 1/3 (~33%) of the samples (remaining) assigned to testing. Training results
are used to classify the test samples. A major criticism of the hold-out method when compared with cross-
validation is that it makes inefficient use of the entire data set, since date are split one time and used once in
this configuration to assess classification accuracy. It is important to recognize that the hold-out method is
not the same as predicting class membership for an independent set of supplemental experimental validation
samples. Validation sets are used when the goal is to confirm the predictive capabilities of a classification
scheme based on the results from an independent set of supplemental samples not used previously for
training and testing. Laboratory investigations involving molecular biology and genomics commonly use
validation sets raised independently from the original training/testing samples. By using an independent set
of validation samples, the ability of a set of pre-selected features (e.g. mRNA or microRNA transcripts, or
proteins) to correctly classify new samples can be better evaluated. The attempt to validate a set of features
using a new set of samples should be done carefully, since processing new samples at a later date using
different lab protocols, buffers, and technicians can introduce significant systematic error into the
investigation. As a precautionary method, a laboratory should plan on processing the independent validation
set of samples in the same laboratory, using the same protocol and buffer solutions, the same technician(s),
and preferably at the same time the original samples are processed.

3.4.3 MEAN SHIFT CLUSTERING

Meanshift is falling under the category of a clustering algorithm in contrast to Unsupervised


learning that assigns the data points to the clusters iteratively by shifting points towards the mode (mode is
the highest density of data points in the region, in the context of the Meanshift). As such, it is also known as
the Mode-seeking algorithm. Mean-shift algorithm has applications in the field of image processing and
computer vision.Unlike the popular K-Means cluster algorithm, mean-shift does not require specifying the
number of clusters in advance. The number of clusters is determined by the algorithm with respect to the
data.

55
Kernel Density Estimation –The first step when applying mean shift clustering algorithms is
representing your data in a mathematical manner; this means representing your data as points such as the set
below.
Mean-shift builds upon the concept of kernel density estimation, in short KDE. Imagine that the
above data was sampled from a probability distribution. KDE is a method to estimate the underlying
distribution also called the probability density function for a set of data.
It works by placing a kernel on each point in the data set. A kernel is a fancy mathematical word for a
weighting function generally used in convolution. There are many different types of kernels, but the most
popular one is the Gaussian kernel. Adding up all of the individual kernels generates a probability surface
example density function. Depending on the kernel bandwidth parameter used, the resultant density function
will vary.Below is the KDE surface for our points above using a Gaussian kernel with a kernel bandwidth of
2.

3.4.4 SUPPORT VECTOR MACHINE


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.SVM chooses the extreme
points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and
hence algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:

Fig 3.9 - RECOGNITION BY SUPPORT VECTOR MACHINE

56
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, such a model can be created by using the SVM algorithm. We will first train our
model with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and
then we test it with this strange creature. So as the support vector creates a decision boundary between these
two data (cat and dog) and chooses extreme cases (support vectors), it will see the extreme case of cat and
dog. On the basis of the support vectors, it will classify it as a cat.
SVM can be of two types:

 Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane:
There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we
need to find out the best decision boundary that helps to classify the data points. This best boundary is
known as the hyperplane of SVM.The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then the hyperplane will be a straight line.
And if there are 3 features, then the hyperplane will be a 2-dimension plane.We always create a hyperplane
that has a maximum margin, which means the maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vectors. These vectors support the hyperplane, hence called a Support
vector.

3.4.5 HIDDEN MARKOV MODEL


A Hidden Markov Model (HMM) is a statistical model in which a set of hidden parameters is
determined from a set of related, observable parameters. In a HMM, the state is not directly observable, but
instead, variables influenced by the state are. Each state has a probability distribution over the possible
output tokens. Therefore, the sequence of tokens generated by an HMM provides information about the
sequence of states. In the context of gesture recognition, the observable parameters are estimated by
recognizing postures (tokens) in images. For this reason and because gestures can be recognized as a

57
sequence of postures, HMMs have been widely utilized for gesture recognition. In this context, it is typical
that each gesture is handled by a different HMM.

The recognition problem is transformed to the problem of selecting the HMM that matches best the observed
data, given the possibility of a state being observed with respect to context. This context may be spelling or
grammar rules, the previous gestures, cross-modal information (e.g. audio) and others. An excellent
introduction and further analysis on the approach, for the case of gesture recognition, can be found in
[WB95]. Early versions of this approach can be found in [YOI92, SHJ94a, RKS96]. There, the the HMMs
were performing directly on the intensity values of the images acquired by a static camera. In [ML97], the
edge image combined with intensity information is used to create a static posture representation or a search
pattern.

The work in [RKE98] includes the temporal component in an approach similar to that of [BD96] and HMMs
are trained on a 2D “motion image”. The method operates on coarse body motions and visually distinct
gestures executed on a plane that is fronto parallel to the camera. Images are acquired in a controlled setting,
where image differencing is utilized to construct the required motion image. Incremental improvements of
this work have been reported in [EKR+98]. The work in [VM98], proposes a posture recognition system
whose inputs are 3D reconstructions of the hand (and body) articulation. In this work, HMMs are coupled
with 3D reconstruction methods to increase robustness. In particular, moving limbs are extracted from
images, using the segmentation of [KMB94] and, subsequently, joint locations are recovered by inferring the
articulated motion from the silhouettes of segments.
The process is performed simultaneously from multiple views and the stereo combination of these
segmentations provides the 3D models of these limbs which are, in turn, utilized for recognition. In
[SWP98], the utilized features are the moments of skin-color based blob extraction for two observed hands.
Grammar rules are integrated in the HMM to increase robustness in the comprehension of gestures. This
way, posture combinations can be characterized as erroneous or improbable depending on previous gestures.
In turn, this information can be utilized as feedback to increase the robustness of the posture recognition
task and, thus, produce overall more accurate recognition results.
The approach in [LK99], introduces the concept of a threshold model that calculates the likelihood threshold
of an input (moments of blob detection). The threshold model is a weak model for the superset of all gestures
in the vocabulary and its likelihood is smaller than that of the correct gesture model for a given gesture, but
larger than for a non-gesture motion. This can be utilized to detect if some motion is part of a gesture or not.
To reduce the states model, states with similar probability distributions are merged, based on a relative
entropy measure. In [WP97], the 3D locations that result from stereo multiple-blob tracking are input to a
HMM that integrates a skeletal model of the human body. Based on the 3D observations, the approach
attempts to infer the posture of the body.

58
Conceptually similar to conditional based reasoning is the “causal analysis” approach. This approach stems
from work in scene analysis [BBC93], which was developed for rigid objects of simple shape (blocks, cubes
etc). The approach uses knowledge about body kinematics and dynamics to identify gestures based on
human motor plans, based on measurements of shoulder, elbow and wrist joint positions in the image plane.
From these positions, the system extracts a feature set that includes wrist acceleration and deceleration,
effort to lift the hand against gravity, size of gesture, area between arms, angle between forearms, nearness
to body etc. Gesture filters use this information, along with causal knowledge on humans interaction with
objects in the physical world, to recognize gestures such as opening , lifting, pushing etc.

Fig 3.10 - RECOGNITION BY HIDDEN MARKOV MODEL

3.4.6 DYNAMIC TIME WARPING


DTW algorithm has earned its popularity by being extremely e fficient asthe time-series similarity
measure which minimizes the effects of shifting and distortion in time by allowing “elastic” transformation
of time series in order to detect similar shapes with different phases. Given two time series X= (x , x , ...x ),
1 2 N

N
∈And Y= (y , y , ...y ), M ∈Represented by the sequences of values (or curves represented by the sequences
1 2 M

of vertices) DTW yields optimal solution in the O(MN ) time which could be improved further through
different techniques such as multi-scaling [17] [19]. The only restriction placed on the data
sequences is that they should be sampled at equidistant points in time (this problem can be resolved
by re-sampling). If sequences are taking values from some feature space Φ than in order to
compare two different sequences X, Y ∈Φ one needs to use the local distance measure which is
defined to be a function: d: Φ ×Φ→R≥0. Intuitively dhas a small value when sequences are similar
and large value if they are different. Since the Dynamic Programming algorithm lies in the core of
DTW it is
59
common to call this distance function the “cost function” and the task of optimal alignment of the
sequences becomes the task of arranging all sequence points by minimizing the cost function (or
distance). Algorithm starts by building the distance matrix C∈RN×Representing all pairwise
distances between Xand Y.
Dynamic Time Warping (DTW) is a way to compare two -usually temporal- sequences that do not
sync up perfectly. It is a method to calculate the optimal matching between two sequences. DTW is
useful in many domains such as speech recognition, data mining, financial markets, etc. It’s
commonly used in data mining to measure the distance between two time-series.
Formulation
Let’s assume we have two sequences like the following:
𝑋=𝑥[1], 𝑥[2], …, x[i], …, x[n]
Y=y[1], y[2], …, y[j], …, y[m]
The sequences 𝑋 and 𝑌 can be arranged to form an 𝑛-by-𝑚 grid, where each point (𝑖, j) is the alignment
between 𝑥[𝑖] and 𝑦[𝑗].A warping path 𝑊 maps the elements of 𝑋 and 𝑌 to minimize the distance between
them. 𝑊 is a sequence of grid points (𝑖, 𝑗). We will see an example of the warping path later.
Warping Path and DTW distance
The Optimal path to (𝑖_𝑘, 𝑗_𝑘) can be computed by:

where 𝑑 is the Euclidean distance. Then, the overall path cost can be calculated as

RESTRICTIONS OF WRAPPING FUNCTIONS


The warping path is found using a dynamic programming approach to align two sequences. Going through
all possible paths is “combinatorially explosive” [1]. Therefore, for efficiency purposes, it’s important to
limit the number of possible warping paths, and hence the following constraints are outlined:

 Boundary Condition: This constraint ensures that the warping path begins with the start points
of both signals and terminates with their endpoints.

60
 Monotonicity condition: This constraint preserves the time-order of points (not going back in
time).

3.4.7 TIME DELAY NEURAL NETWORK


The time-delay neural network (TDNN) is widely used in speech recognition software for the
acoustic model, which converts the acoustic signal into a phonetic representation. The papers describing the
TDNN can be a bit dense, but since I spent some time during my master’s thesis working with them, I’d like
to take a moment to try to demystify them a little.

The TDNN was originally designed by Waibel ([4], [5]) and later popularized by Peddinti et al ([3]), who
used it as part of an acoustic model. It is still widely used for acoustic models in modern speech recognition
software (such as Kaldi) in order to convert an acoustic speech signal into a sequence of phonetic units
(phones).The inputs to the network are the frames of acoustic features. The outputs of the TDNN are a
probability distribution over each of the phones defined for the target language. That is, the goal is to read
the audio one frame at a time, and to classify each frame into the most likely phone. In one layer of the
TDNN, each input frame is a column vector representing a single time step in the signal, with the rows
representing the feature values. The network uses a smaller matrix of weights (the kernel or filter), which
slides over this signal and transforms it into an output using the convolution operation, which we will see in
a moment.
The architecture of our best BDEV network was originally formulated in terms of replicated units trained
under constraints which ensured that the copies of a given unit applied the same weight pattern to successive
portions of the input (Lang, 1987). Because the constrained training procedure for this network is similar to
the standard technique for recurrent back-propagation training (Rumelhart, Hinton, & Williams, 1986), it is
natural to re-interpret the network in iterative terms (Hinton, 1987b). According to this veiwpoint, the 3-
layer network described in section 3.6 has only 16 input units, 8 hidden units, and 4 output units. Each input
unit is connected to each hidden unit by 3 different links having time delays of 0, 1, and 2. Each hidden unit
is connected to each output unit by 5 different links having time delays of 0, 1, 2, 3, and 4. The input
spectrogram is scanned one frame at a time, and activation is iteratively clocked upwards through the
network.
The time-delay nomenclature associated with this iterative viewpoint was employed in describing the
experiments at the Advanced Telecommunications Research Institute in Japan which confirmed the power of
the replicated network of section 3.6 by showing that it performed better than all previously tried techniques
on a set of Japanese consonants extracted from continuous speech.

61
The idea of replicating network hardware to achieve position independence is an old one (Fukushima, 1980).
Replication is especially common in connectionist vision algorithms where local operators are
simultaneously applied to all parts of an image (Marr & Poggio, 1976). The inspiration for the external time
integration step of our time-delay neural network (TDNN) was Michael Jordan's work on back propagating
errors through other post-processing functions (Jordan, 1986). Waibel (1989) describes a modular training
technique that made it possible to scale the TDNN technology up to a network which performs speaker
dependent recognition of all Japanese consonants with an accuracy of 96.7%. The technique consists of
training smaller networks to discriminate between subsets of the consonants, such as brig and ptk, and then
freezing and combining these networks along with "glue" connections that are further trained to provide
interclass discrimination.
Networks similar to the TDNN have been independently designed by other researchers. The time
concentration network of Tank and Hopfield (1987) was motivated by properties of the auditory system of
bats, and was conceived in terms of signal processing components such as delay lines and tuned filters. This
network is interesting because variable length time delays are learned to model words with different
temporal properties, and because it is one of the few connectionist speech recognition systems actually to be
implemented with parallel hardware instead of being simulated by a serial computer. An interesting
performance comparison between a TDNN and a similarly structured version of Kohonen's LVQ2 classifier
on the ATR brig task is reported in Mcdermott and Katagiri (1989). The same 15 × 16 input spectrograms
were used for both networks. In the LVQ2 network, a 7-step window (which is the amount of the input
visible to a single output unit copy in the TDNN) was passed over the input, and the nearest of 150 LVQ2
codebook entries was determined for each input Window position. These codebook entries were then
summed to provide the overall answer for a word. The replicated LVQ2 network achieved nearly identical
performance to the TDNN with less training cost, although recognition was more expensive. An
comprehensive survey of the field of connectionist speech recognition can be found in Lippmann .
In order to facilitate a multiresolution training procedure, the time-delay network of section 3,6 was
modified slightly so that the widths of its receptive fields would be divisible by 2. While the network had
previously utilized hidden unit receptive fields that were 3 time steps wide and output unit receptive fields
that were 5 time steps wide, its connection pattern was adjusted to make all of its receptive fields 4 time
steps wide (see Figure 7(b)). Because this modification would have increased the total number of weights in
the network, the number of hidden units was decreased from 8 to 6. After these changes, the network
contained 490 unique weights. The half-resolution version of the network shown in Figure 7(a) was also
constructed. This network covered the input patterns using six 24-ms frames rather than the twelve 12-ms
frames of the full-resolution network. In the half-resolution version of the network, the receptive fields were
all 2 frames wide Multiresolution training is conducted in two stages.In the first stage, the half-resolution
network is trained from small random weights on half-resolution versions of the training patterns until its

62
training set accuracy reaches a specified level.Then, the network's weights are used to initialize the full-
resolution network, which is further trained on full-resolution versions of the training patterns. Figure 8
illustrates this two-stage training procedure, which saves time because the half-resolution network can be
simulated with only one-fourth as many connections as the full-resolution network.

Fig 3.11 - ARCHITECTURE OF TIME DELAY NEURAL NETWORK

63
Fig 3.12 - STRUCTURE OF TIME DELAY NEURAL NETWORK

3.4.8 FINITE STATE MACHINE

A finite state machine (sometimes called a finite state automaton) is a computation model that can be
implemented with hardware or software and can be used to simulate sequential logic and some computer
programs. Finite state automata generate regular languages. Finite state machines can be used to model
problems in many fields including mathematics, artificial intelligence, games..

There are two types of finite state machines (FSMs): deterministic finite state machines, often called
deterministic finite automata, and non-deterministic finite state machines, often called non-deterministic
finite automata. There are slight variations in ways that state machines are represented visually, but the ideas
behind them stem from the same computational ideas. By definition, deterministic finite automata recognize,
or accept, regular languages, and a language is regular if a deterministic finite automaton accepts it. FSMs
are usually taught using languages made up of binary strings that follow a particular pattern. Both regular
and non-regular languages can be made out of binary strings. An example of a binary string language is: the
language of all strings that have a 0 as the first character. In this language, 001, 010, 0, and 01111 are valid

64
strings (along with many others), but strings like 111, 10000, 1, and 11001100 (along with many others) are
not in this language. You can walk through the finite state machine diagram to see what kinds of strings the
machine will produce, or you can feed it a given input string and verify whether or not there exists a set of
transitions you can take to make the string (ending in an accepting state).
Mealy State Machine

A Finite State Machine is said to be a Mealy state machine, if outputs depend on both present inputs &
present states. The block diagram of the Mealy state machine is shown in the following figure.

As shown in the figure, there are two parts present in the Mealy state machine. Those are combinational
logic and memory. Memory is useful to provide some or part of previous outputs
present states
present states as inputs of combinational logic.So, based on the present inputs and present states, the Mealy
state machine produces outputs. Therefore, the outputs will be valid only at positive
O'negative
o'negative transition of the clock signal. The state diagram of the Mealy state machine is shown in the
following figure.

In the above figure, there are three states, namely A, B & C. These states are labeled inside the circles &
each circle corresponds to one state. Transitions between these states are represented with directed lines.

65
Here, 0 / 0, 1 / 0 & 1 / 1 denotes input / output. In the above figure, there are two transitions from each state
based on the value of input, x.In general, the number of states required in the Mealy state machine is less
than or equal to the number of states required in the Moore state machine. There is an equivalent Moore
state machine for each Mealy state machine.

MOORE STATE MACHINE


A Finite State Machine is said to be a Moore state machine, if outputs depend only on present states. The
block diagram of the Moore state machine is shown in the following figure.

As shown in figure, there are two parts present in the Moore state machine. Those are combinational logic
and memory. In this case, the present inputs and present states determine the next states. So, based on next
states, Moore state machine produces the outputs. Therefore, the outputs will be valid only after transition of
the state.The state diagram of Moore state machine is shown in the following figure.

In the above figure, there are four states, namely A, B, C & D. These states and the respective
outputs are labelled inside the circles. Here, only the input value is labeled on each transition. In the above
figure, there are two transitions from each state based on the value of input, x.In general, the number of
states required in the Moore state machine is more than or equal to the number of states required in the
Mealy state machine. There is an equivalent Mealy state machine for each Moore state machine. So, based
on the requirement we can use one of them

66
67
CHAPTER - 4
Software Platform / Frame works

4.1 INTRODUCTION

When implementing a technique/algorithm for developing an application which detects, tracks and
recognize hand gestures, the main thing to consider is the methodology whichto recognize the gestures.
This section discusses platforms which supports gesture recogni-tion in various methods and further in
development from small to medium scale software applications.

Basically, a framework works as a kind of support structure for something to be built on top of. A
software framework is an abstraction in which software that provides generic functionality, can be
selectively changed by additional user-written code, this provides application specific software

4.2 OPEN CV
It is an open source computer vision programming functions library aimed at developing
applications based on real time computer vision technologies. This framework has BSD license which
enables usage of the framework for both commercial and research purposes.

OpenCV Bradski and Kaehler (2008) was originally developed in C but now consists of full
libraries for C++, Python and supports for Android platform.

It also supports hardwarefor Linux, MacOS X and Windows platforms providing extensive cross
platform compati- bility. OpenCV Eclipse IDE, C++ Builder, DevCpp IDE support provides developer’s
easyaccess to build applications with any of the IDEs listed above.

Some example applications in OpenCV library are object identification, segmentation and
recognition, face recognition,gesture recognition, motion tracking, mobile robotics etc.

With this extensive features this framework also requires a strong knowledge in both development
and integration methods to create a software application.

68
4.3 MATLAB
Matrix laboratory (MATLAB) is a numerical computing environment and fourth
generation programming language. It was developed by MathWorks. MATLAB MATLAB
allows imple- mentation of algorithms, matrix manipulations, plotting of functions and data,
creation of user interfaces and interfacing with programs written in other languages including C,
C++, Javaand Fortran.

It also supports hardware for Linux, MacOS X and Windows platforms provid-ing
extensive cross platform compatibility. MATLAB provides Image Processing Toolbox TM which
provides a comprehensive set of reference-standard algorithms and graphical tools for image
processing, analysis, visualization, and algorithm development.

Different task like image enhancement, image deblurring, feature detection, noise
reduction, image segmenta-tion, geometric transformations, image registration, object tracking,
recognition etc can be performed using the toolbox. Many toolbox functions are multithreaded to
take advantage of multicore and multiprocessor computers.

MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm

programming language and numeric computing environment developed by MathWorks. MATLAB

allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation

of user interfaces, and interfacing with programs written in other languages.

Although MATLAB is intended primarily for numeric computing, an optional toolbox uses

the MuPAD symbolic engine allowing access to symbolic computing abilities. An additional

package, Simulink, adds graphical multi-domain simulation and model-based design for dynamic

and embedded systems.

69
4.4 i GESTURE

It is a well reputed and older gesture recognition framework is iGesture (iGesture). This
framework is a Java based and focused on extendibility and cross-platform reusability. The
distinctive feature of the iGesture framework is that it supports both developers and designers to
develop new hand gesture recognition algorithms.

iGesture integrated framework includesthe gesture recognition framework and ‘I’ gesture
tool component to create custom ges- tures sets. This makes it better compared to other
frameworks as the other frameworks havepredefined gestures and the developers are limited to
those gestures.

Also iGesture tools provide the ability to evaluate the usability, performance, effectiveness
of new and existing hand gesture recognition algorithms. The main disadvantage of this
framework is its long learning period because of the extensive usages the framework offers the
developers must have a good understanding of the principals and methods of using this
framework in software applications.

While there exists a variety of gesture recognition frameworks, none of them addresses the
issues of supporting both application developers as well as the designers of new recognition
algorithms. iGesture supports application developers who would like to add new gesture
recognition functionality to their application as well as designers of new gesture recognition
algorithms.

The iGesture framework can easily be configured to use any of the existing recognition
algorithms (e.g. Rubine, SiGeR) or customised gesture sets can be defined. Furthermore, our test
bench provides tools to verify new gesture recognition algorithms and to evaluate their
performance.

70
4.5 A Forge .NET
The A Forge .NET Framework (A Forge.NET) is an open source framework based on
C#.NETdesigned for developers to provide a platform to develop applications in Computer Vision and
Artificial Intelligence. This framework supports image processing, neural networks, geneticalgorithms,
machine learning and robotics.

The framework is published under LGPL v3 License. This framework has gained popularity
because its ease of use, effectiveness and short learning period. This framework is compatible with .NET
Framework
2.0 and above.

This framework can be easily integrated with Microsoft Visual Studio IDE for development. The
framework consists of set of libraries and sample applications that can be used as a basisto develop a
gesture recognition application.

AForge.NET is a computer vision and artificial intelligence library originally developed by Andrew
Kirillov for the .NET Framework.[2] The source code and binaries of the project are available under the
terms of the Lesser GPL and the GPL (GNU General Public License).[citation needed] Another
(unaffiliated) project called Accord.NET was created to extend the features of the original AForge.NET
library.

71
Table 4.1 - Analysis of some vital literature related to vision based hand gesture recognition systems 1

Table 4.2 - Analysis of some vital literature related to vision based hand gesture recognition systems 2

72
Table 4.3 - Analysis of some vital literature related to vision based hand gesture recognition systems 3

Table 4.4 - Analysis of some vital literature related to vision based hand gesture recognition systems 4

73
CHAPTER - 5

Vision based hand gesture recognition analysis for future perspectives

5.1 INTRODUCTION
This section provides an analysis of the previous sections discussed in the paper
along withan insight for the future perspectives of vision based hand gesture recognition
for human computer interaction. Hand gesture recognition techniques rely heavily on the
core image processing techniques for detection tracking and recognition.

As computers become more pervasive in society, facilitating natural human–


computer interaction (HCI) will have a positive impact on their use. Hence, there has been
growing interest in the development of new approaches and technologies for bridging the
human–computer barrier.

The ultimate aim is to bring HCI to a regime where interactions with computers
will be as natural as an interaction between humans, and to this end, incorporating
gestures in HCI is an important research area. Gestures have long been considered as an
interaction technique that can potentially deliver more natural, creative and intuitive
methods for communicating with our computers.

This paper provides an analysis of comparative surveys done in this area. The use
of hand gestures as a natural interface serves as a motivating force for research in gesture
taxonomies, its representations and recognition techniques, software platforms and
frameworks which is discussed briefly in this paper. It focuses on the three main phases
of hand gesture recognition i.e. detection, tracking and recognition.

Different application which employs hand gestures for efficient interaction has
been discussed under core and advanced application domains. This paper also provides
an

74
analysis of existing literature related to gesture recognition systems for human computer
interaction by categorizing it under different key parameters.

It further discusses the advances that are needed to further improvise the present
hand gesture recognition systems for future perspective that can be widely used for
efficient human computer interaction.

The main goal of this survey is to provide researchers in the field of gesture based
HCI with a summar.

Most of the complete hand interactive systems can be considered to be comprisedof


three layers: detection, tracking and recognition. The detection layer is responsible for
defining and extracting visual features that can be attributed to
the presence of hands in the field of view of the camera(s). The tracking layer is
responsible for performing temporal data association between successive image
frames, so that, at each moment in time, the system may be aware of “what
is where”. Moreover, in model-based methods, tracking also provides a way
to maintain estimates of model parameters, variables and features that are not
directly observable at a certain moment in time. Last, the recognition layer is
responsible for grouping the spatiotemporal data extracted in the previous layers
and assigning the resulting groups with labels associated to particular classes
of gestures. In this section, research on these three identified subproblems
of vision-based gesture recognition is reviewed

75
5.2 Recognition techniques limitations

There are many limiting factors influencing the uses of their core technologies for real time
systems as discussed in previous sections III and IV having key issues being highlighted here.
Though the color based technology are used for segmentation in the detection phase but generally
the color based segmentation could be confused by presence of objects of simi-lar color
distribution as hat of hand.

Within the shape based core technologies of hand gesturerecognition information is


obtained by extracting contours of objects but within the limita- tions of weak classifier
implementation generating faulty results. 3D hand models used fordetection of hand achieve view
independent detection but the fitting is guided by the forcesthat pull the detecting characteristic
points away from goal positions on the hand models.

Motion based hand detection is not so much favored because of the undertaken assumption
that movement in the image is only because of the hand movements. The core technologiesof
hand detection if improved to operate on high frame rate acquisition then it becomes effective for
tracking phase also. Co-relation based feature tracking uses template matching that is not very
effective under variations illumination conditions. Contour based tracking also suffers from
limitation of smoothness of contours.

Optimal tracking operates an obser-vation to transform them into estimations but is not robust
enough to operated in the cluttered background. Similarly the advance technologies for hand gesture
recognition are also havingdifferent sets of limitations and overheads with their evolution. The
overall performance of any gesture recognition system is very much dependent on these set of
techniques used for its implementation.

Hence, it is required to find the optimal set of techniques at different phasesthat is very
much application dependent for which the system has been developed. Thoughthere are many
limitations of the varied application domains of hand gesture recognition system that is discussed
as follows:

76
5.3 Application domain constraints

It is often assumed that the application domains for which the hand gesture recognition sys-
tems are implemented would be restricted to single application only.

As the research in HCI till date concentrates on the design and development of application
specific gesture recogni-tion systems in general and hand gesture recognition systems in
particular.

The core conceptof limitation because of which most of the hand gesture recognition
systems developed areapplication specific is the cognitive mapping of the gesture to command
operating within the application.

These cognitive mappings of gesture to command are easier in the case of system being
developed for single application. The complexity associated with conversion of this cognitive
mapping of gestures to command for different applications had hindered the evolution of application
independent hand gesture recognition systems. Hence it is one of the prime concerns for future of
gesture recognition system to be application independent.

5.4 Real - time challenges

The present survey has found across the literature the tendency of the developed hand ges- ture
recognition systems trying to attain specific performance accuracy against various real time
challenges faced during the design and implementation of these systems.

These set ofreal time challenges varied from variations in illumination conditions to
occlusion problems to real time compatibility of performance along with forward and backward
compatibility among the technologies implemented.

Nevertheless some of these real time challenges are worked upon to a certain extent by
some of the authors but still no robust framework for solu- tion to all of these real time challenges has
been proposed.

Efforts are need to be organized for the design and development of a framework that
generates a hand gesture recognition system satisfying all the real time challenges posed by these

77
systems.

78
Without any detailedlevel of performance defined within the framework it would be really
difficult to develop an optimal solution based system for various real time challenges.
The static and dynamic setsof background from which the hand gestures need to be segmented
are also one of the primereal time challenges that need to be addressed for the wide applicability
of these systems.

5.5 Robustness

Evolution for robustness of a hand gesture recognition system is a complicated task. As thereis no
standard baseline algorithms that could accurately define the quantitative or qualita- tive robustness of any
gesture recognition system specifically. Neither there is present any formal performance comparison
framework for the recognition systems. Still based on the typical problems faced the robustness within the
hand gesture recognition system could be defined under three majors vertical of user, gesture and
application with specifications of theconditional assumptions taken for development of the systems.

Being user adaptive is one of the prime requirements of any hand gesture recognition system for its
wide acceptability.This includes the systems to be independent of the type of user, experience of user with
suchsystems and the compatibility of user with the system. Secondly the gesture used by the system needs
to be user friendly with high intuitiveness and lower stress and fatigue. The system also needs to be
gesture independent in terms of its cognitive mapping with the set ofcommands. This means the system
must be compatible of switching the cognitive mappingof same gesture to different set of commands and
vice versa.

79
CHAPTER 6
SYSTEM EVALUATION

6.1 INTRODUCTION

Depending on the requirements for system evaluation, a system can be evaluated at the application
level, functional level or the technological level. The application determines the systems´ boundaries, the
degree of abstraction and the suitable methods. The focus of the system evaluation can be the motor
vehicle as a whole, the engine control unit, as well as the single bond connection.

Fraunhofer IZM is bridging the gap between “small” and “large” systems. A system can be
optimized using both top-down and bottom-up approaches. As a first step weaknesses are identified on the
basis of system modeling. Using significance analysis, the effects of critical parameters on the target
function are investigated. Therefore the range extends from fault tree analysis to comprehensive state
models.

Existing models can also be analyzed on the basis of experienced faults and then optimized
accordingly. The available laboratory equipment can be used to provide product qualification.

6.2 EXPERIMENTS

6.2.1 EXPERIMENT ON USER VARIABLES

Variables are an important part of an eye tracking experiment. A variable is anything that can
change or be changed. In other words, it is any factor that can be manipulated, controlled for, or
measured in an experiment. Experiments contain different types of variables. We will present you with
some of the main types of experimental variables, their definitions and give you examples containing all
variable types.

Types of experimental variables:

● Independent variables (IV): These are the factors or conditions that you manipulate in
an experiment. Your hypothesis is that this variable causes a direct effect on the
dependent variable.
Dependent variables (DV): These are the factors that you observe or measure. As youvary your
independent variable you watch what happens to your dependent variable.

80
Fig 6.1 - Relationship between the independent and dependent variable

Extraneous variable:
An extraneous variable is any extra factor that may influence the outcome of an experiment,
even though it is not the focus of the experiment. Ideally, these variables won’t affect the
conclusions drawn from theresults as a careful experimental design should equally spread influence
across your test conditions and stimuli.

Nevertheless, extraneous variables should always be considered and controlled when


possible asthey may introduce unwanted variation in your data. In this case, you need to tweak your
design and procedure to be able to keep the variation constant or find a strategy to monitor its
influence (constant orcontrolled variables).

All experiments have extraneous variables. Here are some examples of different types of
extraneous variables:

aspects of the environment where the data collection will take place, e.g., room temperature,
background noise level, light levels;

Differences in participant characteristics (participant variables); and test operator, or


experimenter behavior during the test, i.e., their instructions to the test participants are notconsistent
or they give unintentional clues of the goal of the experiment to the participants.

81
Fig 6.2 - Effect of extraneous variables on the relationship between theindependent and dependent
variables

● Controlled (or constant) variables: Are extraneous variables that you manage to
keep constant or controlled for during the course of the experiment, as they
may have an effect on your dependent variables as well.

● Participant variables: Participant variables can be defined as the differing


individual characteristics that may impact how a participant responds in an
experiment. Examples of participant variables include gender, age,
ethnicity, socioeconomic status, literacy status, mood, clinical diagnosis etc.

● Stimulus variables: These are specific features of your stimulus or group of


stimuli that are part of the context in which the behavior occurs. These are often
an expression of or a subset of your independent variables and covariates.
Examples include the number of items, item category, stimulus crowdedness,
color, brightness, contrast, etc.

82
6.2.2 BACKGROUND ROBUSTNESS

With the development of information technology, recognition of the moving body has become an
important research area in military affairs, national defense and other domains. Usually, we recognize
moving bodies continuously in video supervisory control. The mostly used methods are background
subtraction method [1], optical flow method [2] and temporal difference method [3]. But all of them
have problems.

A pixel model of the background must be constructed when using the background subtraction
method. Then it gets a moving object by comparing every frame to the background. Though it is simple
and quick, its accuracy is not affected by changes in illumination, noise and so on. The main problems are
the elongation phenomenon of void, shadow and moving object.

The main task of the optical method is computation of the optic flow field. It contains two steps.
One is to estimate motion filed by time-space gradient of image sequence under smooth restrain
condition, the other is to divide moving object and background by change of motion field. The main
problem is the high computation complexity. So it cannot be used in the real-time domain by its high
computational time and weak anti noise performance [4].

Temporal difference method uses the difference of the neighbor frame to recognize. It recognizes
moving objects when the difference result is larger than a given threshold. This method shows robustness
of illumination and shadow [5]. So we get foreground with an improved tripling temporal difference
method firstly. Then we smooth edges and eliminate noise of these images by mathematical morphology.

Finally, we eliminate the void and connect disconnected areas with the quadruple directions
connection method. The experiment results validate that the new method is fast and accurate. Moreover,
this method shows its robustness of background.

Recognition method of moving body Extraction prospects method The temporal difference method
subtracts all pixels of neighbor frames. It is considered to be static when the difference is small by
similar illumination. On the other hand, moving objects affects a large difference. Then we sign these
areas with large differences and find moving objects in frames [6].

Temporal difference method shows good performance when movement is uniform motion. But it is not so
good when movement is not uniform motion. Another problem is that a moving object contains large

83
pixels when it is recognized by a temporal difference method.

So we improved it to a tripling temporal difference method to recognize the moving body. This
method not only improves recognition rate, but also improves integrality of moving objects. To let ( ) xI
n to be pixel gray of the fore-frame and n ( ) xT to be threshold.

Quadruple directions connection process

After dilatation, edge of foreground object is clearer and void in object is partly filled. Then we use the
quadruple directions connection method to connect the remaining void. The Original image is shown
in figure 2 with foreground color black.

Figure 6.3 is the target area of connection by use of the connection model. In this paper, the
connection distance is 5 pixels. Quadruple directions means left, upper left, upper and upper right. We
don’t connect the other four directions because they are symmetrical. In fact, quadruple direction
methods can be reduced by half computation.

Fig 6.3 - Flow chart of tripling temporal difference method

84
6.2.3 LIGHTNING INFLUENCE

Ambient light intensity changes substantially from day to night. Our eyes adapt to this broad
luminance range by changing the pupil size and switching between the photopic and
mesopic/scotopic systems. Most of us have experienced the transition between the two systems
when entering a dark place, such as the cinema, from bright daylight outside or when coming out of
it. The underlying cause of this is that our retina is composed of rod and cone receptors that operate
differently in bright and dark environments (e.g., Pokorny & Smith, 1997; Zele & Cao, 2015).

Cones, which have their highest concentration in the fovea centralis, are responsible for color
visionand function best in a bright, photopic environment. Rods, by contrast, are denser in the
extrafoveal parts of the retina and support peripheral vision; they are entirely responsible for
scotopic vision (Várady & Bodrogi, 2006; Zele & Cao, 2015), at luminance levels below 10−3 cd/m2,
such as in a moonless night. Between photopic and scotopic vision, there is a transitional range of
luminance from about 10−3 to 3 cd/m2, known as mesopic vision, in which both cones and rods
contribute to thevisual response

The shift of spectral sensitivity from photopic to mesopic-scotopic vision alters information
processing in visual discrimination and identification tasks, given that less foveal and relatively
more peripheral information is available in mesopic-scotopic vision (Pokorny & Smith, 1997; Zele
& Cao, 2015).

In addition, we face considerable changes and deficits in our perceptual ability (e.g., the
Purkinje shift of the peak luminance sensitivity toward the blue end of the color spectrum; see
Barlow, 1957). While performance degrades with increasing eccentricity in both photopic (Lee,
Legge, & Ortiz, 2003) and mesopic vision (Paulun, Schütz, Michel, Geisler, & Gegenfurtner, 2015),
the degradationis much weaker in the mesopic range.

For instance, target detection is relatively unaffected by varying target eccentricity under
mesopic vision conditions (Hunter, Godde, & Olk, 2017); but search for, say, a Gabor patch appearing
in a peripheral region (e.g., at an eccentricity of 7.5° of visual angle) within a noisy background
requires fewer saccades and is more efficient under scotopic compared with photopic conditions
(Paulun et al., 2015).

85
Findings such as these suggest that the visual system extracts useful information from larger
a
region of the visual field during each eye fixation (i.e., extending the visual span; 1998) to
Rayner,
compensate for the degradation of visual information in mesopic-scotopic vision, as compared with
photopic vision.

While light intensity and display contrast greatly influence visual search, it is unclear
whether statistical learning of spatial target–distractor regularities within the search arrays would
work in the same way under different luminance and item-to-background contrast conditions.

Research on contextual learning and retrieval has revealed the availability of invariant local
context relations within the viewing span to be crucial for contextual cueing. For instance,
Geringswald, Baumgartner, and Pollmann (2012) observed that the loss of central-foveal vision
(by computer simulation) in visual search eliminates contextual cueing.

A further study of a group of participants with age-related macular degeneration (AMD),


who suffer from impaired foveal vision, showed that they profit less from contextual cues compared
with a control group of unimpaired observers (Geringswald, Herbik, Hoffmann, & Pollmann,
2013).

Similarly, when the viewing span was limited (e.g., two to three items) within each fixation
by means of a gaze-contingent display manipulation, Zang et al. (2015) also found barely any effect
ofcontextual cueing. But when the whole display was made available unrestricted, contextual
cueing manifested immediately—indicating that limiting the visual span effectively blocks the
retrieval of already learnt contexts.

Moreover, when the whole spatial configuration of the search items (but not their identity)
was briefly presented (for 150 ms) prior to gaze-contingent search, the limited local invariant
context was able to facilitate search—indicating that a certain amount of global context is required
for successful contextual retrieval. Thus, the extant studies using gaze-contingent viewing
manipulations and, respectively, AMD patients point to separate roles of local and global

Spatial-relational information for contextual cueing, and studies with gaze-contingent


viewing manipulations reveal differential contributions of foveal and peripheral information to
the cueing effect.

86
87
Differential contributions of the global and local context have also been confirmed in studies
of search in naturalistic scenes (Brockmole, Castelhano, & Henderson, 2006; Brooks, Rasmussen, &
Hollingworth, 2010). Brockmole et al. (2006) devised a local context (e.g., a table) containing a
search target (a letter) embedded in a global context (e.g., a library scene).

Their findings revea led contextual cueing to be biased towards global-context associations:In
a transfer block (following initial learning), when the target appeared at the same location within he
t
same global context (e.g., the library scene), contextual cueing was preserved even when the local
context was changed (e.g., changing the table); but changing the global context abolished contextual
cueing.

Varying the association of a global scene with a local search array, Brooks et al. (2010)
further demonstrated that, under certain conditions, the global–local context representations may
organized hierarchically in determining contextual learning and retrieval: When a (predictive) global
scene was uniquely associated with a local (repeated) search array, changing the global scene
disrupted contextual cueing—consistent with Brockmole et al. (2006).

However, no nesting of the local within the global representation was evident when a
(repeated) search array was not consistently paired with a glob scene during initial learning; in this
case, contextual cueing remained robust despite changes of the global background.

Collectively, these studies—whether using just an abstract search array or a more scene-based
searchscenario—point to the important roles of local and global context in learning and retrieval of
context cues. However, all of these studies were conducted under photopic, high-contrast lighting and
stimulu s conditions, so it remains unclear whether the changes of the visual span and brought
about by switching between photopic and mesopic vision (see above) would exert the same influences
on contextual cueing.

This is not to say that the findings necessarily extend one-to-one to naturalistic scenes, which,
qua being meaningful, provide additionalcues deriving from “scene grammar” (e.g., Võ & Wolfe, 2013;
Wolfe et al., 2011). However, it is reasonable to assume that changes of the lighting conditions
engender similar adjustments of ba s visual information processing (balance of rod/cone system, size of
visual span), regardless of whether the scene is artificial or naturalistic.

88
Fig 6.4 - Influences of luminance contrast and ambient lighting on visual context learningand
retrieval

6.2.4 HAND ORIENTATION


Many tasks require us to keep track of our limbs with partial or no visual feedback. For
example, a driver may operate the steering wheel of their vehicle, change gear, and activate the
turn signal all while their gaze is directed at the road. A proprioceptive sense of limb position is
generated by integrating peripheral signals from stretch receptors in the muscles, skin and joints,
as well as central signals such as motor commands and perceived effort (Winter et al. 2005;
Proske and Gandevia 2009; Smith et al. 2009; Medina et al. 2010; for a review, see Proske and
Gandevia 2012).

Behavioural research has found systematic biases in participant-reported hand position


when visual feedback is absent. Recently, we found that errors in perceived index finger
orientation were biased towards specific angles that varied as a function of plane of operation
(frontoparallel or horizontal) and hand tested (left or right) in right-handed individuals (Fraser
and Harris 2016).

89
It has been suggested that perceived hand location is biased towards relevant or likely
manual workspaces (Ghilardi et al. 1995; Haggard et al. 2000; Rincon-Gonzalez et al. 2011); our
findings extended this theory by suggesting perceived finger orientation similarly deviates
towards common functional postures of the hands (Fraser and Harris 2016). In the present study,
we compared perceived finger orientation of left- and right-handed individuals in various hand
locations in order to further test this possibility.

Several behavioural measures have been used to examine the accuracy and precision of
proprioceptive hand and arm position sense. For example, Haggard and colleagues asked
participants to mark location of their unseen hand on the underside of a table using a pen held
in the fingers o f the hand; perceived hand location was biased further to the left for the
left hand, and further to the right for the right hand (Haggard et al. 2000). Other studies using
similar reaching tasks have found similar biases, as have tasks asking participants to report static
arm position with respect to a visual or proprioceptive target (Van Beers et al. 1998; Jones et al.
2010).

In contrast, Schmidt and colleagues measured arm position sense using a paradigm where
the
participant’s forearm was slowly rotated about the elbow and the participant indicatedwhen
the
ir arm passed under an LED. In this paradigm perceived location of the arm was biased
in
wards (towards thebody) in right-handers (Schmidt et al. 2013). Ghilardi et al. used a reaching
paradigm where participants moved their unseen hand to a visual target and also found reach
biases consistent with a bias in initial hand position in towards the body midline (Ghilardi et al.
1995).

The inconsistency in the direction of reported hand location biases may be due in part to
differences in hand posture and task demands. For instance, Jones et al. (2012) tested
participants’ ability totarget,
reachreach
to a to the remembered location of a reach target
reproduction), or (reach sual
ju stimulus. They dge the location of a remembered(2000)
reach target
in the with respect
reaching andtolocation
a vi
nd errors similar to Haggard et al.
fou
estimation tasks, but not in the reach reproduction task. These results suggest that conscious
proprioceptive limb position sense may be biased while movement reproduction mechanisms
remain accurate (e.g., Vindras et al. 1998).

90
The evidence described above suggests that there may be a systematic error in the
proprioceptive mapping of the hands in space. It is known that training and experience can lead
to shifts in proprioceptive localization (Cressman and Henriques 2009) and proprioceptive acuity
(Wong et al. 2011). The study by Ghilardi et al. found that following training in a novel
workspace, participants’ limb position estimates shifted towards this space (Ghilardi et al. 1995),
indicating the location of relevant manual tasks and workspaces can influence proprioceptive
maps in a dynamic way.

Recently we reported that in right-handers, the perceived orientation of the index finger
of the left hand was biased towards ~25° inwards, while the right hand was biased towards only
~2° inwards, when thehand was held pronate in the horizontal plane (i.e., palm-down; Fraser and
Harris 2016). These angles, which we termed the “axes of least error”, reflect hand positions
commonly adopted by right-handers when doing bimanual tasks in this plane (Sainburg 2002,
2005).

That is, right-handers tend to stabilize an item with their left hand (a piece of paper, a loaf
of bread), and manipulate tools with their right (a pen, a bread knife). We suggested that these
common, hand - behaviours lead to a priori assumptions about the likely orientation of the
fingers of that hand, that in turnaffect proprioceptively judged finger orientation estimates.
Proprioceptive mapping of hand position in
space does appear to be idiosyncratic, with vidual
participants’ localization errors remaining indi
st
However, our previous data (Fraser and Harris 2016) suggest that, at least for perceived finger
orientation, significant group-level trends do exist. These group-level characteristics may be
driven by shared experience with common everyday manual tasks, such as writing.

If perceived orientation of the fingers were biased towards common functional hand
postures, we would expect to see a reversal of the errors found in right-handers (Fraser and
Harris 2016) for a group with reversed functional roles of the hands, i.e., left-handers. That is,
if the dominant and non-dominant hands were biased towards unique positions based on their
common roles, these biases should be reversed in groups with reversed dominant and non-
dominant hands.

91
However, there is evidence that, compared right-handers, left-handers are actually
accurate in judging the length of their arms (Linkenauger etal. 2009), represent the space around
their body more evenly (Hach and Schütz-Bosbach 2010) and are more accurate in sensing their
arm position (Schmidt et al. 2013). These differences may be due to reduced hemispheric
lateralization in left-handers (Przybyla et al. 2012; Vingerhoets et al. 2012), and/or to greater
flexibility in switching between dominant and non-dominant hands to accommodate a “right-
handed world”.

Some authors have argued that left-handers may use different strategies for estimating
limb position, such as pictorial representation of the body (Gentilucci et al. 1998; Schmidt et al.
2013). Therefore it is possible that left-handers may not show the opposite pattern to right-
handers in
their perceptio n of orientation. In order to address this we compared the perceived
orientation of the fingers in right- and

left-handed individuals with the goal of better understanding the factors contributing to
proprioceptive finger orientation sense.

We predicted that right-handers would show a pattern of responses similar to those


reported in Fraser and Harris (2016), while left-handers would show a different,
potentially reversed pattern of errors.

Additionally, some studies have reported that hand localization errors are reduced for
locations closer to the body midline (Wilson et al. 2010; Rincon-Gonzalez et al. 2011). We tested
right- and left- handed individuals’ perceived finger orientation when the hand was
directly in front of the located al side,
to determine whether e body midline, aligned with the shoulder, or displaced to the ipsilater
rrors in reported finger orientation increased with the whole hand’s
distance from the body midline for both groups.

92
Fig 6.5 - A schematic of the experimental apparatus

The apparatus used in this experiment is the same as the “horizontal configuration”
described in Fraser and Harris (2016). Participants were seated in front of a table on which rested
a motor, a monitor and a mirror (Fig. 1a). The motor (Applied Motion Products 23Q-3AE
Integrated Stepper Motor, 20,000 steps per revolution) was positioned 35 cm away from the
participant with the shaft sitting 15 cm above the table surface, pointing towards the ceiling. A 5
cm wooden dowel was fastened to the motor shaft orthogonal to the axis of rotation.

During the experiment the participant’s index finger was attached to this dowel with two
lengths of flexible wire such that the axis of rotation passed through the proximal
interphalangeal joint (PIP) of the finger, with the hand supinate (palm facing downward) (Fig.
1b). The motor was connected via a serial port to a laptop that controlled the rotation of the
motor by means of a custom written program running in MATLAB. The motor was pre-set to
accelerate and decelerate at 0.5 revolutions/s2 to a maximum velocity of 0.5 revolutions/s.

93
This apparatus served to passively rotate the participant’s finger, and by extension the rest
of the hand, about an axis orthogonal to the index finger’s PIP joint, to orientations specified by
the test program (–30° to 30° in 10° steps, with 0° corresponding to straight ahead with respect to
the body; see convention below). Pilot testing found that finger rotation to angles greater than
30° was painful and introduced largechanges in elbow position to avoid discomfort, which is why
test orientations were restricted to this rangein this study

A mirror was mounted horizontally half way between a horizontally mounted monitor
and the shaft of the motor, obscuring the participants’ hand from view (Fig. 1a). The monitor
(ASUS VS247H-P 23.6″ widescreen LCD) faced downwards such that images presented on the
screen
were reflected in the and were seen at the depth of the participant’s hand. During the
experiment partic ipants were instructed look at the images in the mirror.The participants’ elbow
was not fixed, allowing them to adopt natural arm and wrist positions to accommodate the motion
of their finger.

Hand locations tested

Perceived index finger orientation was tested in six experimental conditions (2 hands × 3
hand locations) presented in a blocked design. Participants sat with their right or left hand located
(1) directly in front of the body (“midline”), (2) aligned with the shoulder (“shoulder”) or, (3)
twice the distance between the midline and the shoulder to the ipsilateral side (“outside”) (Fig.
1c).

Measuring perceived finger orientation

Perceived finger orientation was measured by having participants align a visual line with
their u
nseen finger. The visual line was presented on the monitor and reflected in the mirror in
which
participantsviewed the image (Fig. 1a, 8 cm × 1.5 cm on monitor; viewing distances
approximately 15 cm, midline condition; 20 cm, shoulder condition; and 30 cm, outside
condition).

There were seven test finger orientations for each hand location, ranging from 30°
counterclockwise of straight ahead (with respect to the body) to 30° clockwise in 10° steps. The
visual line initially appeared at various orientations, randomly selected from within the range of
possible test orientations. Each test orientation was repeated 8 times in a block, yielding 56
trials per block. Blocks were presented in randomized order.

94
Each trial 10–15 s to complete and each block took 10–12 min. Prior to each
finger orientation, the motor rotated through three “distractor” orientations randomly sampled
from a normal distribution with the test orientation as the mean and a standard deviation of 10°.
This was done to reduce hysteresis (e.g., the effect of always rotating to the extreme orientations
from the same direction).

Procedure

Participants first signed an informed consent form and completed an online version of the
original Edinburgh Handedness Inventory (developed by Oldfield 1971; adapted for online use
by Cohen 2008). The experimenter then used a meter stick to measure of the distance between
the spine and the outer edge of the shoulder in order to determine the hand position for the
“Outside” condition.

Participants sat with their right or left index finger attached to the wooden dowel
positioned to hold the hand in one of the three hand positions (midline, shoulder or outside). The
mirror reflected a dark screen. The motor then rotated through three distractor orientations,
followed by the test orientation. At this point participants were prompted with a 400 Hz beep to
click a mouse held in their free hand and the visual line appeared onscreen optically
superimposed over the location of their finger. Participants rotated the white line clockwise or
counterclockwise using the left and right mouse buttons respectively until the orientation of the
line matched that of their unseen finger. Participants submitted their answer by pressingthe scroll
wheel on the mouse, which immediately started the next trial. Participants were asked to report
when they submitted an answer in error (e.g., by accidentally pressing the scroll wheel too soon)
so that these trials could be emoved
r from the data analysis. The six test blocks were conducted
in a ra ndomized order; afterthe 3rd block participants were offered a short break. The entire
experi ment took roughly 1.5h to complete.

Convention
The hand used was coded as right or left. Hand position was coded with respect to the
body—either in front (midline), aligned with shoulder (shoulder), or outside the shoulder
(outside). Finge r orientation coded in hand-centric coordinates. “Straight ahead” with respect
to the body was set as 0°, with deviations of the hand labeled as negative and inward
deviations as positive. This means that negative values reflect a clockwise deviation from straight
ahead for the right hand and a counterclockwise deviation for the left hand, and visa versa for
positive values.

95
Data analysis
Each participant yielded eight responses for each tested finger orientation, in
each hand position. Scores were subjected to an initial outlier analysis, where
responses
±2 standard deviations from the mean of participant’s responses for that test
orientation/hand position c ombination were removed from subsequent analysis. This
resulted in between 6 and 10 scores out of 336 being removed for each participant.

Angular means and standard deviation of responses were calculated using the
CircStat Toolbox for MATLAB (Berens 2009). Responses for a given test orientation
were averaged, and then subtracted from the value of actual test orientation, yielding a
signed orientation error in which positive scores corresponded to an inwards error and
negative corresponded to an outwards error. We subjected these signed orientation
errors to an omnibus 2 × 2 × 3 × 7 mixed-model ANOVA comparing handedness (right-
or left- handed), hand used (right or left), hand position (midline, shoulder, outside), and
test angle(−30°, −20°, −10°, 0°, 10°, 20°, 30°). Planned contrasts compared overall
accuracy of the left vs. right hands within each group.

Additionally, we calculated the standard deviations of participants’ responses for


a given test angle percondition, yielding a measure of the precision of orientation
estimates. We con ducted a 2 × 2 × 3 × 7 mixed-model ANOVA on the
deviations comparin standard Planned contrasts
compared precision of orientation estimates at each of the three hand positions. Where
assumptions of sphericity were violated, Greenhouse–Geisser corrected degrees of
freedom and p-values are reported

96
Fig 6.6 - SIGNED ERROR OF PERCEIVED HAND ORIENTATION

Fig 6.7 - PRECISION OF PERCEIVED HAND ORIENTATION ESTIMATES

97
CHAPTER 7

CONCLUSION AND FUTURE WORKS

7.1 Summary
The main objective of the AI virtual mouse system is to control the
mouse cursor functions by using hand gestures instead of using a physical
mouse. The proposed system can be achieved by using a webcam or a built-in
camera that detects the hand gestures and hand tips and processes these frames to
perform the particular mouse functions.

A virtual gesture control mouse is a system that profound to guides the


mouse cursor and executes its task using a real-time camera. We implemented
mouse navigation, selection of icons and their functions, and tasks like left, right,
double click, and scrolling. This system is based on image comparison and
motion detection technology to do mouse indicator movements and selection of
the icon. Analyzing results, it can be anticipated that if we provide enough light,
and a decent camera, the algorithms can work in any domain. Then our system
will be more systematized.

From the results of the model, we can come to a conclusion that the
proposed AI virtual mouse system has performed very well and has greater
accuracy compared to the existing models and also the model overcomes most of
the limitations of the existing systems.

Since the proposed model has greater accuracy, the AI virtual mouse
can be used for real-world applications, and also, it can be used to reduce the
spread of COVID-19, since the proposed mouse system can be used virtually
using hand gestures without using the traditional physical mouse.

98
7.2 Suggestions for future works

In the future, we want to merge more features such as interaction in multiple


windows, enlarging and shrinking windows, closing windows, etc. by using the palm and multiple
fingers. The model has some limitations such as a small decrease in accuracy in right-click mouse
function and some difficulties in clicking and dragging to select the text. Hence, we will work next
to overcome these limitations by improving the fingertip detection algorithm to produce more
accurate results.

This Gesture virtual mouse has some limitations such as the slight decrease in accuracy of
the right-click mouse operation and also this model has some di fficulties in executing clicking
and dragging to select the text. These are a few limitations of the Gesture virtual mouse system,
and these problems will be overcome in our future work.
After this, this method can be developed to handle the keyboard functionalities along with
the mouse functionalities virtually which is another future scope of Human-Computer Interaction
(HCI).
The mouse cursor. This will also lead to new levels of human-computer interaction (HCI),
which does not require physical contact with the device. This System can perform all mouse tasks.
For people who don't use a touchpad, it's helpful. The architecture of the device proposed would
dramatically change people's interactions with computers. It would eliminate the need for a mouse
completely.
Free movement of the cursor, right-click, left-click, scroll-up, scroll-down, drag & drop,
and select, are all operations that can be performed using only gestures in this Multi-Functional
system. The majority of the applications necessitate additional hardware, which can be quite costly.
The goal was to develop this technology as cheaply as possible while also using a standardized
operating system.

99
7.3 APPENDICES
# Imports

import cv2

import mediapipe as mp

import pyautogui

import math

from enum import IntEnum

from ctypes import cast,

POINTER from comtypes import

CLSCTX_ALL

from pycaw.pycaw import AudioUtilities,

IAudioEndpointVolume from google.protobuf.json_format

import MessageToDict import screen_brightness_control as

sbcontrol

pyautogui.FAILSAFE = False

mp_drawing = mp.solutions.drawing_utils

mp_hands = mp.solutions.hands

# Gesture Encodings

class Gest(IntEnum):

# Binary Encoded

FIST = 0

PINKY = 1

RING = 2

MID = 4

LAST3 = 7

INDEX = 8

FIRST2 = 12

LAST4 = 15

10
THUMB = 16

10
PALM = 31

# Extra Mappings

V_GEST = 33

TWO_FINGER_CLOSED = 34

PINCH_MAJOR = 35

PINCH_MINOR = 36

# Multi-handedness Labels

class HLabel(IntEnum):

MINOR = 0

MAJOR = 1

# Convert Mediapipe Landmarks to recognizable Gestures

class HandRecog:

def init (self, hand_label):

self.finger = 0

self.ori_gesture = Gest.PALM

self.prev_gesture = Gest.PALM

self.frame_count = 0

self.hand_result = None

self.hand_label = hand_label

def update_hand_result(self, hand_result):

self.hand_result = hand_result

def get_signed_dist(self,

point): sign = -1

if self.hand_result.landmark[point[0]].y <

10
self.hand_result.landmark[point[1]].y:

sign = 1

dist = (self.hand_result.landmark[point[0]].x -
self.hand_result.landmark[point[1]].x)**2

dist += (self.hand_result.landmark[point[0]].y -
self.hand_result.landmark[point[1]].y)**2

dist = math.sqrt(dist)

return dist*sign

def get_dist(self, point):

dist = (self.hand_result.landmark[point[0]].x -
self.hand_result.landmark[point[1]].x)**2

dist += (self.hand_result.landmark[point[0]].y -
self.hand_result.landmark[point[1]].y)**2

dist = math.sqrt(dist)

return dist

def get_dz(self,point):

return abs(self.hand_result.landmark[point[0]].z -
self.hand_result.landmark[point[1]].z)

# Function to find Gesture Encoding using current


finger_state.

# Finger_state: 1 if finger is open, else 0

def set_finger_state(self):

if self.hand_result ==

None: return

points = [[8,5,0],[12,9,0],[16,13,0],[20,17,0]]

self.finger = 0

self.finger = self.finger | 0 #thumb

10
for idx,point in enumerate(points):

dist = self.get_signed_dist(point[:2])

dist2 = self.get_signed_dist(point[1:])

try:

ratio = round(dist/dist2,1)

except:

ratio = round(dist1/0.01,1)

self.finger = self.finger << 1

if ratio > 0.5 :

self.finger = self.finger | 1

# Handling Fluctations due to noise

def get_gesture(self):

if self.hand_result ==

None: return Gest.PALM

current_gesture = Gest.PALM

if self.finger in [Gest.LAST3,Gest.LAST4] and


self.get_dist([8,4]) < 0.05:

if self.hand_label == HLabel.MINOR :

current_gesture = Gest.PINCH_MINOR

else:

current_gesture = Gest.PINCH_MAJOR

elif Gest.FIRST2 == self.finger :

point = [[8,12],[5,9]]

10
dist1 =

self.get_dist(point[0]) dist2

= self.get_dist(point[1])

ratio = dist1/dist2

if ratio > 1.7:

current_gesture =

Gest.V_GEST else:

if self.get_dz([8,12]) < 0.1:

current_gesture = Gest.TWO_FINGER_CLOSED

else:

current_gesture = Gest.MID

else:

current_gesture = self.finger

if current_gesture == self.prev_gesture:

self.frame_count += 1

else:

self.frame_count = 0

self.prev_gesture = current_gesture

if self.frame_count > 4 :

self.ori_gesture = current_gesture

return self.ori_gesture

# Executes commands according to detected gestures

class Controller:

tx_old = 0

ty_old = 0

trial = True

10
flag = False

grabflag = False

pinchmajorflag = False

pinchminorflag = False

pinchstartxcoord = None

pinchstartycoord = None

pinchdirectionflag = None

prevpinchlv = 0

pinchlv = 0

framecount = 0

prev_hand = None

pinch_threshold = 0.3

def getpinchylv(hand_result):

dist = round((Controller.pinchstartycoord -
hand_result.landmark[8].y)*10,1)

return dist

def getpinchxlv(hand_result):

dist = round((hand_result.landmark[8].x -
Controller.pinchstartxcoord)*10,1)

return dist

def changesystembrightness():

currentBrightnessLv = sbcontrol.get_brightness()/100.0

currentBrightnessLv += Controller.pinchlv/50.0

if currentBrightnessLv > 1.0:

currentBrightnessLv = 1.0

elif currentBrightnessLv <

0.0: currentBrightnessLv =

0.0

10
sbcontrol.fade_brightness(int(100*currentBrightnessLv) ,
start = sbcontrol.get_brightness())

def changesystemvolume():

devices = AudioUtilities.GetSpeakers()

interface = devices.Activate(IAudioEndpointVolume._iid_,
CLSCTX_ALL, None)

volume = cast(interface, POINTER(IAudioEndpointVolume))

currentVolumeLv = volume.GetMasterVolumeLevelScalar()

currentVolumeLv += Controller.pinchlv/50.0

if currentVolumeLv > 1.0:

currentVolumeLv = 1.0

elif currentVolumeLv < 0.0:

currentVolumeLv = 0.0

volume.SetMasterVolumeLevelScalar(currentVolumeLv, None)

def scrollVertical():

pyautogui.scroll(120 if Controller.pinchlv>0.0 else -


120)

def scrollHorizontal():

pyautogui.keyDown('shift')

pyautogui.keyDown('ctrl')

pyautogui.scroll(-120 if Controller.pinchlv>0.0 else


120)

pyautogui.keyUp('ctrl')

pyautogui.keyUp('shift')

# Locate Hand to get Cursor Position

10
# Stabilize cursor by

Dampening def

get_position(hand_result):

point = 9

position = [hand_result.landmark[point].x
,hand_result.landmark[point].y]

sx,sy = pyautogui.size()

x_old,y_old = pyautogui.position()

x = int(position[0]*sx)

y = int(position[1]*sy)

if Controller.prev_hand is None:

Controller.prev_hand = x,y

delta_x = x -

Controller.prev_hand[0] delta_y = y

- Controller.prev_hand[1]

distsq = delta_x**2 +

delta_y**2 ratio = 1

Controller.prev_hand = [x,y]

if distsq <= 25:

ratio = 0

elif distsq <= 900:

ratio = 0.07 * (distsq ** (1/2))

else:

ratio = 2.1

x , y = x_old + delta_x*ratio , y_old + delta_y*ratio

return (x,y)

def pinch_control_init(hand_result):

Controller.pinchstartxcoord = hand_result.landmark[8].x

Controller.pinchstartycoord = hand_result.landmark[8].y
10
Controller.pinchlv = 0

Controller.prevpinchlv = 0

Controller.framecount = 0

# Hold final position for 5 frames to change status

def pinch_control(hand_result, controlHorizontal,


controlVertical):
if Controller.framecount == 5:

Controller.framecount = 0

Controller.pinchlv =

Controller.prevpinchlv

if Controller.pinchdirectionflag == True:

controlHorizontal() #x

elif Controller.pinchdirectionflag == False:

controlVertical() #y

lvx = Controller.getpinchxlv(hand_result)

lvy = Controller.getpinchylv(hand_result)

if abs(lvy) > abs(lvx) and abs(lvy) >


Controller.pinch_threshold:

Controller.pinchdirectionflag = False

if abs(Controller.prevpinchlv - lvy) <


Controller.pinch_threshold:
Controller.framecount += 1

else:

Controller.prevpinchlv = lvy

Controller.framecount = 0

10
elif abs(lvx) > Controller.pinch_threshold:

Controller.pinchdirectionflag = True

if abs(Controller.prevpinchlv - lvx) <


Controller.pinch_threshold:
Controller.framecount += 1

else:

Controller.prevpinchlv = lvx

Controller.framecount = 0

def handle_controls(gesture, hand_result):

x,y = None,None

if gesture != Gest.PALM :

x,y = Controller.get_position(hand_result)

# flag reset

if gesture != Gest.FIST and Controller.grabflag:

Controller.grabflag = False

pyautogui.mouseUp(button = "left")

if gesture != Gest.PINCH_MAJOR and


Controller.pinchmajorflag:

Controller.pinchmajorflag = False

if gesture != Gest.PINCH_MINOR and


Controller.pinchminorflag:

Controller.pinchminorflag = False

# implementation

if gesture == Gest.V_GEST:

Controller.flag = True

11
pyautogui.moveTo(x, y, duration = 0.1)

elif gesture == Gest.FIST:

if not Controller.grabflag :

Controller.grabflag = True

pyautogui.mouseDown(button = "left")

pyautogui.moveTo(x, y, duration = 0.1)

elif gesture == Gest.MID and Controller.flag:

pyautogui.click()

Controller.flag = False

elif gesture == Gest.INDEX and Controller.flag:

pyautogui.click(button='right')

Controller.flag = False

elif gesture == Gest.TWO_FINGER_CLOSED and


Controller.flag:

pyautogui.doubleClick()

Controller.flag = False

elif gesture == Gest.PINCH_MINOR:

if Controller.pinchminorflag == False:

Controller.pinch_control_init(hand_result)

Controller.pinchminorflag = True

Controller.pinch_control(hand_result,Controller.scrollHorizontal
, Controller.scrollVertical)

elif gesture == Gest.PINCH_MAJOR:

11
if Controller.pinchmajorflag == False:

Controller.pinch_control_init(hand_result)

Controller.pinchmajorflag = True

Controller.pinch_control(hand_result,Controller.changesystembrig
htness, Controller.changesystemvolume)

'''

Main Class

Entry point of Gesture Controller

'''

class GestureController:

gc_mode = 0

cap = None

CAM_HEIGHT = None

CAM_WIDTH = None

hr_major = None # Right Hand by default

hr_minor = None # Left hand by default

dom_hand = True

def init (self):

GestureController.gc_mode = 1

GestureController.cap = cv2.VideoCapture(0)

GestureController.CAM_HEIGHT =
GestureController.cap.get(cv2.CAP_PROP_FRAME_HEIGHT)
GestureController.CAM_WIDTH =
GestureController.cap.get(cv2.CAP_PROP_FRAME_WIDTH)

11
def

classify_hands(results)

: left , right =

None,None try:

handedness_dict =
MessageToDict(results.multi_handedness[0])

if handedness_dict['classification'][0]['label'] ==
'Right':

right = results.multi_hand_landmarks[0]

else :

left = results.multi_hand_landmarks[0]

except:

pass

try:

handedness_dict =
MessageToDict(results.multi_handedness[1])

if handedness_dict['classification'][0]['label'] ==
'Right':

right = results.multi_hand_landmarks[1]

else :

left = results.multi_hand_landmarks[1]

except:

pass

if GestureController.dom_hand ==

True: GestureController.hr_major

= right

GestureController.hr_minor = left

else :

GestureController.hr_major = left

GestureController.hr_minor =

11
right

11
def start(self):

handmajor = HandRecog(HLabel.MAJOR)

handminor = HandRecog(HLabel.MINOR)

with mp_hands.Hands(max_num_hands =
2,min_detection_confidence=0.5, min_tracking_confidence=0.5) as
hands:

while GestureController.cap.isOpened() and


GestureController.gc_mode:

success, image = GestureController.cap.read()

if not success:

print("Ignoring empty camera frame.")

continue

image = cv2.cvtColor(cv2.flip(image, 1),


cv2.COLOR_BGR2RGB)

image.flags.writeable = False

results =

hands.process(image)

image.flags.writeable = True

image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

if results.multi_hand_landmarks:

GestureController.classify_hands(results)

handmajor.update_hand_result(GestureController.hr_major)

handminor.update_hand_result(GestureController.hr_minor)

11
handmajor.set_finger_state()

handminor.set_finger_state()

gest_name =

handminor.get_gesture()

if gest_name == Gest.PINCH_MINOR:

Controller.handle_controls(gest_name,
handminor.hand_result)

else:

gest_name = handmajor.get_gesture()

Controller.handle_controls(gest_name,
handmajor.hand_result)

for hand_landmarks in
results.multi_hand_landmarks:

mp_drawing.draw_landmarks(image,
hand_landmarks, mp_hands.HAND_CONNECTIONS)

else:

Controller.prev_hand = None

cv2.imshow('Gesture Controller', image)

if cv2.waitKey(5) & 0xFF == 13:

break

GestureController.cap.release()

cv2.destroyAllWindows()

# uncomment to run directly

gc1 = GestureController()

gc1.start()

11
REFERENCES
[1] Amardip Ghodichor, Binitha Chirakattu “Virtual Mouse using Hand Gesture
and Color Detection ”, Volume 128 – No.11,October 2015.
[2] Chhoriya P., Paliwal G., Badhan P., 2013, “Image Processing Based Color
Detection”, International Journal of Emerging Technology and Advanced
Engineering, Volume 3, Issue 4, pp.410-415.
[3] Rhitivij Parasher,Preksha Pareek ,”Event triggering Using hand gesture using open
cv”, volume -02-february,2016 page No.15673-15676.
[4] AhemadSiddique, Abhishek Kommera, DivyaVarma, ”Simulation of Mouse using
Image Processing Via Convex Hull Method ”, Vol. 4, Issue 3, March 2016.
[5] Student, Department of Information Technology, PSG College of Technology,
Coimbatore, Tamilnadu, India,”Virtual Mouse Using Hand Gesture Recognition
”,Volume 5 Issue VII, July 7.
[6] Kalyani Pendke1 , Prasanna Khuje2 , Smita Narnaware3,Shweta Thool4 , Sachin
Nimje5 ,”International Journal of Computer Science and Mobile Computing
”,IJCSMC, Vol. 4,Issue. 3, March 2015.
[7] Abhilash S S1, Lisho Thomas2, Naveen Wilson3, Chaithanya C4,”VIRTUAL
MOUSE USING HAND GESTURE”, Volume: 05 Issue: 04 | Apr-2018.
[8] Abdul Khaliq and A. Shahid Khan, “Virtual Mouse Implementation Using Color
Pointer Detection”, International Journal of Electrical Electronics & Computer
Science Engineering, Volume 2, Issue 4, August, 2015, pp. 63-66
[9] Erdem, E. Yardimci, Y. Atalay, V. Cetin, A. E., “Computer vision based mouse”,
Acoustics, Speech, and Signal Processing, Proceedings (ICASS), IEEE
International Conference, 2002.
[10] Chu-Feng Lien, “Portable Vision-Based HCI – A Realtime Hand Mouse System
on Handheld Devices”, National Taiwan University, Computer Science and
Information Engineering Department
[11] Hojoon Park, “A Method for Controlling the Mouse Movement using a Real Time
Camera”, Brown University, Providence, RI, USA, Department of Computer
Science, 2008.
[12] AsanterabiMalima, Erol Ozgur, and Mujdat Cetin, “A FastAlgorithm for Vision-
Based Hand Gesture Recognition for Robot Control”
[13] using Hand Gesture Recognition”, InternationalJournal of Engineering Sciences &
Research Technology,ISSN:2277-9655,March 2014.

[14] ShanyJophin, Sheethal M.S, Priya Philip, T M Bhruguram, “Gesture Based


Interface Using Motion and Image Comparison”, International Journal of
Advanced Information Technology (IJAIT) Vol. 2, No.3, June 2012.
[15] Abhik Banerjee, Abhirup Ghosh, Koustuvmoni Bharadwaj, HemantaSaik, Mouse
Control using a Web Camera based on ColourDetection”, International Journal
of Computer Trends and Technology (IJCTT) –volume 9 number 1, ISSN:2231-
2803, March 2014.
[16] S.Sadhana Rao, “Sixth Sense Technology”, Proceedings of the International
Conference on Communication and Computational Intelligence– 2010, pp.336-
339.
[17] Game P. M., Mahajan A.R,”A gestural user interface to Interact with computer
system ”, International Journal on Science and Technology (IJSAT) Volume II,
Issue I, (Jan.- Mar.) 2011, pp.018 – 027

11
[18] Abhik Banerjee, Abhirup Ghosh, Koustuvmoni Bharadwaj,” Mouse Control using
a Web Camera based on Color Detection”,IJCTT,vol.9, Mar 2014.

[19] Angel, Neethu.P.S, “Real Time Static & Dynamic Hand Gesture Recognition”,
International Journal of Scientific & Engineering Research Volume 4, Issue3,
March-2013.
[20] Q. Y. Zhang, F. Chen and X. W. Liu, “Hand Gesture Detection and Segmentation
Based on Difference Back-ground Image with Complex Background,”
Proceedings of the 2008 International Conference on Embedded Soft-ware and
Systems, Sichuan, 29-1 July 2008, pp. 338-343
[21] A. Erdem, E. Yardimci, Y. Atalay, V. Cetin, A. E. “Computer vision based
mouse”,Acoustics, Speech, and Signal Processing, Proceedings. (ICASS). IEEE
International Conference, 2002
[22] Hojoon Park, “A Method for Controlling the Mouse Movement using a Real Time
Camera”, Brown University, Providence, RI, USA, Department of computer
science, 2008

[23] Chu-Feng Lien, “Portable Vision-Based HCI –A Real-time Hand Mouse System
on Handheld Devices”, National Taiwan University, Computer Science and
Information Engineering Department

[24] Kamran Niyazi, Vikram Kumar, Swapnil Mahe, Swapnil Vyawahare, “Mouse
Simulation Using Two Coloured Tapes”, Department of Computer Science,
University of Pune, India, International Journal of Information Sciences and
Techniques (IJIST) Vol.2, No.2, March 2012.

[25] K N. Shah, K R. Rathod and S. J. Agravat, “A survey on Human Computer


Interaction Mechanism Using Finger Tracking”.

[26] K. H. Shibly, S. Kumar Dey, M. A. Islam, and S. Iftekhar Showrav, “Design and
development of hand gesture based virtual mouse,” in Proceedings of the 2019
1st International Conference on Advances in Science, Engineering and Robotics
Technology (ICASERT), pp. 1–5, Dhaka, Bangladesh, May 2019.

[27] A. Haria, A. Subramanian, N. Asokkumar, S. Poddar, and J. S. Nayak, “Hand


gesture recognition for human computer interaction,” Procedia Computer
Science, vol. 115, pp. 367– 374, 2017.

[28] D.-S. Tran, N.-H. Ho, H.-J. Yang, S.-H. Kim, and G. S. Lee, “Real-time virtual
mouse system using RGB-D images and fingertip detection,” Multimedia Tools
and Application- sMultimedia Tools and Applications, vol. 80, no. 7, pp.
10473–10490, 2021.

[29] https://www.tutorialspoint.com〉 opencv.

[30] K. Pulli, A. Baksheev, K. Kornyakov, and V. Eruhimov, “Realtime computer


vision with openCV,” Queue, vol. 10, no. 4, pp. 40–56, 2012.

11
11

You might also like