Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

The paper can be approved as a base paper for all three students, and they are requested to review

other
papers and prepare a seminar on distinct topics within the base paper
Geevarghese Titus
Journal of
Imaging
Review
Hand Gesture Recognition Based on Computer
Vision: A Review of Techniques
Munir Oudah 1 , Ali Al-Naji 1,2, * and Javaan Chahl 2
1 Electrical Engineering Technical College, Middle Technical University, Baghdad 10022, Iraq;
Munir_aliraqi@yahoo.com
2 School of Engineering, University of South Australia, Mawson Lakes SA 5095, Australia;
Javaan.Chahl@unisa.edu.au
* Correspondence: ali_al_naji@mtu.edu.iq; Tel.: +96-477-1030-4768

Received: 23 May 2020; Accepted: 21 July 2020; Published: 23 July 2020 

Abstract: Hand gestures are a form of nonverbal communication that can be used in several fields
such as communication between deaf-mute people, robot control, human–computer interaction (HCI),
home automation and medical applications. Research papers based on hand gestures have adopted
many different techniques, including those based on instrumented sensor technology and computer
vision. In other words, the hand sign can be classified under many headings, such as posture and
gesture, as well as dynamic and static, or a hybrid of the two. This paper focuses on a review of the
literature on hand gesture techniques and introduces their merits and limitations under different
circumstances. In addition, it tabulates the performance of these methods, focusing on computer
vision techniques that deal with the similarity and difference points, technique of hand segmentation
used, classification algorithms and drawbacks, number and types of gestures, dataset used, detection
range (distance) and type of camera used. This paper is a thorough general overview of hand gesture
methods with a brief discussion of some possible applications.

Keywords: hand gesture; hand posture; computer vision; human–computer interaction (HCI)

1. Introduction
Hand gestures are an aspect of body language that can be conveyed through the center of the
palm, the finger position and the shape constructed by the hand. Hand gestures can be classified
into static and dynamic. As its name implies, the static gesture refers to the stable shape of the hand,
whereas the dynamic gesture comprises a series of hand movements such as waving. There are a
variety of hand movements within a gesture; for example, a handshake varies from one person to
another and changes according to time and place. The main difference between posture and gesture is
that posture focuses more on the shape of the hand whereas gesture focuses on the hand movement.
The main approaches to hand gesture research can be classified into the wearable glove-based sensor
approach and the camera vision-based sensor approach [1,2].
Hand gestures offer an inspiring field of research because they can facilitate communication and
provide a natural means of interaction that can be used across a variety of applications. Previously,
hand gesture recognition was achieved with wearable sensors attached directly to the hand with gloves.
These sensors detected a physical response according to hand movements or finger bending. The data
collected were then processed using a computer connected to the glove with wire. This system of
glove-based sensor could be made portable by using a sensor attached to a microcontroller.
As illustrated in Figure 1, hand gestures for human–computer interaction (HCI) started with the
invention of the data glove sensor. It offered simple commands for a computer interface. The gloves
used different sensor types to capture hand motion and position by detecting the correct coordinates of

J. Imaging 2020, 6, 73; doi:10.3390/jimaging6080073 www.mdpi.com/journal/jimaging


require cumbersome gloves to be worn. These techniques are called camera vision-based sensor
technologies. With the evolution of open-source software libraries, it is easier than ever to detect hand
gestures that can be used under a wide range of applications like clinical operations [10], sign
language [11], robot control [12], virtual environments [13], home automation [14], personal
computer and tablet [15], gaming [16]. These techniques essentially involve replacement of the
J. Imaging 2020, 6, 73 2 of 29
instrumented glove with a camera. Different types of camera are used for this purpose, such as RGB
camera, time of flight (TOF) camera, thermal cameras or night vision cameras.
Algorithms
the location of the have been
palm anddeveloped
fingers [3].based on computer
Various vision
sensors using themethods to detectbased
same technique handsonusing these
the angle
different
of bending types
wereofthecameras. Thesensor
curvature algorithms attempt
[4], angular to segment and
displacement detect
sensor [5], hand
opticalfeatures such as skin
fiber transducer [6],
color, appearance, motion, skeleton, depth, 3D model, deep learn detection and more. These
flex sensors [7] and accelerometer sensor [8]. These sensors exploit different physical principles methods
involve
according several challenges,
to their type. which are discussed in this paper in the following sections.

Figure 1. Different
Different techniques for hand gestures. (a)
(a) Glove-based attached sensor either connected to
the computer
computer ororportable;
portable;(b)
(b)computer vision–based
computer camera
vision–based using
camera a marked
using gloveglove
a marked or just
ora just
naked hand.
a naked
hand.
Although the techniques mentioned above have provided good outcomes, they have various
limitations
Severalthat make
studies themon
based unsuitable
computerfor the elderly,
vision whowere
techniques maypublished
experienceindiscomfort and confusion
the past decade. A study
dueMurthy
by to wire et
connection problems.
al. [17] covered In addition,
the role elderly people
and fundamental suffering
technique frominchronic
of HCI terms disease conditions
of the recognition
that result in
approach, loss of muscleand
classification function may be unable
applications, to wear
describing and take off
computer gloves,
vision causing them
limitations underdiscomfort
various
and constraining
conditions. Another them if used
study for long
by Khan et al.periods. These sensors
[18] presented may also
a recognition cause
system skin damage,
concerned with infection
the issue
or feature
of adverseextraction,
reactions ingesture
peopleclassification,
with sensitiveand skinconsidered
or those suffering burns. Moreover,
the application area of thesome sensors
studies. are
Suriya
quite expensive. Some of these problems were addressed in a study by Lamberti
et al. [19] provided a specific survey on hand gesture recognition for mouse control applications, and Camastra [9],
who developed
including a computer
methodologies vision
and system based
algorithms used on for colored marked gloves.
human–machine Although
interaction. this studythey
In addition, did
not require the attachment of sensors, it still required colored gloves to
provided a brief review of the hidden Markov model (HMM). A study by Sonkusare et al. [20] be worn.
These
reported drawbacks
various led toand
techniques the made
development
comparisons of promising
between and themcost-effective
according to techniques that did
hand segmentation
not require cumbersome
methodology, glovesextraction,
tracking, feature to be worn. These techniques
recognition techniques,are and
called camera vision-based
concluded sensor
that the recognition
technologies. With the evolution of open-source software libraries, it is easier
rate was a tradeoff with temporal rate limited by computing power. Finally, Kaur et al. [16] reviewed than ever to detect
hand gestures that can be used under a wide range of applications like clinical operations [10], sign
language [11], robot control [12], virtual environments [13], home automation [14], personal computer
and tablet [15], gaming [16]. These techniques essentially involve replacement of the instrumented
glove with a camera. Different types of camera are used for this purpose, such as RGB camera, time of
flight (TOF) camera, thermal cameras or night vision cameras.
Algorithms have been developed based on computer vision methods to detect hands using these
different types of cameras. The algorithms attempt to segment and detect hand features such as skin
color, appearance, motion, skeleton, depth, 3D model, deep learn detection and more. These methods
involve several challenges, which are discussed in this paper in the following sections.
Several studies based on computer vision techniques were published in the past decade. A study
by Murthy et al. [17] covered the role and fundamental technique of HCI in terms of the recognition
approach, classification and applications, describing computer vision limitations under various
conditions. Another study by Khan et al. [18] presented a recognition system concerned with the
issue of feature extraction, gesture classification, and considered the application area of the studies.
Suriya et al. [19] provided a specific survey on hand gesture recognition for mouse control applications,
including methodologies and algorithms used for human–machine interaction. In addition, they
provided a brief review of the hidden Markov model (HMM). A study by Sonkusare et al. [20]
reported various techniques and made comparisons between them according to hand segmentation
methodology, tracking, feature extraction, recognition techniques, and concluded that the recognition
rate was a tradeoff with temporal rate limited by computing power. Finally, Kaur et al. [16] reviewed
J. Imaging
J. Imaging 6, x6,FOR
2020,
2020, 73 PEER REVIEW 3 of
3 of 2929

several methods, both sensor-based and vision-based, for hand gesture recognition to improve the
several of
precision methods,
algorithmsboththrough
sensor-based and vision-based,
integrating for hand gesture recognition to improve the
current techniques.
precision of algorithms through integrating current techniques.
The studies above give insight into some gesture recognition systems under various scenarios,
The studies
and address aboveasgive
issues such sceneinsight into some
background gesture recognition
limitations, illuminationsystems under
conditions, variousaccuracy
algorithm scenarios,
and address issues such as scene background limitations, illumination
for feature extraction, dataset type, classification algorithm used and application. However, conditions, algorithm accuracy
no
for feature extraction, dataset type, classification algorithm used and application.
review paper mentions camera type, distance limitations or recognition rate. Therefore, the objective However, no review
ofpaper mentions
this study is tocamera
provide type, distance limitations
a comparative review or ofrecognition rate. concerning
recent studies Therefore, the objective
computer of this
vision
study is to provide a comparative review of recent studies concerning
techniques with regard to hand gesture detection and classification supported by different computer vision techniques with
regard to hand
technologies. Thegesture
currentdetection and classification
paper discusses the seven mostsupported by different
reported approachestechnologies.
to the problem The current
such
paper discusses the seven most reported approaches to the problem
as skin color, appearance, motion, skeleton, depth, 3D-model, deep-learning. This paper also such as skin color, appearance,
motion, skeleton,
discusses depth, 3D-model,
these approaches in detail deep-learning.
and summarizes Thissome
papermodern
also discusses
research these
underapproaches
differentin
detail and summarizes
considerations (type of camerasomeused,
modern research
resolution ofunder differentimage
the processed considerations (type
or video, type of of camera used,
segmentation
resolution of the processed image or video, type of segmentation technique,
technique, classification algorithm used, recognition rate, type of region of interest processing, classification algorithm
used, recognition rate, type of region of interest processing, number of gestures,
number of gestures, application area, limitation or invariant factor, and detection range achieved and application area,
inlimitation
some cases or data
invariant factor,
set use, and detection
runtime range achieved
speed, hardware run, typeand of
in error).
some cases data set the
In addition, use,review
runtime
speed, hardware run, type of error). In addition,
presents the most popular applications associated with this topic. the review presents the most popular applications
associated with this
The remainder oftopic.
this paper is summarized as follows. Section 2 explains hand gesture methods
and take considerationofand
The remainder thisfocus
paper oniscomputer
summarized as techniques,
vision follows. Section
where 2 explains hand gesture
describe seven methods
most common
and take consideration and focus on computer vision techniques,
techniques such as skin color, appearance, motion, skeleton, depth, 3D-module, deep learn where describe seven most common
and
techniques such as skin color, appearance, motion, skeleton, depth, 3D-module,
support that with tables. Section 3 illustrates in detail seven application areas that deal with hand deep learn and support
that with
gesture tables. Section
recognition systems.3 Section
illustrates in detail
4 briefly seven research
discusses application
gapsareas that deal with
and challenges. handSection
Finally, gesture
5 recognition
presents our systems. SectionFigure
conclusions. 4 briefly2 discusses research
below clarify thegaps and challenges.
classification Finally,
methods Section 5by
conducted presents
this
our
review.conclusions. Figure 2 below clarify the classification methods conducted by this review.

Figure 2. 2.
Figure Classifications method
Classifications conducted
method byby
conducted this review.
this review.

2. 2. Hand
Hand Gesture
Gesture Methods
Methods
The
The primary
primary goal
goal inin studying
studying gesture
gesture recognition
recognition is is
toto introduce
introduce a system
a system that
that cancan detect
detect specific
specific
human gestures and use them to convey information or for command and control
human gestures and use them to convey information or for command and control purposes. purposes. Therefore,
it includes
Therefore, it not only tracking
includes not onlyoftracking
human movement, but also thebut
of human movement, interpretation of that movement
also the interpretation of thatas
significant commands. Two approaches are generally used to interpret gestures for HCI
movement as significant commands. Two approaches are generally used to interpret gestures for HCI applications.
The first approach
applications. The firstisapproach
based onisdata gloves
based (wearable
on data gloves or direct contact)
(wearable and
or direct the second
contact) approach
and the second is
based on computer vision without the need to wear any sensors.
approach is based on computer vision without the need to wear any sensors.

2.1. Hand Gestures Based on Instrumented Glove Approach


J. Imaging 2020, 6, 73 4 of 29
J. Imaging 2020, 6, x FOR PEER REVIEW 4 of 29

J. Imaging 2020, 6, x FOR PEER REVIEW 4 of 29


The wearable
2.1. Hand glove-based
Gestures Based sensors can
on Instrumented be Approach
Glove used to capture hand motion and position. In addition,
they can easily provide the exact coordinates of palm and finger locations, orientation and
Thewearable
The wearableglove-based
glove-based sensorscan can beused
usedtoto capturehand
handmotion
motionand andposition.
position.InInaddition,
addition,
configurations by using sensorssensorsattached tobethe glovescapture
[21–23]. However, this approach requires the
they
they can easily provide the exact coordinates of palm and finger locations, orientation and
user can easily
to be provideto
connected thethe
exact coordinates
computer of palm[23],
physically and which
finger locations,
blocks the orientation and configurations
ease of interaction between
configurations
by using by using to sensors attached to the gloves this
[21–23]. However, this approach requires the
user andsensors attached
computer. the gloves
In addition, [21–23].
the price However,
of these devices is approach
quite highrequires
[23,24].the user to be
However, theconnected
modern
user
to to be
the based connected
computer to
physically the computer physically [23], which blocks the ease of interaction between
glove approach uses [23], which blocks
the technology of the easewhich
touch, of interaction betweentechnology
more promising user and computer.
and it is
user
In and computer.
addition, the price In
of addition,
these the price
devices is of high
quite these[23,24].
devicesHowever,
is quite high
the [23,24]. glove
modern However,
based the modern
approach
considered Industrial-grade haptic technology. Where the glove gives haptic feedback that makes
glove based
uses approach uses which
the technology of touch, which moreand itpromising technology and it is
user the technology
sense the shape,of touch,
texture, more promising
movement and weighttechnology
of a virtual is considered
object Industrial-grade
by using microfluidic
considered
haptic Industrial-grade
technology. haptic gives
technology. Where thethat
glove gives haptic feedback thattexture,
makes
technology. Figure Where
3 showsthe anglove
example of haptic
a sensor feedback
glove used makes
in user
sign language. sense the shape,
user senseand
movement theweight
shape,oftexture,
a virtualmovement and microfluidic
object by using weight of atechnology.
virtual objectFigureby3 using
shows microfluidic
an example
technology. Figure 3 shows an
of a sensor glove used in sign language.example of a sensor glove used in sign language.

Figure 3. Sensor-based data glove (adapted from website: https://physicsworld.com/a/smart -glove-


Figure 3. Sensor-based data glove (adapted from website: https://physicsworld.com/a/smart-glove-
translates-sign-language-into-digital-text/).
translates-sign-language-into-digital-text/).
Figure 3. Sensor-based data glove (adapted from website: https://physicsworld.com/a/smart -glove-
translates-sign-language-into-digital-text/).
2.2. Hand Gestures Based on Computer Vision Approach
The camera
2.2. Hand Gesturesvision
Based based is aa common,
sensorVision
on Computer common, suitable
Approach suitable and
and applicable
applicable technique
technique because it
provides contactless communication between humans and computers [16]. Different
provides contactless communication between humans and computers [16]. Different configurations configurations of
The camera vision based sensor is a common, suitable and applicable technique because it
of cameras
cameras cancan be utilized,
be utilized, suchsuch as monocular,
as monocular, fisheye,
fisheye, TOF andTOFIR and
[20]. IR [20]. However,
However, this technique
this technique involves
provides contactless communication between humans and computers [16]. Different configurations
involveschallenges,
several several challenges,
includingincluding lighting variation,
lighting variation, backgroundbackground
issues, the issues,
effect the effect of occlusions,
of occlusions, complex
of cameras can be utilized, such as monocular, fisheye, TOF and IR [20]. However, this technique
complex background,
background, processingprocessing
time traded time traded
against against and
resolution resolution andand
frame rate frame rate and or
foreground foreground
background or
involves several challenges, including lighting variation, background issues, the effect of occlusions,
background
objects objects
presenting thepresenting
same skin the
colorsame
toneskin color tone
or otherwise or otherwise
appearing appearing
as hands [17,21]. as hands
These [17,21].
challenges
complex background, processing time traded against resolution and frame rate and foreground or
These
will bechallenges
discussed will befollowing
in the discussedsections.
in the following
A simple sections.
diagram A of
simple diagram
the camera of the camera
vision-based vision-
sensor for
background objects presenting the same skin color tone or otherwise appearing as hands [17,21].
based sensor
extracting andfor extractinghand
identifying and identifying hand gestures
gestures is presented is presented
in Figure 4. in Figure 4.
These challenges will be discussed in the following sections. A simple diagram of the camera vision-
based sensor for extracting and identifying hand gestures is presented in Figure 4.

Figure
Figure 4. Using computer
4. Using computer vision
vision techniques
techniques toto identify
identify gestures. Where the
gestures. Where the user
user perform
perform specific
specific
gesture
gesture by single or both hand in front of camera which connect with system framework that
by single or both hand in front of camera which connect with system framework that involve
involve
Figure 4. Using computer
different vision techniques toandidentify gestures. Where the beuser perform specific
different possible
possible techniques
techniques to
to extract
extract feature
feature and classify
classify hand
hand gesture
gesture to
to be able
able control
control some
some
gesture application.
possible by single or both hand in front of camera which connect with system framework that involve
possible application.
different possible techniques to extract feature and classify hand gesture to be able control some
possible application.
J. Imaging 2020, 6, 73 5 of 29
J. Imaging 2020, 6, x FOR PEER REVIEW 5 of 29
J. Imaging 2020, 6, x FOR PEER REVIEW 5 of 29

2.2.1. Color-Based Recognition:


Color-Based Recognition:
2.2.1. Color-Based
Color-Based
Color-Based Recognition
Recognition Using Glove Marker
Color-Based Recognition UsingUsing Glove
Glove Marker
Marker
This method
This method uses
method uses a camera
uses aa camera
camera to to track
trackthe
to track themovement
movementof ofthe
thehand
handusing
usingaaaglove
glovewith
withdifferent
differentcolor
color
This the movement of the hand using glove with different color
marks,
marks, as
as shown
shown in
in Figure
Figure 4.
4. This
This method
method has
has been
been used
used for
for interaction
interaction with
with 3D
3D models,
models, permitting
permitting
marks, as shown in Figure 4. This method has been used for interaction with 3D models, permitting
some
someprocessing,
processing, such
such as
aszooming,
zooming,moving,
moving, drawing
drawing andand writing
writingusing
usingaaavirtual
virtualkeyboard
keyboardwith withgoodgood
some processing, such as zooming, moving, drawing and writing using virtual keyboard with good
flexibility [9].
flexibility [9]. The
[9]. The colors
The colors
colors onon the
on the glove
the glove enable
glove enable the
enable the camera
the camera sensor
camera sensor
sensor to to track
to track and
track and detect
and detect the
detect the location
the location
location ofof
ofthe
the
flexibility the
palm
palm and
and fingers,
fingers, which
which allows
allows for
forthe
the extraction
extraction of
ofgeometric
geometric model
model of
ofthe
the shape
shape of
ofthe
the hand
hand [13,25].
[13,25].
palm and fingers, which allows for the extraction of geometric model of the shape of the hand [13,25].
The
Theadvantages
advantages of of this
thismethod
methodare are its
its simplicity of use and low price compared
compared with the sensor data
The advantages of this method are its simplicity
simplicity ofof use
use and
and lowlow price
price compared with
with the
the sensor
sensor data
data
glove
glove [9].
[9].However,
However, it
itstill
stillrequires
requires the
the wearing
wearing of
of colored
colored gloves
gloves and
and limits
limits the
the degree
degree of
of natural
natural and
and
glove [9]. However, it still requires the wearing of colored gloves and limits the degree of natural and
spontaneous
spontaneous interaction
interaction with
with the
the HCI
HCI [25].
[25]. The
The color-based
color-based glove
glove marker
marker is
is shown
shown in
in Figure
Figure 5
5 [13].
[13].
spontaneous interaction with the HCI [25]. The color-based glove marker is shown in Figure 5 [13].

Figure 5. Color-based recognition using glove marker [13].


Figure 5. Color-based
Color-based recognition using glove marker [13].
Color-Based
Color-Based Recognition
Recognition of Skin Color
Color-Based Recognition of of Skin
Skin Color
Color
Skin color
Skin color detection
color detection
detection isis one
is one of
oneof ofthe
themost
mostpopular
popularmethods
methodsfor forhand
handsegmentation
segmentation andand is is
used in in
used a
Skin the most popular methods for hand segmentation and is used in a
wide
a wide range
range ofofapplications,
applications,such suchas object
objectclassification,
classification,degraded
degradedphotograph
photographrecovery,
recovery, person
wide range of applications, such asasobject classification, degraded photograph recovery, person
person
movement tracking, video observation, HCI applications, facial recognition, hand segmentation and
movement tracking, video observation, HCI applications, facial recognition, hand segmentation and
movement tracking, video observation, HCI applications, facial recognition, hand segmentation and
gesture identification.
gesture identification. Skin
identification. Skin color
Skin color detection
color detection
detection has has been achieved using two methods. The first method
methodisis
gesture has been
been achieved
achieved using
using two two methods.
methods. The The first
first method is
pixel based
pixel based skin
based skin detection,
skin detection,
detection, in in which
in which each
which each pixel
each pixel in
pixel in an
in an image
an image
image isis classified
is classified into
classified into skin
into skin or
skin or not,
or not, individually
not, individually
individually
pixel
from its neighbor. The second method isisregion skin detection, in which the skin pixels are spatially
from its neighbor. The second method is region skin detection, in which the skin pixels are
from its neighbor. The second method region skin detection, in which the skin pixels are spatially
spatially
processed based
processed based on information such as intensity and texture.
processed based onon information
information such such asas intensity
intensity and
and texture.
texture.
Color
Colorspace
spacecan canbebe used
used as as a mathematical model to represent image color information. Several
Color space can be used as aa mathematical
mathematical model model to to represent
represent imageimage color
color information.
information. Several
Several
color
color spaces
spaces can
can be
be used
used according
according to
to the
the application
application type
type such
such as
as digital
digital graphics,
graphics, image
image process
process
color spaces can be used according to the application type such as digital graphics, image process
applications,
applications,TV TVtransmission
transmission and and application of computer vision techniques [26,27]. Figure
Figure 666 shows
applications, TV transmission and application
application of of computer
computer vision
vision techniques
techniques [26,27].
[26,27]. Figure shows
shows
an
an example
example of
of skin
skin color
color detection
detection using
using YUV
YUV color
color space.
space.
an example of skin color detection using YUV color space.

Figure6.
Figure Exampleof
6. Example ofskin
skincolor
colordetection.
detection.(a)
(a)Apply
Applythreshold
thresholdtotothe
thechannels
channelsof
ofYUV
YUVcolor
colorspace
space in
in
Figure 6. Example of skin color detection. (a) Apply threshold to the channels of YUV color space in
order to
order to extract only skin color then assign 1 value for the skin and 0 to non-skin color; (b) detected
skin color then assign 1 value for the skin and 0 to non-skin color; (b) detected and
order to extract only skin color then assign 1 value for the skin and 0 to non-skin color; (b) detected
tracked
and hand
tracked using
hand resulted
using binary
resulted image.
binary image.
and tracked hand using resulted binary image.

AAseveral
severalformats
formatsofofcolor
color space
space are
are obtained
obtained for
for skin
skin segmentation,
segmentation, as
as itemized
itemized below:
below:
A several formats of color space are obtained for skin segmentation, as itemized below:
•• red, green, blue (R–G–B and RGB-normalized);
• red,
red, green,
green, blue
blue (R–G–B
(R–G–B and
and RGB-normalized);
RGB-normalized);
• hue and saturation (H–S–V, H–S–I and H–S–L);
•• hue and saturation (H–S–V,
(H–S–V, H–S–I
H–S–I and
and H–S–L);
H–S–L);
• luminance (YIQ, Y–Cb–Cr and YUV).
• luminance (YIQ, Y–Cb–Cr and YUV).
J. Imaging 2020, 6, 73 6 of 29

• luminance (YIQ, Y–Cb–Cr and YUV).

More detailed discussion of skin color detection based on RGB channels can be found in [28,29].
However, it is not preferred for skin segmentation purposes because the mixture of the color channel
and intensity information of an image has irregular characteristics [26]. Skin color can detect the
threshold value of three channels (red, green and blue). In the case of normalized-RGB, the color
information is simply separated from the luminance. However, under lighting variation, it cannot be
relied on for segmentation or detection purposes, as shown in the studies [30,31].
The characteristics of color space such as hue/saturation family and luminance family are good
under lighting variations. The transformation of format RGB to HSI or HSV takes time in case of
substantial variation in color value (hue and saturation). Therefore, a pixel within a range of intensity
is chosen. The RGB to HSV transformation may consume time because of the transformation from
Cartesian to polar coordinates. Thus, HSV space is useful for detection in simple images.
Transforming and splitting channels of Y–Cb–Cr color space is simple if compared with the HSV
color family in regard to skin color detection and segmentation, as illustrated in [32,33]. Skin tone
detection based Y–Cb–Cr is demonstrated in detail in [34,35].
The image is processed to convert RGB color space to another color space in order to detect the
region of interest, normally a hand. This method can be used to detect the region through the range of
possible colors, such as red, orange, pink and brown. The training sample of skin regions is studied to
obtain the likely range of skin pixels with the band values for R, G and B pixels. To detect skin regions,
the pixel color should compare the colors in the region with the predetermined sample color. If similar,
then the region can be labeled as skin [36]. Table 1 presents a set of research papers that use different
techniques to detect skin color.
The skin color method involves various challenges, such as illumination variation, background
issues and other types of noise. A study by Perimal et al. [37] provided 14 gestures under
controlled-conditions room lighting using an HD camera at short distance (0.15 to 0.20 m) and,
the gestures were tested with three parameters, noise, light intensity and size of hand, which directly
affect recognition rate. Another study by Sulyman et al. [38] observed that using Y–Cb–Cr color space is
beneficial for eliminating illumination effects, although bright light during capture reduces the accuracy.
A study by Pansare et al. [11] used RGB to normalize and detect skin and applied a median filter to the
red channel to reduce noise on the captured image. The Euclidian distance algorithm was used for
feature matching based on a comprehensive dataset. A study by Rajesh et al. [15] used HSI to segment
the skin color region under controlled environmental conditions, to enable proper illumination and
reduce the error.
Another challenge with the skin color method is that the background must not contain any
elements that match skin color. Choudhury et al. [39] suggested a novel hand segmentation based on
combining the frame differencing technique and skin color segmentation, which recorded good results,
but this method is still sensitive to scenes that contain moving objects in the background, such as moving
curtains and waving trees. Stergiopoulou et al. [40] combined motion-based segmentation (a hybrid of
image differencing and background subtraction) with skin color and morphology features to obtain
a robust result that overcomes illumination and complex background problems. Another study by
Khandade et al. [41] used a cross-correlation method to match hand segmentation with a dataset to
achieve better recognition. Karabasi et al. [42] proposed hand gestures for deaf-mute communication
based on mobile phones, which can translate sign language using HSV color space. Zeng et al. [43]
presented a hand gesture method to assist wheelchair users indoors and outdoors using red channel
thresholding with a fixed background to overcome the illumination change. A study by Hsieh et al. [44]
used face skin detection to define skin color. This system can correctly detect skin pixels under low
lighting conditions, and even when the face color is not in the normal range of skin chromaticity.
Another study, by Bergh et al. [45], proposed a hybrid method based on a combination of the histogram
and a pre-trained Gaussian mixture model to overcome lighting conditions. Pansare et al. [46] aligned
J. Imaging 2020, 6, 73 7 of 29

two
J. cameras
Imaging 2020, 6,(RGB
x FOR PEERand TOF)
REVIEW together to improve skin color detection with the help of the depth 7 of 29
property of the TOF camera to enhance detection and face background limitations.
2.2.2. Appearance-Based Recognition
2.2.2. Appearance-Based Recognition
This method depends on extracting the image features in order to model visual appearance such
as handThisandmethod
comparingdepends theseonparameters
extractingwith the image
featurefeatures
extractedinfromorder thetoinput
model visual
image appearance
frames. Where
such as hand and comparing these parameters with feature extracted
the features are directly calculated by the pixel intensities without a previous segmentation process. from the input image frames.
Where
The methodthe features
is executedare directly
in real calculated
time due tobythe theeasy
pixel 2Dintensities
image featureswithout a previous
extracted and segmentation
is considered
process. The method is executed in real time due to the
easier to implement than the 3D model method. In addition, this method can detect easy 2D image features extracted
various and is
skin
considered
tones. easierthe
Utilizing to implement than the algorithm,
AdaBoost learning 3D model method. In addition,
which maintains thisfeature
fixed methodsuch can detect
as keyvarious
points
skin tones. Utilizing the AdaBoost learning algorithm, which maintains
for a portion of a hand, which can solve the occlusion issue [47,48], it can separate into two fixed feature such as keya
models:
points
motionfor a portion
model and aof2D a hand,
static which
model.can Tablesolve the occlusion
2 presents a set issue [47,48],papers
of research it can separate
that use into two
different
models: a motion
segmentation model based
techniques and a on 2Dappearance
static model. Table 2 presents
recognition to detectaregion
set of of research
interestpapers
(ROI). that use
different segmentation techniques based on appearance recognition
A study by Chen et al. [49] proposed two approaches for hand recognition. The first to detect region of interest (ROI).
approach
A study by Chen et al. [49] proposed two approaches for hand recognition.
focused on posture recognition using Haar-like features, which can describe the hand posture pattern The first approach
focused on used
effectively posture therecognition
AdaBoost using
learningHaar-like
algorithmfeatures, whichup
to speed canthedescribe the handand
performance posture
thus pattern
rate of
effectively used the AdaBoost learning algorithm to speed up
classification. The second approach focused on gesture recognition using context-free grammarthe performance and thus rate of to
classification. The second approach focused on gesture recognition
analyze the syntactic structure based on the detected postures. Another study by Kulkarni and using context-free grammar
to analyze [50]
Lokhande the syntactic
used threestructure based on the
feature extraction methoddetected
suchpostures. Another
as a histogram study by
technique toKulkarni
segment and and
Lokhande [50] used three feature extraction method such as a histogram
observe images that contained a large number of gestures, then suggested using edge detection such technique to segment and
observe
as Canny, images
Sobel that contained
and Prewitt a large to
operators number
detectof thegestures,
edges with then suggestedthreshold.
a different using edge The detection such
classification
as Canny, Sobel and Prewitt operators to detect the edges with a different
gesture performed using feed forward back propagation artificial neural network with supervision threshold. The classification
gestureSome
learns. performed
of the using feedreported
limitation forward by backthepropagation
author where artificial
conclude neural
when network with supervision
use histogram technique
learns.
the Somegets
system of the limitation reported
misclassified by the histogram
result because author where canconclude
only be when used for use the
histogram
small numbertechnique of
the system
gesture whichgets misclassified result because
completely different from histogram
each other.can onlyetbeal.
Fang used[51]for the small
used number AdaBoost
an extended of gesture
which
methodcompletely different from
for hand detection and each other.optical
combined Fang etflow al. [51]
with used
thean extended
color cue for AdaBoost
tracking.method They also for
hand detection and combined optical flow with the color cue for tracking.
collected hand color from the neighborhood of features’ mean position using a single Gaussian model They also collected hand
color
to from the
describe neighborhood
hand color in HSV of features’
color space.meanWhere
position using
multi a single
feature Gaussianand
extracted model to describe
gesture hand
recognition
color in
using HSVand
palm color space.
finger Where multi feature
decomposition, extractedscale-space
then utilizing and gesturefeature
recognition
detectionusingwhere
palm integrated
and finger
decomposition, then utilizing scale-space feature detection where integrated
into gesture recognition in order to encounter the limitation of aspect ratio which facing most of the into gesture recognition
in order to
learning of encounter
hand gesture the limitation of aspect
methods. Licsa’r et ratio which
al. [52] usedfacing
a simplemostbackground
of the learning of hand method
subtraction gesture
methods.
for hand Licsa’r et al. [52]and
segmentation used a simple itbackground
extended to handle subtraction
background method
changes for hand
in ordersegmentation
to face some and
extended it to handle background changes in order to face some
challenges such as skin like color and complex and dynamic background then used boundary-based challenges such as skin like color
and complex
method and dynamic
to classify background
hand gesture. Finally,then
Zhou used
et al.boundary-based
[53] proposed amethod novel methodto classify hand gesture.
to directly extract
Finally, Zhou et al. [53] proposed a novel method to directly extract
the fingers where the edges were extracted from the gesture images, and then the finger central the fingers where the edges were area
extracted
was from the
obtained from gesture images, and
the obtained thenFingers
edges. the finger werecentral
thenarea was obtained
obtained from the fromparallel
the obtainededge
edges. Fingers were
characteristics. then obtained
The proposed fromcannot
system the parallel
recognizeedgethecharacteristics.
side view ofThe hand proposed
pose. Figuresystem7 cannot
below
recognize the side view of hand pose.
show simple example on appearance recognition. Figure 7 below show simple example on appearance recognition.

Figure 7. Example
Figure 7. Example on appearance recognition
on appearance recognition using
using foreground
foreground extraction
extraction in
in order to segment
order to segment only
only
ROI, where the object features can be extracted using different techniques such as pattern or image
subtraction and foreground and background
background segmentation
segmentation algorithms.
algorithms.
J. Imaging 2020, 6, 73 8 of 29

Table 1. Set of research papers that have used skin color detection for hand gesture and finger counting application.
Type of Techniques/Methods Feature Classify Recognition No. of Application Invariant Distance
Author Resolution
Camera for Segmentation Extract Type Algorithm Rate Gestures Area Factor from Camera
maximum
off-the-shelf distance of light intensity,
[37] 16 Mp Y–Cb–Cr finger count 70% to 100% 14 gestures HCI 150 to 200 mm
HD webcam centroid two size, noise
fingers
heavy light
computer 320 × 250 6 deaf-mute
[38] Y–Cb–Cr finger count expert system 98% during –
camera pixels gestures people
capturing
Fron-Tech RGB threshold & feature matching (ASL)
A–Z alphabet 26 static
[11] E-cam 10 Mp edge detection Sobel (Euclidian 90.19% American sign – 1000 mm
hand gesture gestures
(web camera) method distance) language
distance transform 100% > control the
640 × 480 HIS & distance 6 location of
[15] webcam finger count method & circular according slide during a –
pixels transform gestures hand
profiling limitation presentation
HIS & frame contour matching sensitive to
dynamic hand
[39] webcam – difference & Haar difference – hand segment HCI moving –
gestures
classifier with the previous background
HSV & motion (SPM)
640 × 480
[40] webcam detection hand gestures classification 98.75% hand segment HCI – –
pixels
(hybrid technique) technique
man–machine
640 × 480 HSV & 15
[41] video camera hand gestures Euclidian distance 82.67% interface – –
pixels cross-correlation gestures
(MMI)
objects have
digital or Malaysian
768 × 576 the same skin
[42] cellphone HSV hand gestures division by shape – hand segment sign –
pixels color some &
camera language
hard edges
combine
red channel
information from HCI
320 × 240 threshold 5 hand
[43] web camera hand postures multiple cures of 100% wheelchair – –
pixels segmentation postures
the motion, color control
method
and shape
(< 1) mm
Haar-like
Logitech 93.13 static 2 static man–machine (1000–1500)
320 × 240 normalized R, G, directional
[44] portable hand gestures 95.07 dynamic 4 dynamic interface – mm
pixels original red patterns & motion
webcam C905 Percent gestures (MMI) (1500–2000)
history image
mm
J. Imaging 2020, 6, 73 9 of 29

Table 1. Cont.
Type of Techniques/Methods Feature Classify Recognition No. of Application Invariant Distance
Author Resolution
Camera for Segmentation Extract Type Algorithm Rate Gestures Area Factor from Camera
manipulating
HIS & Gaussian
high 98.24% correct 3D objects &
640 × 480 mixture Haarlet-based 10 changes in
[45] resolution hand postures classification navigating –
pixels model (GMM) hand gesture postures illumination
cameras rate through a 3D
& second histogram
model
histogram-based real-time
ToF camera & 176 × 144 &
skin color hand gesture
[46] AVT Marlin 640 × 480 hand gestures 2D Haarlets 99.54% hand segment – 1000 mm
probability & interaction
color camera pixels
depth threshold system
Table footer: –: none.

Table 2. A set of research papers that have used appearance-based detection for hand gesture application.
Techniques/ Distance
Type of Feature Classify RECOGNITION No. of
Author Resolution Methods for Application Area Dataset Type Invariant Factor from
Camera Extract Type Algorithm RATE Gestures
Segmentation Camera
real-time Positive and
Logitech Haar -like features parallel 4
320 × 240 vision-based hand negative hand
[49] Quick Cam & AdaBoost hand posture cascade above 90% hand – –
pixels gesture sample collected
web camera learning algorithm structure postures
classification by author
feed-forward
OTSU & canny
80 × 64 back
edge detection 26 static American Sign Dataset created different
[50] webcam-1.3 resize image hand sign propagation 92.33% low differentiation
technique for gray signs Language by author distances
for train neural
scale image
network
Gaussian model
describes hand 6 real-time hand
camera 320 × 240 palm–finger
[51] color in HSV & hand gesture 93% hand gesture recognition – – –
video pixels configuration
AdaBoost gestures method
algorithm
point coordinates
background 9 ground truth
camera–projector 384 × 288 Fourier-based user-independent geometrically
[52] subtraction hand gesture 87.7% hand data set collected –
system pixels classification application distorted & skin
method gestures manually
color
combine Y–Cb–Cr hand The test data are variation in
14
Monocular 320 × 240 & edge extraction posture based substantial collected from lightness would
[53] finger model – static ≤ 500 mm
web camera pixels & parallel finger on finger applications videos captured result in edge
gestures
edge appearance gesture by web-camera extraction failure
Table footer: –: none.
J. Imaging 2020, 6, 73 10 of 29
J. Imaging 2020, 6, x FOR PEER REVIEW 10 of 29

According to
According to information
information mentioned
mentioned in in Table
Table2.2. The
The first
first row
row indicates
indicates Haar-like
Haar-like feature
feature which
which
consider aa good
consider good for
for analyze
analyze ROIROI pattern
pattern efficiently.
efficiently. Haar-like
Haar-like features
features can can efficiently
efficiently analyze
analyze the the
contrastbetween
contrast betweendarkdarkandandbright
bright object
object within
within a kernel,
a kernel, which
which can operate
can operate fasterfaster compared
compared with
with pixel
pixel based
based system. system. In addition,
In addition, it is immune
it is immune forand
for noise noise and lighting
lighting variationvariation because
because they they calculate
calculate the gray
the gray
value value difference
difference between the between the black
white and whiterectangles.
and black rectangles.
The result ofThefirstresult of90%,
row is first but
rowifiscompared
90%, but
if compared
with with single
single gaussian modelgaussian
which model
used towhich used
describe to describe
hand color in hand colorspace
HSV color in HSV in color space
the third in the
row the
third row
result the result of
of recognition recognition
rate rate is 93%.
is 93%. Although bothAlthough
proposedboth proposed
system used the system used the
Adaboost Adaboost
algorithm to
algorithm to speed up the system
speed up the system and classification. and classification.

2.2.3.
2.2.3. Motion-Based Recognition
Recognition
Motion-based
Motion-based recognition
recognition can can bebe utilized
utilized forfor detection
detection purposes;
purposes; it it can
can bebe extracts
extracts the the object
object
through a series of image frames. The AdaBoost algorithm utilized
through a series of image frames. The AdaBoost algorithm utilized for object detection, for object detection, characterization,
movement modeling,
characterization, and pattern
movement recognition
modeling, is needed
and pattern to recognize
recognition the gesture
is needed [16]. The the
to recognize main issue
gesture
encounter motion recognition is this is an occasion if one more gesture
[16]. The main issue encounter motion recognition is this is an occasion if one more gesture is active is active at the recognition
process and also dynamic
at the recognition process and background
also dynamic has abackground
negative effect. has aInnegative
addition, the loss
effect. of gesture
In addition, themay
lossbeof
caused by occlusion among tracked hand gesture or error in region
gesture may be caused by occlusion among tracked hand gesture or error in region extraction from extraction from tracked gesture
and effect
tracked long-distance
gesture and effect onlong-distance
the region appearance
on the region Tableappearance
3 presents aTable set of3research
presentspapers
a set ofthat used
research
different segmentation techniques based on motion recognition to
papers that used different segmentation techniques based on motion recognition to detect ROI. detect ROI.
Two
Two stages
stages forforefficient
efficienthand
handdetection
detectionwere wereproposed
proposedin in[54].
[54]. First,
First, the
the hand
hand detected
detected for for each
each
frame and center point is used for tracking the hand. Then, the second
frame and center point is used for tracking the hand. Then, the second stage matching model applying stage matching model applying
to
to each
each type
type of of gesture
gesture using
using aa set
set of
of features
features is is extracted
extracted from from thethe motion
motion tracking
tracking inin order
order to to provide
provide
better
better classification where the main drawback of the skin color is affected by lighting variations which
classification where the main drawback of the skin color is affected by lighting variations which
lead
lead to detect non-skin color. A standard face detection algorithm and optical flow computation was
to detect non-skin color. A standard face detection algorithm and optical flow computation was
used
used byby [55]
[55] toto give
give aa user-centric
user-centric coordinate
coordinate frameframe in in which
which motion
motion features
features were
were used
used to to recognize
recognize
gestures
gesturesfor forclassification
classificationpurposes
purposes using
usingthethe
multiclass
multiclassboosting algorithm.
boosting algorithm.A real-time dynamic
A real-time hand
dynamic
gesture recognition system based on TOF was offered in [56], in which
hand gesture recognition system based on TOF was offered in [56], in which motion patterns were motion patterns were detected
based
detectedon hand
basedgestures
on hand received
gesturesas input
receiveddepthas images.
input depthTheseimages.
motion patterns
These motionwere compared
patterns with were
the hand motion
compared with the classifications
hand motion computed from the
classifications real dataset
computed from videos
the realwhich do not
dataset require
videos whichthe do
usenot
of
arequire
segmentation algorithm. Where the system provides good result except
the use of a segmentation algorithm. Where the system provides good result except the depth the depth rang limitation
of TOF
rang camera. ofInTOF
limitation [57],camera.
YUV color space
In [57], YUV wascolor
used, withwas
space the used,
help of thethe
with CAMShift
help of the algorithm,
CAMShift to
distinguish
algorithm, to between
distinguishbackground
between and skin color, and
background and skin
the naïve
color,Bayes
and theclassifier
naïvewas implemented
Bayes classifier was to
assist with gesture recognition. The proposed system faces some
implemented to assist with gesture recognition. The proposed system faces some challenges such as challenges such as illumination
variation
illumination where light changes
variation affectchanges
where light the result of the
affect theskin segment.
result Other
of the skin challenges
segment. Otherarechallenges
the degreeare of
gesture freedom which affect directly on the output result by change
the degree of gesture freedom which affect directly on the output result by change rotation. Next, rotation. Next, hand position
capture problem,
hand position if hand
capture appearsifinhand
problem, the corner
appears of in
thethe
frame andofthe
corner thedots
framewhich
and must
the dotscover the hand
which must
does
covernotthelie on hand
hand does that
not liemay onled
handto failing
that may captured
led to user
failinggesture.
captured In addition,
user gesture.the hand size quite
In addition, the
differs between humans and maybe causes a problem with the interaction
hand size quite differs between humans and maybe causes a problem with the interaction system. system. However, the major
still challenging
However,, problem
the major is the skin-like
still challenging color which
problem affects overall
is the skin-like systemaffects
color which and can abortsystem
overall the result.
and
Figure 8 gives simple example on hand motion recognition.
can abort the result. Figure 8 gives simple example on hand motion recognition.

Figure 8.
Figure 8. Example
Example on
on motion
motion recognition
recognition using
using frame
frame difference
difference subtraction
subtraction to
to extract
extract hand
hand feature,
feature,
wherethe
where themoving
movingobject
objectsuch
suchas
ashand
handextracted
extractedfrom
fromthe
thefixed
fixedbackground.
background.
J. Imaging 2020, 6, 73 11 of 29

Table 3. A set of research papers that have used motion-based detection for hand gesture application.
Techniques/ Distance
Type of Feature Classify Recognition No. of Application Dataset Invariant
Author Resolution Methods for from
Camera Extract Type Algorithm Rate Gestures Area Type Factor
Segmentation Camera
other object
RGB, HSV, histogram Data set
off-the-shelf human–computer moving and
[54] – Y–Cb–Cr & hand gesture distribution 97.33% 10 gestures created by –
cameras interface background
motion tracking model author
issue
gesture Data set
Canon GL2 720 × 480 face detection & motion leave-one-out 7
[55] – recognition created by – –
camera pixels optical flow gesture cross-validation gestures
system author
depth motion interaction cardinal
time of flight 176 × 144 motion depth range
[56] information, patterns 95% 26 gestures with virtual directions 3000 mm
(TOF) SR4000 pixels gesture limitation
motion patterns compared environments dataset
changed
illumination,
YUV & human and Data set
naïve Bayes rotation
[57] digital camera – CAMShift hand gesture high unlimited machine created by –
classifier problem,
algorithm system author
position
problem
Table footer: –: none.
J. Imaging 2020, 6, 73 12 of 29

According to information mentioned in Table 3. The first row recognition rate of system is
97%, where the hybrid system based on skin detect and motion detection is more reliable for gesture
recognition, where the motion hand can track using multiple track candidates depend on stand
derivation calculation for both skin and motion approach. Where every single gesture encoded as
chain-code in order
J. Imaging 2020, 6, xto model
FOR every single gesture which considers a simple model compared
PEER REVIEW 12 of 29 with
(HMM) and classified gesture using a model of the histogram distribution. The proposed system in the
According to information mentioned in Table 3. The first row recognition rate of system is 97%,
third row use depth camera based on (TOF) where the motion pattern of the arm model for human
where the hybrid system based on skin detect and motion detection is more reliable for gesture
utilized to define motion patterns, were the authors confirm that using the depth information for
recognition, where the motion hand can track using multiple track candidates depend on stand
hand trajectories
derivation estimation
calculation for is to improve
both skin andgesture
motion recognition rate.every
approach. Where Moreover, the proposed
single gesture encodedsystem
as no
need forchain-code
the segmentation algorithm, where the system is examined using 2D and 2.5D
in order to model every single gesture which considers a simple model compared with approaches,
(HMM)
were 2.5D and classified
performs gesture
better than 2Dusing
and agives
modelrecognition
of the histogram
rate distribution.
95%. The proposed system in
the third row use depth camera based on (TOF) where the motion pattern of the arm model for human
utilized to defineRecognition
2.2.4. Skeleton-Based motion patterns, were the authors confirm that using the depth information for
hand trajectories estimation is to improve gesture recognition rate. Moreover, the proposed system
Thenoskeleton-based recognitionalgorithm,
need for the segmentation specifies model parameters
where the system is which
examinedcan using
improve the detection
2D and 2.5D of
complexapproaches,
features [16].
were Where the various
2.5D performs representations
better than of skeleton
2D and gives recognition data
rate 95%.for the hand model can be
used for classification, it describes geometric attributes and constraint and easy translates features and
2.2.4. Skeleton-Based Recognition
correlations of data, in order to focus on geometric and statistic features. The most common feature
used is the joint orientation, recognition
The skeleton-based the space between joints,
specifies model the skeletal
parameters whichjoint
can location
improve the and degreeofof angle
detection
betweencomplex features
joints and [16]. Where
trajectories andthe various representations
curvature of the joints. ofTableskeleton data for athe
4 presents sethand model can
of research be
papers that
used for classification, it describes geometric attributes and constraint and easy translates features
use different segmentation techniques based on skeletal recognition to detect ROI.
and correlations of data, in order to focus on geometric and statistic features. The most common
Hand segmentation using the depth sensor of the Kinect camera, followed by location of the
feature used is the joint orientation, the space between joints, the skeletal joint location and degree of
fingertips using
angle between3D connections, Euclidean
joints and trajectories anddistance,
curvature andof thegeodesic distance
joints. Table overa hand
4 presents set of skeleton
research pixels
to provide increased
papers accuracy
that use different was proposed
segmentation in [58].
techniques basedA onnew 3D hand
skeletal gesture
recognition recognition
to detect ROI. approach
based on a deepHandlearning
segmentation model using the depth
using sensor
parallel of the Kinectneural
convolutional camera, followed (CNN)
networks by location of the hand
to process
skeletonfingertips using 3D connections,
joints’ positions was introduced Euclidean distance,
in [59], and geodesic
the proposed distance
system hasover hand skeleton
a limitation pixelsit works
where
to provide increased accuracy was proposed in [58]. A new 3D hand gesture recognition approach
only with complete sequence. The optimal viewpoint was estimated and the point cloud of gesture
based on a deep learning model using parallel convolutional neural networks (CNN) to process hand
transformed using
skeleton a curve
joints’ positionsskeleton to specify
was introduced topology,
in [59], then Laplacian-based
the proposed system has a limitation contraction was applied
where it works
to specify
onlythe
withskeleton
completepoints
sequence.in [60]. Whereviewpoint
The optimal the Hungarianwas estimatedalgorithm
and the was
pointapplied
cloud oftogesture
calculate the
match scores of theusing
transformed skeleton
a curvepoint set, but
skeleton the joint
to specify tracking
topology, then information
Laplacian-basedacquired by Kinect
contraction was is not
accurateapplied
enough to specify
which the giveskeleton
a result points
withinconstant
[60]. Where the Hungarian
vibration. A novelalgorithm
method wasbased
applied ontoskeletal
calculatefeatures
extractedthefrom
match scores
RGB of the skeleton
recorded videopoint set,language,
of sign but the jointwhich
tracking information
presents acquiredto
difficulties byextracting
Kinect is notaccurate
accurate enough which give a result with constant vibration. A novel method based on skeletal
skeletal data because of occlusions, was offered in [61]. A dynamic hand gesture using depth and
features extracted from RGB recorded video of sign language, which presents difficulties to extracting
skeletal accurate
dataset skeletal
for a skeleton-based approach was
data because of occlusions, wasoffered
presentedin [61].inA[62], where
dynamic handsupervised
gesture using learning
depth (SVM)
used forandclassification
skeletal datasetwith for a linear kernel.approach
a skeleton-based Anotherwas dynamic
presented hand gesture
in [62], where recognition using Kinect
supervised learning
sensor depth metadata
(SVM) used for acquisition
for classification and segmentation
with a linear which used
kernel. Another dynamic hand to extract
gesture orientation
recognition usingfeature,
Kinect sensor depth metadata for acquisition and segmentation
where the support vector machine (SVM) algorithm and HMM was utilized for classification and which used to extract orientation
feature, where the support vector machine (SVM) algorithm and HMM was utilized for classification
recognition to evaluate system performance where the SVM bring a good result than HMM in some
and recognition to evaluate system performance where the SVM bring a good result than HMM in
specification such elapsed time, average recognition rate, was proposed in [63]. A hybrid method
some specification such elapsed time, average recognition rate, was proposed in [63]. A hybrid
for handmethod
segmentation based on depth
for hand segmentation basedand colorand
on depth data
coloracquired by the
data acquired byKinect
the Kinectsensor
sensor with
withthe
the help of
skeletal help
dataofwere proposed
skeletal data were in [64]. In this
proposed method,
in [64]. In thisthe imagethe
method, threshold is applied
image threshold to the to
is applied depth
the frame
depth frame and the super-pixel segmentation method is used to extract
and the super-pixel segmentation method is used to extract the hand from the color frame, then the the hand from the color
frame,
two results arethen the two for
combined results are combined
robust for robust
segmentation. Figure segmentation.
9 show anFigure example9 show on an example
skeleton on
recognition.
skeleton recognition.

Figure 9. Example of skeleton recognition using depth and skeleton dataset to representation hand
Figure 9. Example of skeleton recognition using depth and skeleton dataset to representation hand
skeleton model [62].
skeleton model [62].
J. Imaging 2020, 6, 73 13 of 29

Table 4. Set of research papers that have used skeleton-based recognition for hand gesture application.
Techniques/ Distance
Type of Feature Classify Recognition No. of Application Invariant
Author Resolution Methods for Dataset Type from
Camera Extract Type Algorithm Rate Gestures Area Factor
Segmentation Camera
Euclidean real time
Kinect
512 × 424 distance & skeleton pixels hand hand
[58] camera fingertip – – – –
pixels geodesic extracted tracking tracking
depth sensor
distance method
Dynamic
Intel Real hand-skeletal convolutional only works on
91.28% 14 gestures classification Hand
[59] Sense depth – skeleton data joints’ neural network complete –
84.35% 28 gestures method Gesture-14/28
camera positions (CNN) sequences
(DHG) dataset
ChaLearn HGR less
hand gesture
Kinect 240 × 320 Laplacian-based skeleton points Hungarian Gesture performance in
[60] 80% 12 gestures recognition –
camera pixels contraction clouds algorithm Dataset the viewpoint
method
(CGD2011) 0◦condition
difficulties in
RGB video vision-based hand and body skeleton sign extracting
[61] sequence – approach & skeletal classification – hand gesture language LSA64 dataset skeletal data –
recorded skeletal data features network recognition because of
occlusions
supervised
Create SHREC
Intel Real learning classifier
640 × 480 depth and 88.24% 14 gestures hand gesture 2017 track “3D
[62] Sense depth hand gesture support vector – –
pixels skeletal dataset 81.90% 28 gestures application Hand Skeletal
camera machine (SVM)
Dataset
with a linear kernel
Arabic low
Kinect v2
512 × 424 dynamic hand 10 gesture numbers author own recognition
[63] camera depth metadata SVM 95.42% –
pixels gesture 26 gesture (0–9) letters dataset rate, “O”, “T”
sensor
(26) and “2”
Kinect RGB Malaysian
[64] camera & 640 × 480 skeleton data hand blob – – hand gesture sign – – –
depth sensor language
Table footer: –: none.
J. Imaging 2020, 6, 73 14 of 29

According to information mentioned in Table 4. The depth camera provides good accuracy for
segmentation, because not affected by lightening variations and cluttered background. However,
the main issue is in the range of detection. The Kinect V1 sensor has an embedded system in which
gives feedback information received by depth sensor as a metadata, which gives information about
human body joint coordinate. The Kinect V1 provides information used to track skeletal joint up to
20 joints, that’s help to module the hand skeleton. While Kinect V2 sensor can tracking joint as 25 joints
and up to six people at the same time with full joints tracking. With a range of detection between
(0.5–4.5) meter.

2.2.5. Depth-Based Recognition


Approaches have proposed for solving hand gesture recognition using different types of cameras.
A depth camera provides 3D geometric information about the object [65]. Previously, both major
approximations were utilized: TOF precepts and light coding. The 3D data from a depth camera
directly reflects the depth field if compared with a color image which contains only a projection [66].
Using this approach, the lighting, shade, and color did not affect the result image. However, the cost,
size and availability of the depth camera will limit its use [67]. Table 5 presents a set of research papers
that use different segmentation techniques based on depth recognition to detect ROI.
The finger earth mover’s distance (FEMD) approach was evaluated in terms of speed and precision,
and then compared with the shape-matching algorithm using the depth map and color image acquired
by the Kinect camera [65]. Improved depth threshold segmentation was offered in [68], by combining
depth and color information using the hierarchical scan method, then hand segmentation by the local
neighbor method; this approach gives a result over a range of up to two meters. A new method
was proposed in [69], based on a near depth range of less than 0.5 m where skeletal data were not
provided by Kinect. This method was implemented using two image frames, depth and infrared.
A depth threshold was used in order to segment the hand, then a K-mean algorithm was applied to
obtain both user’s hand pixels [70]. Next, Graham’s scan algorithm was used to detect the convex
hulls of the hand in order to merge with the result of the contour tracing algorithm to detect the
fingertip. The depth image frame was analyzed to extract 3D hand gestures in real time, which were
executed using frame differences to detect moving objects [71]. The foremost region was utilized and
classified using an automatic state machine algorithm. The skin–motion detection technique was used
to detect the hand, then Hu moments were applied to feature extraction, after which HMM was used
for gesture recognition [72]. Depth range was utilized for hand segmentation, then Otsu’s method
was used for applying threshold value to the color frame after it was converted into a gray frame [14].
A kNN classifier was then used to classify gestures. In [73], where the hand was segmented based on
depth information using a distance method, background subtraction and iterative techniques were
applied to remove the depth image shadow and decrease noise. In [74], the segmentation used 3D
depth data selected using a threshold range. In [75], the proposed algorithm used an RGB color frame,
which converted to a binary frame using Otsu’s global threshold. After that, a depth range was selected
for hand segmentation and then the two methods were aligned. Finally, the kNN algorithm was used
with Euclidian distance for finger classification. Depth data and an RGB frame were used together
for robust hand segmentation and the segmented hand matched with the dataset classifier to identify
the fingertip [76]. This framework was based on distance from the device and shape based matching.
The fingertips selected using depth threshold and the K-curvature algorithm based on depth data were
presented in [77]. A novel segmentation method was implemented in [78] by integrating RGB and
depth data, and classification was offered using speeded up robust features (SURF). Depth information
with skeletal and color data were used in [79], to detect the hand, then the segmented hand was
matched with the dataset using SVM and artificial neural networks (ANN) for recognition. The authors
concluded that ANN was more accurate than SVM. Figure 10 shows an example of segmentation using
Kinect depth sensor.
J. Imaging 2020, 6, x FOR PEER REVIEW 15 of 29
J. Imaging 2020, 6, 73 15 of 29

Figure 10.
Figure 10. Depth-based Depth-based
recognition: recognition:
(a) hand (a) hand
joint distance joint
from distance
camera; (b)from camera;
different (b) different
feature feature
extraction usingextraction using
Kinect depth Kinect depth sensor.
sensor.

Table 5. Set of research papers that have used depth-based detection for hand gesture and finger counting application.
Table 5. Set of research papers that have used depth-based detection for hand gesture and finger counting application.
Techniques/
Type of Techniques/ FeatureFeature Classify Recognition No. of Distance from
Distance
Author Resolution Methods for Application Area Invariant Factor
Type of
Camera
Segmentation
Extract Type Classify
Algorithm Recognition
Rate No. of
Gestures Application Camera
Author Resolution Methods for Extract Invariant Factor from
Camera Algorithm
finger–earth Rate Gestures Area
RGB - 640 × 480Segmentationthreshold & Type 10 human–computer Camera
[65] Kinect V1 finger gesture movers 93.9% – –
- 640-×320 × 240
RGBdepth near-convex shape
distance (FEMD)
gestures interactions
human–(HCI)
Kinect 480 threshold & near- finger finger–earth movers 10 computer
[65] local neighbor
convex hull
93.9% natural
– –
V1 depth
RGB - 320
- 1920 × 1080convex shape
method & gesture distance (FEMD) gestures
6 interactions
[68] Kinect V2 fingertip detection 96% human–robot – (500–2000) mm
×depth
240 - 512 × 424 threshold gestures (HCI)
algorithm interaction
RGB - 1920 segmentation
local neighbor natural
Kinect × 1080 finger convex hull 6 (500–2000)
[68] Infrared methodoperation
& threshold of depth fingertip 96% finger count & human–robot –
[69] V2
Kinect V2 depth - 512 sensor and infrared
counting detectionnumber of
algorithm – gestures
two hand
mouse-movement
– mm
< 500 mm
depth - 512 × 424segmentation & hand separate areas interaction
controlling
× 424 images gestures
gesture
Infrared finger
finger counting finger count mouse-
Kinect sensor
RGB - 640 × 480
operation of depth counting number of separate
classifier & finger 84% one
[69]
[70] Kinect V1 depth thresholds & finger – hand 9 hand
& two chatting with
movement – (500–800)
< 500 mm
V2 depth - 512
depth - 320 ×and
240 infrared images hand gesture areascollect &
name 90% two hand gestures speech

mm
gestures controlling
× 424 gesture vector matching
RGB - 640 × frame finger counting
RGB - 640 × 480 automatic state hand human–computer
[71] Kinect V1
Kinect 480 - 320 × 240 difference hand gestureclassifier & finger
finger 94%hand
84% one 9 chatting with – –(500–800)
[70] depth depth thresholds machine (ASM) gesture interaction –
V1 depth - 320 algorithm gesture name collect & 90% two hand gestures speech mm
× 240 skin & motion vector matching
discrete hidden
[72] Kinect V1 RGBRGB × × 480
- 640- 640 detection & Hu
hand gesture Markov model –
10 human–computer
– –
depth - 320 × 240 moments an
(DHMM)
gestures human–
interfacing
Kinect 480 frame orientation hand automatic state hand
[71] 94% computer – –
V1 depth - 320 difference algorithm gesture machine (ASM) gesture
interaction
× 240
discrete hidden human–
Kinect RGB - 640 × skin & motion hand 10
[72] Markov model – computer – –
V1 480 detection & Hu gesture gestures
(DHMM) interfacing
J. Imaging 2020, 6, 73 16 of 29

Table 5. Cont.
Techniques/
Type of Feature Classify Recognition No. of Distance from
Author Resolution Methods for Application Area Invariant Factor
Camera Extract Type Algorithm Rate Gestures Camera
Segmentation
range of depth hand gestures kNN classifier & 5 electronic home (250–650)
[14] Kinect V1 depth - 640 × 480 88% –
image 1–5 Euclidian distance gestures appliances mm
hand human–computer
[73] Kinect V1 depth - 640 × 480 distance method hand gesture – – – –
gesture interaction (HCI)
hand
400–1500
[74] Kinect V1 depth - 640 × 480 threshold range hand gesture – – hand gesture rehabilitation –
mm
system
hand not
RGB - 1920 × 1080 Otsu’s global kNN classifier & human–computer identified if it’s 250–650
[75] Kinect V2 finger gesture 90% finger count
depth - 512 × 424 threshold Euclidian distance interaction (HCI) not connected mm
with boundary
depth-based data distance from the
RGB - 640 × 480 6 finger mouse 500—-800
[76] Kinect V1 and RGB data finger gesture device and shape 91% –
depth - 640 × 480 gesture interface mm
together bases matching
detection
depth threshold fingertips should
depth threshold finger 5 picture selection
[77] Kinect V1 depth - 640 × 480 and 73.7% though the hand –
and K-curvature counting gestures application
K-curvature was moving or
rotating
integrate the RGB
RGB - 640 × 480 hand forward recursion hand virtual
[78] Kinect V1 and depth 90% – –
depth - 320 × 240 gesture & SURF gesture environment
information
support vector
skeletal data 93.4% for
hand machine (SVM) & 24 alphabets American Sign 500—-800
[79] Kinect V2 depth - 512 × 424 stream & depth & SVM 98.2% –
gesture artificial neural hand gesture Language mm
color data streams for ANN
networks (ANN)
Table footer: –: none.
J. Imaging 2020, 6, 73 17 of 29

J. Imaging 2020, 6, x FOR PEER REVIEW 17 of 29


2.2.6. 3D Model-Based Recognition
The 3D model essentially depends on 3D Kinematic hand model which has a large degree
2.2.6.
of 3D Model-Based
freedom, where Recognition
hand parameter estimation obtained by comparing the input image with the
two-dimensional
The 3D model essentiallyappearance projected
depends on by3Dthree-dimensional
Kinematic hand model hand which
model.has In addition, the 3D
a large degree ofmodel
freedom,
introduces human hand feature as pose estimation by forming
where hand parameter estimation obtained by comparing the input image with the two-dimensional volumetric or skeletal or 3D model
that identical
appearance to the
projected byuser’s hand. Wherehand
three-dimensional the model.
3D model parameter
In addition, theupdated
3D model through the matching
introduces human hand
process.
feature Where
as pose the depth
estimation byparameter is added toorthe
forming volumetric modelor
skeletal to 3D
increase
modelaccuracy. Table to
that identical 6 presents
the user’s a set
hand.
of research papers based on 3D model.
Where the 3D model parameter updated through the matching process. Where the depth parameter is
added toAthe study by Tekin
model et al. [80]
to increase proposed
accuracy. Table a new model atoset
6 presents understand
of research interactions
papers based between
on 3D3D hands
model.
and objectby
A study using
Tekin single
et al.RGB[80]image,
proposed wherea newsingle image
model is trained end-to-end
to understand interactions using neural
between 3Dnetwork,
hands and
and show jointly estimation of the hand and object poses in 3D.
object using single RGB image, where single image is trained end-to-end using neural network, Wan et al. [81] proposed 3D hand andpose
show
estimation
jointly fromofsingle
estimation the hand depthand map usingposes
object self-supervision
in 3D. Wanneural network
et al. [81] proposedby approximating
3D hand posethe hand
estimation
fromsurface
singlewith
depth a set
mapofusingspheres. A novel of estimating
self-supervision neural networkfull 3D byhand shape andthe
approximating pose
handpresented by Gea set
surface with
of spheres. A novel of estimating full 3D hand shape and pose presented by Ge et al. [82] based CNN)
et al. [82] based on single RGB image. Where Graph Convolutional Neural Network (Graph on single
RGB utilized
image. to Where
reconstruct Graph fullConvolutional
3D mesh for hand surface.
Neural Another
Network study CNN)
(Graph by Taylor et al. [83]
utilized proposed a new
to reconstruct full 3D
system tracking human hand by combine surface model with new energy
mesh for hand surface. Another study by Taylor et al. [83] proposed a new system tracking human hand by function which continuously
optimized
combine surfacejointly
model overwithposenew andenergy
correspondences,
function which whichcontinuously
can track theoptimized
hand for several
jointly meter
over posefromand
the camera. Malik
correspondences, which et al.
can[84]trackproposed
the hand a novel CNN-based
for several meter from algorithm whichMalik
the camera. automatically
et al. [84]learns
proposed in a
order to segment hand from a raw depth image and estimate 3D hand
novel CNN-based algorithm which automatically learns in order to segment hand from a raw depth image pose estimation including the
andstructural
estimate 3D constraints
hand pose of hand skeleton.
estimation Tsoli et
including al.structural
the [85] presented a novel
constraints of method to trackTsoli
hand skeleton. a complex
et al. [85]
deformable object in interaction with a hand. Chen et al. [86] proposed
presented a novel method to track a complex deformable object in interaction with a hand. Chen et al. [86]self-organizing hand network
SO—Hand
proposed Net—whichhand
self-organizing achievednetwork 3D hand
SO—Hand pose estimation
Net—which viaachieved
semi-supervised
3D hand learning. Where via
pose estimation
end-to-end regression
semi-supervised learning.method Whereutilized for single
end-to-end depth image
regression method to estimation
utilized for 3D single
hand pose.
depthAnother
image to
study by Ge et al. [87] proposed a point-to-point regression method
estimation 3D hand pose. Another study by Ge et al. [87] proposed a point-to-point regression method for 3D hand pose estimation in for
3D single depthestimation
hand pose images. Wu in et al. [88]
single proposed
depth images. novel
Wu hand
et al. pose estimationnovel
[88] proposed from hand
a single depth
pose image by
estimation from
combine detection based method and regression-based method
a single depth image by combine detection based method and regression-based method to improve to improve accuracy. Cai et al. [89]
presentCai
accuracy. one-way to adapt
et al. [89] present a weakly
one-way labeled
to adaptreal-world
a weaklydataset
labeled from a fully annotated
real-world dataset from synthetic
a fully dataset
annotated
with the
synthetic aid ofwith
dataset low-cost
the aid depth images depth
of low-cost and take only and
images RGBtake inputs
onlyfor
RGB 3D inputs
joint predictions.
for 3D jointFigure 11
predictions.
shows an example of a 3D hand model interaction
Figure 11 shows an example of a 3D hand model interaction with virtual system.with virtual system.

Figure 11.
Figure 11. 3D
3D hand
hand model interaction with
model interaction with virtual
virtual system
system [83].
[83].

There are are


There some reported
some limitations,
reported suchsuch
limitations, as 3Dashand required
3D hand a largea dataset
required of images
large dataset to formulate
of images to
theformulate
characteristic shapes of theshapes
the characteristic hand inofcase multi-view.
the hand in case Moreover,
multi-view.the matchingthe
Moreover, process considers
matching time
process
consumption, alsoconsumption,
considers time computation costly and less ability
also computation to and
costly treatless
unclear views.
ability to treat unclear views.
J. Imaging 2020, 6, 73 18 of 29

Table 6. Set of research papers that have used 3D model-based recognition for HCI, VR and human behavior application.
Techniques/
Feature Extract Classify
Author Type of Camera Methods for Type of Error Hardware Run Application Area Dataset Type Runtime Speed
Type Algorithm
Segmentation
3D hand poses, real-time speed framework for
network directly Fingertips
6D object poses PnP algorithm & of understanding human First-person hand
predicts the 48.4 mm
[80] RGB camera ,object classes Single-shot neural 25 fps on an behavior through action (FPHA) 25 fps
control points in Object coordinates
and action network NVIDIA Tesla 3Dhand and object dataset
3D 23.7 mm
categories M40 interactions
mean joint error
3D hand pose
(stack = 1) design hand pose
Prime sense estimation & Pose estimation NYU Hand Pose
[81] depth maps 12.6 mm – estimation using –
depth cameras sphere model neural network Dataset
(stack = 2) self-supervision method
renderings
12.3 mm
Stereo hand pose
Mesh error design model for
Single RGB image train networks tracking benchmark
3D hand shape 7.95 mm Nvidia GTX estimate 3D hand shape
[82] RGB-D camera direct feed to the with full (STB) & Rendered 50 fps
and pose Pose error 1080 GPU from a monocular
network supervision Hand Pose Dataset
8.03 mm RGB image
(RHD)
Marker error 5%
Finger paint
segmentation subset of the
Kinect V2 interactions with virtual dataset &
[83] mask Kinect body hand machine learning frames in each CPU only high frame-rate
camera and augmented worlds NYU dataset used
tracker sequence & pixel
for comparison
classification error
dataset contains
3D hand pose Nvidia Geforce
raw depth CNN-based hand CNN-based 3D Joint Location applications of virtual 8000 original depth
[84] regression GTX 1080 Ti –
image segmentation algorithm Error 12.9 mm reality (VR) images created by
pipeline GPU
authors
appearance percentage synthetic dataset
bounding box Interaction with
Kinect V2 and the of template generated with the
[85] around the hand hand – deformable object & –
camera kinematics of the vertices over all Blender modeling
& hand mask tracking
hand frames software
human–computer
regression-based 3D hand pose For evaluation ICVL
RGBD data interaction (HCI),
method & 3D hand pose estimation via Mean error NVIDIA TITAN Dataset
[86] from computer graphics 58 fps
hierarchical estimation semi-supervised 7.7 mm Xp GPU & MSRA Dataset
3 Kinect devices and virtual/augmented
feature extraction learning. & NYU Dataset
reality
J. Imaging 2020, 6, 73 19 of 29

Table 6. Cont.
Techniques/
Feature Extract Classify
Author Type of Camera Methods for Type of Error Hardware Run Application Area Dataset Type Runtime Speed
Type Algorithm
Segmentation
3D point cloud of (HCI), computer For evaluation NYU
Nvidia TITAN
single depth hand as network mean error graphics dataset
[87] depth image 3D hand pose Xp 41.8 fps
images. input and outputs distances and virtual/augmented & ICVL dataset
GPU
heat-maps reality & MSRA datasets
dense feature
predicting heat maps through For evaluation
mean error
maps of hand intermediate ‘HANDS 20170
hand pose 6.68 mm GeForce GTX (HCI), virtual and mixed
[88] depth images joints in supervision challenge dataset & –
estimation maximal per-joint 1080 Ti reality
detection-based in a first-person hand
error 8.73 mm
methods regression-based action
framework
weakly GeForce GTX
3D hand pose mean error (HCI), virtual and mixed Rendered hand pose
[89] RGB-D cameras – supervised 1080 GPU with –
estimation 0.6 mm reality (RHD) dataset
method CUDA 8.0.
Table footer: –: none.
J. Imaging 2020, 6, x FOR PEER REVIEW 20 of 29
J. Imaging 2020, 6, 73 20 of 29

2.2.7. Deep-Learning Based Recognition


2.2.7. Deep-Learning Based Recognition
The artificial intelligence offers a good and reliable technique used in a wide range of modern
The artificial
applications becauseintelligence
of usingoffers a goodrole
a learning and principle.
reliable technique
The deep used in a wide
learning usedrange of modern
multilayers for
applications because of using a learning role principle. The deep learning
learning data and gives a good prediction out result. The most challenges facing this technique used multilayers for learning is
data and gives a good prediction out result. The most challenges
required dataset to learn algorithm which may affect time processing. Table 7 presents a set of facing this technique is required
dataset
researchtopapers
learn algorithm which may
that use different affect time
techniques basedprocessing. Table 7 presents
on deep-learning recognition a setto ofdetect
researchROI.papers
that use different techniques based on deep-learning recognition
Authors proposed seven popular hand gestures which captured by mobile camera and generate to detect ROI.
24,698 imageproposed
Authors frames. seven popularextraction
The feature hand gestures and whichadapted captured
deep by mobile camera
convolutional and generate
neural network
24,698 image frames. The feature extraction and adapted deep convolutional
(ADCNN) utilized for hand classification. The experiment evaluates result for the training data neural network (ADCNN)100%
utilized for hand classification. The experiment evaluates result for
and testing data 99%, with execution time 15,598 s [90]. While other proposed systems used webcam the training data 100% and testing
data 99%,towith
in order trackexecution
hand. Then timeused15,598skin s [90].
colorWhile
(Y–Cb–Cr othercolor
proposed
space)systems
technique used andwebcam in order
morphology to
to track hand. Then used skin color (Y–Cb–Cr color space) technique
remove the background. In addition, kernel correlation filters (KCF) used to track ROI. The resulted and morphology to remove
the
imagebackground.
enters intoInaaddition, kernel correlation
deep convolutional neural filters
network (KCF) used to
(CNN). trackthe
Where ROI. CNNThe resulted
model used imageto
enters into a deep convolutional neural network (CNN). Where
compare performance of two modified from Alex Net and VGG Net. The recognition rate for training the CNN model used to compare
performance
data and testing of twodata,
modified from Alex
respectively Net and
99.90% andVGG Net. in
95.61% The[91].
recognition
A new rate methodfor training
based data and
on deep
testing data, respectively
convolutional neural network, 99.90%whereand 95.61% in [91].
the resized A new
image method
directly feds based
intoon thedeep convolutional
network ignoring
neural network, where the resized image directly feds into the network
segmentation and detection stages in orders to classify hand gestures directly. The system works ignoring segmentation andin
detection stages in orders to classify hand gestures directly. The
real time and gives a result with simple background 97.1% and with complex background 85.3% insystem works in real time and gives a
result withdepth
[92]. The simpleimagebackground
produced 97.1%by andKinectwithsensor
complex usedbackground
to segment 85.3%
colorinimage
[92]. The depth
then skinimage
color
produced by Kinect sensor used to segment color image then
modeling combined with convolution neural network, where error back propagation algorithm skin color modeling combined with
convolution
applied to modify neural network, where error
the threshold and backweightspropagation
for the neuralalgorithm appliedThe
network. to modify
SVM the threshold
classification
and weights for the neural network. The SVM classification algorithm
algorithm added to the network to enhance result in [93]. Other research study used Gaussian added to the network to enhance
result
Mixture model (GMM) to filter out non-skin colors of an image which used to train the CNN incolors
in [93]. Other research study used Gaussian Mixture model (GMM) to filter out non-skin order
of an image which used to train the CNN in order to recognize seven
to recognize seven hand gestures, where the average recognition rate 95.96 % in [94]. The next hand gestures, where the average
recognition
proposed system rate 95.96used% long-term
in [94]. The next proposed
recurrent system used
convolutional long-term recurrent
network-based convolutional
action classifier, where
network-based
multiple framesaction sampled classifier,
from the where
video multiple
sequence frames sampled
recorded is fedfrom thenetwork.
to the video sequence
In orderrecorded
to extractis
fed to the network. In order to extract the representative frames,
the representative frames, the semantic segmentation-based de-convolutional neural network is used.the semantic segmentation-based
de-convolutional
The tiled image patterns neural and network is used.patterns
tiled binary The tiled are image
utilizedpatterns
to train and tiled binary patterns
the de-convolutional are
network
utilized
in [95]. A to double-channel
train the de-convolutional
convolutional network in [95].
neural network A double-channel
(DC-CNN) isconvolutional
proposed by neural network
[96] where the
(DC-CNN) is proposed by [96] where the original image preprocessed
original image preprocessed to detect the edge of the hand before fed to the network. The each to detect the edge of the handof
before fed to the
two-channel CNN network. The eachweight
has a separate of two-channel
and softmax CNNclassifier
has a separate
used toweightclassify and softmax
output classifier
results. The
used to classify
proposed system output
givesresults.
recognitionThe proposed system Finally,
rate of 98.02%. gives recognition
a new neural rate ofnetwork
98.02%. based
Finally, ona new
SPD
neural network based on SPD manifold learning for skeleton-based
manifold learning for skeleton-based hand gesture recognition proposed by [97]. Figure 12 below hand gesture recognition proposed
by [97].example
shown Figure 12onbelow deep shown example onneural
learn convolution deep learnnetwork.convolution neural network.

FigureFigure
12. Simple example
12. Simple on deep
example onlearning convolutional
deep learning neural neural
convolutional network architecture.
network architecture.
J. Imaging 2020, 6, 73 21 of 29

Table 7. Set of research papers that have used deep-learning-based recognition for hand gesture application.
Techniques/
Type of Feature Recognition No. of Hardware
Author Resolution Methods for Classify Algorithm Application Area Dataset Type
Camera Extract Type Rate Gestures Run
Segmentation
Adapted Deep training set (HCI)
Different Core™ i7-6700
features extraction Convolutional 100% 7 hand communicate for Created by video
[90] mobile HD and 4k hand gestures CPU @
by CNN Neural gestures people was frame recorded
cameras 3.40 GHz
Network (ADCNN) test set 99% injured Stroke
skin color detection
deep training set Home appliance 4800 image collect
and morphology & 6 hand
[91] webcam – hand gestures convolutional neural 99.9% control for train and 300 –
background gestures
network (CNN) test set 95.61% (smart homes) for test
subtraction
simple
Command GPU with
backgrounds
No segment stage consumer Mantecón et al.* 1664
640 × 480 deep convolutional 97.1% 7 hand
[92] RGB image Image direct fed to hand gestures electronics device dataset for direct cores, base
pixels neural network complex gestures
CNN after resizing such as mobiles testing clock of
background
phones and TVs 1050 MHz
85.3%
skin color modeling
combined with convolution neural CPUE
8 hand image information
[93] Kinect – convolution neural hand gestures network & support 98.52% – 5-1620v4,
gestures collected by Kinect
network image vector machine 3.50 GHz
feature
skin color -Y–Cb–Cr human hand
Image size color space & convolution neural Average 7 hand gesture image information
[94] Kinect hand gestures –
200 × 200 Gaussian Mixture network 95.96% gestures recognition collected by Kinect
model system
Semantic
video Cambridge Nvidia
segmentation based hand gesture convolution network 9 hand intelligent vehicle
[95] sequences – 95% gesture recognition Geforce GTX
deconvolution motion (LRCN) deep gestures applications
recorded dataset 980 graphics
neural network
double channel Jochen Triesch
Original images
convolutional neural Database (JTD) &
in the database Canny operator 10 hand man–machine Core i5
[96] image hand gesture network (DC-CNN) 98.02% NAO Camera hand
248 × 256 or edge detection gestures interaction processor
& posture Database
128 × 128 pixels
softmax classifier (NCD)
Dynamic Hand
Gesture (DHG)
Skeleton-based dataset &
neural network 14 hand non-optimized
[97] Kinect – – hand gesture 85.39% – First-Person Hand
based on SPD gestures CPU 3.4 GHz
recognition. Action (FPHA)
dataset

Table footer: –: none.


J. J.Imaging
Imaging2019,
2020,5,6,x 73
FOR PEER REVIEW 2222ofof2929

3.3.Application
ApplicationAreas
AreasofofHand
HandGesture
GestureRecognition
RecognitionSystems
Systems
Researchinto
Research intohand
handgestures
gestureshashasbecome
becomean anexciting
excitingand
andrelevant
relevantfield;
field;ititoffers
offersaameans
meansofof
natural interaction and reduces the cost of using sensors in terms of data gloves.
natural interaction and reduces the cost of using sensors in terms of data gloves. Conventional Conventional
interactivemethods
interactive methodsdepend
depend onon different
different devices
devices such
suchas
asaamouse,
mouse,keyboard,
keyboard,touch
touchscreen, joystick
screen, for
joystick
gaming and consoles for machine controls. The following sections describe some
for gaming and consoles for machine controls. The following sections describe some popular popular applications
of hand gestures.
applications of hand Figure 13 Figure
gestures. shows 13theshows
most the
common application
most common area deal
application area with hand
deal withgesture
hand
recognition
gesture techniques.
recognition techniques.

Figure13.
Figure Mostcommon
13.Most commonapplication
applicationarea
areaof
ofhand
hand gesture
gesture interaction system
system (the
(the image
imageof
ofFigure
Figure13
13is
adapted from [12,14,42,76,83,98,99]).
is adapted from [12,14,42,76,83,98,99]).

3.1. Clinical and Health


3.1. Clinical and Health
During clinical operations, a surgeon may need details about the patient’s entire body structure or
During clinical operations, a surgeon may need details about the patient’s entire body structure
a detailed organ model in order to shorten the operating time or increase the accuracy of the result.
or a detailed organ model in order to shorten the operating time or increase the accuracy of the result.
This is achieved by using a medical imaging system such as MRI, CT or X-ray system [10,99], which
This is achieved by using a medical imaging system such as MRI, CT or X-ray system [10,99], which
collects data from the patient’s body and displays them on the screen as a detailed image. The surgeon
collects data from the patient’s body and displays them on the screen as a detailed image. The surgeon
can facilitate interaction with the viewed images by performing hand gestures in front of the camera
can facilitate interaction with the viewed images by performing hand gestures in front of the camera
using a computer vision technique. These gestures can enable some operations such as zooming,
using a computer vision technique. These gestures can enable some operations such as zooming,
rotating, image cropping and going to the next or previous slide without using any peripheral device
rotating, image cropping and going to the next or previous slide without using any peripheral device
such as a mouse, keyboard or touch screen. Any additional equipment requires sterilization, which can
such as a mouse, keyboard or touch screen. Any additional equipment requires sterilization, which
be difficult in the case of keyboards and touch screen. In addition, hand gestures can be used for
can be difficult in the case of keyboards and touch screen. In addition, hand gestures can be used for
assistive purpose such as wheelchair control [43].
assistive purpose such as wheelchair control [43].
3.2. Sign Language Recognition
3.2. Sign Language Recognition
Sign language is an alternative method used by people who are unable to communicate with
Sign language is an alternative method used by people who are unable to communicate with
others by speech. It consists of a set of gestures wherein every gesture represents one letter, number
others by speech. It consists of a set of gestures wherein every gesture represents one letter, number
or expression. Many research papers have proposed recognition of sign language for deaf-mute
or expression. Many research papers have proposed recognition of sign language for deaf-mute
people, using a glove-attached sensor worn on the hand that gives responses according to hand
people, using a glove-attached sensor worn on the hand that gives responses according to hand
movement. Alternatively, it may involve uncovered hand interaction with the camera, using computer
movement. Alternatively, it may involve uncovered hand interaction with the camera, using
vision techniques to identify the gesture. For both approaches mentioned above, the dataset used for
computer vision techniques to identify the gesture. For both approaches mentioned above, the
classification of gestures matches a real-time gesture made by the user [11,42,50].
dataset used for classification of gestures matches a real-time gesture made by the user [11,42,50].

3.3. Robot Control


J. Imaging 2020, 6, 73 23 of 29

3.3. Robot Control


Robot technology is used in many application fields such as industry, assistive services [100],
stores, sports and entertainment. Robotic control systems use machine learning techniques, artificial
intelligence and complex algorithms to execute a specific task, which lets the robotic system, interact
naturally with the environment and make an independent decision. Some research proposes computer
vision technology with a robot to build assistive systems for elderly people. Other research uses
computer vision to enable a robot to ask a human for a proper path inside a specific building [12].

3.4. Virtual Environment


Virtual environments are based on a 3D model that needs a 3D gesture recognition system in
order to interact in real time as a HCI. These gestures may be used for modification and viewing or
for recreational purposes, such as playing a virtual piano. The gesture recognition system utilizes a
dataset to match it with an acquired gesture in real time [13,78,83].

3.5. Home Automation


Hand gestures can be used efficiently for home automation. Shaking a hand or performing some
gesture can easily enable control of lighting, fans, television, radio, etc. They can be used to improve
older people’s quality of life [14].

3.6. Personal Computer and Tablet


Hand gestures can be used as an alternative input device that enables interaction with a computer
without a mouse or keyboard, such as dragging, dropping and moving files through the desktop
environment, as well as cut and paste operations [19,69,76]. Moreover, they can be used to control slide
show presentations [15]. In addition, they are used with a tablet to permit deaf-mute people to interact
with other people by moving their hand in front of tablet’s camera. This requires the installation of an
application that translates sign language to text, which is displayed on the screen. This is analogous to
the conversion of acquired voice to text.

3.7. Gestures for Gaming


The best example of gesture interaction for gaming purposes is the Microsoft Kinect Xbox,
which has a camera placed over the screen and connects with the Xbox device through the cable port.
The user can interact with the game by using hand motions and body movements that are tracked by
the Kinect camera sensor [16,98].

4. Research Gaps and Challenges


From the previous sections, it is easy to identify the research gap, since most research studies
focus on computer applications, sign language and interaction with a 3D object through a virtual
environment. However, many research papers deal with enhancing frameworks for hand gesture
recognition or developing new algorithms rather than executing a practical application with regard to
health care. The biggest challenge encountered by the researcher is in designing a robust framework
that overcomes the most common issues with fewer limitations and gives an accurate and reliable
result. Most proposed hand gesture systems can be divided into two categories of computer vision
techniques. First, a simple approach is to use image processing techniques via Open-NI library or
OpenCV library and possibly other tools to provide interaction in real time, which considers time
consumption because of real-time processing. This has some limitations, such as background issues,
illumination variation, distance limit and multi-object or multi-gesture problems. A second approach
uses dataset gestures to match against the input gesture, where considerably more complex patterns
require complex algorithm. Deep learning technique and artificial intelligence techniques to match the
interaction gesture in real time with dataset gestures already containing specific postures or gestures.
J. Imaging 2020, 6, 73 24 of 29

Although this approach can identify a large number of gestures, it has some drawbacks in some cases,
such as missing some gestures because of the classification algorithms accuracy contrast. In addition,
it takes time more than first approach because of the matching dataset in case of using a large number
of the dataset. In addition, the dataset of gestures cannot be used by other frameworks.

5. Conclusions
Hand gesture recognition addresses a fault in interaction systems. Controlling things by hand
is more natural, easier, more flexible and cheaper, and there is no need to fix problems caused by
hardware devices, since none is required. From previous sections, it was clear to need to put much
effort into developing reliable and robust algorithms with the help of using a camera sensor has
a certain characteristic to encounter common issues and achieve a reliable result. Each technique
mentioned above, however, has its advantages and disadvantages and may perform well in some
challenges while being inferior in others.

Author Contributions: Conceptualization, A.A.-N. & M.O.; funding acquisition, A.A.-N. & J.C.; investigation,
M.O.; methodology, M.O. & A.A.-N.; project administration, A.A.-N. and J.C.; supervision, A.A.-N. & J.C.; writing–
original draft, M.O.; writing– review & editing, M.O., A.A.-N. & J.C. All authors have read and agreed to the
published version of the manuscript.
Funding: This research received no external funding.
Acknowledgments: The authors would like to thank the staff in Electrical Engineering Technical College, Middle
Technical University, Baghdad, Iraq and the participants for their support to conduct the experiments.
Conflicts of Interest: The authors of this manuscript have no conflicts of interest relevant to this work.

References
1. Zhigang, F. Computer gesture input and its application in human computer interaction. Mini Micro Syst.
1999, 6, 418–421.
2. Mitra, S.; Acharya, T. Gesture recognition: A survey. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2007,
37, 311–324. [CrossRef]
3. Ahuja, M.K.; Singh, A. Static vision based Hand Gesture recognition using principal component analysis.
In Proceedings of the 2015 IEEE 3rd International Conference on MOOCs, Innovation and Technology in
Education (MITE), Amritsar, India, 1–2 October 2015; pp. 402–406.
4. Kramer, R.K.; Majidi, C.; Sahai, R.; Wood, R.J. Soft curvature sensors for joint angle proprioception.
In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San
Francisco, CA, USA, 25–30 September 2011; pp. 1919–1926.
5. Jesperson, E.; Neuman, M.R. A thin film strain gauge angular displacement sensor for measuring finger joint
angles. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and
Biology Society, New Orleans, LA, USA, 4–7 November 1988; pp. 807–vol.
6. Fujiwara, E.; dos Santos, M.F.M.; Suzuki, C.K. Flexible optical fiber bending transducer for application in
glove-based sensors. IEEE Sens. J. 2014, 14, 3631–3636. [CrossRef]
7. Shrote, S.B.; Deshpande, M.; Deshmukh, P.; Mathapati, S. Assistive Translator for Deaf & Dumb People. Int. J.
Electron. Commun. Comput. Eng. 2014, 5, 86–89.
8. Gupta, H.P.; Chudgar, H.S.; Mukherjee, S.; Dutta, T.; Sharma, K. A continuous hand gestures recognition
technique for human-machine interaction using accelerometer and gyroscope sensors. IEEE Sens. J. 2016, 16,
6425–6432. [CrossRef]
9. Lamberti, L.; Camastra, F. Real-time hand gesture recognition using a color glove. In Proceedings of
the International Conference on Image Analysis and Processing, Ravenna, Italy, 14–16 September 2011;
pp. 365–373.
10. Wachs, J.P.; Kölsch, M.; Stern, H.; Edan, Y. Vision-based hand-gesture applications. Commun. ACM 2011, 54,
60–71. [CrossRef]
11. Pansare, J.R.; Gawande, S.H.; Ingle, M. Real-time static hand gesture recognition for American Sign Language
(ASL) in complex background. JSIP 2012, 3, 22132. [CrossRef]
J. Imaging 2020, 6, 73 25 of 29

12. Van den Bergh, M.; Carton, D.; De Nijs, R.; Mitsou, N.; Landsiedel, C.; Kuehnlenz, K.; Wollherr, D.;
Van Gool, L.; Buss, M. Real-time 3D hand gesture interaction with a robot for understanding directions from
humans. In Proceedings of the 2011 Ro-Man, Atlanta, GA, USA, 31 July–3 August 2011; pp. 357–362.
13. Wang, R.Y.; Popović, J. Real-time hand-tracking with a color glove. ACM Trans. Graph. 2009, 28, 1–8.
14. Desai, S.; Desai, A. Human Computer Interaction through hand gestures for home automation using
Microsoft Kinect. In Proceedings of the International Conference on Communication and Networks, Xi’an,
China, 10–12 October 2017; pp. 19–29.
15. Rajesh, R.J.; Nagarjunan, D.; Arunachalam, R.M.; Aarthi, R. Distance Transform Based Hand Gestures
Recognition for PowerPoint Presentation Navigation. Adv. Comput. 2012, 3, 41.
16. Kaur, H.; Rani, J. A review: Study of various techniques of Hand gesture recognition. In Proceedings of
the 2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems
(ICPEICES), Delhi, India, 4–6 July 2016; pp. 1–5.
17. Murthy, G.R.S.; Jadon, R.S. A review of vision based hand gestures recognition. Int. J. Inf. Technol. Knowl.
Manag. 2009, 2, 405–410.
18. Khan, R.Z.; Ibraheem, N.A. Hand gesture recognition: A literature review. Int. J. Artif. Intell. Appl. 2012, 3,
161. [CrossRef]
19. Suriya, R.; Vijayachamundeeswari, V. A survey on hand gesture recognition for simple mouse control.
In Proceedings of the International Conference on Information Communication and Embedded Systems
(ICICES2014), Chennai, India, 27–28 February 2014; pp. 1–5.
20. Sonkusare, J.S.; Chopade, N.B.; Sor, R.; Tade, S.L. A review on hand gesture recognition system. In Proceedings
of the 2015 International Conference on Computing Communication Control and Automation, Pune, India,
26–27 February 2015; pp. 790–794.
21. Garg, P.; Aggarwal, N.; Sofat, S. Vision based hand gesture recognition. World Acad. Sci. Eng. Technol. 2009,
49, 972–977.
22. Dipietro, L.; Sabatini, A.M.; Member, S.; Dario, P. A survey of glove-based systems and their applications.
IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2008, 38, 461–482. [CrossRef]
23. LaViola, J. A survey of hand posture and gesture recognition techniques and technology. Brown Univ. Provid.
RI 1999, 29. Technical Report no. CS-99-11.
24. Ibraheem, N.A.; Khan, R.Z. Survey on various gesture recognition technologies and techniques. Int. J.
Comput. Appl. 2012, 50, 38–44.
25. Hasan, M.M.; Mishra, P.K. Hand gesture modeling and recognition using geometric features: A review.
Can. J. Image Process. Comput. Vis. 2012, 3, 12–26.
26. Shaik, K.B.; Ganesan, P.; Kalist, V.; Sathish, B.S.; Jenitha, J.M.M. Comparative study of skin color detection
and segmentation in HSV and YCbCr color space. Procedia Comput. Sci. 2015, 57, 41–48. [CrossRef]
27. Ganesan, P.; Rajini, V. YIQ color space based satellite image segmentation using modified FCM clustering
and histogram equalization. In Proceedings of the 2014 International Conference on Advances in Electrical
Engineering (ICAEE), Vellore, India, 9–11 January 2014; pp. 1–5.
28. Brand, J.; Mason, J.S. A comparative assessment of three approaches to pixel-level human skin-detection.
In Proceedings of the 15th International Conference on Pattern Recognition. ICPR-2000, Barcelona, Spain,
3–7 September 2000; Volume 1, pp. 1056–1059.
29. Jones, M.J.; Rehg, J.M. Statistical color models with application to skin detection. Int. J. Comput. Vis. 2002, 46,
81–96. [CrossRef]
30. Brown, D.A.; Craw, I.; Lewthwaite, J. A som based approach to skin detection with application in real time
systems. BMVC 2001, 1, 491–500.
31. Zarit, B.D.; Super, B.J.; Quek, F.K.H. Comparison of five color models in skin pixel classification. In Proceedings
of the International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time
Systems, In Conjunction with ICCV’99 (Cat. No. PR00378). Corfu, Greece, 26–27 September 1999; pp. 58–63.
32. Albiol, A.; Torres, L.; Delp, E.J. Optimum color spaces for skin detection. In Proceedings of the 2001
International Conference on Image Processing (Cat. No. 01CH37205), Thessaloniki, Greece, 7–10 October
2001; Volume 1, pp. 122–124.
J. Imaging 2020, 6, 73 26 of 29

33. Sigal, L.; Sclaroff, S.; Athitsos, V. Estimation and prediction of evolving color distributions for skin
segmentation under varying illumination. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2000 (Cat. No. PR00662). Hilton Head Island, SC, USA, 15 June 2000; Volume 2,
pp. 152–159.
34. Chai, D.; Bouzerdoum, A. A Bayesian approach to skin color classification in YCbCr color space.
In Proceedings of the 2000 TENCON Proceedings. Intelligent Systems and Technologies for the New
Millennium (Cat. No. 00CH37119), Kuala Lumpur, Malaysia, 24–27 September 2000; Volume 2, pp. 421–424.
35. Menser, B.; Wien, M. Segmentation and tracking of facial regions in color image sequences. In Proceedings
of the Visual Communications and Image Processing 2000, Perth, Australia, 20–23 June 2000; Volume 4067,
pp. 731–740.
36. Kakumanu, P.; Makrogiannis, S.; Bourbakis, N. A survey of skin-color modeling and detection methods.
Pattern Recognit. 2007, 40, 1106–1122. [CrossRef]
37. Perimal, M.; Basah, S.N.; Safar, M.J.A.; Yazid, H. Hand-Gesture Recognition-Algorithm based on Finger
Counting. J. Telecommun. Electron. Comput. Eng. 2018, 10, 19–24.
38. Sulyman, A.B.D.A.; Sharef, Z.T.; Faraj, K.H.A.; Aljawaryy, Z.A.; Malallah, F.L. Real-time numerical 0-5
counting based on hand-finger gestures recognition. J. Theor. Appl. Inf. Technol. 2017, 95, 3105–3115.
39. Choudhury, A.; Talukdar, A.K.; Sarma, K.K. A novel hand segmentation method for multiple-hand gesture
recognition system under complex background. In Proceedings of the 2014 International Conference on
Signal Processing and Integrated Networks (SPIN), Noida, India, 20–21 February 2014; pp. 136–140.
40. Stergiopoulou, E.; Sgouropoulos, K.; Nikolaou, N.; Papamarkos, N.; Mitianoudis, N. Real time hand detection
in a complex background. Eng. Appl. Artif. Intell. 2014, 35, 54–70. [CrossRef]
41. Khandade, S.L.; Khot, S.T. MATLAB based gesture recognition. In Proceedings of the 2016 International
Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 26–27 August 2016; Volume 1,
pp. 1–4.
42. Karabasi, M.; Bhatti, Z.; Shah, A. A model for real-time recognition and textual representation of malaysian
sign language through image processing. In Proceedings of the 2013 International Conference on Advanced
Computer Science Applications and Technologies, Kuching, Malaysia, 23–24 December 2013; pp. 195–200.
43. Zeng, J.; Sun, Y.; Wang, F. A natural hand gesture system for intelligent human-computer interaction and
medical assistance. In Proceedings of the 2012 Third Global Congress on Intelligent Systems, Wuhan, China,
6–8 November 2012; pp. 382–385.
44. Hsieh, C.-C.; Liou, D.-H.; Lee, D. A real time hand gesture recognition system using motion history image.
In Proceedings of the 2010 2nd international conference on signal processing systems, Dalian, China, 5–7 July
2010; Volume 2, pp. V2–394.
45. Van den Bergh, M.; Koller-Meier, E.; Bosché, F.; Van Gool, L. Haarlet-based hand gesture recognition for 3D
interaction. In Proceedings of the 2009 Workshop on Applications of Computer Vision (WACV), Snowbird,
UT, USA, 7–8 December 2009; pp. 1–8.
46. Van den Bergh, M.; Van Gool, L. Combining RGB and ToF cameras for real-time 3D hand gesture interaction.
In Proceedings of the 2011 IEEE workshop on applications of computer vision (WACV), Kona, HI, USA,
5–7 January 2011; pp. 66–72.
47. Chen, L.; Wang, F.; Deng, H.; Ji, K. A survey on hand gesture recognition. In Proceedings of the 2013
International conference on computer sciences and applications, Wuhan, China, 14–15 December 2013;
pp. 313–316.
48. Shimada, A.; Yamashita, T.; Taniguchi, R. Hand gesture based TV control system—Towards both user-&
machine-friendly gesture applications. In Proceedings of the 19th Korea-Japan Joint Workshop on Frontiers
of Computer Vision, Incheon, Korea, 30 January–1 February 2013; pp. 121–126.
49. Chen, Q.; Georganas, N.D.; Petriu, E.M. Real-time vision-based hand gesture recognition using haar-like
features. In Proceedings of the 2007 IEEE instrumentation & measurement technology conference IMTC
2007, Warsaw, Poland, 1–3 May 2007; pp. 1–6.
50. Kulkarni, V.S.; Lokhande, S.D. Appearance based recognition of american sign language using gesture
segmentation. Int. J. Comput. Sci. Eng. 2010, 2, 560–565.
51. Fang, Y.; Wang, K.; Cheng, J.; Lu, H. A real-time hand gesture recognition method. In Proceedings of the
2007 IEEE International Conference on Multimedia and Expo, Beijing, China, 2–5 July 2007; pp. 995–998.
J. Imaging 2020, 6, 73 27 of 29

52. Licsár, A.; Szirányi, T. User-adaptive hand gesture recognition system with interactive training. Image Vis.
Comput. 2005, 23, 1102–1114. [CrossRef]
53. Zhou, Y.; Jiang, G.; Lin, Y. A novel finger and hand pose estimation technique for real-time hand gesture
recognition. Pattern Recognit. 2016, 49, 102–114. [CrossRef]
54. Pun, C.-M.; Zhu, H.-M.; Feng, W. Real-time hand gesture recognition using motion tracking. Int. J. Comput.
Intell. Syst. 2011, 4, 277–286. [CrossRef]
55. Bayazit, M.; Couture-Beil, A.; Mori, G. Real-time Motion-based Gesture Recognition Using the GPU.
In Proceedings of the MVA, Yokohama, Japan, 20–22 May 2009; pp. 9–12.
56. Molina, J.; Pajuelo, J.A.; Martínez, J.M. Real-time motion-based hand gestures recognition from time-of-flight
video. J. Signal Process. Syst. 2017, 86, 17–25. [CrossRef]
57. Prakash, J.; Gautam, U.K. Hand Gesture Recognition. Int. J. Recent Technol. Eng. 2019, 7, 54–59.
58. Xi, C.; Chen, J.; Zhao, C.; Pei, Q.; Liu, L. Real-time Hand Tracking Using Kinect. In Proceedings of the 2nd
International Conference on Digital Signal Processing, Tokyo, Japan, 25–27 February 2018; pp. 37–42.
59. Devineau, G.; Moutarde, F.; Xi, W.; Yang, J. Deep learning for hand gesture recognition on skeletal data.
In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition
(FG 2018), Xi’an, China, 15–19 May 2018; pp. 106–113.
60. Jiang, F.; Wu, S.; Yang, G.; Zhao, D.; Kung, S.-Y. independent hand gesture recognition with Kinect.
Signal Image Video Process. 2014, 8, 163–172. [CrossRef]
61. Konstantinidis, D.; Dimitropoulos, K.; Daras, P. Sign language recognition based on hand and body skeletal
data. In Proceedings of the 2018-3DTV-Conference: The True Vision-Capture, Transmission and Display of
3D Video (3DTV-CON), Helsinki, Finland, 3–5 June 2018; pp. 1–4.
62. De Smedt, Q.; Wannous, H.; Vandeborre, J.-P.; Guerry, J.; Saux, B.L.; Filliat, D. 3D hand gesture recognition
using a depth and skeletal dataset: SHREC’17 track. In Proceedings of the Workshop on 3D Object Retrieval,
Lyon, France, 23–24 April 2017; pp. 33–38.
63. Chen, Y.; Luo, B.; Chen, Y.-L.; Liang, G.; Wu, X. A real-time dynamic hand gesture recognition system
using kinect sensor. In Proceedings of the 2015 IEEE International Conference on Robotics and Biomimetics
(ROBIO), Zhuhai, China, 6–9 December 2015; pp. 2026–2030.
64. Karbasi, M.; Muhammad, Z.; Waqas, A.; Bhatti, Z.; Shah, A.; Koondhar, M.Y.; Brohi, I.A. A Hybrid Method
Using Kinect Depth and Color Data Stream for Hand Blobs Segmentation; Science International: Lahore, Pakistan,
2017; Volume 29, pp. 515–519.
65. Ren, Z.; Meng, J.; Yuan, J. Depth camera based hand gesture recognition and its applications in
human-computer-interaction. In Proceedings of the 2011 8th International Conference on Information,
Communications & Signal Processing, Singapore, 13–16 December 2011; pp. 1–5.
66. Dinh, D.-L.; Kim, J.T.; Kim, T.-S. Hand gesture recognition and interface via a depth imaging sensor for smart
home appliances. Energy Procedia 2014, 62, 576–582. [CrossRef]
67. Raheja, J.L.; Minhas, M.; Prashanth, D.; Shah, T.; Chaudhary, A. Robust gesture recognition using Kinect:
A comparison between DTW and HMM. Optik 2015, 126, 1098–1104. [CrossRef]
68. Ma, X.; Peng, J. Kinect sensor-based long-distance hand gesture recognition and fingertip detection with
depth information. J. Sens. 2018, 2018, 5809769. [CrossRef]
69. Kim, M.-S.; Lee, C.H. Hand Gesture Recognition for Kinect v2 Sensor in the Near Distance Where Depth
Data Are Not Provided. Int. J. Softw. Eng. Its Appl. 2016, 10, 407–418. [CrossRef]
70. Li, Y. Hand gesture recognition using Kinect. In Proceedings of the 2012 IEEE International Conference on
Computer Science and Automation Engineering, Beijing, China, 22–24 June 2012; pp. 196–199.
71. Song, L.; Hu, R.M.; Zhang, H.; Xiao, Y.L.; Gong, L.Y. Real-time 3d hand gesture detection from depth images.
Adv. Mater. Res. 2013, 756, 4138–4142. [CrossRef]
72. Pal, D.H.; Kakade, S.M. Dynamic hand gesture recognition using kinect sensor. In Proceedings of the 2016
International Conference on Global Trends in Signal Processing, Information Computing and Communication
(ICGTSPICC), Jalgaon, India, 22–24 December 2016; pp. 448–453.
73. Karbasi, M.; Bhatti, Z.; Nooralishahi, P.; Shah, A.; Mazloomnezhad, S.M.R. Real-time hands detection in
depth image by using distance with Kinect camera. Int. J. Internet Things 2015, 4, 1–6.
74. Bakar, M.Z.A.; Samad, R.; Pebrianti, D.; Aan, N.L.Y. Real-time rotation invariant hand tracking using 3D data.
In Proceedings of the 2014 IEEE International Conference on Control System, Computing and Engineering
(ICCSCE 2014), Batu Ferringhi, Malaysia, 28–30 November 2014; pp. 490–495.
J. Imaging 2020, 6, 73 28 of 29

75. Desai, S. Segmentation and Recognition of Fingers Using Microsoft Kinect. In Proceedings of the International
Conference on Communication and Networks, Paris, France, 21–25 May 2017; pp. 45–53.
76. Lee, U.; Tanaka, J. Finger identification and hand gesture recognition techniques for natural user interface.
In Proceedings of the 11th Asia Pacific Conference on Computer Human Interaction, Bangalore, India,
24–27 September 2013; pp. 274–279.
77. Bakar, M.Z.A.; Samad, R.; Pebrianti, D.; Mustafa, M.; Abdullah, N.R.H. Finger application using K-Curvature
method and Kinect sensor in real-time. In Proceedings of the 2015 International Symposium on Technology
Management and Emerging Technologies (ISTMET), Langkawai Island, Malaysia, 25–27 August 2015;
pp. 218–222.
78. Tang, M. Recognizing Hand Gestures with Microsoft’s Kinect; Department of Electrical Engineering of Stanford
University: Palo Alto, CA, USA, 2011.
79. Bamwenda, J.; Özerdem, M.S. Recognition of Static Hand Gesture with Using ANN and SVM. Dicle Univ.
J. Eng. 2019, 10, 561–568.
80. Tekin, B.; Bogo, F.; Pollefeys, M. H+ O: Unified egocentric recognition of 3D hand-object poses and interactions.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
15–20 June 2019; pp. 4511–4520.
81. Wan, C.; Probst, T.; Van Gool, L.; Yao, A. Self-supervised 3d hand pose estimation through training by fitting.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
15–20 June 2019; pp. 10853–10862.
82. Ge, L.; Ren, Z.; Li, Y.; Xue, Z.; Wang, Y.; Cai, J.; Yuan, J. 3d hand shape and pose estimation from a single rgb
image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach,
CA, USA, 15–20 June 2019; pp. 10833–10842.
83. Taylor, J.; Bordeaux, L.; Cashman, T.; Corish, B.; Keskin, C.; Sharp, T.; Soto, E.; Sweeney, D.; Valentin, J.;
Luff, B. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and
correspondences. ACM Trans. Graph. 2016, 35, 1–12. [CrossRef]
84. Malik, J.; Elhayek, A.; Stricker, D. Structure-aware 3D hand pose regression from a single depth image.
In Proceedings of the International Conference on Virtual Reality and Augmented Reality, London, UK,
22–23 October 2018; pp. 3–17.
85. Tsoli, A.; Argyros, A.A. Joint 3D tracking of a deformable object in interaction with a hand. In Proceedings of
the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 484–500.
86. Chen, Y.; Tu, Z.; Ge, L.; Zhang, D.; Chen, R.; Yuan, J. So-handnet: Self-organizing network for 3d hand pose
estimation with semi-supervised learning. In Proceedings of the IEEE International Conference on Computer
Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6961–6970.
87. Ge, L.; Ren, Z.; Yuan, J. Point-to-point regression pointnet for 3d hand pose estimation. In Proceedings of the
European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 475–491.
88. Wu, X.; Finnegan, D.; O’Neill, E.; Yang, Y.-L. Handmap: Robust hand pose estimation via intermediate
dense guidance map supervision. In Proceedings of the European Conference on Computer Vision (ECCV),
Munich, Germany, 8–14 September 2018; pp. 237–253.
89. Cai, Y.; Ge, L.; Cai, J.; Yuan, J. Weakly-supervised 3d hand pose estimation from monocular rgb images.
In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September
2018; pp. 666–682.
90. Alnaim, N.; Abbod, M.; Albar, A. Hand Gesture Recognition Using Convolutional Neural Network for
People Who Have Experienced A Stroke. In Proceedings of the 2019 3rd International Symposium on
Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 11–13 October 2019;
pp. 1–6.
91. Chung, H.; Chung, Y.; Tsai, W. An efficient hand gesture recognition system based on deep CNN.
In Proceedings of the 2019 IEEE International Conference on Industrial Technology (ICIT), Melbourne,
Australia, 13–15 February 2019; pp. 853–858.
92. Bao, P.; Maqueda, A.I.; del-Blanco, C.R.; García, N. Tiny hand gesture recognition without localization via a
deep convolutional network. IEEE Trans. Consum. Electron. 2017, 63, 251–257. [CrossRef]
93. Li, G.; Tang, H.; Sun, Y.; Kong, J.; Jiang, G.; Jiang, D.; Tao, B.; Xu, S.; Liu, H. Hand gesture recognition based
on convolution neural network. Cluster Comput. 2019, 22, 2719–2729. [CrossRef]
J. Imaging 2020, 6, 73 29 of 29

94. Lin, H.-I.; Hsu, M.-H.; Chen, W.-K. Human hand gesture recognition using a convolution neural network.
In Proceedings of the 2014 IEEE International Conference on Automation Science and Engineering (CASE),
Taipei, Taiwan, 18–22 August 2014; pp. 1038–1043.
95. John, V.; Boyali, A.; Mita, S.; Imanishi, M.; Sanma, N. Deep learning-based fast hand gesture recognition using
representative frames. In Proceedings of the 2016 International Conference on Digital Image Computing:
Techniques and Applications (DICTA), Gold Coast, Australia, 30 November–2 December 2016; pp. 1–8.
96. Wu, X.Y. A hand gesture recognition algorithm based on DC-CNN. Multimed. Tools Appl. 2019, 1–13.
[CrossRef]
97. Nguyen, X.S.; Brun, L.; Lézoray, O.; Bougleux, S. A neural network based on SPD manifold learning for
skeleton-based hand gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12036–12045.
98. Lee, D.-H.; Hong, K.-S. Game interface using hand gesture recognition. In Proceedings of the 5th International
Conference on Computer Sciences and Convergence Information Technology, Seoul, Korea, 30 November–2
December 2010; pp. 1092–1097.
99. Gallo, L.; Placitelli, A.P.; Ciampi, M. Controller-free exploration of medical image data: Experiencing the
Kinect. In Proceedings of the 2011 24th international symposium on computer-based medical systems
(CBMS), Bristol, UK, 27–30 June 2011; pp. 1–6.
100. Zhao, X.; Naguib, A.M.; Lee, S. Kinect based calling gesture recognition for taking order service of elderly
care robot. In Proceedings of the The 23rd IEEE international symposium on robot and human interactive
communication, Edinburgh, UK, 25–29 August 2014; pp. 525–530.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like