Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Pattern Recognition 61 (2017) 139–152

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

An adaptive local binary pattern for 3D hand tracking


Joongrock Kim a, Sunjin Yu b, Dongchul Kim c, Kar-Ann Toh a, Sangyoun Lee a,n
a
Department of Electrical and Electronic Engineering, Yonsei University, 134 Shinchon-dong, Seodaemun-gu, Seoul 120-749, South Korea
b
Department of Broadcasting and Film, Cheju Halla University, 38, Halladaehak-ro, Jeju-si, Jeju-do, South Korea
c
Department of Computer Science, Yonsei University, 134 Shinchon-dong, Seodaemun-gu, Seoul 120-749, South Korea

art ic l e i nf o a b s t r a c t

Article history: Ever since the availability of real-time three-dimensional (3D) data acquisition sensors such as time-of-
Received 5 June 2013 flight and Kinect depth sensor, the performance of gesture recognition can be largely enhanced. However,
Received in revised form since conventional two-dimensional (2D) image based feature extraction methods such as local binary
25 July 2016
pattern (LBP) generally use texture information, they cannot be applied to depth or range image which
Accepted 26 July 2016
does not contain texture information. In this paper, we propose an adaptive local binary pattern (ALBP)
Available online 27 July 2016
for effective depth images based applications. Contrasting to the conventional LBP which is only rotation
Keywords: invariant, the proposed ALBP is invariant to both rotation and the depth distance in range images. Using
3D hand tracking ALBP, we can extract object features without using texture or color information. We further apply the
Hand gesture recognition
proposed ALBP for hand tracking using depth images to show its effectiveness and its usefulness. Our
Human computer interaction
experimental results validate the proposal.
Natural user interface
& 2016 Elsevier Ltd. All rights reserved.

1. Introduction Although the 3D depth images based system can overcome


limitations in 2D image based system, few literatures on NUI using
A natural user interface (NUI) for human–computer interaction 3D information can be found [19–21,31]. The conventional feature
(HCI) has gained much attention recently where its main goal is to extraction methods such as local binary pattern (LBP) [27,28]
recognize natural motion without any previous learning [1–5]. cannot be directly applied to extract useful features due to the lack
Among those existing NUI techniques, hand gesture recognition is
of texture information in depth images. In addition, many feature
recognized as among the most effective ways to convey user's
extraction methods such as PCA and LDA [29,30] are not invariant
intention to the computer [6,7].
to depth scene.
Existing 2D image based hand gesture recognition systems use
In this paper, we propose a novel feature extraction technique,
a variety of cue information such as color and texture [8–11].
However, the performance of these systems is affected by ex- called adaptive local binary pattern (ALBP), for depth image based
ternal environments such as illumination and complex back- applications. Contrasting to the conventional LBP which is only
ground. In addition, the motion space for gesture recognition is rotation invariant, the proposed ALBP is invariant to both rotation
limited by the cameras inability to perceive any change in visual and depth distance in range images. Using ALBP, we can extract
depth. object features without using texture or color information. More-
Recently, 3D depth sensors which provide depth images with over, the proposed ALBP can verify detected features without the
3D scene information in real-time have been introduced into the need of a classifier as that in typical applications of LBP. Conse-
commercial market (e.g. [22,23]). In terms of environmental il- quently, we apply the proposed ALBP for hand tracking using
lumination variation and background complexion, depth images depth images to show its effectiveness and its usefulness.
taken by 3D depth sensor are not affected by these external The paper is organized as follows. In the next section, we
factors as compared to that in RGB images. In addition, not only
briefly review related existing hand tracking methods. In Section
can the 3D shape of an object be analyzed directly, but also any
3, we introduce the proposed ALBP and ALBP based hand tracking
change in the depth of an object can be perceived via depth
images [24–26]. using depth images. Section 4 presents an extensive performance
comparison with state-of-the-art hand tracking methods. Finally,
our conclusion is given in Section 5.
n
Corresponding author.
E-mail addresses: jurock@yonsei.ac.kr (J. Kim), sjyu@chu.ac.kr (S. Yu),
dckim@msl.yonsei.ac.kr (D. Kim), katoh@yonsei.ac.kr (K.-A. Toh),
syleee@yonsei.ac.kr (S. Lee).

http://dx.doi.org/10.1016/j.patcog.2016.07.039
0031-3203/& 2016 Elsevier Ltd. All rights reserved.
140 J. Kim et al. / Pattern Recognition 61 (2017) 139–152

Table 1 Table 2
Solutions for main problems of the object tracking. State-of-the-arts literature survey of LBP based object tracking.

Problem Similar color Complex No motion Distance Method LBP feature Additional features
objects background variation
[51] LBP histogram MeanShift þ particle filter
Solution Color √ [52] MeanShift
Model [53] Skin color þ particle filter
√ √
[54] MeanShift þ color histogram þ expecta-
Motion √ √ tion–maximization (EM)
Hybrid √ √ √ [55] Color þ particle filter
3D [56] Color þ motion
√ √ √ √

[57] LBP texture CamShift


2. Related works [58] 3D rotation model

2.1. Previous hand tracking methods


texture is combined with 3D rotation model [58] and motion
features [56].
A variety of information has been attempted to make reliable
hand detection and tracking for HCI. Generally, two types of
information have been adopted: 2D image based and 3D depth 2.3. Depth image based hand tracking
image based.
Table 1 summarizes those cue information adopted in existing Many depth image based hand gesture recognition technolo-
tracking schemes and the corresponding problems they tackle. As gies have been proposed in the literatures [40–47] for touchless
seen in the table, one of the most commonly used approaches for interfaces with devices such as TV, PC and gaming devices. In
2D hand tracking is to use color information [8]. Although the these applications, it is important to detect accurate hand location
color information is a relatively reliable cue for hand tracking, it to achieve reliable recognition performance. Ever since the in-
has difficulty when the background object has color similar to skin troduction of range camera such as Kinect sensor and ToF camera,
color. Therefore, it is frequent to find motion information being it becomes possible to reach a level of recognizing natural hand
combined with color information in order to deal with the back- motions precisely in 3D space.
ground noise such as face behind the hand [9–11]. Also, pre-de- Many researchers have been active in developing hand detec-
fined 2D and 3D hand models are frequently found in hand tion and tracking techniques utilizing depth images acquired from
tracking to tackle variations in illumination, pose and occlusion range camera. At the early first stage, simple depth threshold
[12–14]. Moreover, hybrid methods combining several hand methods on depth images had been adopted for hand detection
tracking algorithms are currently a focused research topic [15–18] and tracking [40–42]. Since the distance information can be used
(Table 2). to analyze the depth images, it is easy to detect objects within a
Recently, 3D depth image including distance information has certain range. Although such approaches are simple and fast, the
been acquired from a time-of-flight (ToF) camera or Kinect sensor tracking results are noisy and not precise. At the second stage, a
in real-time [22,23]. This promotes wide applications of 3D combination of depth threshold with other features such as skin
tracking and gesture recognition such as interactive TV and games. color, contour and size information was proposed to make up for
In [19], a click event is used as initialization for activation of a hand weakness of depth threshold method alone [43–45]. However,
tracking algorithm. In [20], an approach for head and hand such a combined method usually requires additional computa-
tracking including hand posture is proposed based on range or tional resources. Moreover, it has yet to achieve desired reliable
depth images. In addition, a simple method for hand detection and performance. On the other hand, many hand gesture recognition
tracking using Kalman filter in depth image from Kinect is pro- algorithms have adopted open library such as Kinect SDK and
posed in [21] (Table 3). OpenNI and show relatively reliable performance [46,47]. How-
Apart from the above problems in object tracking, there are ever, it is not documented which type of algorithms has been
other challenges particular to our hand tracking application. adopted for hand detection and tracking.
Firstly, the hands can be occluded by other objects such as a per- PrimeSense has provided two kinds of software, OpenNI and
son standing right in front of the sensor. Secondly the location of NITE [32]. First, OpenNI is the driver to capture the depth data
hands may not be tracked accurately due to its fast movement. from Kinect. And, NITE is the middleware module based on
Thirdly, the hand may hold an object where tracking can be af- OpenNI for hand tracking and body tracking. Since NITE has many
fected. We will discuss about the process which can deal with such
Table 3
situations in Section 3.
State-of-the-arts literature survey of hand detection and tracking using depth
images.
2.2. Local binary patterns based hand tracking
Method Hand detection Hand tracking

Local binary patterns (LBP) is an effective method to extract [40] Depth threshold Closest pixel þ mayor
powerful features for texture classification [27,28]. Particularly, axis
it has been successfully adopted for object tracking. To improve [41] Depth threshold
the tracking performance, additional features such as color and [42] (bounding box)
[43] Depth threshold Skin color
motion are combined with LBP feature. Although color feature detection
based tracking methods such as MeanShift [51,52,54] CamShift [44] Contour features No tracking
[57] and skin color detection [53] are used to detect regions of [45] Size filtering Depth threshold þ size
interest, they are much affected by ambient lighting condition. filtering
[46] Kinect SDK Kinect SDK
Hence, LBP feature which has robustness to light variations is
[47] OpenNI OpenNI
added to color features in object tracking [51,52,57,53,54]. LBP
J. Kim et al. / Pattern Recognition 61 (2017) 139–152 141

benefits, minimal CPU load, and multiplatform support, many images. Next, a rotation and depth invariant ALBP is proposed
developers has been trying to apply it for natural hand-based using the above estimated regression function. Finally, a fast ver-
control or full-body control. However, there are major limitations sion of ALBP is proposed to reduce the processing time.
to develop NITE base natural user interfaces. On top of all, since
NITE is not provided as an open source code, it cannot be modified 3.1.1. Estimation of regression function in depth images
for developer based applications. Also, it should be applied with Firstly, we estimate the size of a hand projected on depth
OpenNI so it cannot be used for other depth sensors except Kinect. images according to the depth between the sensor and the hand.
However, since it supplies a reliable and robust hand tracker, we The purpose is to determine an adaptive size of the ALBP according
thus choose NITE as major competitor.
to depth information. Since the pixel value in depth images re-
presents the depth distance or range, we estimate the relation
between the size of the hand and the pixel value through a linear
3. Proposed methodology
regression function as follows:
In this section, we propose an effective feature extraction r (g ) = β0 + β1g + β2 g 2 + ⋯βN g N (1)
technique, called adaptive local binary pattern (ALBP), for depth
image based applications. Subsequently, we apply the proposed where g is the pixel intensity of a depth image, r(g) is the width
ALBP to hand tracking using depth images. We achieve a reliable size with respect to g, βk (k ¼ 0, 1…N ) are regression coefficients
3D tracking performance which is invariant to different depth and N is the order of the regression function.
changes and hand movements. Fig. 1 shows an overview of the From K-observations of g and r, a cumulated matrix can be
proposed system. written as
⎡ ⎤ ⎡1 g1 g12 ⋯ g1N ⎤ ⎡ β ⎤
3.1. Proposed adaptive local binary pattern for depth image based ⎢ r1 ⎥ ⎢ ⎥⎢ 0 ⎥
applications ⎢ r2 ⎥ ⎢ 1 g2 g22 g2N ⎥ ⎢ β1 ⎥
⎢ ⎥=⎢ ⎥⎢ ⎥
⎢ ⋮⎥ ⎢ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥
Conventional local binary pattern (LBP) is a fast and effective ⎢ rK ⎥ ⎢ 1
feature extraction method for texture based classification of gray- ⎣ ⎦ ⎣ gK gK2 ⋯ gKN ⎥⎦ ⎢⎣ βN ⎥⎦ (2)
scale images [27,28]. However, the LBP cannot be applied to re-
In matrix notation, we can re-write it as
cognize target objects in depth image since depth images do not
contain any texture information. In addition, although LBP is in- R = GB (3)
variant to image rotation, it is not invariant to depth distance
change. Moreover, an additional stage of classification using con- where R is a column vector [r1, r2…rK ]T , B is a column vector
ventional classifiers such as LDA [38] and SVM [39] is required to [β0 , β1, β2…βN ]T and G is the matrix connecting R and B.
classify those features extracted by LBP in applications such as The coefficient β can be obtained using the least squares
object recognition and object detection. parameter estimation as:
Here, we propose an adaptive local binary pattern (ALBP) to ^
extract useful features from depth images. The proposed ALBP can B = (G T G)−1G T R (4)
extract features which are invariant to rotation and depth varia- Finally, we can find the fitted regression function to the size of
tion without speed compromise. In addition, using ALBP, we can ALBP as follows:
verify the shape of objects directly without the need of any
classifiers. r^ (g ) = β^0 + β^1g + β^2 g 2 + ⋯β^N g N (5)
Essentially, the proposed ALBP consists of three stages. Firstly, a
regression function is estimated to set the radius of ALBP in depth where β^i (i = 0, 1, …N ) is the estimated regression coefficients.

Fig. 1. An overview of the proposed system.


142 J. Kim et al. / Pattern Recognition 61 (2017) 139–152

Fig. 2. Examples of adaptive local binary pattern with different I and r.

3.1.2. Proposed adaptive local binary pattern where r (gc ) is the estimated radius based on regression. The size of
In this stage, a texture T is first defined with respect to a pixel the pattern is thus adaptively determined by gc which is the pixel
with local neighborhoods with radius size r as the joint distribu- value in depth image, as well as the distance between the sensor
tion of the gray levels of I( I > 0) image pixels as shown in Fig. 2: and objects.

Tr (gc ) = t ( g0 , g1, g2 , …gI − 1) (6)

where gc is the image intensity of the center pixel of ALBP, and gi 3.1.3. A fast version of ALBP
(i¼ 0,1… I − 1) are the pixel intensities of circular neighborhoods Although the computational cost of the above ALBP is not too
around the center pixel. I is the number of circular neighborhood heavy for real-time application, we nevertheless propose a tech-
points of ALBP. nique to speed up the process. The following two conditions have
If the value of the center pixel is subtracted from the values of been adopted for computational speeding up:
the neighbors, the local texture can be represented as a joint dis-
tribution of the value of the differences values:  The pattern of signs of the differences with respect to a
threshold in Eq. (8) (e.g. t(1,111,111) and t(1,111)).
Tr (gc ) = t (g0 − gc , g1 − gc , g2 − gc , …, gI − 1 − gc ) (7)  The number of transitions, which is the bit change from 1 to
To separate an object from the background in depth image, only 0 or from 0 to 1, can be calculated as
the signs of the differences with respect to a threshold are con- I−2

sidered: ∑ p (si − si + 1) + p (s0 − sI − 1)


i=0 (13)
Tr (gc ) = t (s (g0 − gc ), s (g1 − gc ), …, s (gI − 1 − gc )) (8)
where si denotes s (gi − gc ) and
where ⎧ 0, x = 0
⎧ 1, x ≥ threshold p (x) = ⎨
s (x) = ⎨ ⎩ 1, otherwise
⎩ 0, x < threshold

A binomial weight 2i is assigned to each sign s (gi − gc ), trans- By adopting these two conditions, it does not need to find the
forming the differences in a neighborhood into a unique ALBP circularly shifted minimum value of ALBP. That is, original ALBP
code. The ALBPI ,r operator can be presented as: finds a minimum binary code through I times ROR operations
(total I2 times summations). However, in the fast version of ALBP,
I−1
ALBPI, r (x c , yc ) = ∑ s (gi − gc ) 2i we calculate the number of transitions which needs I times sum-
i=0 (9) mations instead of finding circularly shifted minimum value of
ALBP.
where i is the pattern index, (xc , yc ) is the center position of the In theory, the complexity of fast version of ALBP, as well as
pattern, gc is the pixel value of center position and gi is the value of the original ALBP, can be given directly. Given N data and s
index i. samples, the original ALBP has complexity of O (N ·s2) with big O
To achieve invariance with respect to image rotation, each ALBP notation. While the calculation time is proportional to N ·s ,
binary code must be shifted to a reference code which is the O (N ·s ), in fast version of ALBP as a result of checking only the
minimum code by the circularly shifting of the original code. This number of transitions. Therefore, it needs less computational
transformation can be written as: cost than ALBP.
ALBPI, r = min {ROR (ALBPI, r , k )|k = 0, 1, …, I − 1} (10)

where the function ROR (x, i ) performs a circular bitwise right shift
3.1.4. An example
on the I-bit binary number xi times. The ROR operation is defined As an example, a depth image is shown in Fig. 3. The ALBP
as: consists of 8 points ( g0 , g1, g2, …g7), and each pixel intensity is as
I−1 k−1 below. The radius of the pattern is determined using r (gc ), which
ROR (ALBPI, r , k ) = ∑ s (gi − gc ) 2i − k + ∑ s (gi − gc ) 2I − k + i is r (25):
i=k i=0 (11)

To achieve invariance with respect to depth, the radius of ALBP gc g0 g1 g2 g3 g4 g5 g6 g7


is set using r (gc ) determined in previous stage and this give rise to
25 87 195 200 194 193 201 27 73
ALBPIrdi
, r (gc ) = min {ROR (ALBPI , r (gc ) , k )|k = 0, 1, … , I − 1} (12)
J. Kim et al. / Pattern Recognition 61 (2017) 139–152 143

Fig. 3. An example for the ALBP feature extraction.

Fig. 4. The proposed hand detection consists of hand candidates detection and candidates verification.

Then, each pixel value of neighborhood pixels is subtracted by


the center pixel (gc).

 Assuming the threshold value of 70 is selected, then, the results


are as follows:
○ The binary code of ALBP by signed difference is 01111100b.
○ The binary code of ALBPrdi is 00011111b.
○ Number of transitions is 2 (s0 to s1 and s5 to s6).

 Assuming another threshold value of 30, the results is as


follows:
○ The binary code of ALBP by signed difference is 11111101b.
○ The binary code of ALBPrdi is 01111111b.
○ Number of transitions is 2 (s5 to s6 and s6 to s7).

g0 − gc g1 − gc g2 − gc g3 − gc g4 − gc g5 − gc g6 − gc g7 − gc

62 170 175 169 168 176 2 48

The results of ALBP depend on the radius of the pattern and the Fig. 5. The results of ALBP with assumptions for hand detection: (a) there is no
other object within 30 cm, and (b) the body is seen behind the hand within 70 cm;
threshold value for signs of difference of ALBP. That is, we can find
white and black points represent 0 and 1 of signs of difference, respectively.
and verify a specific shape that we want to extract from a depth
image by selection of the radius of the pattern and the threshold Firstly, detection of an initial position to be tracked in depth
value. images is introduced. Based on the detected initial hand's position,
the ALBP based hand tracking is next performed. Subsequently, a
3.2. Proposed depth image based 3D hand tracking trajectory filtering to reduce the effect of depth noise is adopted
using an Unscented Kalman Filter (UKF) [35]. Finally, a process for
Depth image contains range information, but not texture and tracking failure and occlusion by other objects is proposed.
color information. It is difficult to detect and trace the object
without such information. Using the proposed ALBP, we can ex- 3.2.1. ALBP based hand detection in depth images
tract and verify the shape of target objects in depth images at the We use the user's hand with an act of reaching into the sensor
same time. In this paper, we apply the proposed ALBP to detect as an initial (focus) cue for hand detection. The proposed hand
and trace the position of hands, as specific object, in depth detection is divided into two steps: hand candidates detection and
images. candidates verification as shown in Fig. 4.
144 J. Kim et al. / Pattern Recognition 61 (2017) 139–152

In hand candidates detection step, we search all positions of


hand's candidates in depth images. We assume that there is
nothing nearby the hand within 30 cm range when the user
reaches out his hand to the sensor as shown in Fig. 5 (a).
To find hand candidates in a depth image, two sizes of ALBP with
radiuses are selected as 1.5 × r (g ) and 2.0 × r (g ) with a threshold of
30 cm. When the user reaches his hand into the sensor, the center of
ALBP is set according to the hand position in depth image where the
pixel value of the center of ALBP has smaller depth value than those
values of its neighbors. According to the proposed methodology, the
observed ALBP values of reaching hands' candidates should satisfy
the following conditions:

 ,1.5 × r (gc ) ¼11111111b ¼255d, when I is 8.


ALBPIrdi
 ,2.0 × r (gc ) ¼11111111b ¼255d, when I is 8.
ALBPIrdi Fig. 7. Search range for multiple hand tracking: (1) original search range, (2) search
 All signs of difference are “1”. range for another hand tracking.
 Number of transition is “0”.
In case of multiple hands tracking, we decrease the search
Next we search for those positions which satisfy the above ranges of x and y coordinates as 3 × r (g ). We make the new search
conditions over all regions of depth images as the hand candidates range for another hand excluded from the original search region to
as shown in Fig. 5 (a). avoid interference between the two hands as shown in Fig. 7.
In candidates verification step, we verify whether the detected In the extraction of hand features step, hands-feature points are
candidate points are real user's hand. We assume that user's body extracted by ALBP within search range. Detected region in search
always exists within 70 cm behind a hand during the reaching range not only includes a real hand, but also non-hand regions
motion as shown in Fig. 5 (b). We apply the ALBP with radius 1.5
such as forearm. We apply ALBP, with size r(g) and a threshold of
×r (g ) and a threshold value of 70 cm. Accordingly, the results of
10 cm, to verify real hand-shaped features including fingertips in
ALBP of real user's hands should satisfy the following conditions:
depth images. The results of ALBP for real hands have the fol-
 LBPIrdi lowing characteristics:
,1.5 × r (gc ) ¼00111111b or 00011111b … 00000011b, when I is 8.
 The number of signs of difference which the values are “0” is
3  ALBPIrdi
, r (gc ) ¼11111111b, 01111111b … 00011111b, when I is 8.
larger than or equal to I/4 , and smaller than or equal to I.

4  The number of signs of difference which the values are “0” is less
The number of transition of ALBPrdi is “2”.
than or equals to I/4 .
 Number of transitions of ALBPrdi is “0” or “2”.
Fig. 5(b) shows the result of ALBP in candidates verification
step. We finally decide a real hand's location which is detected at
Those feature points that satisfied the above conditions are
the same location during the previous 5-frames continuously. To
extracted in search region as shown in Fig. 8.
reduce the processing times, a skip mode that searches the co-
In the selection of hand position step, we select a point to be
ordinate at intervals of 2 pixels with respect to x- and y-axis is
used in detection step. tracked from those extracted feature points. Since the center of
extracted feature points might be non-hand location such as
3.2.2. ALBP based hand tracking in depth images empty region between fingers, the tracking point should be se-
Based on the detected hand's location, hand tracking is in- lected from the definite hand. The nearest point from the center
itiated to estimate and track the hand's location rapidly and pre- point is the most suitable point to be tracked since it is least af-
cisely. The hand tracking can be divided into three steps as shown fected by noise in the depth image.
in Fig. 6: (1) update of search range, (2) extraction of hand features Finally, the search region is updated and the process of step
and (3) selection of a point to be tracked. 1 through step 3 is repeated during tracking.
Firstly, in the update of search range step, we need to set a
search range for fast estimation of hand locations. The search 3.2.3. Trajectory filtering
ranges of x and y coordinates are set based on 6 × r (g ) since we There are two issues regarding hand gesture recognition even
assume that the maximum speed of a hand movements is about for simple hand movement: (1) stabilization of depth noise and
0.6 m/frame. The distance range of z coordinates is set as 715 cm. (2) fast movement of hand.

Fig. 6. Overview of proposed hand tracking.


J. Kim et al. / Pattern Recognition 61 (2017) 139–152 145

Fig. 8. Some examples of ALBP for the real hands in depth images.

The edge of hand in a depth image changes continuously due to and the system shall enter into hand detection mode for tracking
noise even when the hand does not move. It might influence the restart if hand is detected.
performance of gesture recognition since the trajectory contains
noise arising from the depth image. Therefore, the trajectory of
hand tracking should be filtered such that it is not affected by 4. Experiments
small movement of hand. On the other hand, there is not enough
trajectory data when the hand move fast. That is, the sparse tra- In this study, we use a Kinect depth sensor which captures RGB
jectory data by fast movement of hand need to be filtered by and depth images of 640  480 as 30 fps. The data acquisition is
interpolation. implemented in Open Natural Interaction (OpenNI) [34] while
Since the position of hand tracking in depth images can be other modules including detection and tracking have been im-
formulated as an estimation problem, the formulation can be plemented using C in a machine with 3.93 GHz Intel Core i7 870
implemented using an Extended Kalman Filter (EKF) or Unscented and 4G of physical memory. The number of ALBP is set at 16
Kalman Filter (UKF) [35,36]. In this paper, we adopted the UKF for (I ¼16) which was selected based on the most effective perfor-
smoothing the trajectory of hand since UKF shows better perfor- mance in terms of tracking accuracy and processing time.
mance than EKF based on our observations. In this section, we perform four experiments to evaluate pro-
posed methods as summarized and enumerated in Table 4.
3.2.4. Handling tracking failure and occlusion
Tracking failure can occur because of fast movement or occlu- 4.1. Accuracy of regression function
sion by un-expected objects. Moreover, the situation where no
hand can be found within the depth image might occur, which To adapt the size of ALBP according to the distance information
means that the user wants to stop hand tracking. Upon tracking in depth images, we need to estimate the relation between the
failure, we include a detection mode to re-start tracking and a size of ALBP and the hand distance. We use a polynomial regres-
tracking mode to track the hand after detection. sion function to map the relationship. For initialization, we assume
First, we keep the tracking mode with respect to the last de- the size of hand to be around 20 cm for constructing the regres-
tected region before losing the hand for 30 frames to re-track the sion function. Then, we measure the pixel size in depth image with
hand which comes back into the search region. Next, the detection respect to the initial 20 cm object within the range of 60–750 cm
mode is activated to re-start tracking. We assume occluding ob- at intervals of 20 cm as shown in Fig. 9.
jects generally exist beyond the search region due to the short The errors in terms of pixel difference between the ground
distance between the hand and sensor. In the unlikely event of truth and the fit result of regression function are shown in Fig. 10.
occlusion, we consider it a tracking failure where a new round of In Table 5, we show regressions using several polynomial or-
hand detection shall take place. ders to estimate the hand size. Based this experimental observa-
The assumption of no occlusion with 30 cm range was made for tion, a 5th order regression function is seen to estimate well the
application scenarios such as user–PC-interface, user–TV-interface, hand size. The estimated polynomial regression function used in
and gaming where an indoor and somewhat controlled environ- this study is r^ (x ) = 327.9824 − 3.2828x + 0.0152x2 − 3.5441×
ment took place. In the unexpected event of occlusion by small 10 x + 3.9893 × 10−8x 4 − 1.731 × 10−11x5
−5 3

objects which occurs in between the hand and the sensor, the In order to cater for images with much regional variations, we
system stops tracking and a fresh initialization shall take place to performed an additional experiment on regression based on
restart tracking if the hand is detected again. Similarly, cessation of 9 image regions as shown in Fig. 11.
tracking will occur if the center of tracking is corrupted by noise Regressions based on 9 image regions are shown in Table 6. We
also set the order of regression functions as 5th order.
Table 4
Goal and performance measure for each experiment. 4.2. ALBP based hand detection
Experiment Goals Performance measure
In this experiment, we evaluate the accuracy of hand detection
E1 Fitting regression function Root mean square (RMS) according to its range distance. The range distance of hand for
error evaluation is set ranging from 1 m to 7 m at an interval of 50 cm.
E2 Hand detection accuracy Detection rate
This range is selected according to our target applications such as
user–PC-interface, user–TV-interface, and gaming. We evaluate
E3 (a) Evaluate the tracking per- RMS error, t-test detection accuracy using precision and recall based on the classical
formances in 2D space
true positive (TP), false positive (FP) and false negative (FN) rates:
(b) Evaluate the tracking per- RMS error, t-test
formances in 3D space TP
E4 Process time Process time (ms) Precision =
TP + FP (14)
146 J. Kim et al. / Pattern Recognition 61 (2017) 139–152

Fig. 9. Sample images used in estimating regression function according to range distance: left, middle and right images show an object at 60 cm, 400 cm and 700 cm
respectively.

TP compared tracking positions are similar are rejected at sig-


Recall =
TP + FN (15) nificance level of 0.05.

Table 7 shows the detection performance over several range To verify the robustness of the proposed hand tracking under
distances. Each detection rate is calculated based on the result different ranges of operation and different hand movements, we
obtained from 2000 attempts wherein 20 times repeated hand made a data set based on 100 identities each with five gestures at
detections are taken from 100 people standing at different dis- different standing distances (1 m, 2 m and 3 m) as shown in
tances from the sensor. This table shows deterioration of detection Fig. 12. Each gesture is carried out 5 times repeatedly.
rate beyond 4.5 m due to the effective range of Kinect sensor. The results filtered by UKF are compared with the results by
The result shows good detection performance at distances, 1– EKF to see the impact of trajectory smoothing similar to ground
4 m, with a 100% recall rate. However, the detection rate rapidly
truth.
decreases when the distance is over 4.5 m since the size of hand is The tracking accuracy is compared with several existing hand
too small to be recognized in depth image. To reduce false de- trackers namely,
tections, we take the range of 0.5–4 m as reliable distance for hand
detection in this study.  NITE which is from PrimeSenses Natural Interaction Technology
for End-user [32].
4.3. ALBP based hand tracking  CamShift which is an object tracking method based on color
information [33].
In this experiment, we use two kinds of measures to compare  3D hand tracking using kalman filter in depth space [21].
the proposed hand tracking with other state-of-the-art hand  Fixed-sized ALBP based tracking where the target size is not
tracking methods as follows: adjusted.

 The average RMS error: The Root Mean Square (RMS) error be- Table 8 represents the abilities of each hand tracking method in
tween the ground truth and the estimated tracking position. 2D and 3D spaces.
 Paired t-test: This is to check whether the difference in perfor- The ground truth is manually selected as 1/3 position from the
mances is statistically significant. The null hypotheses that the tip of the hands as shown in Fig. 13.

Fig. 10. Error between ground truth and result of regression function according to polynomial order.
J. Kim et al. / Pattern Recognition 61 (2017) 139–152 147

Table 5
Regressions using several polynomial orders to estimate the hand size.

Polynomial order Regression

2nd r^ (x ) = 160.3399 − 0.5226x + 0.0005x2


3rd r^ (x ) = 217.5454 − 1.1755x + 0.0023x2 − 1.5597 × 10−6x 3
4th r^ (x ) = 270.6893 − 2.0503x + 0.0066x2 − 9.4003 × 10−6x 3 + 4.8399 × 10−9x 4
5th r^ (x ) = 327.9824 − 3.2828x + 0.0152x2 − 3.5441 × 10−5x 3 + 3.9893 × 10−8x 4
− 1.731 × 10−11x5

compared with a method which uses a fixed-sized ALBP with sizes


fixed at 5 and 10.
The error between the ground truth and the position of hand
tracking is calculated for the 5 repeated hand movements from
100 people for five different gestures recorded at different dis-
tances (1 m, 2 m and 3 m).
There are many performance evaluation methods for object
tracking such as center error [66], region overlap [61], tracking
length [62], failure rate [63,64], pixel-based precision and statis-
tical difference [60,65]. In this paper, we have evaluated the
tracking performance based on the average RMS error and the
standard deviations. Subsequently, we performed a paired t-test to
compare between the proposed method and that of others.
Fig. 14 (a) shows a example of the RMS error between the
ground truth and the tracking position with respect to circular
hand gesture. We summarize the results in Tables 9 and 10 which
Fig. 11. 9 image regions to estimate regressions.
show the average RMS error and the standard deviations. Table 9
shows the results with respect to different distances (1 m, 2 m and
4.3.1. Tracking performance evaluation in 2D image space 3 m), and Table 10 shows the results with respect to different hand
We compare the performance of the proposed hand tracking gestures. In terms of the results of RMS error and standard de-
with several other tracking methods in 2D image space. We select viations, a statistical significance paired t-test is performed to
CamShift algorithm [33] since it is among the most widely adopted compare the proposed method with other existing tracking
tracking methods in 2D color images. Also, we compare the pro- methods. The null hypotheses of both compared tracking points
are similar with ground truth is rejected at a significance level of
posed method with NITE [32] and 3D hand tracking using Kalman
0.05.
filter in depth space [21] as state-of-the-art 3D hand trackers. It
has not been revealed which algorithms were used for NITE hand 4.3.2. Tracking performance evaluation in 3D space
tracker. However, several papers have referred to it for 3D hand In this experiment, we evaluate the proposed hand tracking
tracking since it showed the most reliable tracking performance in with other tracking methods in 3D world space. We compare the
applications [59,46,47]. In addition, the proposed method is performance of the proposed hand tracking with NITE [33], 3D

Table 6
Regressions based on 9 image regions.

Region of image Regression

1 r^ (x ) = 354.0246 − 4.6721x + 0.0306x2 − 1.0375 × 10−4x 3 + 1.7397 × 10−7x 4


− 1.1384 × 10−10x5
2 r^ (x ) = 383.0833 − 5.3762x + 0.0371x2 − 1.3386 × 10−4x 3 + 2.4265 × 10−7x 4
− 1.7428 × 10−10x5
3 r^ (x ) = 379.3135 − 5.3359x + 0.0369x2 − 1.3263 × 10−4x 3 + 2.3816 × 10−7x 4
− 1.6923 × 10−10x5
4 r^ (x ) = 302.6163 − 3.2413x + 0.0163x2 − 3.8956 × 10−5x 3 + 3.9028 × 10−8x 4
− 8.67 × 10−12x5
5 r^ (x ) = 376.4027 − 5.3080x + 0.0361x2 − 1.263 × 10−4x 3 + 2.1995 × 10−7x 4
− 1.5117 × 10−10x5
6 ^r (x ) = 396.5094 − 5.8350x + 0.0416x2 − 1.5378 × 10−4x 3 + 2.8375 × 10−7x 4

− 2.0697 × 10−10x5
7 r^ (x ) = 377.3735 − 5.2374x + 0.0358x2 − 1.2738 × 10−4x 3 + 2.2853 × 10−7x 4
− 1.6417 × 10−10x5
8 r^ (x ) = 443.5954 − 6.8975x + 0.0510x2 − 1.9210 × 10−4x 3 + 3.5775 × 10−7x 4
− 2.6109 × 10−11x5
9 r^ (x ) = 437.2232 − 6.5823x + 0.0466x2 − 1.6617 × 10−4x 3 + 2.9005 × 10−7x 4
− 1.9692 × 10−10x5
148 J. Kim et al. / Pattern Recognition 61 (2017) 139–152

Table 7
Detection performance according to distance.

Distance (m) 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5

Precision (%) 100 100 100 100 100 97.6 93.2 79.8 55.4 33.1 5.7 0
Recall (%) 100 100 100 100 100 100 100 83.3 62.7 41.7 12.6 0

hand tracking using Kalman filter in depth space [21] and fixed-
Table 8
sized LBP based tracking in 3D world space. Abilities to perform hand tracking in 2D and 3D space.
Fig. 14(b) shows the RMS error of infinite hand motion. The unit
of RMS error is millimeter since errors are calculated by coordinate Compared hand tracking In 2D image space In 3D world space
of world reference frame. Since a pixel intensity of a depth image
Proposed hand tracking √ √
represents real depth (Zw) information from camera to the scene,
NITE [32] √ √
we can convert an ( xim , yim ) image coordinate to a ( Xw , Yw , Z w )
Fixed-sized LBP based hand tracking √ √
world coordinate as follows:
3D hand tracking using Kalman filter √ √
⎛ Z (y − y ) ⎞ [21]
( Xw Yw Z w ) = ⎜⎝ Z w (ximf − x 0 ) w imf 0 Z w ⎟⎠ CamShift [33] √
(16)

where ( x0 , y0 ) is a principal point, and f is the focal length of depth


sensor which internal parameters can be acquired by camera ca-
libration [37].
We summarize all results in Tables 11 and 12 which show the
average RMS error and the standard deviations. In terms of the
results of RMS error and standard deviations, a statistical sig-
nificance paired t-test is performed to compare the proposed
method with other tracking methods.

4.3.3. Summary of results


We summarize the average RMS error and standard deviations
with respect to all gestures and different distances as shown in
Figs. 15 and 16. Based on the above experimental results, the
proposed hand tracking method shows the best performance for
different distances and positional variations of the hand in 2D
space, as well as 3D space.
In addition, the impact of UKF over EKF upon our tracking
system can be observed from Figs. 15 and 16 where their tracking
performances are compared. From these figures we observed that
the tracking result filtered by UKF is more similar with ground Fig. 13. Ground truth is manually selected as 1/3 position from the tip of the hands.

truths than the result filtered by EKF. This observation is con-


gruence to that in [48–50] where UKF outperform EKF under ob- CamShift algorithm shows the worst tracking performance.
ject tracking condition. Since Camshift algorithm uses color information, tracking failure
In 2D image space, the average error decreases according to frequently occurs when objects with similar color such as the face
increasing distance, because the variation of hands position in 2D are located near the hands. In addition, in 3D hand tracking using
image by the hand motion decreases according to increasing dis- Kalman filter in depth space [21], the tracking does not indicate a
tance as shown in Fig. 16(a). However, the average error increases precise hand's position since the tracking point is obtained based
according to increasing distance in 3D world space, since the on the central point of an ellipse which fits to the detected hand in
world coordinate from image coordinate is increased by distance initial hand detection process. Moreover, in fixe-sized ALBP based
(ZW) as XW ¼ZW (xim-x0)/f and YW ¼ZW (yim-y)/f. Even at hand tracking, it is difficult to extract adaptively size-variable
similar RMS error in 2D image coordinate, the increasing distance hands in depth images. The fixed-sized ALBP based method works
(ZW) makes the error in 3D world coordinate larger as shown in well with respect to a hand at fixed distance, but does not works
Fig. 16(b). well when the distance of hand varies.

Fig. 12. Five different gesture from an identity; (a) circle gesture, (b) infinite gesture, (c) triangle gesture, (d) vertical (up to down and down to up) gesture and (e) horizontal
gesture (left to right and right to left).
J. Kim et al. / Pattern Recognition 61 (2017) 139–152 149

Fig. 14. The RMS error between the ground truth and the tracking position with respect to (a) circular hand gesture in 2D image space and (b) infinite hand motions in 3D
world space.

Table 9 Table 11
Average errors and standard deviations of proposed method, NITE [32], fixed-sized Average errors and standard deviations of proposed method, NITE [32], fixed-sized
(5 and 10 ALBP sizes) ALBP based method, 3D hand tracking using Kalman filter in (5 and 10 ALBP sizes) ALBP based method and 3D hand tracking using Kalman filter
depth space [21] and CamShift [33] with respect to the distance (1 m, 2 m and 3 m) in depth space [21] with respect to varying distance (1 m, 2 m and 3 m) (H :
(H : ‘1’ rejects the null hypothesis that the gathered RMS errors evaluated are equal 1 rejects the null hypothesis that the gathered RMS errors evaluated are equal at
at significance level of 0.05, ‘0’ otherwise). significance level of 0.05, 0 otherwise).

Method 1m 2m 3m Method 1m 2m 3m

Avg 7std H Avg 7 std H Avg 7 std H Avg 7std H Avg 7 std H Avg 7std H

Proposed (UKF) 13.11 72.37 8.48 7 1.94 4.377 1.20 Proposed (UFK) 12.727 2.07 13.93 7 2.79 15.48 73.02
Proposed (EKF) 15.377 2.74 9.84 72.27 5.137 1.38 Proposed (EKF) 14.217 2.32 15.37 7 3.17 16.25 7 3.71
NITE 16.59 7 4.23 1 10.68 72.64 1 5.217 1.74 1 NITE 14.3573.24 1 15.63 7 3.93 1 16.667 4.30 1
Fixed-sized (5) 23.93 7 7.84 1 18.337 3.78 1 13.277 2.76 1 Fixed-sized (5) 27.82 7 7.10 1 31.74 77.39 1 33.827 7.96 1
Fixed-sized (10) 21.577 6.02 1 16.277 3.23 1 12.42 7 2.14 1 Fixed-sized (10) 21.94 75.93 1 26.20 76.13 1 28.94 7 7.28 1
3D hand tracker [34] 24.437 9.56 1 20.26 76.02 1 15.92 7 4.27 1 3D hand tracker 32.32 7 10.48 1 28.94 711.98 1 43.30 7 13.24 1
CamShift 61.50 7 20.37 1 45.55 711.37 1 36.32 78.93 1

Table 10
Average errors of proposed method, NITE, fixed-sized (5 and 10 ALBP sizes) ALBP based method, 3D hand tracking using Kalman filter in depth space and CamShift with
respect to different hand motions; (a) circle, (b) infinite, (c) triangle, (d) vertical and (e) horizontal gesture (H: 1 rejects the null hypothesis that the gathered RMS errors
evaluated are equal at significance level of 0.05, 0 otherwise).

Method Circle Infinite Triangle Vertical Horizontal

Avg 7 std H Avg 7 std H Avg7 std H Avg 7std H Avg7 std H

Proposed (UKF) 7.937 1.91 8.757 2.37 9.627 2.43 5.737 1.43 5.507 1.52
Proposed (EKF) 9.217 2.27 9.32 73.76 11.52 7 2.73 6.98 7 1.73 5.50 7 1.52
NITE 9.60 72.67 1 11.677 3.02 1 10.75 73.24 1 7.357 1.96 1 6.99 7 1.93 1
Fixed-sized (5) 17.477 3.86 1 19.78 7 4.76 1 19.497 4.80 1 15.38 7 3.24 1 16.90 72.54 1
Fixed-sized (10) 16.14 73.27 1 17.32 7 4.32 1 17.96 7 4.56 1 14.09 7 3.15 1 16.077 3.07 1
3D hand tracker [34] 19.52 7 6.19 1 19.127 8.06 1 21.23 7 9.27 1 17.87 75.66 1 20.497 6.98 1
CamShift 46.277 10.15 1 51.83 713.21 1 52.90 7 15.04 1 36.59 7 9.98 1 36.167 14.32 1

4.4. Processing time tracking are shown in Table 13. Each processing time is calculated
based on the average time of 10,000 frames.
The processing time is one of the most important factors in real- The processing time for hand detection is 32 ms (31 fps) on a
time applications. The processing times for the hand detection and 640  480 depth resolution. The processing time becomes faster
150 J. Kim et al. / Pattern Recognition 61 (2017) 139–152

Table 12
Average errors and standard deviations of proposed method, NITE, fixed-sized (5 and 10 ALBP sizes) ALBP based method and 3D hand tracking using Kalman filter in depth
space with respect to different hand gesture: (a) circle, (b) infinite, (c) triangle, (d) vertical and (e) horizontal gesture. (H : 1 rejects the null hypothesis that the gathered RMS
errors evaluated are equal at significance level of 0.05, 0 otherwise).

Method Circle Infinite Triangle Vertical Horizontal

Avg 7 std H Avg7 std H Avg 7 std H Avg 7std H Avg7 std H

Proposed (UKF) 13.93 7 2.38 18.80 7 3.27 18.39 73.43 9.17 72.10 12.29 72.32
Proposed (EKF) 15.177 3.15 19.737 3.56 20.03 73.89 9.56 7 2.32 13.127 2.72
NITE 15.63 7 3.47 1 20.89 7 3.89 1 19.03 7 4.02 1 11.217 2.84 1 13.09 7 2.83 1
Fixed-sized (5) 31.747 7.32 1 33.577 9.20 1 34.07710.21 1 25.707 6.47 1 29.98 7 6.48 1
Fixed-sized (10) 26.20 76.01 1 28.29 7 8.47 1 29.82 79.47 1 21.43 75.21 1 25.74 75.12 1
3D hand tracker 39.107 11.76 1 42.22 7 13.23 1 33.78 714.53 1 33.587 8.47 1 35.527 9.47 1

Fig. 15. Total average error and standard deviation with respect to distances. x- and
y-axes represent the total average error and the distances respectively. The unit of
(a) and (b) is pixels and mm, respectively. The vertical line on top of the each bar Fig. 16. Total average error and standard deviation with respect to hand gestures.
denotes the standard deviation. x- and y-axes represent the total average error and the distances respectively. The
unit of (a) and (b) is pixels and mm, respectively. The vertical line on top of the each
according to increase of hand distance since the search region bar denotes the standard deviation.

becomes smaller. The hand tracking was performed in 28 ms


(35 fps), 15 ms (66 fps) and 12 ms (83 fps) at 1 m, 2 m and 3 m
Table 13
distances respectively.
Processing time for hand detection and tracking.
The processing time of fast-ALBP to reduce the computational
cost and ALBP with respect to just one pattern is 1.8 μs and 2.6 μs, Hand detection Hand tracking
respectively.
1m 2m 3m

Process time (ms) 32 28 15 12


5. Conclusion and future works

We proposed a novel hand detection and tracking method


using adaptive local binary pattern (ALBP) in range images. The Main advantages of the proposed ALBP include its simplicity in
proposed ALBP is not only simple to implementation in real-time, terms of implementation and its effectiveness. Experimental re-
but also effective for hand detection and tracking. Since the radius sults show that it has better performance than other state-of-the-
of ALBP, which represents the size of hands, was adaptively arts hand tracking over distance and positional variation of the
changed according to the pixel intensity of range image, we can hand. The proposed system shows good capacity for real-time
extract depth invariant features. processing on images with 640  480 resolution.
J. Kim et al. / Pattern Recognition 61 (2017) 139–152 151

A number of open problems must be solved to further the Recognition, 2010, pp. 1173–1180.
proposed method towards a NUI solution for wide applications. [25] L. Xia, C.C. Chen, J.K. Aggarwal, Human detection using depth information by
Kinect, in: Proceedings of the 2011 IEEE Computer Society Conference on
One would be to investigate into the possibility of allowing Computer Vision and Pattern Recognition Workshops (CVPRW), 2011, pp. 15–
multiple hands tracking. Another possibility would be to com- 22.
bine the proposed hand tracking with gesture recognition [26] Zhou Ren, Jingjing Meng, Junsong Yuan, Depth camera based hand gesture
recognition and its applications in human–computer-interaction, in: 8th In-
algorithms. ternational Conference on Information, Communications and Signal Proces-
sing (ICICS), 2011, pp. 1–5.
[27] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns, IEEE Trans. Pattern
Acknowledgment Anal. Mach. Intell. 24 (7) (2002) 971–987.
[28] T. Ojala, K. Valkealahti, E. Oja, M. Pietikainen, Texture discrimination with
This research was supported by Basic Science Research Program multidimensional distributions of signed gray-level differences, Pattern Re-
cognit. 34 (3) (2001) 727–739.
through the National Research Foundation of Korea (NRF) funded [29] J.E. Jackson, A. User's, Guide to Principal Components, Wiley, Hoboken, New
by the Ministry of Education (NRF-2015R1D1A1A01061315). Jersey, 1991.
[30] G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition,
Wiley Interscience, Hoboken, New Jersey, 2004.
[31] Jun Xie, Can Xie, Wei Bian, Dacheng Tao, Feature fusion for 3D hand gesture
References recognition by learning a shared hidden space, Pattern Recognit. Lett. 33 (4)
(2012) 476–484.
[32] 〈http://www.primesense.com/en/nite〉.
[1] D.M. Gavrila, The visual analysis of human movement: a survey, Comput. Vi- [33] G.R. Bradski, Computer vision face tracking for use in a perceptual user in-
sion. Image Underst. 73 (1) (1999) 82–98. terface, Intel. Technol. J. (1998).
[2] J.K. Aggarwal, Q. Cai, Human motion analysis: a review, Comput. Vis. Image [34] 〈http://openni.org〉.
Underst. 73 (3) (1999) 428–440. [35] R. Merse, E. Wan, The unscented kalman filter for nonlinear estimation, in:
[3] Y. Wu, T. Huang, Vision-based gesture recognition: a review, in: International Proceedings of the IEEE Symposium 2000 on Adaptive Systems for Signal
Gesture Workshop on Gesture-Based Communication in Human–Computer Processing, Communication and Control (AS-SPCC), 2000.
Interaction, 1999, pp. 103–115. [36] S.J. Julier, J.K. Uhlmann, A new extension of the kalman filter to nonlinear
[4] T. Kirishima, K. Sato, K. Chihara, Real-time gesture recognition by learning and systems, in: The 11th International Symposium on Aerospace/Defence Sen-
selective control of visual interest points, IEEE Trans. PAMI 27 (3) (2005) sing, Simulation and Controls, 1997, pp. 182–193.
351–364. [37] Richard Hartley, Andrew Zisserman, Multiple View Geometry, Cambridge
[5] Ron George, Joshua Blake, Objects, containers, gestures, and manipulations: University Press, Cambridge, England, 2000.
universal foundational metaphors of natural user interfaces, in: CHI10 Natural [38] Z. Li, U. Park, A.K. Jain, A discriminative model for age invariant face re-
User Interfaces Workshop, 2010. cognition, IEEE Trans. Inf. Forensics Secur. 6 (3) (2011) 1028–1037.
[6] Q. Chen, N.D. Georganas, E.M. Petriu, Hand gesture recognition using Haar-like [39] H. Lu, D. Wang, R. Zhang, Y.-W. Chen, Video object pursuit by tri-tracker with
features and a stochastic context-free grammar, IEEE Trans. Instrum. Meas. 57 on-line learning from positive and negative candidates, IET Image Process. 5
(8) (2008) 1562–1571. (2011) 101–111.
[7] N. Dardas, N. Georganas, Real time hand gesture detection and recognition [40] D.L. Marino Lizarazo, J.A. Tumialan Borja, Hand position tracking using a depth
using bag-of-features and multi-class support vector machine, IEEE Trans. image from a RGB-d camera, in: 2015 IEEE International Conference on In-
Instrum. Meas. (2011). dustrial Technology (ICIT), 2015, pp. 1680–1687.
[8] Qiu-yu Zhang, Mo-yi Zhang, Jian-qiang Hu, A method of hand gesture seg- [41] Wu Xiaoyu, Cheng Yang, Youwen Wang, Hui Li, Shengmiao Xu, An intelligent
mentation and tracking with appearance based on probability model, in: 2008 interactive system based on hand gesture recognition algorithm and kinect,
Second International Symposium on Intelligent Information Technology in: 2012 Fifth International Symposium on Computational Intelligence and
Application, 2008. Design (ISCID), vol. 2, 2012, pp. 294–298.
[9] S.K. Hee, K. Gregorij, B. Ruzena, Hand tracking and motion detection from the [42] V. Frati, D. Prattichizzo, Using Kinect for hand tracking and rendering in
sequence of stereo color image frames, in: Proceedings of the ICIT2008, 2008. wearable haptics, in: World Haptics Conference (WHC), 2011, pp. 317–321.
[10] S. Zhong, F. Hao. Hand tracking by particle filtering with elite particles mean [43] Yanmin Zhu, Bo Yuan, Real-time hand gesture recognition with Kinect for
shift, in: IEEE Workshop on Frontier of Computer Science, 2008, pp. 163–167. playing racing video games, in: 2014 International Joint Conference on Neural
[11] Q.Y. Zhang, M.Y. Zhang, J.Q. Hu, Hand gesture contour tracking based on skin Networks (IJCNN), 2014, pp. 3240–3246.
color probability and state estimation model, JMM 4 (6) (2009) 349–355. [44] Le Van Bang, Anh Tu Nguyen, Yu Zhu, Hand detecting and positioning based
[12] V.A. Prisacariu, I. Reid, Robust 3D hand tracking for human computer inter- on depth image of kinect sensor, Int. J. Inf. Electron. Eng. 4 (3) (2014) 176–179.
action, in: IEEE International Conference on Automatic Face and Gesture Re- [45] Xavier Suau, Javier Ruiz-Hidalgo, R. Josep, Casas, real-time head and hand
cognition, 2011, pp. 368–375. tracking based on 2.5D data, IEEE Trans. Multimed. 14 (3) (2012) 575–585.
[13] B. Stenger, A. Thayananthan, P.H.S. Torr, R. Cipolla, Model-based hand tracking [46] M. Zabri Abu Bakar, R. Samad, D. Pebrianti, N.L.Y. Aan, Real-time rotation in-
using a hierarchical Bayesian filter, IEEE Trans. Pattern Anal. Mach. Intell. 28 variant hand tracking using 3D data, in: 2014 IEEE International Conference on
(9) (2006) 1372–1385. Control System, Computing and Engineering (ICCSCE), 2014, pp. 490–495.
[14] T. Gumpp, P. Azad, K. Welke, E. Oztop, R. Dillmann, G. Cheng, Unconstrained [47] Zhou Ren, Junsong Yuan, Jingjing Meng, Zhengyou Zhang, Robust part-based
real-time markerless hand tracking for humanoid interaction, in: International hand gesture recognition using kinect sensor, IEEE Trans. Multimed. 15 (5)
Conference on Humanoid Robots, 2006. (2013) 1110–1120.
[15] J.S. Chang, E.Y. Kim, K. Jung, H.J. Kim, Real time hand tracking based on active [48] N.B.F. da Silva, D.B. Wilson, K.R.L.J. Branco, Performance evaluation of the
contour model, ICCSA 2005 (3483) (2005) 999–1006. extended kalman filter and unscented kalman filter, in: 2015 International
[16] S. Bilal, R. Akmelawati, M.J.E. Salami, A.A. Shafie, E.M. Bouhabba, A hybrid Conference on Unmanned Aircraft Systems (ICUAS), 2015, pp. 733–741.
method using haar-like and skin-color algorithm for hand posture detection, [49] Nicola Bellotto, Hu Huosheng, People tracking with a mobile robot: a com-
recognition and tracking, in: International Conference on Mechatronics and parison of Kalman and particle filters, in: The 13th IASTED International
Automation (ICMA), 2010, pp. 934–939. Conference on Robotics and Applications, 2007, pp. 388–393.
[17] Z. Pan, Y. Li, M. Zhang, C. Sun, K. Guo, X. Tang, Z. Zhou, A real-time multi-cue [50] R. Zhan, J. Wan, Iterated unscented kalman filter for passive target tracking,
hand tracking algorithm based on computer vision, IEEE VR 200 (2010) IEEE Trans. Aerosp. Electron. Syst. 43 (3) (2007) 1155–1163.
219–223. [51] Ye, Jianhua, Zhengguang Liu, Jun Zhang, A face tracking algorithm based on
[18] Yun-Fu Liu, Che-Hao Chang, Hoang-Son Nguyen, Improved hand tracking LBP histograms and particle filtering, in: 2010 Sixth International Conference
system, IEEE Trans. Circuits Syst. Video Technol. 22 (2012) 693–701. on Natural Computation (ICNC), vol. 7, 2010, pp. 3550–3553.
[19] Chia-Ping Chen, Yu-Ting Chen, Ping-Han Lee, Yu-Pao Tsai, Shawmin Lei, Real- [52] P. Pouladzadeh, M. Semsarzadeh, B. Hariri, S. Shirmohammadi, An enhanced
time hand tracking on depth images, in: 2011 Visual Communications and mean-shift and LBP-based face tracking method, in: 2011 IEEE International
Image Processing VCIP, 2011, pp. 1–4. Conference on Virtual Environments, Human–Computer Interfaces and Mea-
[20] Xavier Suau, Josep R. Casas, Javier Ruiz-Hidalgo, Real-time head and hand surement Systems Proceedings, 2011.
tracking based on 2.5D data, in: 2011 IEEE International Conference on Mul- [53] Wang Chuan-xu, Li Zuo-yong, A new face tracking algorithm based on local
timedia and Expo, 2011. binary pattern and skin color information, in: International Symposium on
[21] Sangheon Park, Sunjin Yu, Joongrock Kim, Sungjin Kim, Sangyoun Lee, 3D Computer Science and Computational Technology, 2008, pp. 657–660.
hand tracking using Kalman filter in depth space, EURASIP J. Adv. Signal [54] Hueser, Markus, Tim Baier, Jianwei Zhang, Learning of demonstrated grasping
Process. 2012 (2012) 36. skills by stereoscopic tracking of human head configuration, in: Proceedings
[22] A. Kolb, E. Barth, R. Koch, R. Larsen, Time-of-Flight Sensors in Computer 2006 IEEE International Conference on Robotics and Automation, 2006, 2006,
Graphics, EUROGRAPHICS STAR Report, 2009. pp. 2795–2800.
[23] 〈http://www.xbox.com/en-us/kinect〉. [55] S. Rahimi, Ali Aghagolzadeh, Hadi Seyedarabi, Three camera-based human
[24] Y. Cui, S. Schuon, C. Derek, S. Thrun, C. Theobalt, 3D shape scanning with a tracking using weighted color and cellular LBP histograms in a particle filter
time-of-flight camera, in: IEEE Conference on Computer Vision and Pattern framework, in: 2013 21st Iranian Conference on Electrical Engineering (ICEE),
152 J. Kim et al. / Pattern Recognition 61 (2017) 139–152

IEEE, Mashhad, Iran, 2013, pp. 1–6. 2011 IEEE Conference onComputer Vision and Pattern Recognition (CVPR),
[56] Takala, Valtteri, Matti Pietikainen, Multi-object tracking using color, texture 2011, pp. 1305–1312.
and motion, in: IEEE Conference on Computer Vision and Pattern Recognition, [62] J. Kwon, K.M. Lee, Tracking of a non-rigid object via patch based dynamic
2007. CVPR'07, 2007, pp. 1–7. appearance modeling and adaptive basin hopping Monte Carlo sampling, in:
[57] Xian Wu, Lihong Li, Jianhuang Lai, Jian Huang, A framework of face tracking 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
with classification using CAMShift-C and LBP, in: Fifth International Con- 2009, pp. 1208–1215.
ference on Image and Graphics, 2009, ICIG'09, 2009, pp. 217–222. [63] Matej Kristan, Janez Pers, Matej Perse, Stanislav Kovacic, Closed-world track-
[58] Siriteerakul, Teera, Yoichi Sato, Veera Boonjing, Estimating change in head ing of multiple interacting targets for indoor-sports applications, Comput.
pose from low resolution video using LBP-based tracking, in: 2011 Interna- Vision. Image Underst. 113 (5) (2009) 598–611.
tional Symposium on Intelligent Signal Processing and Communications Sys- [64] M. Kristan, S. Kovacic, A. Leonardis, J. Pers, A two-stage dynamic model for
tems (ISPACS), 2011, pp. 1–6.
visual tracking, IEEE Trans. Syst., Man, Cybern., Part B 40 (6) (2010) 1505–1520.
[59] Mu Hsen Hsu, T.K. Shih, Jen Shiun Chiang, Real-time finger tracking for visual
[65] Y. Bar-Shalom, X.R. Li, T. Kirubarajan, Estimation with Applications to Tracking
instruments, in: 2014 7th International Conference on Ubi-Media Computing
and Navigation, 11, John Wiley & Sons, Inc., Hoboken, New Jersey 2001, pp.
and Workshops (UMEDIA), 2014, pp. 133–138.
438–440.
[60] S.-I. Jang, K. Choi, K.-A. Toh, Andrew B.J. Teoh, J. Kim, Object tracking based on
[66] B. Karasulu, S. Korukoglu, A software for performance evaluation and com-
an online learning network with total error rate minimization, Pattern Re-
cognit. 48 (1) (2015) 126–139. parison of people detection and tracking methods in video processing, Mul-
[61] H. Li, C. Shen, Q. Shi, Real-time visual tracking using compressive sensing, in: timed. Tools Appl. 55 (3) (2011) 677–723.

Joongrock Kim received his M.S. degree in Graduate Program in Biometrics from Yonsei University, Seoul, Korea. Currently, he is a candidate of Ph.D. degree in Electrical and
Electronic Engineering from Yonsei University, Seoul, Korea. His research interests include human computer interaction, biometrics and computer vision.

Sunjin Yu received his M.S. degree in Graduate Program in Biometrics from Yonsei University, Seoul, Korea. He received Ph.D. degree in Electrical and Electronic Engineering
from Yonsei University, Seoul, Korea. Currently, he is a assistant professor in the Department of Broadcasting and Film, Cheju Halla University, Cheju-Do, Korea. His research
interests include 3D face modeling and human computer interaction.

Dongchul Kim is currently a candidate of Ph.D. degrees in Computer Science from Yonsei University, Seoul, Korea. His research interests are in the fields of human computer
interaction and augmented reality.

Kar-Ann Toh is a full professor in the School of Electrical and Electronic Engineering at Yonsei University, South Korea. He received the Ph.D. degree from Nanyang
Technological University (NTU), Singapore. He worked for two years in the aerospace industry prior to his post-doctoral appointments at research centers in NTU from 1998
to 2002. He was affiliated with Institute for Infocomm Research in Singapore from 2002 to 2005 prior to his current appointment in Korea. His research interests include
biometrics, pattern classification, optimization and neural networks. He is a co-inventor of a US patent and has made several PCT lings related to biometric applications.
Besides being an active member in publications, Dr. Toh has served as a member of technical program committee for international conferences related to biometrics and
artificial intelligence. He is currently an associate editor of Pattern Recognition Letters and a senior member of the IEEE.

Sangyoun Lee received his B.S. and M.S. degrees in Electronic Engineering from Yonsei University, Seoul, South Korea in 1987 and 1989 respectively. He received his Ph.D.
degree in Electrical and Computer Engineering from Georgia Tech., Atlanta, GA, in 1999. He was a senior researcher in Korea Telecom from 1989 to 2004. He is now a full
professor of the School of Electrical and Electronic Engineering, Yonsei University, Korea. His research interests include pattern recognition, computer vision, video coding
and biometrics.

You might also like