Professional Documents
Culture Documents
Pattern Recognition: Joongrock Kim, Sunjin Yu, Dongchul Kim, Kar-Ann Toh, Sangyoun Lee
Pattern Recognition: Joongrock Kim, Sunjin Yu, Dongchul Kim, Kar-Ann Toh, Sangyoun Lee
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
art ic l e i nf o a b s t r a c t
Article history: Ever since the availability of real-time three-dimensional (3D) data acquisition sensors such as time-of-
Received 5 June 2013 flight and Kinect depth sensor, the performance of gesture recognition can be largely enhanced. However,
Received in revised form since conventional two-dimensional (2D) image based feature extraction methods such as local binary
25 July 2016
pattern (LBP) generally use texture information, they cannot be applied to depth or range image which
Accepted 26 July 2016
does not contain texture information. In this paper, we propose an adaptive local binary pattern (ALBP)
Available online 27 July 2016
for effective depth images based applications. Contrasting to the conventional LBP which is only rotation
Keywords: invariant, the proposed ALBP is invariant to both rotation and the depth distance in range images. Using
3D hand tracking ALBP, we can extract object features without using texture or color information. We further apply the
Hand gesture recognition
proposed ALBP for hand tracking using depth images to show its effectiveness and its usefulness. Our
Human computer interaction
experimental results validate the proposal.
Natural user interface
& 2016 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.patcog.2016.07.039
0031-3203/& 2016 Elsevier Ltd. All rights reserved.
140 J. Kim et al. / Pattern Recognition 61 (2017) 139–152
Table 1 Table 2
Solutions for main problems of the object tracking. State-of-the-arts literature survey of LBP based object tracking.
Problem Similar color Complex No motion Distance Method LBP feature Additional features
objects background variation
[51] LBP histogram MeanShift þ particle filter
Solution Color √ [52] MeanShift
Model [53] Skin color þ particle filter
√ √
[54] MeanShift þ color histogram þ expecta-
Motion √ √ tion–maximization (EM)
Hybrid √ √ √ [55] Color þ particle filter
3D [56] Color þ motion
√ √ √ √
Local binary patterns (LBP) is an effective method to extract [40] Depth threshold Closest pixel þ mayor
powerful features for texture classification [27,28]. Particularly, axis
it has been successfully adopted for object tracking. To improve [41] Depth threshold
the tracking performance, additional features such as color and [42] (bounding box)
[43] Depth threshold Skin color
motion are combined with LBP feature. Although color feature detection
based tracking methods such as MeanShift [51,52,54] CamShift [44] Contour features No tracking
[57] and skin color detection [53] are used to detect regions of [45] Size filtering Depth threshold þ size
interest, they are much affected by ambient lighting condition. filtering
[46] Kinect SDK Kinect SDK
Hence, LBP feature which has robustness to light variations is
[47] OpenNI OpenNI
added to color features in object tracking [51,52,57,53,54]. LBP
J. Kim et al. / Pattern Recognition 61 (2017) 139–152 141
benefits, minimal CPU load, and multiplatform support, many images. Next, a rotation and depth invariant ALBP is proposed
developers has been trying to apply it for natural hand-based using the above estimated regression function. Finally, a fast ver-
control or full-body control. However, there are major limitations sion of ALBP is proposed to reduce the processing time.
to develop NITE base natural user interfaces. On top of all, since
NITE is not provided as an open source code, it cannot be modified 3.1.1. Estimation of regression function in depth images
for developer based applications. Also, it should be applied with Firstly, we estimate the size of a hand projected on depth
OpenNI so it cannot be used for other depth sensors except Kinect. images according to the depth between the sensor and the hand.
However, since it supplies a reliable and robust hand tracker, we The purpose is to determine an adaptive size of the ALBP according
thus choose NITE as major competitor.
to depth information. Since the pixel value in depth images re-
presents the depth distance or range, we estimate the relation
between the size of the hand and the pixel value through a linear
3. Proposed methodology
regression function as follows:
In this section, we propose an effective feature extraction r (g ) = β0 + β1g + β2 g 2 + ⋯βN g N (1)
technique, called adaptive local binary pattern (ALBP), for depth
image based applications. Subsequently, we apply the proposed where g is the pixel intensity of a depth image, r(g) is the width
ALBP to hand tracking using depth images. We achieve a reliable size with respect to g, βk (k ¼ 0, 1…N ) are regression coefficients
3D tracking performance which is invariant to different depth and N is the order of the regression function.
changes and hand movements. Fig. 1 shows an overview of the From K-observations of g and r, a cumulated matrix can be
proposed system. written as
⎡ ⎤ ⎡1 g1 g12 ⋯ g1N ⎤ ⎡ β ⎤
3.1. Proposed adaptive local binary pattern for depth image based ⎢ r1 ⎥ ⎢ ⎥⎢ 0 ⎥
applications ⎢ r2 ⎥ ⎢ 1 g2 g22 g2N ⎥ ⎢ β1 ⎥
⎢ ⎥=⎢ ⎥⎢ ⎥
⎢ ⋮⎥ ⎢ ⋮ ⋱ ⋮ ⎥⎢ ⋮ ⎥
Conventional local binary pattern (LBP) is a fast and effective ⎢ rK ⎥ ⎢ 1
feature extraction method for texture based classification of gray- ⎣ ⎦ ⎣ gK gK2 ⋯ gKN ⎥⎦ ⎢⎣ βN ⎥⎦ (2)
scale images [27,28]. However, the LBP cannot be applied to re-
In matrix notation, we can re-write it as
cognize target objects in depth image since depth images do not
contain any texture information. In addition, although LBP is in- R = GB (3)
variant to image rotation, it is not invariant to depth distance
change. Moreover, an additional stage of classification using con- where R is a column vector [r1, r2…rK ]T , B is a column vector
ventional classifiers such as LDA [38] and SVM [39] is required to [β0 , β1, β2…βN ]T and G is the matrix connecting R and B.
classify those features extracted by LBP in applications such as The coefficient β can be obtained using the least squares
object recognition and object detection. parameter estimation as:
Here, we propose an adaptive local binary pattern (ALBP) to ^
extract useful features from depth images. The proposed ALBP can B = (G T G)−1G T R (4)
extract features which are invariant to rotation and depth varia- Finally, we can find the fitted regression function to the size of
tion without speed compromise. In addition, using ALBP, we can ALBP as follows:
verify the shape of objects directly without the need of any
classifiers. r^ (g ) = β^0 + β^1g + β^2 g 2 + ⋯β^N g N (5)
Essentially, the proposed ALBP consists of three stages. Firstly, a
regression function is estimated to set the radius of ALBP in depth where β^i (i = 0, 1, …N ) is the estimated regression coefficients.
3.1.2. Proposed adaptive local binary pattern where r (gc ) is the estimated radius based on regression. The size of
In this stage, a texture T is first defined with respect to a pixel the pattern is thus adaptively determined by gc which is the pixel
with local neighborhoods with radius size r as the joint distribu- value in depth image, as well as the distance between the sensor
tion of the gray levels of I( I > 0) image pixels as shown in Fig. 2: and objects.
where gc is the image intensity of the center pixel of ALBP, and gi 3.1.3. A fast version of ALBP
(i¼ 0,1… I − 1) are the pixel intensities of circular neighborhoods Although the computational cost of the above ALBP is not too
around the center pixel. I is the number of circular neighborhood heavy for real-time application, we nevertheless propose a tech-
points of ALBP. nique to speed up the process. The following two conditions have
If the value of the center pixel is subtracted from the values of been adopted for computational speeding up:
the neighbors, the local texture can be represented as a joint dis-
tribution of the value of the differences values: The pattern of signs of the differences with respect to a
threshold in Eq. (8) (e.g. t(1,111,111) and t(1,111)).
Tr (gc ) = t (g0 − gc , g1 − gc , g2 − gc , …, gI − 1 − gc ) (7) The number of transitions, which is the bit change from 1 to
To separate an object from the background in depth image, only 0 or from 0 to 1, can be calculated as
the signs of the differences with respect to a threshold are con- I−2
A binomial weight 2i is assigned to each sign s (gi − gc ), trans- By adopting these two conditions, it does not need to find the
forming the differences in a neighborhood into a unique ALBP circularly shifted minimum value of ALBP. That is, original ALBP
code. The ALBPI ,r operator can be presented as: finds a minimum binary code through I times ROR operations
(total I2 times summations). However, in the fast version of ALBP,
I−1
ALBPI, r (x c , yc ) = ∑ s (gi − gc ) 2i we calculate the number of transitions which needs I times sum-
i=0 (9) mations instead of finding circularly shifted minimum value of
ALBP.
where i is the pattern index, (xc , yc ) is the center position of the In theory, the complexity of fast version of ALBP, as well as
pattern, gc is the pixel value of center position and gi is the value of the original ALBP, can be given directly. Given N data and s
index i. samples, the original ALBP has complexity of O (N ·s2) with big O
To achieve invariance with respect to image rotation, each ALBP notation. While the calculation time is proportional to N ·s ,
binary code must be shifted to a reference code which is the O (N ·s ), in fast version of ALBP as a result of checking only the
minimum code by the circularly shifting of the original code. This number of transitions. Therefore, it needs less computational
transformation can be written as: cost than ALBP.
ALBPI, r = min {ROR (ALBPI, r , k )|k = 0, 1, …, I − 1} (10)
where the function ROR (x, i ) performs a circular bitwise right shift
3.1.4. An example
on the I-bit binary number xi times. The ROR operation is defined As an example, a depth image is shown in Fig. 3. The ALBP
as: consists of 8 points ( g0 , g1, g2, …g7), and each pixel intensity is as
I−1 k−1 below. The radius of the pattern is determined using r (gc ), which
ROR (ALBPI, r , k ) = ∑ s (gi − gc ) 2i − k + ∑ s (gi − gc ) 2I − k + i is r (25):
i=k i=0 (11)
Fig. 4. The proposed hand detection consists of hand candidates detection and candidates verification.
g0 − gc g1 − gc g2 − gc g3 − gc g4 − gc g5 − gc g6 − gc g7 − gc
The results of ALBP depend on the radius of the pattern and the Fig. 5. The results of ALBP with assumptions for hand detection: (a) there is no
other object within 30 cm, and (b) the body is seen behind the hand within 70 cm;
threshold value for signs of difference of ALBP. That is, we can find
white and black points represent 0 and 1 of signs of difference, respectively.
and verify a specific shape that we want to extract from a depth
image by selection of the radius of the pattern and the threshold Firstly, detection of an initial position to be tracked in depth
value. images is introduced. Based on the detected initial hand's position,
the ALBP based hand tracking is next performed. Subsequently, a
3.2. Proposed depth image based 3D hand tracking trajectory filtering to reduce the effect of depth noise is adopted
using an Unscented Kalman Filter (UKF) [35]. Finally, a process for
Depth image contains range information, but not texture and tracking failure and occlusion by other objects is proposed.
color information. It is difficult to detect and trace the object
without such information. Using the proposed ALBP, we can ex- 3.2.1. ALBP based hand detection in depth images
tract and verify the shape of target objects in depth images at the We use the user's hand with an act of reaching into the sensor
same time. In this paper, we apply the proposed ALBP to detect as an initial (focus) cue for hand detection. The proposed hand
and trace the position of hands, as specific object, in depth detection is divided into two steps: hand candidates detection and
images. candidates verification as shown in Fig. 4.
144 J. Kim et al. / Pattern Recognition 61 (2017) 139–152
Fig. 8. Some examples of ALBP for the real hands in depth images.
The edge of hand in a depth image changes continuously due to and the system shall enter into hand detection mode for tracking
noise even when the hand does not move. It might influence the restart if hand is detected.
performance of gesture recognition since the trajectory contains
noise arising from the depth image. Therefore, the trajectory of
hand tracking should be filtered such that it is not affected by 4. Experiments
small movement of hand. On the other hand, there is not enough
trajectory data when the hand move fast. That is, the sparse tra- In this study, we use a Kinect depth sensor which captures RGB
jectory data by fast movement of hand need to be filtered by and depth images of 640 480 as 30 fps. The data acquisition is
interpolation. implemented in Open Natural Interaction (OpenNI) [34] while
Since the position of hand tracking in depth images can be other modules including detection and tracking have been im-
formulated as an estimation problem, the formulation can be plemented using C in a machine with 3.93 GHz Intel Core i7 870
implemented using an Extended Kalman Filter (EKF) or Unscented and 4G of physical memory. The number of ALBP is set at 16
Kalman Filter (UKF) [35,36]. In this paper, we adopted the UKF for (I ¼16) which was selected based on the most effective perfor-
smoothing the trajectory of hand since UKF shows better perfor- mance in terms of tracking accuracy and processing time.
mance than EKF based on our observations. In this section, we perform four experiments to evaluate pro-
posed methods as summarized and enumerated in Table 4.
3.2.4. Handling tracking failure and occlusion
Tracking failure can occur because of fast movement or occlu- 4.1. Accuracy of regression function
sion by un-expected objects. Moreover, the situation where no
hand can be found within the depth image might occur, which To adapt the size of ALBP according to the distance information
means that the user wants to stop hand tracking. Upon tracking in depth images, we need to estimate the relation between the
failure, we include a detection mode to re-start tracking and a size of ALBP and the hand distance. We use a polynomial regres-
tracking mode to track the hand after detection. sion function to map the relationship. For initialization, we assume
First, we keep the tracking mode with respect to the last de- the size of hand to be around 20 cm for constructing the regres-
tected region before losing the hand for 30 frames to re-track the sion function. Then, we measure the pixel size in depth image with
hand which comes back into the search region. Next, the detection respect to the initial 20 cm object within the range of 60–750 cm
mode is activated to re-start tracking. We assume occluding ob- at intervals of 20 cm as shown in Fig. 9.
jects generally exist beyond the search region due to the short The errors in terms of pixel difference between the ground
distance between the hand and sensor. In the unlikely event of truth and the fit result of regression function are shown in Fig. 10.
occlusion, we consider it a tracking failure where a new round of In Table 5, we show regressions using several polynomial or-
hand detection shall take place. ders to estimate the hand size. Based this experimental observa-
The assumption of no occlusion with 30 cm range was made for tion, a 5th order regression function is seen to estimate well the
application scenarios such as user–PC-interface, user–TV-interface, hand size. The estimated polynomial regression function used in
and gaming where an indoor and somewhat controlled environ- this study is r^ (x ) = 327.9824 − 3.2828x + 0.0152x2 − 3.5441×
ment took place. In the unexpected event of occlusion by small 10 x + 3.9893 × 10−8x 4 − 1.731 × 10−11x5
−5 3
objects which occurs in between the hand and the sensor, the In order to cater for images with much regional variations, we
system stops tracking and a fresh initialization shall take place to performed an additional experiment on regression based on
restart tracking if the hand is detected again. Similarly, cessation of 9 image regions as shown in Fig. 11.
tracking will occur if the center of tracking is corrupted by noise Regressions based on 9 image regions are shown in Table 6. We
also set the order of regression functions as 5th order.
Table 4
Goal and performance measure for each experiment. 4.2. ALBP based hand detection
Experiment Goals Performance measure
In this experiment, we evaluate the accuracy of hand detection
E1 Fitting regression function Root mean square (RMS) according to its range distance. The range distance of hand for
error evaluation is set ranging from 1 m to 7 m at an interval of 50 cm.
E2 Hand detection accuracy Detection rate
This range is selected according to our target applications such as
user–PC-interface, user–TV-interface, and gaming. We evaluate
E3 (a) Evaluate the tracking per- RMS error, t-test detection accuracy using precision and recall based on the classical
formances in 2D space
true positive (TP), false positive (FP) and false negative (FN) rates:
(b) Evaluate the tracking per- RMS error, t-test
formances in 3D space TP
E4 Process time Process time (ms) Precision =
TP + FP (14)
146 J. Kim et al. / Pattern Recognition 61 (2017) 139–152
Fig. 9. Sample images used in estimating regression function according to range distance: left, middle and right images show an object at 60 cm, 400 cm and 700 cm
respectively.
Table 7 shows the detection performance over several range To verify the robustness of the proposed hand tracking under
distances. Each detection rate is calculated based on the result different ranges of operation and different hand movements, we
obtained from 2000 attempts wherein 20 times repeated hand made a data set based on 100 identities each with five gestures at
detections are taken from 100 people standing at different dis- different standing distances (1 m, 2 m and 3 m) as shown in
tances from the sensor. This table shows deterioration of detection Fig. 12. Each gesture is carried out 5 times repeatedly.
rate beyond 4.5 m due to the effective range of Kinect sensor. The results filtered by UKF are compared with the results by
The result shows good detection performance at distances, 1– EKF to see the impact of trajectory smoothing similar to ground
4 m, with a 100% recall rate. However, the detection rate rapidly
truth.
decreases when the distance is over 4.5 m since the size of hand is The tracking accuracy is compared with several existing hand
too small to be recognized in depth image. To reduce false de- trackers namely,
tections, we take the range of 0.5–4 m as reliable distance for hand
detection in this study. NITE which is from PrimeSenses Natural Interaction Technology
for End-user [32].
4.3. ALBP based hand tracking CamShift which is an object tracking method based on color
information [33].
In this experiment, we use two kinds of measures to compare 3D hand tracking using kalman filter in depth space [21].
the proposed hand tracking with other state-of-the-art hand Fixed-sized ALBP based tracking where the target size is not
tracking methods as follows: adjusted.
The average RMS error: The Root Mean Square (RMS) error be- Table 8 represents the abilities of each hand tracking method in
tween the ground truth and the estimated tracking position. 2D and 3D spaces.
Paired t-test: This is to check whether the difference in perfor- The ground truth is manually selected as 1/3 position from the
mances is statistically significant. The null hypotheses that the tip of the hands as shown in Fig. 13.
Fig. 10. Error between ground truth and result of regression function according to polynomial order.
J. Kim et al. / Pattern Recognition 61 (2017) 139–152 147
Table 5
Regressions using several polynomial orders to estimate the hand size.
Table 6
Regressions based on 9 image regions.
− 2.0697 × 10−10x5
7 r^ (x ) = 377.3735 − 5.2374x + 0.0358x2 − 1.2738 × 10−4x 3 + 2.2853 × 10−7x 4
− 1.6417 × 10−10x5
8 r^ (x ) = 443.5954 − 6.8975x + 0.0510x2 − 1.9210 × 10−4x 3 + 3.5775 × 10−7x 4
− 2.6109 × 10−11x5
9 r^ (x ) = 437.2232 − 6.5823x + 0.0466x2 − 1.6617 × 10−4x 3 + 2.9005 × 10−7x 4
− 1.9692 × 10−10x5
148 J. Kim et al. / Pattern Recognition 61 (2017) 139–152
Table 7
Detection performance according to distance.
Precision (%) 100 100 100 100 100 97.6 93.2 79.8 55.4 33.1 5.7 0
Recall (%) 100 100 100 100 100 100 100 83.3 62.7 41.7 12.6 0
hand tracking using Kalman filter in depth space [21] and fixed-
Table 8
sized LBP based tracking in 3D world space. Abilities to perform hand tracking in 2D and 3D space.
Fig. 14(b) shows the RMS error of infinite hand motion. The unit
of RMS error is millimeter since errors are calculated by coordinate Compared hand tracking In 2D image space In 3D world space
of world reference frame. Since a pixel intensity of a depth image
Proposed hand tracking √ √
represents real depth (Zw) information from camera to the scene,
NITE [32] √ √
we can convert an ( xim , yim ) image coordinate to a ( Xw , Yw , Z w )
Fixed-sized LBP based hand tracking √ √
world coordinate as follows:
3D hand tracking using Kalman filter √ √
⎛ Z (y − y ) ⎞ [21]
( Xw Yw Z w ) = ⎜⎝ Z w (ximf − x 0 ) w imf 0 Z w ⎟⎠ CamShift [33] √
(16)
Fig. 12. Five different gesture from an identity; (a) circle gesture, (b) infinite gesture, (c) triangle gesture, (d) vertical (up to down and down to up) gesture and (e) horizontal
gesture (left to right and right to left).
J. Kim et al. / Pattern Recognition 61 (2017) 139–152 149
Fig. 14. The RMS error between the ground truth and the tracking position with respect to (a) circular hand gesture in 2D image space and (b) infinite hand motions in 3D
world space.
Table 9 Table 11
Average errors and standard deviations of proposed method, NITE [32], fixed-sized Average errors and standard deviations of proposed method, NITE [32], fixed-sized
(5 and 10 ALBP sizes) ALBP based method, 3D hand tracking using Kalman filter in (5 and 10 ALBP sizes) ALBP based method and 3D hand tracking using Kalman filter
depth space [21] and CamShift [33] with respect to the distance (1 m, 2 m and 3 m) in depth space [21] with respect to varying distance (1 m, 2 m and 3 m) (H :
(H : ‘1’ rejects the null hypothesis that the gathered RMS errors evaluated are equal 1 rejects the null hypothesis that the gathered RMS errors evaluated are equal at
at significance level of 0.05, ‘0’ otherwise). significance level of 0.05, 0 otherwise).
Method 1m 2m 3m Method 1m 2m 3m
Avg 7std H Avg 7 std H Avg 7 std H Avg 7std H Avg 7 std H Avg 7std H
Proposed (UKF) 13.11 72.37 8.48 7 1.94 4.377 1.20 Proposed (UFK) 12.727 2.07 13.93 7 2.79 15.48 73.02
Proposed (EKF) 15.377 2.74 9.84 72.27 5.137 1.38 Proposed (EKF) 14.217 2.32 15.37 7 3.17 16.25 7 3.71
NITE 16.59 7 4.23 1 10.68 72.64 1 5.217 1.74 1 NITE 14.3573.24 1 15.63 7 3.93 1 16.667 4.30 1
Fixed-sized (5) 23.93 7 7.84 1 18.337 3.78 1 13.277 2.76 1 Fixed-sized (5) 27.82 7 7.10 1 31.74 77.39 1 33.827 7.96 1
Fixed-sized (10) 21.577 6.02 1 16.277 3.23 1 12.42 7 2.14 1 Fixed-sized (10) 21.94 75.93 1 26.20 76.13 1 28.94 7 7.28 1
3D hand tracker [34] 24.437 9.56 1 20.26 76.02 1 15.92 7 4.27 1 3D hand tracker 32.32 7 10.48 1 28.94 711.98 1 43.30 7 13.24 1
CamShift 61.50 7 20.37 1 45.55 711.37 1 36.32 78.93 1
Table 10
Average errors of proposed method, NITE, fixed-sized (5 and 10 ALBP sizes) ALBP based method, 3D hand tracking using Kalman filter in depth space and CamShift with
respect to different hand motions; (a) circle, (b) infinite, (c) triangle, (d) vertical and (e) horizontal gesture (H: 1 rejects the null hypothesis that the gathered RMS errors
evaluated are equal at significance level of 0.05, 0 otherwise).
Avg 7 std H Avg 7 std H Avg7 std H Avg 7std H Avg7 std H
Proposed (UKF) 7.937 1.91 8.757 2.37 9.627 2.43 5.737 1.43 5.507 1.52
Proposed (EKF) 9.217 2.27 9.32 73.76 11.52 7 2.73 6.98 7 1.73 5.50 7 1.52
NITE 9.60 72.67 1 11.677 3.02 1 10.75 73.24 1 7.357 1.96 1 6.99 7 1.93 1
Fixed-sized (5) 17.477 3.86 1 19.78 7 4.76 1 19.497 4.80 1 15.38 7 3.24 1 16.90 72.54 1
Fixed-sized (10) 16.14 73.27 1 17.32 7 4.32 1 17.96 7 4.56 1 14.09 7 3.15 1 16.077 3.07 1
3D hand tracker [34] 19.52 7 6.19 1 19.127 8.06 1 21.23 7 9.27 1 17.87 75.66 1 20.497 6.98 1
CamShift 46.277 10.15 1 51.83 713.21 1 52.90 7 15.04 1 36.59 7 9.98 1 36.167 14.32 1
4.4. Processing time tracking are shown in Table 13. Each processing time is calculated
based on the average time of 10,000 frames.
The processing time is one of the most important factors in real- The processing time for hand detection is 32 ms (31 fps) on a
time applications. The processing times for the hand detection and 640 480 depth resolution. The processing time becomes faster
150 J. Kim et al. / Pattern Recognition 61 (2017) 139–152
Table 12
Average errors and standard deviations of proposed method, NITE, fixed-sized (5 and 10 ALBP sizes) ALBP based method and 3D hand tracking using Kalman filter in depth
space with respect to different hand gesture: (a) circle, (b) infinite, (c) triangle, (d) vertical and (e) horizontal gesture. (H : 1 rejects the null hypothesis that the gathered RMS
errors evaluated are equal at significance level of 0.05, 0 otherwise).
Avg 7 std H Avg7 std H Avg 7 std H Avg 7std H Avg7 std H
Proposed (UKF) 13.93 7 2.38 18.80 7 3.27 18.39 73.43 9.17 72.10 12.29 72.32
Proposed (EKF) 15.177 3.15 19.737 3.56 20.03 73.89 9.56 7 2.32 13.127 2.72
NITE 15.63 7 3.47 1 20.89 7 3.89 1 19.03 7 4.02 1 11.217 2.84 1 13.09 7 2.83 1
Fixed-sized (5) 31.747 7.32 1 33.577 9.20 1 34.07710.21 1 25.707 6.47 1 29.98 7 6.48 1
Fixed-sized (10) 26.20 76.01 1 28.29 7 8.47 1 29.82 79.47 1 21.43 75.21 1 25.74 75.12 1
3D hand tracker 39.107 11.76 1 42.22 7 13.23 1 33.78 714.53 1 33.587 8.47 1 35.527 9.47 1
Fig. 15. Total average error and standard deviation with respect to distances. x- and
y-axes represent the total average error and the distances respectively. The unit of
(a) and (b) is pixels and mm, respectively. The vertical line on top of the each bar Fig. 16. Total average error and standard deviation with respect to hand gestures.
denotes the standard deviation. x- and y-axes represent the total average error and the distances respectively. The
unit of (a) and (b) is pixels and mm, respectively. The vertical line on top of the each
according to increase of hand distance since the search region bar denotes the standard deviation.
A number of open problems must be solved to further the Recognition, 2010, pp. 1173–1180.
proposed method towards a NUI solution for wide applications. [25] L. Xia, C.C. Chen, J.K. Aggarwal, Human detection using depth information by
Kinect, in: Proceedings of the 2011 IEEE Computer Society Conference on
One would be to investigate into the possibility of allowing Computer Vision and Pattern Recognition Workshops (CVPRW), 2011, pp. 15–
multiple hands tracking. Another possibility would be to com- 22.
bine the proposed hand tracking with gesture recognition [26] Zhou Ren, Jingjing Meng, Junsong Yuan, Depth camera based hand gesture
recognition and its applications in human–computer-interaction, in: 8th In-
algorithms. ternational Conference on Information, Communications and Signal Proces-
sing (ICICS), 2011, pp. 1–5.
[27] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns, IEEE Trans. Pattern
Acknowledgment Anal. Mach. Intell. 24 (7) (2002) 971–987.
[28] T. Ojala, K. Valkealahti, E. Oja, M. Pietikainen, Texture discrimination with
This research was supported by Basic Science Research Program multidimensional distributions of signed gray-level differences, Pattern Re-
cognit. 34 (3) (2001) 727–739.
through the National Research Foundation of Korea (NRF) funded [29] J.E. Jackson, A. User's, Guide to Principal Components, Wiley, Hoboken, New
by the Ministry of Education (NRF-2015R1D1A1A01061315). Jersey, 1991.
[30] G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition,
Wiley Interscience, Hoboken, New Jersey, 2004.
[31] Jun Xie, Can Xie, Wei Bian, Dacheng Tao, Feature fusion for 3D hand gesture
References recognition by learning a shared hidden space, Pattern Recognit. Lett. 33 (4)
(2012) 476–484.
[32] 〈http://www.primesense.com/en/nite〉.
[1] D.M. Gavrila, The visual analysis of human movement: a survey, Comput. Vi- [33] G.R. Bradski, Computer vision face tracking for use in a perceptual user in-
sion. Image Underst. 73 (1) (1999) 82–98. terface, Intel. Technol. J. (1998).
[2] J.K. Aggarwal, Q. Cai, Human motion analysis: a review, Comput. Vis. Image [34] 〈http://openni.org〉.
Underst. 73 (3) (1999) 428–440. [35] R. Merse, E. Wan, The unscented kalman filter for nonlinear estimation, in:
[3] Y. Wu, T. Huang, Vision-based gesture recognition: a review, in: International Proceedings of the IEEE Symposium 2000 on Adaptive Systems for Signal
Gesture Workshop on Gesture-Based Communication in Human–Computer Processing, Communication and Control (AS-SPCC), 2000.
Interaction, 1999, pp. 103–115. [36] S.J. Julier, J.K. Uhlmann, A new extension of the kalman filter to nonlinear
[4] T. Kirishima, K. Sato, K. Chihara, Real-time gesture recognition by learning and systems, in: The 11th International Symposium on Aerospace/Defence Sen-
selective control of visual interest points, IEEE Trans. PAMI 27 (3) (2005) sing, Simulation and Controls, 1997, pp. 182–193.
351–364. [37] Richard Hartley, Andrew Zisserman, Multiple View Geometry, Cambridge
[5] Ron George, Joshua Blake, Objects, containers, gestures, and manipulations: University Press, Cambridge, England, 2000.
universal foundational metaphors of natural user interfaces, in: CHI10 Natural [38] Z. Li, U. Park, A.K. Jain, A discriminative model for age invariant face re-
User Interfaces Workshop, 2010. cognition, IEEE Trans. Inf. Forensics Secur. 6 (3) (2011) 1028–1037.
[6] Q. Chen, N.D. Georganas, E.M. Petriu, Hand gesture recognition using Haar-like [39] H. Lu, D. Wang, R. Zhang, Y.-W. Chen, Video object pursuit by tri-tracker with
features and a stochastic context-free grammar, IEEE Trans. Instrum. Meas. 57 on-line learning from positive and negative candidates, IET Image Process. 5
(8) (2008) 1562–1571. (2011) 101–111.
[7] N. Dardas, N. Georganas, Real time hand gesture detection and recognition [40] D.L. Marino Lizarazo, J.A. Tumialan Borja, Hand position tracking using a depth
using bag-of-features and multi-class support vector machine, IEEE Trans. image from a RGB-d camera, in: 2015 IEEE International Conference on In-
Instrum. Meas. (2011). dustrial Technology (ICIT), 2015, pp. 1680–1687.
[8] Qiu-yu Zhang, Mo-yi Zhang, Jian-qiang Hu, A method of hand gesture seg- [41] Wu Xiaoyu, Cheng Yang, Youwen Wang, Hui Li, Shengmiao Xu, An intelligent
mentation and tracking with appearance based on probability model, in: 2008 interactive system based on hand gesture recognition algorithm and kinect,
Second International Symposium on Intelligent Information Technology in: 2012 Fifth International Symposium on Computational Intelligence and
Application, 2008. Design (ISCID), vol. 2, 2012, pp. 294–298.
[9] S.K. Hee, K. Gregorij, B. Ruzena, Hand tracking and motion detection from the [42] V. Frati, D. Prattichizzo, Using Kinect for hand tracking and rendering in
sequence of stereo color image frames, in: Proceedings of the ICIT2008, 2008. wearable haptics, in: World Haptics Conference (WHC), 2011, pp. 317–321.
[10] S. Zhong, F. Hao. Hand tracking by particle filtering with elite particles mean [43] Yanmin Zhu, Bo Yuan, Real-time hand gesture recognition with Kinect for
shift, in: IEEE Workshop on Frontier of Computer Science, 2008, pp. 163–167. playing racing video games, in: 2014 International Joint Conference on Neural
[11] Q.Y. Zhang, M.Y. Zhang, J.Q. Hu, Hand gesture contour tracking based on skin Networks (IJCNN), 2014, pp. 3240–3246.
color probability and state estimation model, JMM 4 (6) (2009) 349–355. [44] Le Van Bang, Anh Tu Nguyen, Yu Zhu, Hand detecting and positioning based
[12] V.A. Prisacariu, I. Reid, Robust 3D hand tracking for human computer inter- on depth image of kinect sensor, Int. J. Inf. Electron. Eng. 4 (3) (2014) 176–179.
action, in: IEEE International Conference on Automatic Face and Gesture Re- [45] Xavier Suau, Javier Ruiz-Hidalgo, R. Josep, Casas, real-time head and hand
cognition, 2011, pp. 368–375. tracking based on 2.5D data, IEEE Trans. Multimed. 14 (3) (2012) 575–585.
[13] B. Stenger, A. Thayananthan, P.H.S. Torr, R. Cipolla, Model-based hand tracking [46] M. Zabri Abu Bakar, R. Samad, D. Pebrianti, N.L.Y. Aan, Real-time rotation in-
using a hierarchical Bayesian filter, IEEE Trans. Pattern Anal. Mach. Intell. 28 variant hand tracking using 3D data, in: 2014 IEEE International Conference on
(9) (2006) 1372–1385. Control System, Computing and Engineering (ICCSCE), 2014, pp. 490–495.
[14] T. Gumpp, P. Azad, K. Welke, E. Oztop, R. Dillmann, G. Cheng, Unconstrained [47] Zhou Ren, Junsong Yuan, Jingjing Meng, Zhengyou Zhang, Robust part-based
real-time markerless hand tracking for humanoid interaction, in: International hand gesture recognition using kinect sensor, IEEE Trans. Multimed. 15 (5)
Conference on Humanoid Robots, 2006. (2013) 1110–1120.
[15] J.S. Chang, E.Y. Kim, K. Jung, H.J. Kim, Real time hand tracking based on active [48] N.B.F. da Silva, D.B. Wilson, K.R.L.J. Branco, Performance evaluation of the
contour model, ICCSA 2005 (3483) (2005) 999–1006. extended kalman filter and unscented kalman filter, in: 2015 International
[16] S. Bilal, R. Akmelawati, M.J.E. Salami, A.A. Shafie, E.M. Bouhabba, A hybrid Conference on Unmanned Aircraft Systems (ICUAS), 2015, pp. 733–741.
method using haar-like and skin-color algorithm for hand posture detection, [49] Nicola Bellotto, Hu Huosheng, People tracking with a mobile robot: a com-
recognition and tracking, in: International Conference on Mechatronics and parison of Kalman and particle filters, in: The 13th IASTED International
Automation (ICMA), 2010, pp. 934–939. Conference on Robotics and Applications, 2007, pp. 388–393.
[17] Z. Pan, Y. Li, M. Zhang, C. Sun, K. Guo, X. Tang, Z. Zhou, A real-time multi-cue [50] R. Zhan, J. Wan, Iterated unscented kalman filter for passive target tracking,
hand tracking algorithm based on computer vision, IEEE VR 200 (2010) IEEE Trans. Aerosp. Electron. Syst. 43 (3) (2007) 1155–1163.
219–223. [51] Ye, Jianhua, Zhengguang Liu, Jun Zhang, A face tracking algorithm based on
[18] Yun-Fu Liu, Che-Hao Chang, Hoang-Son Nguyen, Improved hand tracking LBP histograms and particle filtering, in: 2010 Sixth International Conference
system, IEEE Trans. Circuits Syst. Video Technol. 22 (2012) 693–701. on Natural Computation (ICNC), vol. 7, 2010, pp. 3550–3553.
[19] Chia-Ping Chen, Yu-Ting Chen, Ping-Han Lee, Yu-Pao Tsai, Shawmin Lei, Real- [52] P. Pouladzadeh, M. Semsarzadeh, B. Hariri, S. Shirmohammadi, An enhanced
time hand tracking on depth images, in: 2011 Visual Communications and mean-shift and LBP-based face tracking method, in: 2011 IEEE International
Image Processing VCIP, 2011, pp. 1–4. Conference on Virtual Environments, Human–Computer Interfaces and Mea-
[20] Xavier Suau, Josep R. Casas, Javier Ruiz-Hidalgo, Real-time head and hand surement Systems Proceedings, 2011.
tracking based on 2.5D data, in: 2011 IEEE International Conference on Mul- [53] Wang Chuan-xu, Li Zuo-yong, A new face tracking algorithm based on local
timedia and Expo, 2011. binary pattern and skin color information, in: International Symposium on
[21] Sangheon Park, Sunjin Yu, Joongrock Kim, Sungjin Kim, Sangyoun Lee, 3D Computer Science and Computational Technology, 2008, pp. 657–660.
hand tracking using Kalman filter in depth space, EURASIP J. Adv. Signal [54] Hueser, Markus, Tim Baier, Jianwei Zhang, Learning of demonstrated grasping
Process. 2012 (2012) 36. skills by stereoscopic tracking of human head configuration, in: Proceedings
[22] A. Kolb, E. Barth, R. Koch, R. Larsen, Time-of-Flight Sensors in Computer 2006 IEEE International Conference on Robotics and Automation, 2006, 2006,
Graphics, EUROGRAPHICS STAR Report, 2009. pp. 2795–2800.
[23] 〈http://www.xbox.com/en-us/kinect〉. [55] S. Rahimi, Ali Aghagolzadeh, Hadi Seyedarabi, Three camera-based human
[24] Y. Cui, S. Schuon, C. Derek, S. Thrun, C. Theobalt, 3D shape scanning with a tracking using weighted color and cellular LBP histograms in a particle filter
time-of-flight camera, in: IEEE Conference on Computer Vision and Pattern framework, in: 2013 21st Iranian Conference on Electrical Engineering (ICEE),
152 J. Kim et al. / Pattern Recognition 61 (2017) 139–152
IEEE, Mashhad, Iran, 2013, pp. 1–6. 2011 IEEE Conference onComputer Vision and Pattern Recognition (CVPR),
[56] Takala, Valtteri, Matti Pietikainen, Multi-object tracking using color, texture 2011, pp. 1305–1312.
and motion, in: IEEE Conference on Computer Vision and Pattern Recognition, [62] J. Kwon, K.M. Lee, Tracking of a non-rigid object via patch based dynamic
2007. CVPR'07, 2007, pp. 1–7. appearance modeling and adaptive basin hopping Monte Carlo sampling, in:
[57] Xian Wu, Lihong Li, Jianhuang Lai, Jian Huang, A framework of face tracking 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
with classification using CAMShift-C and LBP, in: Fifth International Con- 2009, pp. 1208–1215.
ference on Image and Graphics, 2009, ICIG'09, 2009, pp. 217–222. [63] Matej Kristan, Janez Pers, Matej Perse, Stanislav Kovacic, Closed-world track-
[58] Siriteerakul, Teera, Yoichi Sato, Veera Boonjing, Estimating change in head ing of multiple interacting targets for indoor-sports applications, Comput.
pose from low resolution video using LBP-based tracking, in: 2011 Interna- Vision. Image Underst. 113 (5) (2009) 598–611.
tional Symposium on Intelligent Signal Processing and Communications Sys- [64] M. Kristan, S. Kovacic, A. Leonardis, J. Pers, A two-stage dynamic model for
tems (ISPACS), 2011, pp. 1–6.
visual tracking, IEEE Trans. Syst., Man, Cybern., Part B 40 (6) (2010) 1505–1520.
[59] Mu Hsen Hsu, T.K. Shih, Jen Shiun Chiang, Real-time finger tracking for visual
[65] Y. Bar-Shalom, X.R. Li, T. Kirubarajan, Estimation with Applications to Tracking
instruments, in: 2014 7th International Conference on Ubi-Media Computing
and Navigation, 11, John Wiley & Sons, Inc., Hoboken, New Jersey 2001, pp.
and Workshops (UMEDIA), 2014, pp. 133–138.
438–440.
[60] S.-I. Jang, K. Choi, K.-A. Toh, Andrew B.J. Teoh, J. Kim, Object tracking based on
[66] B. Karasulu, S. Korukoglu, A software for performance evaluation and com-
an online learning network with total error rate minimization, Pattern Re-
cognit. 48 (1) (2015) 126–139. parison of people detection and tracking methods in video processing, Mul-
[61] H. Li, C. Shen, Q. Shi, Real-time visual tracking using compressive sensing, in: timed. Tools Appl. 55 (3) (2011) 677–723.
Joongrock Kim received his M.S. degree in Graduate Program in Biometrics from Yonsei University, Seoul, Korea. Currently, he is a candidate of Ph.D. degree in Electrical and
Electronic Engineering from Yonsei University, Seoul, Korea. His research interests include human computer interaction, biometrics and computer vision.
Sunjin Yu received his M.S. degree in Graduate Program in Biometrics from Yonsei University, Seoul, Korea. He received Ph.D. degree in Electrical and Electronic Engineering
from Yonsei University, Seoul, Korea. Currently, he is a assistant professor in the Department of Broadcasting and Film, Cheju Halla University, Cheju-Do, Korea. His research
interests include 3D face modeling and human computer interaction.
Dongchul Kim is currently a candidate of Ph.D. degrees in Computer Science from Yonsei University, Seoul, Korea. His research interests are in the fields of human computer
interaction and augmented reality.
Kar-Ann Toh is a full professor in the School of Electrical and Electronic Engineering at Yonsei University, South Korea. He received the Ph.D. degree from Nanyang
Technological University (NTU), Singapore. He worked for two years in the aerospace industry prior to his post-doctoral appointments at research centers in NTU from 1998
to 2002. He was affiliated with Institute for Infocomm Research in Singapore from 2002 to 2005 prior to his current appointment in Korea. His research interests include
biometrics, pattern classification, optimization and neural networks. He is a co-inventor of a US patent and has made several PCT lings related to biometric applications.
Besides being an active member in publications, Dr. Toh has served as a member of technical program committee for international conferences related to biometrics and
artificial intelligence. He is currently an associate editor of Pattern Recognition Letters and a senior member of the IEEE.
Sangyoun Lee received his B.S. and M.S. degrees in Electronic Engineering from Yonsei University, Seoul, South Korea in 1987 and 1989 respectively. He received his Ph.D.
degree in Electrical and Computer Engineering from Georgia Tech., Atlanta, GA, in 1999. He was a senior researcher in Korea Telecom from 1989 to 2004. He is now a full
professor of the School of Electrical and Electronic Engineering, Yonsei University, Korea. His research interests include pattern recognition, computer vision, video coding
and biometrics.