3D Gazetrackingmethodusingpurkinjeimages

3D gaze tracking method using Purkinje images
on eye optical model and pupil

Ji Woo Lee
a
, Chul Woo Cho
a
, Kwang Yong Shin
a
, Eui Chul Lee
b
, Kang Ryoung Park
a,n
a
Division of Electronics and Electrical Engineering, Dongguk University, 26, Pil-dong 3-ga, Jung-gu, Seoul 100-715, Republic of Korea
b
Division of Computer Science, Sangmyung University, 7 Hongji-dong, Jongno-gu, Seoul 110-743, Republic of Korea
a r t i c l e i n f o
Article history:
Received 29 August 2011
Received in revised form
28 October 2011
Accepted 4 December 2011
Available online 23 December 2011
Keywords:
3D gaze position
3D optical structure of human eye model
First and fourth Purkinje images
MLP
a b s t r a c t
Gaze tracking is to detect the position a user is looking at. Most research on gaze estimation has focused
on calculating the X, Y gaze position on a 2D plane. However, as the importance of stereoscopic displays
and 3D applications has increased greatly, research into 3D gaze estimation of not only the X, Y gaze
position, but also the Z gaze position has gained attention for the development of next-generation
interfaces. In this paper, we propose a new method for estimating the 3D gaze position based on the
illuminative reections (Purkinje images) on the surface of the cornea and lens by considering the 3D
optical structure of the human eye model.
This research is novel in the following four ways compared with previous work. First, we
theoretically analyze the generated models of Purkinje images based on the 3D human eye model for
3D gaze estimation. Second, the relative positions of the rst and fourth Purkinje images to the pupil
center, inter-distance between these two Purkinje images, and pupil size are used as the features for
calculating the Z gaze position. The pupil size is used on the basis of the fact that pupil accommodation
happens according to the gaze positions in the Z direction. Third, with these features as inputs, the nal
Z gaze position is calculated using a multi-layered perceptron (MLP). Fourth, the X, Y gaze position on
the 2D plane is calculated by the position of the pupil center based on a geometric transform
considering the calculated Z gaze position.
Experimental results showed that the average errors of the 3D gaze estimation were about 0.961
(0.48 cm) on the X-axis, 1.601 (0.77 cm) on the Y-axis, and 4.59 cm along the Z-axis in 3D space.
& 2011 Elsevier Ltd. All rights reserved.
1. Introduction
Gaze tracking is estimating a users gaze position. Most gaze
tracking studies focused on estimating the X, Y gaze position on a
2D plane [1,2]. There have been four approaches in 2D gaze
tracking. The rst one is the skin electrode based method, which
attaches the skin electrode around the eye and measures the
electric difference between the retina and cornea [3]. It can
estimate the users gaze position from the electric difference.
The eye movement can be represented as a continuous value.
However, a drawback is that the performance can be affected by
users eye blinking. In addition, it can cause inconvenience and
resistance to users due to the attachment of the electrode. The
second approach is the contact lens based method. The method
combining contact-lens and coil systems on eye [4] belongs to
this category. This method is based on the measurement of the
users natural eye movement. However, wearing contact lens can
give inconvenience to user.
The third approach is the remote camera based method [57],
which needs near infrared (NIR) light illuminators, and one or two
cameras. This method is very convenient and can be used for
various applications. However, in order to cope with natural head
movements, the method needs more than two cameras or addi-
tional pantilt devices [5]. The last approach is the wearable
device based method. A small camera and NIR light illuminators
are attached to a wearable device; various types are available,
such as a glasses, helmet, or stereoscopic glasses [8,9,11]. The
method estimates the gaze positions based on the relative
position between the reective pattern of the NIR illuminator
and the center of the pupil region in the captured eye image.
Because the camera is attached to the head-mounted device, it
can always capture the static eye region irrespective of a users
head movements. However, in order to obtain the gaze position
on the monitor screen, the head movements should be estimated
in addition to the eye movement; this requires NIR illuminators
on monitor corners, an additional camera, or motion tracking
Contents lists available at SciVerse ScienceDirect
journal homepage: www.elsevier.com/locate/optlaseng
Optics and Lasers in Engineering
0143-8166/$ - see front matter & 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.optlaseng.2011.12.001
n
Corresponding author at: Division of Electronics and Electrical Engineering,
Dongguk University, 26, Pil-dong 3-ga, Jung-gu, Seoul 100-715, Republic of Korea.
E-mail address: parkgr@dongguk.edu (K.R. Park).
Optics and Lasers in Engineering 50 (2012) 736751
sensor [8,9,11]. In [8], an additional frontal viewing camera was
attached to the wearable device to track the head movement. In
the previous study [9], four NIR illuminators were attached to the
four monitor corners for gaze tracking. In [11], a motion tracking
sensor was attached to the wearable stereoscopic glasses to track
the head movements.
In contrast to 2D gaze tracking, 3D gaze detection estimates a
users gaze position not only on the X, Y plane but also for the
Z depth. Recently, the importance of 3D graphic or stereoscopic
displays has been greatly emphasized, and 3D gaze tracking
technology has also become focused on humancomputer inter-
faces for 3D content [1315]. In studies on 3D gaze tracking, there
are two main approaches. The rst approach is the remote camera
based method, which uses cameras on the desktop computer [16].
Kwon et al. proposed a method that estimates 3D gaze position
for gaze-based interaction in the 3D display [16]. It needs one
monocular camera and two NIR illuminators. The camera is
placed under the monitor screen and captures the facial image
including both eye regions. The two NIR illuminators produce two
specular reections on both eyes. Thus, while a user gazes at a
position on the monitor screen, the distance between the pupil
center and two centers of the specular reections in the image are
used to estimate the gaze position. The pupil center distance
(PCD), which represents the inter-distance between the two pupil
centers of both eyes, is used for calculating the Z gaze position
[16]. This method is convenient for users because no device needs
to be worn. However, since the method uses a camera with no
panning or tilting, it has restrictions on the natural movements of
the users head. Also, because it uses one wide-view camera to
capture both eyes of a user in an image, the resolution of the eye
image is very low, which degrades the accuracy of the gaze
detection. In [17], Hennessey and Lawrence adapted the 3D gaze
tracking method to a volumetric display. In their method, the 3D
gaze position is calculated based on the intersection of the two
line-of-sight (LOS) vectors from both eyes, and they evaluated the
performance with a 3D tic-tac-toe game. This method has similar
restrictions as [16] since it also uses one camera (with no panning
or tilting) to capture both eyes of a user in an image.
The second approach is the wearable device based method,
which requires a user to wear the device [1821]. Essig et al.
suggested a method that estimates the 3D gaze point of a user
with a neural network approach [18]. The method needs the
wearable type device that contains two cameras to capture the
left and right eyes images at the same time. A user needs to gaze
at 27 (333) points on the three-dimensional space for
calibration, and the 3D gaze position is calculated based on the
calibrated references. This method obtains high-resolution eye
images due to the use of two cameras. However, it is inconvenient
for users, who have to wear a heavy device containing the two
cameras. In addition, gazing at 27 positions for calibration is very
cumbersome. Similarly, the other head-mounted 3D gaze tracking
methods require two or three cameras on the wearable device
[1921]. In addition to two cameras for capturing eye images, the
additional camera is used to capture the frontal viewing scene,
and the users 3D gaze position is calculated by combining the
two pupil positions of the users two eyes and the information of
the frontal viewing image [21]. Using two or three cameras
requires complicated calibrations, and also increases the systems
weight and cost.
All previous studies on 3D gaze tracking used two or three
cameras, which can increase the weight, cost, and calibration
complexity [1821]. To overcome these problems, we propose a
new 3D gaze tracking method based on the pupil center and
Purkinje images that uses a monocular eye tracker. To detect the
pupil center, the rst and fourth Purkinje images (dual Purkinje
images) in the captured eye image, local binarization, and
component labeling are performed based on the coarse center
detected by circular edge detector. When a user gazes at one
position in the three-dimensional space, the positions of the rst
and fourth Purkinje images are changed by the lens accommoda-
tion of the human eye. In other words, as a user gazes at a farther
position, the users lens of eye becomes thin. The distance
between the rst and fourth Purkinje images then becomes
longer, and the pupil size gets larger. Based on that, the X, Y gaze
position on the 2D plane is calculated by the positions of the pupil
center based on geometric transform. The relative positions of the
rst and fourth Purkinje images to the pupil center, inter-distance
between these two Purkinje images, and pupil size are the
features used to calculate the Z gaze position. With these features
as inputs, the nal Z gaze position is calculated using a multi-
layered perceptron (MLP). The proposed method uses just one
small sized universal serial bus (USB) camera and a NIR illumi-
nator, which is used to capture the high-resolution eye image and
is convenient to users.
Table 1 shows the summary of comparisons between previous
works and the proposed method.
The remainder of this paper is as follows. In Section 2, the
proposed method is explained. Experimental results and conclu-
sions are presented in Sections 3 and 4, respectively.
2. Proposed method
2.1. Overview of the proposed method
Fig. 1 shows an overview of the proposed method. The NIR
illuminator produces the Purkinje images on the eye, and the
camera captures the NIR eye image using the proposed device as
shown in Fig. 2. The center position and diameter of the pupil are
then detected. After that, we detect the rst and fourth Purkinje
images based on the analyzed 3D eyeball model, as explained
in Section 2.4. Detailed explanations of steps 2 and 3 of Fig. 1 are
given in Section 2.3. Six features are extracted, and these six
features (F
1
, F
2
: x, y pixel positions of the rst Purkinje image
about the pupil center, F
3
, F
4
: x, y pixel positions of the fourth
Purkinje image about the pupil center, F
5
: pixel distance between
the rst and fourth Purkinje image, F
6
: pupil size) are used as the
inputs of the multi-layered perceptron (MLP). The MLP produces
the users depth gaze position (z) as output. Detailed explanations
of steps 4 and 5 are given in Section 2.7. The two extracted
features (F
7
, F
8
: x, y pixel position of the pupil center) are used to
calculate the planar gaze position (x, y) based on the geometric
transformation considering the calculated depth gaze position (z).
Detailed explanations of steps 6 and 7 are given in Section 2.8.
Thus, the 3D gaze position (x, y, z) is obtained from the depth gaze
position (z) with the planar gaze position (x, y).
2.2. Proposed wearable gaze tracking device
In this study, a lightweight glasses-type device including an
eye-capture camera and NIR light emitting diode (LED) was used
[8,24,25], as shown in Fig. 2. For the eye-capture camera, one
small universal serial bus (USB) camera (c600) is used [22]. The
NIR rejection lter of the camera is removed, and the NIR passing
(visible light rejection) lter is attached onto the camera
[812,24,25,33]. Therefore, the eye image is lit only by the NIR
illuminator, and the brightness of the captured eye image is not
changed by visible lighting of the environment, which makes it
easier to detect the pupil area. The NIR-LED is used not only for
illuminating the eye region but also for producing the dual
Purkinje images (rst and fourth Purkinje images). A zoom lens
is attached onto the eye-capture camera in order to acquire
J.W. Lee et al. / Optics and Lasers in Engineering 50 (2012) 736751 737
a magnied image of the eye and Purkinje images. The distance
between the camera lens and surface of eye is about 30 mm. The
angle between the optical axis of camera and the visual axis of
users eye is about 301 as shown in Fig. 5. The focal length of the
zoom lens is 20.3 mm, and the magnication factor is 3.08.
Detailed specications of the camera and NIR-LED are as follows:
Eye-capture camera: spatial resolution: 640480 pixels;
image acquisition speed: 30 frames per second.
NIR-LED: wavelength: 850 nm, illuminative angle: 7301.
2.3. Detecting the pupil region and Purkinje images
This section corresponds to steps 2 and 3 of Fig. 1. The pupil
region is the most important factor for eye gaze tracking [23].
To detect the pupil region in the eye image, circular edge detection
Fig. 1. Flowchart of the proposed method.
Fig. 2. Proposed eye image capture device with one USB camera and an NIR-LED.
Table 1
Comparisons between the previous works and the proposed method.
Category Methods Strength Weakness
2D gaze
tracking
method
Skin electrode
based method
Attach the skin electrode around eye and
measure the electric difference between the
retina and cornea [3]
The eye movement can be represented
as a continuous value
Its performance can be affected by users
eye blinking.
Users can feel inconvenient and be resistant
to attaching of the electrode
Contact lens based
method
The method combining contact-lens and coil
systems on eye [4] belongs to this category.
Based on the users natural eye
movement
Wearing contact lens can give
inconvenience to user
Remote camera
based method
Need NIR illuminators, and one or two
cameras to calculate the gaze position [57]
Since it does not require a user to wear
any device or sensor, it is convenient
In order to cope with the natural head
movements, more than two cameras or
additional pantilt devices are needed
Wearable device
based method
The small camera and NIR light illuminators
are attached onto a wearable device that can
be of various types such as glasses, helmet, or
stereoscopic glasses [8,9,11]
Because the camera is attached onto the
head-mounted device, it can always
capture the static eye region irrespective
of a users head movements
In order to obtain the gaze position on the
monitor screen, the head movements
should be estimated in addition to the eye
movement, which requires NIR illuminators
on monitor corners, an additional camera, or
motion tracking sensor
3D gaze
tracking
method
Remote camera
based method
Use NIR illuminators and wide view camera
that can capture both eyes to calculate the 3D
gaze position based on the information of the
left and right eye images [16,17]
Convenient to users because it does not
require them to wear any device
Since the method uses a camera with no
panning and tilting, it has restriction on the
natural movements of the users head
Because it uses one wide-view camera to
capture both eyes of a user in an image, the
resolution of the eye image is low, which
degrades the accuracy of the gaze
detection [16]
Wearable device
based method
Use two or three cameras on the wearable
device [1821]
It can obtain high-resolution eye images
using two cameras, so the accuracy of
the gaze detection is high
Inconvenient to users as they have to wear a
heavy device containing the two or three
cameras
Calculating a 3D gaze position by combining
two pupil positions of two eyes and the
information of the frontal viewing image [21]
Using two or three cameras requires
complicated calibrations and, increases the
systems weight and cost
Proposed method (using the pupil center,
pupil size, Purkinje images with monocular
eye tracker)
Using only one eye camera and one NIR
illuminator, the device is light and
convenient for users
Using a wearable device is more
inconvenient to users than the remote
camera based method
(CED) is performed [812,24,25,30,33]. Based on the detected pupil
region by CED, local binarization, morphological closing opera-
tion, and geometric center calculation are performed as follows
[812,24,25,30,33]. The detailed procedure is shown in Fig. 3.
Fig. 3(a) shows the CED where two scalable circles (external and
internal circles) are moved in the eye image. The region where the
difference value for the gray level of the external and internal circles
is maximum is the pupil region, as shown in Fig. 3(b). Based on the
pupil region, a rectangular area is dened, and local binarization
is performed as shown in Fig. 3(c). The threshold value for binariza-
tion is determined based on the p-tile method [31]. Based on the
rough pupil size detected by CED, p% of the p-tile method can be
determined [25]. In order to ll in the white specular reection
regions inside the black pupil region, the morphological closing
operation [26] is then performed in the binarized rectangular region
as shown in Fig. 3(d). After that, through calculation of the geometric
center for the black pixels of the pupil region, the center position of
the pupil can be acquired, as shown in Fig. 3(e).
From the detected pupil center, one circular region is dened as
shown in Fig. 3(f); the radius is the same as that of the pupil. In this
region, the rst and fourth Purkinje images are detected by local
binarization. The optimal radius of this region is experimentally
determined in terms of detection accuracy of the Purkinje image.
After detecting the two regions of the rst and fourth Purkinje
images, the geometric center position for each of the rst and
fourth Purkinje images is calculated as shown in Fig. 3(f).
2.4. Generation model of dual Purkinje images
In the human eye, there are four optical mirror surfaces: the
anterior and posterior cornea surfaces, and the anterior and
posterior lens surfaces. Fig. 4 shows the generation model of four
Purkinje images by NIR-LED and camera [27], and Gullstrands eye
model [28]. The reected image of incident light by the surface of
anterior cornea sphere (ACS) is called as the rst Purkinje image.
Those by the surfaces of posterior cornea sphere (PCS), anterior
lens sphere (ALS), and posterior lens sphere (PLS) are called the
second, third, and fourth Purkinje images, respectively. As shown
in Fig. 4, since the rst, second, and third Purkinje images are
generated on the lower side from the pupil center, altogether, it is
often the case that they are merged in the captured image and it is
usually difcult to discriminate these three Purkinje images (the
Fig. 3. Procedure of detecting the pupil region and Purkinje images: (a) circular edge detection (CED) template and eye image, (b) rough pupil region is detected by CED,
(c) local binarization, (d) region lling, (e) calculating the geometric center of pupil, and (f) detecting dual Purkinje images.
Fig. 4. Generation model of four Purkinje images by NIR-LED and camera [27].
rst, second, and third Purkinje images) in the image. So, the rst
Purkinje image has been widely used instead of the second and
third ones since the rst one is brightest and largest due to its
closer position to the camera compared to the second and third
ones [27]. Different from the rst, second, and third ones, the
fourth Purkinje image is separately generated on the upper side
from the pupil center as shown in Fig. 4, which makes it easier to
be located in the captured image [27]. The rst and fourth Purkinje
images are called a dual Purkinje image (DPI), which is used for
estimating depth gaze position in this paper.
2.5. Change in distance between dual Purkinje images according to
depth gaze position
When a user gazes at a far object, the users eye lens becomes
thin to focus on the object, as shown in Fig. 5(a). In contrast, when
a user gazes at a near object, the users eye lens thickens to focus
on the object, as shown in Fig. 5(b). As mentioned in Section 2.4,
since the Purkinje images are based on Gullstrands eye model
[28], the theoretical position of the dual Purkinje image can be
determined by the curvature and diameter of the cornea and lens.
Therefore, the change in lens curvature induces a change in the
positions of the dual Purkinje image. Consequently, the change in
depth gaze position induces a change in distance between the
rst and fourth Purkinje images, as shown in Fig. 5. Fig. 6 shows
the experimental results conrming this phenomenon.
2.6. Change in pupil size according to depth gaze position
As the depth directional gaze position of a user changes, lens
accommodation occurs, as mentioned in Section 2.5. Previous
researches have studied the relation between the lens accom-
modation and pupil size [12,29,32]. In [12], they showed the
pupil accommodation by depth xation in the real world. In
detail, they examined the pupil accommodation using a real-
world target that moves back and forth (from 10 cm to 50 cm
of Z distance) [12]. To prove this in this research, we also
performed the experiments shown in Fig. 7. We requested 15
subjects to gaze at ve positions of different depth distances
(from (1) at 10 cm to (5) at 50 cm) on the left eyes visual line, as
shown in Fig. 7(a). The Z distances from the frontal surface of eye
Fig. 5. Theoretical analysis of the changes in distance between the rst and fourth Purkinje images according to depth gaze position of user: (a) when a user gazes at a far
object; (b) when a user gazes at a near object.
Fig. 6. Experiment proving the changes in distance between the rst and the
fourth Purkinje images according to the depth gaze position of user: (a) when a
user gazes at a far object; (b) when a user gazes at a near object.
to each position were measured, manually. As shown in Fig. 7(b),
the pupil size became larger as the depth directional gaze
position became farther.
2.7. Estimating depth gaze position using MLP
As shown in Sections 2.5 and 2.6, we conrmed that the
positions of the DPI and pupil size change according to changes in
the depth directional gaze positions. Thus, we estimated the
depth gaze position using the following six features (F
1
F
6
):
F
1
: x
1
2c
x
F
2
: y
1
c
y
F
3
: x
4
c
x
F
4
: y
4
c
y
F
5
:
x
1
x
4
2
y
1
y
4
2
q
F
6
: pupil diameter as pixel
Here, (x
1
, y
1
), (x
4
, y
4
), and (c
x
, c
y
) are the positions of the rst Purkinje
images, fourth Purkinje images, and pupil center, respectively. With
these six features, we obtain the depth gaze position (Z
d
) using the
multi-layered perceptron (MLP) [34], as shown in Fig. 8.
Using the back propagation algorithm for learning MLP, we
obtain the optimal parameters (w
ij
, w
0
j1
), which are used to estimate
the users depth gaze position (Z
d
). As shown in Fig. 8, there are six
input nodes and one output node.
The users depth gaze position (Z
d
) can be represented as
follows:
Z
d
func2w
0
11
UO_h
1
w
0
21
UO_h
2
w
0
31
UO_h
3
w
0
n1
UO_h
n
1
where O_h
i
is the output value of the hidden node h
i
. w
0
i1
is the
weight value between the hidden node h
i
and output node (o
1
).
func2( ) is the kernel function of output node (o
1
). For the hidden
node (h
i
) and the output node (o
1
), various kinds of functions such
as linear, sigmoid functions, etc. can be used. For example, if
func2( ) is sigmoid function, the Eq. (1) can be shown as follows:
Z
d
1expw
0
11
UO_h
1
w
0
21
UO_h
2
w
0
31
UO_h
3
w
0
n1
UO_h
n
1
2
O_h
1
, O_h
2
, O_h
3
, y O_h
n
can be represented as follows:
O_h
1
func1F
1
Uw
11
F
2
Uw
21
F
3
Uw
31
F
6
Uw
61
O_h
2
func1F
1
Uw
12
F
2
Uw
22
F
3
Uw
32
F
6
Uw
62
O_h
3
func1F
1
Uw
13
F
2
Uw
23
F
3
Uw
33
F
6
Uw
63
^
O_h
n
func1F
1
Uw
1n
F
2
Uw
2n
F
3
Uw
3n
F
6
Uw
6n
3
where func1( ) is the kernel function of hidden node (h
i
). By
replacing O_h
1
, O_h
2
, O_h
3
, y O_h
n
of Eq. (1) by Eq. (3), Z
d
can be
Fig. 7. Examples of pupil size change according to the depth gaze position: (a) gazing at the ve reference positions in depth direction on the left eyes visual line at
(1) 10 cm, (2) 20 cm, (3) 30 cm, (4) 40 cm, and (5) 50 cm; (b) changes in the left eyes pupil size at the ve reference positions of (a) (1) 120 pixels, (2) 150 pixels, (3) 165
pixels, (4) 185 pixels, and (5) 200 pixels.
Fig. 8. MLP for estimating the depth gaze position of a user.
represented as follows:
Z
d
func2w
0
11
Uf unc1F
1
Uw
11
F
2
Uw
21
F
3
Uw
31
F
6
Uw
61
w
0
21
Ufunc1F
1
Uw
12
F
2
Uw
22
F
3
Uw
32
F
6
Uw
62
w
0
31
Ufunc1F
1
Uw
13
F
2
Uw
23
F
3
Uw
33
F
6
Uw
63
^
w
0
n1
Ufunc1F
1
Uw
1n
F
2
Uw
2n
F
3
Uw
3n
F
6
Uw
6n
4
Fig. 9 shows that the mean square errors (MSE) with various
numbers of hidden nodes decreased according to the learning epoch
of MLP training. Thus, we determined the optimal number of hidden
nodes to be nine, with which the minimum MSE is obtained.
2.8. Estimating planar gaze position using geometric transformation
considering the calculated depth gaze position
This section shows the method of calculating the gaze position
on the planar space (X, Y). The extracted pupil center position
(C
x
, C
y
) is used for calculating a users gaze position in the planar
space. As shown in Fig. 10, the transform matrix (mapping
function) T between the movable pupil center region ((C
x1
, C
y1
),
(C
x2
, C
y2
), (C
x3
, C
y3
), and (C
x4
, C
y4
)) and a users view region
((S
x1
, S
y1
), (S
x2
, S
y2
), (S
x3
, S
y3
), and (S
x4
, S
y4
)) is obtained using
geometric transformation [811,24,25,30,33]. The pupils mova-
ble area ((C
x1
, C
y1
), (C
x2
, C
y2
), (C
x3
, C
y3
), and (C
x4
, C
y4
)) is deter-
mined by gazing at the four corners of the users view region at
the initial stage of user calibration. The equation of the geometric
transformation is as follows [811,24,25,30,33], and matrix T
Fig. 9. Training procedures of MLP according to the number of hidden nodes.
Fig. 10. Relation between the pupils movable area and a users view plane
[811,24,25,30,33].
Fig. 11. Nine reference points gazed at for the ve different depth positions.
(a) Five depth positions from 10 cm to 50 cm. (b) Nine reference points in the
planar region.
of Fig. 10 can be calculated with matrix S and inverse matrix of C,
and the eight parameters ah can then be obtained
[811,24,25,30,33]:
S TC
S
x1
S
x2
S
x3
S
x4
S
y1
S
y2
S
y3
S
y4
0 0 0 0
0 0 0 0
0
B
B
B
@
1
C
C
C
A
a b c d
e f g h
0 0 0 0
0 0 0 0
0
B
B
B
@
1
C
C
C
A
C
x1
C
x2
C
x3
C
x4
C
y1
C
y2
C
y3
C
y4
C
x1
C
y1
C
x2
C
y2
C
x3
C
y3
C
x4
C
y4
1 1 1 1
0
B
B
B
@
1
C
C
C
A
5
Since 15 persons took part in the experiment of planar gaze
detection at 5 Z distances (10, 20, 30, 40, and 50 cm) of Fig. 11(a),
75 (155) T matrices were obtained. That is, each person has 5 T
matrices at 5 Z distances (10, 20, 30, 40, and 50 cm). Examples of
Matrix T real values are as follows:
T
a10

1:17361 0:25277 0:0001 274:26
0:0748 1:66715 0:0003 202:97
0 0 0 0
0 0 0 0
0
B
B
B
@
1
C
C
C
A
T
a30

3:27567 0:93095 0:000917914 1159:7
0:5602 3:84433 0:000810824 698:73
0 0 0 0
0 0 0 0
0
B
B
B
@
1
C
C
C
A
T
a50

4:73389 0:93574 0:00054243 1824:6
0:22047 8:23577 0:003079604 1740
0 0 0 0
0 0 0 0
0
B
B
B
@
1
C
C
C
A
T
b10

1:1155 0:20704 4:3 10
5
260:35
0:2299 1:48583 0:00031 147:21
0 0 0 0
0 0 0 0
0
B
B
B
B
@
1
C
C
C
C
A
T
a10
, T
a30
, and T
a50
are the T matrices, which were obtained
from one person at Z distances of 10, 30, and 50 cm, respectively.
T
b10
is the T matrix, which was obtained from the other person at
Z distance of 10 cm.
After we get matrix T at the initial stage of user calibration, the
users gaze position (S
0
x
, S
0
y
) can be obtained on the planar space
(X, Y) with the two features (C
0
x
, C
0
y
0
) of the pupil center, as shown
in Eq. (6) [811,24,25,30,33]:
S
0
x
S
0
y
0
0
0
B
B
B
@
1
C
C
C
A
a b c d
e f g h
0 0 0 0
0 0 0 0
0
B
B
B
@
1
C
C
C
A
C
0
x
C
0
y
C
0
x
C
0
y
1
0
B
B
B
B
@
1
C
C
C
C
A
6
However, this method has the problem that the matrix T
should be changed according to the planar spaces of different Z
distances. For example, matrix T obtained from the planar space
at the Z distance of 10 cm of Fig. 11(a) shows the errors in gaze
Table 2
Average ZGE (standard deviation of the error) of depth gaze estimation by the
proposed method and other methods (unit: cm).
Linear
regression
(LR)
Support vector
regression (SVR)
MLP (proposed method)
Linear kernel for
output node
Sigmoid kernel
for output node
Training
data
4.90(2.50) 11.80(8.04) 5.04(2.36) 2.26(2.11)
Test data 6.67(3.22) 11.89(7.94) 6.69(3.09) 4.59(3.96)
Table 3
Average ZGE (standard deviation of the error) of depth gaze estimation by the proposed method and other methods according to Z distance (unit: cm).
Z distance (cm) Linear
regression (LR)
Support vector
regression (SVR)
MLP (proposed method)
Linear kernel for
output node
Sigmoid kernel
for output node
Training data 10 0(0) 3.33(3.20) 0(0) 0(0)
20 5.44(4.65) 13.05(10.39) 5.77(4.80) 0(0)
30 9.81(4.24) 14.01(8.18) 10.18(3.38) 2.72(2.67)
40 0(0) 13.47(7.18) 0(0) 3.85(3.62)
50 9.23(3.62) 15.15(11.26) 9.23(3.62) 4.71(4.24)
Testing data 10 1.92(1.92) 3.33(3.20) 1.92(1.92) 1.83(1.82)
20 7.70(5.01) 11.71(9.53) 8.16(4.80) 2.56(2.51)
30 11.39(3.85) 14.14(7.86) 12.02(3.62) 4.27(3.43)
40 2.72(2.67) 13.61(6.81) 1.92(1.92) 6.42(5.55)
50 9.62(2.67) 16.67(12.31) 9.43(3.20) 7.89(6.47)
Table 4
Accuracies of the planar gaze estimation at ve depth positions (depth gaze
position calculated by the proposed method).
Depth position (cm) Average XGE (deg./cm) Average YGE (deg./cm)
10 1.05/0.18 1.98/0.35
20 1.05/0.37 2.06/0.72
30 1.18/0.62 1.07/0.56
40 0.81/0.57 1.75/1.22
50 0.73/0.64 1.13/0.99
Average error 0.96/0.48 1.60/0.77
Table 5
Accuracies of planar gaze estimation for ve depth positions (depth gaze position
measured manually).
Depth position (cm) Average XGE (deg./cm) Average YGE (deg./cm)
10 1.02/0.18 1.64/0.29
20 1.01/0.35 1.55/0.54
30 1.08/0.57 1.04/0.54
40 0.79/0.55 1.43/1
50 0.72/0.63 1.05/0.92
Average error 0.92/0.46 1.34/0.66
position for the planar space at the Z distance of 50 cm of
Fig. 11(a). This is because the users view plane of Fig. 10 changes
according to Z distance even if the pupil movable area of Fig. 10 is
the same. To overcome these problems, we use the following
method. In the user calibration stage, a user is required to see the
20 positions (four gaze positions in a planar space ve planar
spaces at ve Z distances) of Fig. 11(a). From this, ve T matrices
of ve planar spaces are calculated. In the testing stage, when the
depth gaze position is calculated using the method presented in
Section 2.7, the T matrix of the corresponding Z distance is
selected, and the (X, Y) gaze position is calculated using the
selected T matrix. For example, if the depth gaze position is
calculated as 22 cm, the T matrix calculated when a user gazes at
the four positions of the planar space of 20 cm of Fig. 11(a) is used
for calculating the (X, Y) gaze position.
3. Experimental results
The proposed method for 3D gaze estimation was tested on a
desktop computer with Intel Core2 Quad 2.33 GHz CPU and 4 GB
RAM. The algorithm was implemented with Microsoft Foundation
Class (MFC) based C programming, and the image capturing
software through the proposed camera device used DirectX
9.0 software development kit (SDK). In our experiments, a user
gazed at reference points in the 3D space as shown in Fig. 11. The
distances between the reference points for ve depth positions
(at 10 cm, 20 cm, 30 cm, 40 cm, and 50 cm) are 1 cm, 2 cm, 3 cm,
4 cm, and 5 cm, respectively, as shown in Fig. 11. Fifteen subjects
participated in the experiment, and each subject underwent six
trials of gazing at the nine reference points for ve depth
positions from 10 cm to 50 cm. Half of the trial data were
randomly selected and used for training. The other half were
used for testing. This procedure was repeated ve times, and the
average accuracy was measured.
In the rst experiment, we measured the error of the depth
gaze estimation, which is shown in Table 2. As the metric for
evaluating accuracy, we use Z gaze error (ZGE) between the
calculated Z distance (Z
c
) and the reference Z distance (Z
r
) as
shown in Eq. (7):
ZGE 9Z
c
2Z
r
9 7
Since there is no previous research to measure users 3D gaze
position using just one camera and one eye image, the compar-
isons with the previous studies were not performed in experi-
ments. Instead, we just compared the accuracy of the proposed
method to the methods using linear regression (LR) and support
vector regression (SVR).
Linear regression (LR) is a method that denes the relation
between one output value and one (or more than one) input value
using linear functions [35,36]. Support vector regression (SVR) is a
supervised learning method that uses nonlinear regression based
on a scalar function. SVR uses a nonlinear kernel function, which
can project the input data into a high-dimensional space, and SVR
can then use linear regression to t a hyper-plane [25,37,38]:
f x wFxb 8
Eq. (8) is established by attening the hyper-plane. The data
points lie outside a margin e surrounding the hyper-plane when
minimizing 99w99
2
and the sum of errors [7,25,37,38]. Any high-
dimensional projection can then be calculated using nonlinear
kernel functions [25]. For a fair comparison, the same six features
explained in Section 2.7 were used for the LR, SVR, and MLP-based
methods. As shown in Table 2, the accuracy of the proposed
method was better than those of the other methods.
Fig. 12. Reference (red diamond) and average calculated (blue cross) gaze positions in 3D space. (For interpretation of the references to color in this gure legend, the
reader is referred to the web version of this article.)
Fig. 13. XY plane view of Fig. 12 (red diamond: reference positions; blue cross: calculated gaze positions): (a) Z10 cm, (b) Z20 cm, (c) Z30 cm, (d) Z40 cm, and
(e) Z50 cm. (For interpretation of the references to color in this gure legend, the reader is referred to the web version of this article.)
Fig. 14. XZ plane view of Fig. 12 (red diamond: reference positions; blue cross: the calculated gaze positions): (a) Y1 cm, (b) Y2 cm, (c) Y3 cm, (d) Y4 cm,
(e) Y5 cm, (f) Y6 cm, (g) Y7 cm, (h) Y8 cm, (i) Y9 cm, (j) Y10 cm, and (k) Y11 cm. (For interpretation of the references to color in this gure legend, the reader
is referred to the web version of this article.)
Fig. 14. (continued)
Fig. 15. YZ plane view of Fig. 12 (red diamond: reference positions; blue cross: the calculated gaze positions): (a) X1 cm, (b) X2 cm, (c) X3 cm, (d) X4 cm,
(e) X5 cm, (f) X6 cm, (g) X7 cm, (h) X8 cm, (i) X9 cm, (j) X10 cm, (k) X11 cm. (For interpretation of the references to color in this gure legend, the reader is
referred to the web version of this article.)
Fig. 15. (continued)
In the next experiment, we measured the accuracies of depth
gaze estimation for the proposed method and other methods
according to the Z distance, as shown in Table 3. As shown in
Table 3, the accuracy of the proposed method was better than
those of the other methods.
In the next experiment, we measured the accuracy of gaze
estimation in the planar space (X, Y). As the metric for evaluating
accuracy on X-axis, we use X gaze error (XGE) between the
calculated X gaze position (x
c
) and the reference gaze position (x
r
)
as shown in Eq. (9). As the metric for evaluating accuracy on Y-axis,
we use Y gaze error (YGE) between the calculated Y gaze position
(y
c
) and the reference gaze position (y
r
) as shown in Eq. (10):
XGE 9x
c
2x
r
9 9
YGE 9y
c
2y
r
9 10
As shown in Fig. 11, each user gazed at the nine reference points
for ve depth positions from 10 cm to 50 cm. The accuracies of
the planar gaze estimations at the ve depth positions are shown
in Table 4. The average XGE and YGE between the estimated
and ground-truth positions (reference points of Fig. 11) were
0.961(0.48 cm) and 1.601(0.77 cm), respectively. The XGE (1) or
YGE (1) can be obtained as follows:
XGE1 tan
1
XGEcm=Z distancecm between users eye
and gazed plane 11
YGE1 tan
1
YGEcm=Z distance cmbetween users eye
and gazed plane 12
The YGE was larger than the XGE because the eye camera captures
the eye image on the slant under the eye, as shown in Fig. 2.
The gaze errors shown in Table 4 include the errors for depth
gaze estimation (ZGE of Eq. (7)) of Tables 2 and 3 since the
calculated depth gaze position was used for selecting the corre-
sponding T matrix for calculating the planar gaze position, as
explained in Section 2.8.
Thus, to calculate the accuracy of planar gaze estimation
without the error of the depth gaze estimation, we measured
the accuracy of gaze estimation in the planar space (X, Y) with the
known depth gaze position. In other words, we assumed that the
depth gaze position is known (manually measured) instead of
calculated by the proposed method of Section 2.7. As shown in
Figs. 11 and 15 users gazed at the nine reference points for ve
depth positions from 10 cm to 50 cm. The accuracies of the planar
gaze estimations at the ve depth positions are shown in Table 5.
The average XGE and YGE between the estimated and ground-
truth positions (reference points of Fig. 11) were 0.921(0.46 cm)
and 1.341(0.66 cm), respectively. When comparing Tables 4 and 5,
the errors shown in the latter are smaller than those in the former
since the errors of depth gaze estimation are not included.
Fig. 12 shows the reference and calculated gaze positions of
three persons in 3D space. Figs. 1315 show the positions in the
XY, XZ, and YZ plane views, respectively.
In the last experiments, we measured the processing time of
the proposed method. The processing times for detecting the
pupil region and Purkinje images were 16 and 0 ms, respectively.
That of calculating the gaze position in the Z plane was 0 ms, and
that of calculating the gaze position in the X, Y plane was 20 ms.
In general, the pupil size becomes larger as a user gazes at a
point further from his or her position, as shown in Fig. 16(a).
However, when the user has poor eyesight, there are some cases
where the pupil size does not change even though the depth gaze
position becomes further, as shown in Fig. 16(b); this causes an
error in depth gaze estimation.
There is another cause for error in the X and Y gaze positions.
As shown in Fig. 17, even if a user gazes at the same position, the
pupil center position in the image changes due to the movements
of our device of Fig. 2, which causes an error when calculating the
gaze position in the X and Y plane.
4. Conclusions
In this paper, we propose a new method for estimating the 3D
gaze position based on the illuminative reections on the surface of
the cornea and lens (Purkinje image) by considering the 3D structure
of the human eye. We use a lightweight glasses-type device for
Fig. 16. Error case of depth gaze estimation: (a) good case; (b) error case.
Fig. 17. Error for X and Y gaze position due to the movement of our device:
(a) gazing at a point; (b) gazing at the same point as (a), but the device has moved.
capturing the eye image that includes one USB camera and one
NIR-LED. We theoretically analyzed the generation models of Purkinje
images based on the 3D human eye model for 3D gaze estimation.
The relative positions of the rst and fourth Purkinje images to the
pupil center, inter-distance between these two Purkinje images, and
pupil size are used as the six features for calculating the Z gaze
position. With these features as inputs, the nal Z gaze position is
calculated using a multi-layered perceptron (MLP). The X, Y gaze
position on the 2D plane is calculated by the positions of the pupil
center based on geometric transform considering the calculated Z
gaze position. Experimental results showed that the average errors of
the 3D gaze estimation were about 0.961(0.48 cm) on the X-axis,
1.601(0.77 cm) on the Y-axis, and 4.59 cm along the Z-axis in 3D
space. In future work, we would test the proposed method with more
people of various ages and race in various environments and examine
the feasibility of combining the proposed one eye-based method and
that using two eyes.
Acknowledgments
This research was supported by Basic Science Research Pro-
gram through the National Research Foundation of Korea (NRF)
funded by the Ministry of Education, Science and Technology (No.
2011-0004362), and in part by the Public welfare and Safety
research program through the National Research Foundation of
Korea (NRF) funded by the Ministry of Education, Science, and
Technology (No. 2011-0020976).
References
[1] Lin C-S, Huan C-C, Chan C-N, Yeh M-S, Chiu C-C. Design of a computer game
using an eye-tracking device for eyes activity rehabilitation. Opt Lasers Eng
2004;42(1):91108.
[2] Lin C-S, Ho C-W, Chang K-C, Hung S-S, Shei H-J, Yeh M-S. A novel device for
head gesture measurement system in combination with eye-controlled
humanmachine INTERFACE. Opt Lasers Eng 2006;44(6):597614.
[3] Bulling A, Roggen D, Tr oster G. Wearable EOG goggles: seamless sensing and
context-awareness in everyday environments. J Ambient Intell Smart Environ
2009;1(2):15771.
[4] Young L, Sheena D. Survey of eye movement recording methods. Behav Res
Methods Instrum 1975;7(5):397429.
[5] Yoo DH, Chung MJ. A novel non-intrusive eye gaze estimation using cross-
ratio under large head motion. Comput Vision Image Understanding 2005;
98(1):2551.
[6] Wang J-G, Sung E. Study on eye gaze estimation. IEEE Trans Syst, Man,
Cybern, Part B 2002;32(3):33250.
[7] Murphy-Chutorian E, Doshi A, Trivedi MM. Head pose estimation for driver
assistance systems: a robust algorithm and experimental evaluation. In:
Proceedings of the 2007 IEEE Intelligence Transaction Systems Conference;
2007. pp. 70914.
[8] Cho CW, Lee JW, Lee EC, Park KR. Robust gaze-tracking method using frontal-
viewing and eye-tracking cameras. Opt Eng 2009;48(12):127202-115.
[9] Ko YJ, Lee EC, Park KR. A robust gaze detection method by compensating for
facial movements based on corneal specularities. Pattern Recognition Letters
2008;29(10):147485.
[10] Bang JW, Lee EC, Park KR. New computer interface combining gaze tracking
and brainwave measurements. IEEE Transactions on Consumer Electronics;
accepted for publication.
[11] Lee EC, Park KR, Whang MC, Park J. Robust gaze tracking method for
stereoscopic virtual reality systems. Lecture Notes in Computer Science
2007;4552:7009.
[12] Lee EC, Lee JW, Park KR. Experimental investigations of pupil accommodation
factors. Invest Ophthalmol Visual Sci 2011;52(9):647885.
[13] /http://www.avatarmovie.com/S [accessed 28.10.11].
[14] /http://adisney.go.com/disneypictures/aliceinwonderland/S [accessed 28.10.11].
[15] /http://www.imdb.com/title/tt0892791/S [accessed 28.10.11].
[16] Kwon Y-M, Jeon K-W, Ki J, Shahab QM, Jo S, Kim S-K. 3D gaze estimation and
interaction to stereo display. Int J Virtual Reality 2006;5(3):415.
[17] Hennessey C, Lawrence P. 3D point-of-gaze estimation on a volumetric
display. In: Proceedings of the 2008 symposium on eye tracking research
and applications; 2008. p. 59.
[18] Essig K, Pomplun M, Ritter H. A neural network for 3D gaze recording with
binocular eye trackers. Int J Parallel, Emergent Distributed Syst 2006;21(2):
7995.
[19] Pfeiffer T, Latoschik ME, Wachsmuth I. Evaluation of binocular eye trackers
and algorithms for 3D gaze interaction in virtual reality environments.
J Virtual Reality Broadcast 2008;5(16).
[20] Sumi K, Sugimoto A, Matsuyama T, Toda M, Tsukizawa S. Active wearable
vision sensor: recognition of human activities and environments. In: Pro-
ceedings of international conference on informatics research for develop-
ment of knowledge society infrastructure; 2004. p. 1522.
[21] Mitsugami I, Ukita N, Kidode M. Estimation of 3D gazed position using view
lines. In: Proceedings of international conference on image analysis and
processing; 2003. p. 46671.
[22] /http://www.logitech.comS [accessed 28.10.11].
[23] Yamato M, Monden A, Matsumoto K, Inoue K, Torii K. Quick button selection
with eye gazing for general GUI environments. In: Proceedings of interna-
tional conference on software: theory and practice; 2000. p. 7129.
[24] Lee EC, Woo JC, Kim JH, Whang M, Park KR. A braincomputer interface
method combined with eye tracking for 3D interaction. J Neurosci Methods
2010;190(2):28998.
[25] Cho CW, Lee JW, Shin KY, Lee EC, Park KR, Lee HK et al. Gaze tracking method
for an IPTV interface based on support vector regression. ETRI J; submitted
for publication.
[26] Gonzalez RC, Woods RE. Digital image processing. 2nd ed. NJ: Prentice-Hall;
2002.
[27] Lee EC, Ko YJ, Park KR. Fake iris detection method using Purkinje images
based on gaze position. Opt Eng 2008;47(6):067,204-116.
[28] Gullstrand A. Helmholtzs physiological optics. Opt Soc Am 1924:3508.
[29] /http://en.wikipedia.org/wiki/Accommodation_(eye)S [accessed 28.10.11].
[30] Lee HC, Luong DT, Cho CW, Lee EC, Park KR. Gaze tracking system at a distance
for controlling IPTV. IEEE Trans Consum Electron 2010;56(4):257783.
[31] Jain R, Kasturi R, Schunck BG. Machine vision.McGraw-Hill; 1995.
[32] Ripps H, Chin NB, Siegel IM, Breinin GM. The effect of pupil size on
accommodation, convergence, and the AC/A ratio. Invest Ophthalmol Visual
Sci 1962;1:12735.
[33] Heo H, Lee EC, Park KR, Kim CJ, Whang M. A realistic game system using multi-
modal user interfaces. IEEE Trans Consum Electron 2010;56(3):136472.
[34] Freeman JA, Skapura DM. Neural networks: algorithms, applications, and
programming techniques.Addison-Wesley; 1991.
[35] Zou KH, Tuncali K, Silverman SG. Correlation and simple linear regression.
Radiology 2003;227:61722.
[36] Zanutto EL. A comparison of propensity score and linear regression analysis
of complex survey data. J Data Sci 2006;4(1):6791.
[37] Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V. Support vector
regression machines. Adv Neural Inf Process Syst 1997;9:15561.
[38] Smola AJ, Sch olkopf B. A tutorial on support vector regression. Stat Comput
2004;14(3):199222.

3D Gazetrackingmethodusingpurkinjeimages

Uploaded by

Copyright:

Available Formats

You might also like

3D Gazetrackingmethodusingpurkinjeimages

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3D Gazetrackingmethodusingpurkinjeimages

Uploaded by

Copyright:

Available Formats

3D gaze tracking method using Purkinje images

on eye optical model and pupil

You might also like