Mobile Eye Tracking

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Technical University of Denmark

Project course
DTU Compute

Mobile Eye Tracking

By: Supervisor:
Elias Lundgaard Pedersen Per Bækgaard
S143969

January 18, 2018


Contents
1 Abstract 2

2 Introduction 2
2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Enable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Current tracking model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Challenges in pupil detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 External parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.2 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.3 Face alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.4 Pupil localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.5 Errors and noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Related work 6
3.1 Models for pupil localization and tracking . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Face detection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Algorithms for face alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.3 Methods for pupil detection and tracking . . . . . . . . . . . . . . . . . . . . . 7
3.1.4 Failure detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 State-of-the-art model performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Redesigning the tracking model 9


4.1 Speeding up the tracking model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Implementation 11
5.1 Comparison with state-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Processing performance of new model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 Conclusion 13

References 13

7 Appendix A: R-studio code for benchmarking with BioID 16

1
Mobile eye tracking
Elias Lundgaard
January 2018

1 Abstract
Enable, an eye tracking application designed for disabled people, is constrained to use no extra hard-
ware and run on a mobile device. Eye tracking in-the-wild is an especially challenging task in computer
vision, and requires sophisticated algorithms and tracking models for stable use. In this paper I some
major challenges in Enables current tracking model are identified. Afterwards a series of state-of-
the-art solutions for an improved eye tracking model and solutions to the identified challenges are
researched. The methods are then evaluated on three parameters: accuracy, computational per-
formance and ease of implementation, and are used for redesigning the tracking model of Enable.
Implementation of r-CNN’s for face detection, facial landmarking and several performance measures
results in a significant increase of accuracy (a score of 96.38 for e ≤ 0.2 on the BioID database) for
pupil detection, and better bounding boxes for eye regions. It comes at a cost of computational perfor-
mance in a reduction of 7 FPS. The paper concludes that implementation of the remaining redesigns
potentially can make Enable reach state-of-the-art and work stable on a mobile device in the wild.

2 Introduction
Eye tracking on mobile phones without additional hardware has previously been extremely challenging
without additional hardware. This is mostly due to the advanced concept of eye gaze tracking, which
so far has required several known landmarks in the eye such as corneal reflections [1]. This is typically
done using infrared light diodes (IREDs) placed in front of the user. Since gaze based tracking
systems often are used for complete UI interaction, stereoscopic cameras are used to model the eye
in 3D. Furthermore, the anatomical accuracy of eye gaze is 2-4◦ [2], and as a result, video based eye
tracking has a limited precision. Calibration is another issue with eye gaze tracking, which often is
required multiple times during use to great annoyance to the user. Few companies such as Samsung [3]
has attempted adding certain software based eye tracking features such as automatic scrolling to their
phones, but it has been regarded as a failure, and never been widely adopted. To avoid these issues,
Enable, an eye tracking application designed for disabled people is designed for using gaze gestures
instead of gaze points. This makes it possible to run on mobile devices as a software solution, which
is desirable in terms of scalability. The device and environmental parameters does however makes up
some challenging problems for the tracking model, which are attended to in this paper.

2.1 Problem statement


The use case of Enable poses a series of design constraints. First of all, the application must run on
a mid to high-end mobile device such as smartphones (iPhone 6 equivalent or up) or tablets (iPad
2 equivalent or up). This sets up specific requirements for computational performance. Secondly the
application must be reliable and have no involuntary activations. This means the pupil detection
and tracking algorithm must be as accurate as possible, in all types of environments. Finally since
the project is run by a company with limited resources, the model must be easily implementable.
Specifically it is desired that the algorithms used are readily available as open-sourced projects or can
be implemented without commercial licenses and in a limited time-frame. These three points makes
up the major constraints and assessing parameters for the redesign.

2
Figure 1: The Enable proof of concept application. In the debug window, the left pupil x-coordinates are logged. If
the difference in this value is below or above a certain threshold value, a corresponding gesture is logged. In this case
LEFT. The pupil os located by detecting the face (middle) and eye regions (right).

Webcam Face detection Eye region Pupil detection Pupil state Action (on/off)
+ center detection
Haar cascades Constants Gradients Threshold values

Figure 2: The current tracking model for Enable

2.2 Enable
Enable is a platform for controlling smart home devices with eye gesture interaction as a primary
input. The platform runs locally on a smartphone in the form of an application, and does not
require additional hardware than what is already integrated in the phone. Enable is targeted motor
impaired audiences with the purpose of assisting them in independently interacting with their physical
environment. To evade the many issues of using eye gaze, Enable adopts pupil gesture interaction as
input modality [1]. By ”swiping” from left to right or down to up, the user can control a limited
amount of objects on the smart phone screen. A video demonstrating the concept can be seen here
or pictured in figure 1. This method is preferred because it addresses almost all of the issues with
traditional eye tracking. It doesn’t require high precision pupil localization, and therefore doesn’t
need IREDs or stereoscopic cameras. Because the point of gaze isn’t required neither is calibration
and the solution works out of the box. This makes it perfect for mobile interaction with low quality
cameras and changing environmental factors.

2.2.1 Current tracking model


The process for Enables pupil gesture interaction is illustrated in figure 2.
A brief rundown of each step in figure 2 is now presented:

• Face detection is performed using the Viola-Jones (VJ) object detection framework [4]. The
VJ framework uses Haar-like features in a cascaded Adaboost classifier. The Haar-like features
used in Enable are the ones of OpenCV 3.3 for frontal face detection.
• Eye cropping is be found by using mean values of eye localization in a front-facing face [5].

• Pupil center localization is performed on the cropped eye region image. The localization
algorithm used is Fabian Timm et als. Means of Gradients [5] which uses intersecting gradient
vectors to determine the pupil center location.

3
(a) (b) (c)

Figure 3: The image quality changes depending on the environment. In (a) the face has changing brightness due to
shadows. In (b) the face has darkened due to strong backlight and in (c) lack of lightning makes the face appear dark
and full of noise.

• Pupil state detection is done using constant threshold values. If the threshold value in the X
or Y direction is exceeded, the pupil state is set to LEFT (L), RIGHT (R), UP (U) or DOWN
(D). If the sequence of states matches a predetermined gesture, eg. LRLR, the corresponding
action is executed.
The current Enable eye tracking model is primarily based on readily available and easily imple-
mentable existing software libraries. Therefore there is plenty of room for improvements.

2.3 Challenges in pupil detection


Enable, and pupil detection in general involves several challenges, both environmentally and techno-
logically. Both processing power and accuracy influences performance of an algorithm, and several
factors have to be taken into account. Enable currently faces all the challenges that comes with using
purely mono camera frames in an ”in-the-wild” environment.

2.3.1 External parameters


Since the use case of Enable constraints it from using external hardware, no other lighting than the
existing light in the environment can be supplied. This can result in very dark, low contrast images or
very high contrast in strong back light (fig. 3). Since the pupil detection is done purely from the image,
it has to be possible to do so in almost all types of environments. A general rule of thumb is that if
a person can see an object on an image, so can a machine. Modern commercial low quality cameras
such as smart phone cameras and webcams handles changing environmental lightning extremely well,
but changes in brightness, contrast and noise will still pose as a challenging factor.
Anatomical parameters such as pupil color [6], skin color, blinking, disabilities (eg. lazy eye), and
aestethics such as make-up may also be influencing factors.

2.3.2 Face detection


Since face detection is a fundamental part of eye detection, the challenges regarding these fields
of computer vision are closely related. These factors includes face size in the recorded frame, face
pose, the structural components of the face, the facial expression, occlusions, such as glasses, image
orientation and the number of faces in the frame [7]. Especially face size and pose are challenging in
the case of VJ which Enable uses.

2.3.3 Face alignment


Face alignment or facial landmark localization is an important topic in computer vision research. It
deals with localization of certain features in the face. This can be eye corners, nostrils, cheekbones
etc. This is very important in eye tracking because it can be used as a frame of reference for for the
pupil localization [8]. Furthermore, landmarks are required for head pose estimation, which in turn
can be used to eliminate the VOR problem described previously.

4
(a) One closed eye (b) Two closed eyes (c) Angled face

(d) Obstructed eye (e) Squinting eyes (f) Sideways face

Figure 4: Different errors

Enable currently uses the bounding box of the eye regions for locating the pupil. This box is in
the case of Enable derived from some mean geometric values. This method however, is very imprecise
due to changing face size, and varying facial features between different users. By using landmarks,
the pupil localization could be improved substantially.

2.3.4 Pupil localization


The process of locating the pupil in the eye can be tedious due to many factors. First of all, the eye
is not flat, but spherical [1]. This makes 2D-gaze tracking hard, and requires the 2D-coordinates to
be mapped to 3D-models [8] or measuring the eye coordinates in 3D using stereoscopic cameras. The
pupil is not always circular. It can be covered by the eyelid, or placed in the corners of the eye, which
changes the basic geometry. Shape based methods such as Circular Hough Transforms (CHT) [6],
Circular Edge Detection (CED) and Longest Line Detection (LLD) [9] does not perform well under
these circumstances.
Compensation movement or the Vestibulo-ocular reflex (VOR) may pose a major challenge in this
project. Since Enable uses the relative motion of the eye, it is easily tricked by head movements. Even
though the pupil centre changes position in the eye, the user may just be adjusting his or her head
position, making an involuntary gesture. Since Enable must be used in a normal day context, head
fixation is not an option. Zhu et al . [10] proposes some interesting methods, one being a dynamic
computational head compensation model, but it is computationally heavy.

2.3.5 Errors and noise


The eye tracking model of Enable faces serious issues when confronted with small changes. These
errors includes

5
• False positives on pupil location on one or two closed eyelids (figure 4a and 4b).

• Spikes in pupil positions in some frames. This is because the pupil is found in every frame, and
does not correlate the position to previous frames.
• Different values in pupil positions for both eyes. Theoretically the pupils should be located the
same place in both eyes under normal circumstances.
• Not adjustable to face orientation. The software assumes that the face is always frontfacing and
aligned horizontally (figure 4c and 4f).
• Unstable measurements on normal use.

3 Related work
There is without doubt plenty of challenges to attend. Luckily, recent advancements in computer
vision and especially artificial intelligence (AI) has made previously complex and inefficient solutions
incredibly easy and lightweight.

3.1 Models for pupil localization and tracking


When looking for models for pupil localization it is important to note that the actual pupil localization
in the eye regions is not necessarily the most challenging factor. The hardest part is arguably to find
the face on which the pupil localization is done. For doing this, detection algorithms are used.

3.1.1 Face detection algorithms


Face detection is a subset of object detection, and has been a major focus in computer vision research
due to its many uses [7]. Until the groundbreaking paper of Viola et al in 2004, face detection in
real time was incredibly challenging [4]. After this breakthrough numerous algorithms have been
proposed. The focus has primarily been on improving the classifiers and features of the VJ-model.
Actually more than 179 classifiers has been proposed, and an abundance of more advanced features like
HOG, SIFT, SURF, ACF, NPD etc. has been developed to meet the ever increasing demand for robust
face detection [11]. In the brilliant article Do we Need Hundreds of Classifiers to Solve Real World
Classification Problems? [12] Delgado et. al reviews 179 of the available classifiers and concludes only
minor differences between them. Notably is the random forests classifier which performs best. The
most critical drawback of using these classifiers is their ability to handle non-frontal faces in the wild.
This is important, as you can not expect the user to always face directly to the camera, eg. when
looking down on their phone.
OpenCV uses the VJ model, while the dlib library uses Support Vector Machine (SVM) as a
classifier for face detection.
Recent advances in AI have witnessed significant improvements in face detection compared to the
more traditional classifier approach, most notably the Faster-RCNN model proposed by Ren et al. [13].
It is a region-based very deep convolutional object detection neural network (R-CNN) with state-of-
the-art precision and good computational performance. Several improvements for face detection have
later been proposed [11], and a method for making the detector efficient at handling faces in different
sizes has recently been published [14]. Other deep learning approaches like that of Li et als. [15] also
achieves very high performance compared to traditional models.

3.1.2 Algorithms for face alignment


Before the advent of deep learning methods, methods of cascaded regression was generally considered
state-of-the-art [8] for face alignment. As with face detection algorithms, CNN’s have recently been
shown much more effective, and the calculations of a trained detector can be done in almost real
time [16]. Since Enable currently finds the eye regions from constant values, implementing face
alignment in the tracking model could significantly improve accuracy.

6
3.1.3 Methods for pupil detection and tracking
Eye tracking has been a popular research topic since the middle of the 20th century [17]. As a
result, there exist a very large amount of methods for computer based pupil tracking. Since enable
cannot make use of external lamps such as IREDs, only methods regarding passive light and mono-
cameras are covered here. As with face detection, classifiers such as SVM’s can also be used for pupil
detection [18] [19].
Feature based methods uses color distribution in the eye region. Enable uses this method through
means of gradients [5]. This method works well under good lighting conditions, but is susceptible to
changing light and image focus. Hansen et al. proposes a model using Marginalized Contour Model
for iris tracking [20], but this approach assumes that the user does not move his er her head. Some
models combines different approaches to improve performance [21]. Vater et al. [22] combines isophote
features with a cascade classifier to reach high accuracy.
As with most other object detection tasks, deep learning and AI has shown to be a potential
replacement for these algorithms. Tian et al. [23] proposes using Adaptive Gradient Boosting and
outperforms most state-of-the-art methods in terms of accuracy and time. The best performing
algorithm is the improved supervised descent method (SDM) proposed by Zhou et al. [24].

3.1.4 Failure detection


The errors mentioned in previous section are by no means unknown to people working with eye
tracking. A common fix to strange pupil position measurements is to constraint the pupils of both
eyes to one another. This has been successfully implemented by Wang et al. with the k-mean method
in the form of a double eye gaze constraint [25]. Wang et al. furthermore implements an eye close
detector to avoid false positives on closed eyes or blinking. To address unstable measurement of pupil
location or facial landmarks, smoothing algorithms such as moving averages, Kalmann filtering or
optical flow can be implemented.
This section described a number of approaches for attending the challenges of pupil tracking.
An overall observation is the significant improvement of CNN’s and deep neural networks (DNN)
compared to traditional computer vision approaches. Ultimately DNN’s will be able to replace some
if not all of Enables detection algorithms, making it perform to state-of-the-art. Determining which
approaches to start with is the next challenge.

3.2 State-of-the-art model performances


The many different approaches and algorithms to do eye tracking does not make it any easier to choose
the ”right” approach. Most modern approaches covered here addresses the original design constraints
for Enable. Some however, are better than others. The models and algorithms can be assessed
from three main parameters: accuracy, processing performance and ease of implementation. The first
two points can easily be examined for face detection using FDDB [26], a well known face detection
benchmarking platform. The badly designed performance curve from the webpage can be seen on
figure 5. It is here apparent that all of the top scores are deep learning methods such as faster-RCNN
described earlier. Furthermore, the difference between them performance-wise is almost negligible
between the top scores. This makes it easy to choose an algorithm from the ease of implementation
constraint. dlib recently implemented a deep learning API for object detection. Using this API, dlib
made a face detector which performed arguably better than faster-RCNN, a state-of-the-art method.
This makes it obvious to use, since dlib is open source and free for commercial use.
If the Enable application is ported to mobile platforms like Android or iOS, the built in face
detector may be a better option since platform specific detectors makes use of hardware acceleration.
For eye detection another database, BioID is used for benchmarking. The results for the most
relevant algorithms are presented in table 1. e denotes the normalized error score on the dataset, and
the final column the time consumed for processing each image. Generally e ≤ 0.05 corresponds to very
high precision, and e ≤ 0.2 to the pupil prediction being somewhere in the eye region [27]. Also here,
the deep learning approaches of [23] and [24] surpasses feature based methods. This means that the
performance of Enable can be improved by simply implementing a newer algorithm for eye detection.
There is however no implementable deep learning method for eye localization at of the time writing.

7
Figure 5: FDDB discrete scores of all recently published methods measured in performance. The metric is the Receiver
Operating Characteristic (ROC) curve which measures true positive rate against false positive rate. Generally a steeper
slope is better, since this means better sensitivity. A perfect classification would be in the point (0, 1) since this would
mean no false negatives and no false positives. Source: FDDB results page

8
Table 1: Comparison of normalized error scores in BioID database of state-of-the-art pupil detection algorithms. e
denotes the error metric of how many percent of the predictions are in a certain range. Consumed time is the time
spent predicting the pupil position in each image.

Method Authors e ≤ 0.05 e ≤ 0.1 e ≤ 0.2 Consumed Time (ms)


Adaptive GBDT Tian et al. 2017 [23] 91.54 96.07 99.89 2.0
Modular approach Zhang et al. 2016 [21] 85.66 93.68 97.00 -
Means of gradients Timm et al. 2011 [5] 82.5 93.4 96.4 -
Improved SDM Zhou et al 2015 [24] 93.8 99.8 99.9 2.0

Since the trade-off is acceptable (Timm et als. algorithm is only a few percentage points behind state
of the art), this will be a future improvement, and thus not implemented in this paper.
One of the most popular methods of face landmarking is done with an ensemble of regression trees.
Kazemi et al. [16] showed in 2014 that this method could accurately do face alignment in under one
millisecond. This makes it one of the most powerful and used algorithms as of today. Other methods
have been proposed [8] with extreme resilience to unfamiliar poses. Kazemi et als. method is already
implemented in dlib, and is therefore the preferred face alignment method for this project.

4 Redesigning the tracking model


As covered in the previous section, Enable has plenty of options for improving performance. These
options includes:

1. Using r-CNN’s for face detection


2. Using facial landmarks instead of fixed values for determining eye regions and pupil change
3. Aligning face posture and orientation to address VOR issues

4. Smoothing pupil positions and facial landmarks using optical flow


5. Implementing Improved SDM or similar machine learning techniques for pupil localization.
6. Implement an eye close detector

7. Implement double eye gaze constraint for error correction.

From these realizations, a redesigned tracking model can be presented in figure 6. As mentioned
earlier, time constraints and ease of implementation decides which points to prioritize. Points 1, 2,
4 and 6 are easily implementable. Points 3 and 7 are probably the of the most important initiatives
for reliable interaction, and point 4 would make the pupil tracking itself suitable for more delicate
applications such as gaze point tracking.

4.1 Speeding up the tracking model


Since the application primarily will run on mobile devices, optimizing the model for speed is very
important. The redesigned tracking model has already taken some action with optimized detection
algorithms, but several intuitive actions can still be taken. Face detection is the most computer
expensive task since it has to go through all pixels for secure detection. The task however, does not
need to be done on every frame. A normal phone camera records at 30 frames per second (FPS), but
head movements are normally not fast enough to change the bounding box for eye region detection.
A classic approach is to simply skip every second or third frame, since this will speed up the face
detector two or three times. The facial landmark detector will use the bounding box of the face from
the previous frame every second frame instead.
Furthermore, the face detector does not need a high resolution image to detect faces. By reducing
the resolution of the frame for the face detector it can be significantly sped up since less pixels needs
analyzing. We can the upscale the bounding box for landmark detection in high resolution. As a

9
Get frame Align face posture Pupil state

No
Face detection Extract eye regions Action
r-CNN Located landmarks detected?
face
No

Yes

Action (on/off)
Yes

Locate landmarks Eye closed?


Regression trees

No

Locate pupil centres


Adaptive GBDT

Yes
First frame?

No

Stabilize landmarks Constrain pupils


Optical flow and face alignment

Figure 6: The redesigned tracking model. Please note orange modules will not be implemented in this report due to
time constraints.

10
Face detection Locate landmarks
r-CNN Regression trees
Get frame

False Downscale Upscale


Convert to frame count is an
grayscale equal number frame bounding box

True
Use previous
face box

Figure 7: Measures for speeding up the detector

downside, smaller faces cannot be detected, but since Enable requires the user to be relatively close
to the screen, this is not a critical issue.
None of the detection algorithms requires full RGB images for precise detection. Since converting
frames to grayscale will significantly reduce it’s array size, this is a very easy measure to take.
In figure 7, the measures taken for speeding up the detector are illustrated. I final measure for
speeding up the detector is enabling hardware acceleration by compiling the software in release mode.
This enables protocols like SSE4 [28] which significantly speeds up runtime. Summing up, the measures
for speeding up the detector are:

1. Skip every second frame for face detection


2. Downscale image for face detection
3. Convert frame to grayscale for all detectors
4. Enable hardware acceleration.

5 Implementation
To compare the redesigned tracking model with the original, the models and methods mentioned
earlier are now implemented in the application. To test the improvements, the model is evaluated
on two parameters: performance in accuracy, and computational performance. The first parameter
can be tested using the BioID database and compared to the original model performance and state-
of-the-art. The second parameter can be tested by obtaining the average framerate of a pre-recorded
video.

5.1 Comparison with state-of-the-art


The BioID database as introduced earlier is a database containing 1521 annotated grayscale images.
It is generally considered one of the most challenging datasets due to the changing environmental
factors (the same as introduced earlier). The metric for comparison is given by:

max {DL , DR }
e= (1)
D
Where DL and DR represents the euclidian distance between the measured pupil position and the
ground truth pupil locations for respectively the left and right eye. D denotes the euclidian distances
between the ground truth pupil locations:
q
D = (Lex − Rex )2 + (Ley − Rey )2 (2)

Where Le and Re are left and right pupil positions as a 2D point for the ground truth. As formal
research papers usually only test the pupil detection algorithm on the known eye region, frames where
the face is not detected are discarded in this test. As the face detection algorithm of dlib uses image
pyramids to detect differently scaled faces, the number of pyramids has to be fitted the small resolution
of BioID. By using respectively 2 and 5 pyramids, it is important to note that processing performance

11
Table 2: Comparison of normalized error scores in BioID database for the redesigned tracking model and the most
relevant published methods. Numbers marked in bold corresponds to best performance in the given column. There is
a significant improvement in performance in the redesigned model compared to the original model. However it doesn’t
quite reach the performance of [5] in spite of using the same algorithm

Method Authors e ≤ 0.05 e ≤ 0.1 e ≤ 0.2 Not detected


Improved SDM Zhou et al 2015 [24] 93.8 99.8 99.9 -
Means of gradients Timm et al. 2011 [5] 82.5 93.4 96.4 -
Enable previous - 60.62 78.00 79.83 48 (3.2%)
Enable redesigned 2 pyramids - 70.68 91.41 96.80 427 (28.1%)
Enable redesigned 5 pyramids - 71.14 90.33 96.38 1 (0.0007%)

(a) (b)

Figure 8: Tacking in the benchmarking video file on the old (a) and new (b) tracking model. Note how the eye regions
are much tighter in the redesigned model because of facial landmarking.

alternated. Likewise the original Haar-based detector needs to be tweaked to detect smaller faces. I
used a minimum size of 15 px and a maximum size of 50 px.
The results on BioID database are presented in table 2. As expected, the redesigned model can
theoretically not surpass that of [5], since they use the same pupil detection algorithm. Compared
with the previous tracking model however, the performance has significantly increased. At e ≤ 0.2
the performance is infinitesimally close to that of [5], which means the new face detection algorithm
is almost infallible. However the redesigned model isn’t even close to reach the same performance in
e ≤ 0.05 as [5], which is strange considering both methods are using the same algorithm. This may
be due to some values differing in the detection algorithm, but should definitely be troubleshooted.
Furthermore, only one face is not detected when using 5 pyramids, which is excellent considering
the challenging nature of BioID. This means that the next major step for increasing the accuracy is
to find out what decreases the performance in e ≤ 0.05 and afterwards implement the Improved SDM
or similar for pupil localization.

5.2 Processing performance of new model


A metric for processing performance is the average FPS the model can perform on a given video
sequence. It is important that both tested algorithms can actually detect all faces and pupil centres,

12
Method Framerate (FPS)
Enable original 20
Enable redesigned 13

Table 3: Framerates for the two iterations of Enables tracking model on a test video. There is a significant decrease
compared to the previous.

as these are the most computationally expensive tasks. A video file of 1280x720 px resolution and 20
second duration is loaded into each iteration of Enable. The framerate is calculated by substracting
the timestamp of each frame from the start time ∆t[ms] dividing it into the number of frames, and
then averaging all measured FPS. Practically this is done using the built-in tick counter of OpenCV.

Nf rames
FPS = (3)
∆t
This results in a measured FPS for the redesigned tracking model presented in table 3. This is a
significant trade-off in speed, and is caused by the CNN face detector. The performance can still be
tweaked to improve performance slightly, however it does not seem to be possible to achieve the desired
30FPS using this algorithm. However, using CNN’s is definitely a more future proof solution than
classifiers, and much more efficient implementations already exists [29]. Furthermore, since Enable is
designed for mobile implementation, face detection performance may not be a problem due to built-in
solutions and API-calls on mobile platforms.

6 Conclusion
In this paper, a thorough search of methods, models, and algorithms has resulted in a series of pro-
posals for optimizing Enable. The proposals includes methods for improving accuracy, computational
performance and ease of implementation. Some of these methods have been implemented, including
CNN face detection, facial landmarking, smoothing pupil positions using optical flow, and a series of
measures for speeding up the detector. This has resulted in a significant improvement in accuracy,
with relatively high normalized error scores. The CNN face detection algorithm is almost infallible,
and detected all but 1 face in the challenging BioID database. This stability comes at a cost of pro-
cessing performance, and a significant drop in FPS. Some implementations where not tested in this
paper. The facial landmarking can be used for improved gesture detection, and give more precise
pupil coordinates in the eye. This is relevant to test since it will impact the ease of interaction.
There are still plenty of methods for optimization like blink detection and gaze constraints that are
not yet implemented. These implementations will further stabilize Enable, which is certainly needed
before it can be used in an every-day environment. There is still quite a way to reach the performance
of state-of-the-art, however the few implementations done in this paper shows promising results of it
being possible in the near future.

References
[1] Heiko Drewes. Eye Gaze Tracking. Interactive Displays: Natural Human-Interface Technologies,
pages 251–283, 2014.
[2] Robert J K Jacob. Eye Tracking in Advanced Interface Design. Virtual Environments and
Advanced Interface Design, pages 258–290, 1995.
[3] Agam Shah. Inside Samsung Galaxy S4’s face and eye-tracking technology, 2013.

[4] Paul Viola and Michael J Jones. Robust Real-Time Face Detection. International Journal of
Computer Vision, 57(2):137–154, 5 2004.
[5] Fabian Timm and Erhardt Barth. Accurate Eye Centre Localisation By Means of Gradients.
Proceedings of the International Conference on Computer Vision Theory and Applications, 1:125–
130, 2011.

13
[6] Amer Al-Rahayfeh and Miad Faezipour. Eye Tracking and Head Movement Detection: A State-of-
Art Survey. IEEE Journal of Translational Engineering in Health and Medicine, 1(July):2100212–
2100212, 2013.
[7] N Ismail and M I M Sabri. Review of existing algorithms for face detection and recognition.
Proceedings of the 8th WSEAS International . . . , 2009.
[8] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2D & 3D Face
Alignment problem? (and a dataset of 230,000 3D facial landmarks). 2017.

[9] Tomasz Kocejko, Adam Bujnowski, and Jerzy Wtorek. Eye mouse for disabled. 2008 Conference
on Human System Interaction, HSI 2008, (December 2014):199–202, 2008.
[10] Zhiwei Zhu and Qiang Ji. Novel Eye Gaze Tracking Techniques Under Natural Head Movement.
IEEE Transactions on Biomedical Engineering, 54(12):2246–2260, 12 2007.

[11] Xudong Sun, Pengcheng Wu, and Steven C. H. Hoi. Face Detection using Deep Learning: An
Improved Faster RCNN Approach. 2017.
[12] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim, and Dinani Amorim
Fernández-Delgado. Do we Need Hundreds of Classifiers to Solve Real World Classification
Problems? Journal of Machine Learning Research, 15:3133–3181, 2014.

[13] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 39(6):1137–1149, 2017.
[14] Zekun Hao, Yu Liu, Hongwei Qin, Junjie Yan, Xiu Li, and Xiaolin Hu. Scale-Aware Face Detec-
tion. 2017.
[15] Yunzhu Li, Benyuan Sun, Tianfu Wu, and Yizhou Wang. Face detection with end-to-end integra-
tion of a convNet and a 3D model. Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9907 LNCS:420–436, 2016.
[16] Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of
regression trees. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages
1867–1874. IEEE, 6 2014.
[17] Hall E.T. The Hidden Dimension. Anchor Books, Garden City, 1 edition, 1966.
[18] Karlene Nguyen, Cindy Wagner, David Koons, and Myron Flickner. Differences in the infrared
bright pupil response of human eyes. Proceedings of the symposium on Eye tracking research &
applications - ETRA ’02, page 133, 2002.
[19] Su Yeong Gwon, Chul Woo Cho, Hyeon Chang Lee, Won Oh Lee, and Kang Ryoung Park.
Robust eye and pupil detection method for gaze tracking. International Journal of Advanced
Robotic Systems, 10, 2013.

[20] Dan Witzner Hansen and Arthur E.C. Pece. Eye tracking in the wild. Computer Vision and
Image Understanding, 98(1):155–181, 2005.
[21] Wenhao Zhang, Melvyn Lionel Smith, Lyndon Neal Smith, and Abdul Rehman Farooq. Eye
centre localisation: an unsupervised modular approach. Sensor Review, 36(3):277–286, 2016.

[22] Sebastian Vater and Fernando Puente Leon. Combining isophote and cascade classifier informa-
tion for precise pupil localization. In 2016 IEEE International Conference on Image Processing
(ICIP), pages 589–593. IEEE, 9 2016.
[23] Dong Tian, Guanghui He, Jiaxiang Wu, Hongtao Chen, and Yong Jiang. An accurate eye pupil
localization approach based on adaptive gradient boosting decision tree. VCIP 2016 - 30th
Anniversary of Visual Communication and Image Processing, pages 0–3, 2017.

14
[24] Mingcai Zhou, Xiying Wang, Haitao Wang, Jingu Heo, and DongKyung Nam. Precise eye lo-
calization with improved SDM. In 2015 IEEE International Conference on Image Processing
(ICIP), pages 4466–4470. IEEE, 9 2015.
[25] Congyi Wang, Fuhao Shi, Shihong Xia, and Jinxiang Chai. Realtime 3D eye gaze animation
using a single RGB camera. ACM Transactions on Graphics, 35(4):1–14, 2016.
[26] V. Jain and E. Learned-Miller. FDDB : A Benchmark for Face Detection in Unconstrained
Settings. -, pages 2010–009, 2010.

[27] Amine Kacete, Jerome Royan, Renaud Seguier, Michel Collobert, and Catherine Soladie. Real-
time eye pupil localization using Hough regression forest. 2016 IEEE Winter Conference on
Applications of Computer Vision, WACV 2016, 2016.
[28] Yongnian Le. Schema Validation with Intel
R Streaming SIMD Extensions 4 (Intel
R SSE4),
2008.
[29] Ilya Kalinovskii and Vladimir Spitsyn. Compact Convolutional Neural Network Cascade for Face
Detection. 8 2015.

15
7 Appendix A: R-studio code for benchmarking with BioID
library(ggplot2)
library(plotly)
library(reshape2)
library(plyr)
library(devtools)
library(dplyr)

#import new values


setwd("/Users/eliaspedersen/eyeLike-bioid/build/BioId-FD-pyramid5")
file_list <- list.files()

for (file in file_list){

# if the merged measure doesn't exist, create it


if (!exists("measure")){
measure <- read.table(file, header=FALSE, sep=" ")
}

# if the merged measure does exist, append to it


if (exists("measure")){
temp_measure <-read.table(file, header=FALSE, sep=" ")
measure<-rbind(measure, temp_measure)
rm(temp_measure)
}

#import ground truth


setwd("/Users/eliaspedersen/eyeLike-bioid/build/BioID-FD-Eyepos-V1")
file_list <- list.files()

for (file in file_list){

# if the merged ground_truth doesn't exist, create it


if (!exists("ground_truth")){
ground_truth <- read.table(file, header=FALSE, sep="\t")
}

# if the merged ground_truth does exist, append to it


if (exists("ground_truth")){
temp_ground_truth <-read.table(file, header=FALSE, sep="\t")
ground_truth<-rbind(ground_truth, temp_ground_truth)
rm(temp_ground_truth)
}

#import markup points


setwd("/Users/eliaspedersen/eyeLike-bioid/build/points_20")
file_list <- list.files()

for (file in file_list){

16
# if the merged pts doesn't exist, create it
if (!exists("pts")){
pts <- read.table(file, header=FALSE, skip = 11, nrows = 4, sep="
")
pts <- c(t(pts))
}

# if the merged pts does exist, append to it


if (exists("pts")){
temp_pts <-read.table(file, header=FALSE, skip = 11, nrows = 4,
sep=" ")
temp_pts <- c(t(temp_pts))
pts<-rbind(pts, temp_pts)
rm(temp_pts)
}
}

#Remove non applicable (NaN) measurements


row_sub = apply(measure, 1, function(row) all(row!=0))
measure <- measure[row_sub,]
ground_truth <- ground_truth[row_sub,]

#Find euclidian distances


DL <- sqrt((measure$V1-ground_truth$V1)^2+(measure$V2-
ground_truth$V2)^2)
DR <- sqrt((measure$V3-ground_truth$V3)^2+(measure$V4-
ground_truth$V4)^2)

abs_ground_truth <- sqrt((ground_truth$V1-ground_truth$V3)^2+


(ground_truth$V2-ground_truth$V4)^2)

#combine in data frame


DLDR <- cbind(DL, DR)
#Take maximum value
max_value<-apply(DLDR, 1, max)

#divide by ground truth euclidian distance


distribution <- max_value/abs_ground_truth

#Find percentage of entries with precision


e005 <- sum(distribution <= 0.05)/length(distribution)
e01 <- sum(distribution <= 0.1)/length(distribution)
e015 <- sum(distribution <= 0.15)/length(distribution)
e02 <- sum(distribution <= 0.2)/length(distribution)

You might also like