Professional Documents
Culture Documents
Mobile Eye Tracking
Mobile Eye Tracking
Mobile Eye Tracking
Project course
DTU Compute
By: Supervisor:
Elias Lundgaard Pedersen Per Bækgaard
S143969
2 Introduction 2
2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Enable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Current tracking model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Challenges in pupil detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 External parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.2 Face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.3 Face alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.4 Pupil localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.5 Errors and noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Related work 6
3.1 Models for pupil localization and tracking . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Face detection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Algorithms for face alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.3 Methods for pupil detection and tracking . . . . . . . . . . . . . . . . . . . . . 7
3.1.4 Failure detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 State-of-the-art model performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Implementation 11
5.1 Comparison with state-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Processing performance of new model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6 Conclusion 13
References 13
1
Mobile eye tracking
Elias Lundgaard
January 2018
1 Abstract
Enable, an eye tracking application designed for disabled people, is constrained to use no extra hard-
ware and run on a mobile device. Eye tracking in-the-wild is an especially challenging task in computer
vision, and requires sophisticated algorithms and tracking models for stable use. In this paper I some
major challenges in Enables current tracking model are identified. Afterwards a series of state-of-
the-art solutions for an improved eye tracking model and solutions to the identified challenges are
researched. The methods are then evaluated on three parameters: accuracy, computational per-
formance and ease of implementation, and are used for redesigning the tracking model of Enable.
Implementation of r-CNN’s for face detection, facial landmarking and several performance measures
results in a significant increase of accuracy (a score of 96.38 for e ≤ 0.2 on the BioID database) for
pupil detection, and better bounding boxes for eye regions. It comes at a cost of computational perfor-
mance in a reduction of 7 FPS. The paper concludes that implementation of the remaining redesigns
potentially can make Enable reach state-of-the-art and work stable on a mobile device in the wild.
2 Introduction
Eye tracking on mobile phones without additional hardware has previously been extremely challenging
without additional hardware. This is mostly due to the advanced concept of eye gaze tracking, which
so far has required several known landmarks in the eye such as corneal reflections [1]. This is typically
done using infrared light diodes (IREDs) placed in front of the user. Since gaze based tracking
systems often are used for complete UI interaction, stereoscopic cameras are used to model the eye
in 3D. Furthermore, the anatomical accuracy of eye gaze is 2-4◦ [2], and as a result, video based eye
tracking has a limited precision. Calibration is another issue with eye gaze tracking, which often is
required multiple times during use to great annoyance to the user. Few companies such as Samsung [3]
has attempted adding certain software based eye tracking features such as automatic scrolling to their
phones, but it has been regarded as a failure, and never been widely adopted. To avoid these issues,
Enable, an eye tracking application designed for disabled people is designed for using gaze gestures
instead of gaze points. This makes it possible to run on mobile devices as a software solution, which
is desirable in terms of scalability. The device and environmental parameters does however makes up
some challenging problems for the tracking model, which are attended to in this paper.
2
Figure 1: The Enable proof of concept application. In the debug window, the left pupil x-coordinates are logged. If
the difference in this value is below or above a certain threshold value, a corresponding gesture is logged. In this case
LEFT. The pupil os located by detecting the face (middle) and eye regions (right).
Webcam Face detection Eye region Pupil detection Pupil state Action (on/off)
+ center detection
Haar cascades Constants Gradients Threshold values
2.2 Enable
Enable is a platform for controlling smart home devices with eye gesture interaction as a primary
input. The platform runs locally on a smartphone in the form of an application, and does not
require additional hardware than what is already integrated in the phone. Enable is targeted motor
impaired audiences with the purpose of assisting them in independently interacting with their physical
environment. To evade the many issues of using eye gaze, Enable adopts pupil gesture interaction as
input modality [1]. By ”swiping” from left to right or down to up, the user can control a limited
amount of objects on the smart phone screen. A video demonstrating the concept can be seen here
or pictured in figure 1. This method is preferred because it addresses almost all of the issues with
traditional eye tracking. It doesn’t require high precision pupil localization, and therefore doesn’t
need IREDs or stereoscopic cameras. Because the point of gaze isn’t required neither is calibration
and the solution works out of the box. This makes it perfect for mobile interaction with low quality
cameras and changing environmental factors.
• Face detection is performed using the Viola-Jones (VJ) object detection framework [4]. The
VJ framework uses Haar-like features in a cascaded Adaboost classifier. The Haar-like features
used in Enable are the ones of OpenCV 3.3 for frontal face detection.
• Eye cropping is be found by using mean values of eye localization in a front-facing face [5].
• Pupil center localization is performed on the cropped eye region image. The localization
algorithm used is Fabian Timm et als. Means of Gradients [5] which uses intersecting gradient
vectors to determine the pupil center location.
3
(a) (b) (c)
Figure 3: The image quality changes depending on the environment. In (a) the face has changing brightness due to
shadows. In (b) the face has darkened due to strong backlight and in (c) lack of lightning makes the face appear dark
and full of noise.
• Pupil state detection is done using constant threshold values. If the threshold value in the X
or Y direction is exceeded, the pupil state is set to LEFT (L), RIGHT (R), UP (U) or DOWN
(D). If the sequence of states matches a predetermined gesture, eg. LRLR, the corresponding
action is executed.
The current Enable eye tracking model is primarily based on readily available and easily imple-
mentable existing software libraries. Therefore there is plenty of room for improvements.
4
(a) One closed eye (b) Two closed eyes (c) Angled face
Enable currently uses the bounding box of the eye regions for locating the pupil. This box is in
the case of Enable derived from some mean geometric values. This method however, is very imprecise
due to changing face size, and varying facial features between different users. By using landmarks,
the pupil localization could be improved substantially.
5
• False positives on pupil location on one or two closed eyelids (figure 4a and 4b).
• Spikes in pupil positions in some frames. This is because the pupil is found in every frame, and
does not correlate the position to previous frames.
• Different values in pupil positions for both eyes. Theoretically the pupils should be located the
same place in both eyes under normal circumstances.
• Not adjustable to face orientation. The software assumes that the face is always frontfacing and
aligned horizontally (figure 4c and 4f).
• Unstable measurements on normal use.
3 Related work
There is without doubt plenty of challenges to attend. Luckily, recent advancements in computer
vision and especially artificial intelligence (AI) has made previously complex and inefficient solutions
incredibly easy and lightweight.
6
3.1.3 Methods for pupil detection and tracking
Eye tracking has been a popular research topic since the middle of the 20th century [17]. As a
result, there exist a very large amount of methods for computer based pupil tracking. Since enable
cannot make use of external lamps such as IREDs, only methods regarding passive light and mono-
cameras are covered here. As with face detection, classifiers such as SVM’s can also be used for pupil
detection [18] [19].
Feature based methods uses color distribution in the eye region. Enable uses this method through
means of gradients [5]. This method works well under good lighting conditions, but is susceptible to
changing light and image focus. Hansen et al. proposes a model using Marginalized Contour Model
for iris tracking [20], but this approach assumes that the user does not move his er her head. Some
models combines different approaches to improve performance [21]. Vater et al. [22] combines isophote
features with a cascade classifier to reach high accuracy.
As with most other object detection tasks, deep learning and AI has shown to be a potential
replacement for these algorithms. Tian et al. [23] proposes using Adaptive Gradient Boosting and
outperforms most state-of-the-art methods in terms of accuracy and time. The best performing
algorithm is the improved supervised descent method (SDM) proposed by Zhou et al. [24].
7
Figure 5: FDDB discrete scores of all recently published methods measured in performance. The metric is the Receiver
Operating Characteristic (ROC) curve which measures true positive rate against false positive rate. Generally a steeper
slope is better, since this means better sensitivity. A perfect classification would be in the point (0, 1) since this would
mean no false negatives and no false positives. Source: FDDB results page
8
Table 1: Comparison of normalized error scores in BioID database of state-of-the-art pupil detection algorithms. e
denotes the error metric of how many percent of the predictions are in a certain range. Consumed time is the time
spent predicting the pupil position in each image.
Since the trade-off is acceptable (Timm et als. algorithm is only a few percentage points behind state
of the art), this will be a future improvement, and thus not implemented in this paper.
One of the most popular methods of face landmarking is done with an ensemble of regression trees.
Kazemi et al. [16] showed in 2014 that this method could accurately do face alignment in under one
millisecond. This makes it one of the most powerful and used algorithms as of today. Other methods
have been proposed [8] with extreme resilience to unfamiliar poses. Kazemi et als. method is already
implemented in dlib, and is therefore the preferred face alignment method for this project.
From these realizations, a redesigned tracking model can be presented in figure 6. As mentioned
earlier, time constraints and ease of implementation decides which points to prioritize. Points 1, 2,
4 and 6 are easily implementable. Points 3 and 7 are probably the of the most important initiatives
for reliable interaction, and point 4 would make the pupil tracking itself suitable for more delicate
applications such as gaze point tracking.
9
Get frame Align face posture Pupil state
No
Face detection Extract eye regions Action
r-CNN Located landmarks detected?
face
No
Yes
Action (on/off)
Yes
No
Yes
First frame?
No
Figure 6: The redesigned tracking model. Please note orange modules will not be implemented in this report due to
time constraints.
10
Face detection Locate landmarks
r-CNN Regression trees
Get frame
True
Use previous
face box
downside, smaller faces cannot be detected, but since Enable requires the user to be relatively close
to the screen, this is not a critical issue.
None of the detection algorithms requires full RGB images for precise detection. Since converting
frames to grayscale will significantly reduce it’s array size, this is a very easy measure to take.
In figure 7, the measures taken for speeding up the detector are illustrated. I final measure for
speeding up the detector is enabling hardware acceleration by compiling the software in release mode.
This enables protocols like SSE4 [28] which significantly speeds up runtime. Summing up, the measures
for speeding up the detector are:
5 Implementation
To compare the redesigned tracking model with the original, the models and methods mentioned
earlier are now implemented in the application. To test the improvements, the model is evaluated
on two parameters: performance in accuracy, and computational performance. The first parameter
can be tested using the BioID database and compared to the original model performance and state-
of-the-art. The second parameter can be tested by obtaining the average framerate of a pre-recorded
video.
max {DL , DR }
e= (1)
D
Where DL and DR represents the euclidian distance between the measured pupil position and the
ground truth pupil locations for respectively the left and right eye. D denotes the euclidian distances
between the ground truth pupil locations:
q
D = (Lex − Rex )2 + (Ley − Rey )2 (2)
Where Le and Re are left and right pupil positions as a 2D point for the ground truth. As formal
research papers usually only test the pupil detection algorithm on the known eye region, frames where
the face is not detected are discarded in this test. As the face detection algorithm of dlib uses image
pyramids to detect differently scaled faces, the number of pyramids has to be fitted the small resolution
of BioID. By using respectively 2 and 5 pyramids, it is important to note that processing performance
11
Table 2: Comparison of normalized error scores in BioID database for the redesigned tracking model and the most
relevant published methods. Numbers marked in bold corresponds to best performance in the given column. There is
a significant improvement in performance in the redesigned model compared to the original model. However it doesn’t
quite reach the performance of [5] in spite of using the same algorithm
(a) (b)
Figure 8: Tacking in the benchmarking video file on the old (a) and new (b) tracking model. Note how the eye regions
are much tighter in the redesigned model because of facial landmarking.
alternated. Likewise the original Haar-based detector needs to be tweaked to detect smaller faces. I
used a minimum size of 15 px and a maximum size of 50 px.
The results on BioID database are presented in table 2. As expected, the redesigned model can
theoretically not surpass that of [5], since they use the same pupil detection algorithm. Compared
with the previous tracking model however, the performance has significantly increased. At e ≤ 0.2
the performance is infinitesimally close to that of [5], which means the new face detection algorithm
is almost infallible. However the redesigned model isn’t even close to reach the same performance in
e ≤ 0.05 as [5], which is strange considering both methods are using the same algorithm. This may
be due to some values differing in the detection algorithm, but should definitely be troubleshooted.
Furthermore, only one face is not detected when using 5 pyramids, which is excellent considering
the challenging nature of BioID. This means that the next major step for increasing the accuracy is
to find out what decreases the performance in e ≤ 0.05 and afterwards implement the Improved SDM
or similar for pupil localization.
12
Method Framerate (FPS)
Enable original 20
Enable redesigned 13
Table 3: Framerates for the two iterations of Enables tracking model on a test video. There is a significant decrease
compared to the previous.
as these are the most computationally expensive tasks. A video file of 1280x720 px resolution and 20
second duration is loaded into each iteration of Enable. The framerate is calculated by substracting
the timestamp of each frame from the start time ∆t[ms] dividing it into the number of frames, and
then averaging all measured FPS. Practically this is done using the built-in tick counter of OpenCV.
Nf rames
FPS = (3)
∆t
This results in a measured FPS for the redesigned tracking model presented in table 3. This is a
significant trade-off in speed, and is caused by the CNN face detector. The performance can still be
tweaked to improve performance slightly, however it does not seem to be possible to achieve the desired
30FPS using this algorithm. However, using CNN’s is definitely a more future proof solution than
classifiers, and much more efficient implementations already exists [29]. Furthermore, since Enable is
designed for mobile implementation, face detection performance may not be a problem due to built-in
solutions and API-calls on mobile platforms.
6 Conclusion
In this paper, a thorough search of methods, models, and algorithms has resulted in a series of pro-
posals for optimizing Enable. The proposals includes methods for improving accuracy, computational
performance and ease of implementation. Some of these methods have been implemented, including
CNN face detection, facial landmarking, smoothing pupil positions using optical flow, and a series of
measures for speeding up the detector. This has resulted in a significant improvement in accuracy,
with relatively high normalized error scores. The CNN face detection algorithm is almost infallible,
and detected all but 1 face in the challenging BioID database. This stability comes at a cost of pro-
cessing performance, and a significant drop in FPS. Some implementations where not tested in this
paper. The facial landmarking can be used for improved gesture detection, and give more precise
pupil coordinates in the eye. This is relevant to test since it will impact the ease of interaction.
There are still plenty of methods for optimization like blink detection and gaze constraints that are
not yet implemented. These implementations will further stabilize Enable, which is certainly needed
before it can be used in an every-day environment. There is still quite a way to reach the performance
of state-of-the-art, however the few implementations done in this paper shows promising results of it
being possible in the near future.
References
[1] Heiko Drewes. Eye Gaze Tracking. Interactive Displays: Natural Human-Interface Technologies,
pages 251–283, 2014.
[2] Robert J K Jacob. Eye Tracking in Advanced Interface Design. Virtual Environments and
Advanced Interface Design, pages 258–290, 1995.
[3] Agam Shah. Inside Samsung Galaxy S4’s face and eye-tracking technology, 2013.
[4] Paul Viola and Michael J Jones. Robust Real-Time Face Detection. International Journal of
Computer Vision, 57(2):137–154, 5 2004.
[5] Fabian Timm and Erhardt Barth. Accurate Eye Centre Localisation By Means of Gradients.
Proceedings of the International Conference on Computer Vision Theory and Applications, 1:125–
130, 2011.
13
[6] Amer Al-Rahayfeh and Miad Faezipour. Eye Tracking and Head Movement Detection: A State-of-
Art Survey. IEEE Journal of Translational Engineering in Health and Medicine, 1(July):2100212–
2100212, 2013.
[7] N Ismail and M I M Sabri. Review of existing algorithms for face detection and recognition.
Proceedings of the 8th WSEAS International . . . , 2009.
[8] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2D & 3D Face
Alignment problem? (and a dataset of 230,000 3D facial landmarks). 2017.
[9] Tomasz Kocejko, Adam Bujnowski, and Jerzy Wtorek. Eye mouse for disabled. 2008 Conference
on Human System Interaction, HSI 2008, (December 2014):199–202, 2008.
[10] Zhiwei Zhu and Qiang Ji. Novel Eye Gaze Tracking Techniques Under Natural Head Movement.
IEEE Transactions on Biomedical Engineering, 54(12):2246–2260, 12 2007.
[11] Xudong Sun, Pengcheng Wu, and Steven C. H. Hoi. Face Detection using Deep Learning: An
Improved Faster RCNN Approach. 2017.
[12] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim, and Dinani Amorim
Fernández-Delgado. Do we Need Hundreds of Classifiers to Solve Real World Classification
Problems? Journal of Machine Learning Research, 15:3133–3181, 2014.
[13] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 39(6):1137–1149, 2017.
[14] Zekun Hao, Yu Liu, Hongwei Qin, Junjie Yan, Xiu Li, and Xiaolin Hu. Scale-Aware Face Detec-
tion. 2017.
[15] Yunzhu Li, Benyuan Sun, Tianfu Wu, and Yizhou Wang. Face detection with end-to-end integra-
tion of a convNet and a 3D model. Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9907 LNCS:420–436, 2016.
[16] Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of
regression trees. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages
1867–1874. IEEE, 6 2014.
[17] Hall E.T. The Hidden Dimension. Anchor Books, Garden City, 1 edition, 1966.
[18] Karlene Nguyen, Cindy Wagner, David Koons, and Myron Flickner. Differences in the infrared
bright pupil response of human eyes. Proceedings of the symposium on Eye tracking research &
applications - ETRA ’02, page 133, 2002.
[19] Su Yeong Gwon, Chul Woo Cho, Hyeon Chang Lee, Won Oh Lee, and Kang Ryoung Park.
Robust eye and pupil detection method for gaze tracking. International Journal of Advanced
Robotic Systems, 10, 2013.
[20] Dan Witzner Hansen and Arthur E.C. Pece. Eye tracking in the wild. Computer Vision and
Image Understanding, 98(1):155–181, 2005.
[21] Wenhao Zhang, Melvyn Lionel Smith, Lyndon Neal Smith, and Abdul Rehman Farooq. Eye
centre localisation: an unsupervised modular approach. Sensor Review, 36(3):277–286, 2016.
[22] Sebastian Vater and Fernando Puente Leon. Combining isophote and cascade classifier informa-
tion for precise pupil localization. In 2016 IEEE International Conference on Image Processing
(ICIP), pages 589–593. IEEE, 9 2016.
[23] Dong Tian, Guanghui He, Jiaxiang Wu, Hongtao Chen, and Yong Jiang. An accurate eye pupil
localization approach based on adaptive gradient boosting decision tree. VCIP 2016 - 30th
Anniversary of Visual Communication and Image Processing, pages 0–3, 2017.
14
[24] Mingcai Zhou, Xiying Wang, Haitao Wang, Jingu Heo, and DongKyung Nam. Precise eye lo-
calization with improved SDM. In 2015 IEEE International Conference on Image Processing
(ICIP), pages 4466–4470. IEEE, 9 2015.
[25] Congyi Wang, Fuhao Shi, Shihong Xia, and Jinxiang Chai. Realtime 3D eye gaze animation
using a single RGB camera. ACM Transactions on Graphics, 35(4):1–14, 2016.
[26] V. Jain and E. Learned-Miller. FDDB : A Benchmark for Face Detection in Unconstrained
Settings. -, pages 2010–009, 2010.
[27] Amine Kacete, Jerome Royan, Renaud Seguier, Michel Collobert, and Catherine Soladie. Real-
time eye pupil localization using Hough regression forest. 2016 IEEE Winter Conference on
Applications of Computer Vision, WACV 2016, 2016.
[28] Yongnian Le. Schema Validation with Intel
R Streaming SIMD Extensions 4 (Intel
R SSE4),
2008.
[29] Ilya Kalinovskii and Vladimir Spitsyn. Compact Convolutional Neural Network Cascade for Face
Detection. 8 2015.
15
7 Appendix A: R-studio code for benchmarking with BioID
library(ggplot2)
library(plotly)
library(reshape2)
library(plyr)
library(devtools)
library(dplyr)
16
# if the merged pts doesn't exist, create it
if (!exists("pts")){
pts <- read.table(file, header=FALSE, skip = 11, nrows = 4, sep="
")
pts <- c(t(pts))
}