Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Fakultät für Informatik

Lehrstuhl für Robotik, Künstliche Intelligenz und Echtzeitsysteme und Robotik

Survey on Inferring Eye Gaze Fixation for


Driver’s Attention Prediction

Omran Kaddah

Advanced Methods for Emotion Recognition in Highly Automated Driving (SS19)

Advisor: Sina Shafaei, M.Sc.


Supervisor: Sina Shafaei, M.Sc.
Submission: July 2019
Survey on Inferring Eye Gaze Fixation for Driver’s
Attention Prediction
Omran Kaddah
Faculty of Informatics
Technical University of Munich
Email:omransy1994@gmail.com

Abstract—Inferring the gaze of a driver for attention prediction A. Datasets


systems is a problem in computer vision that is still being studied
and researched in order to improve current Advanced driver-
assistance systems (ADAS). In this survey, a brief literature There are many datasets available in the market. For brevity,
review is presented over some of the works that has been three of them are going discussed briefly. All of which have
introduced in the recent years in the area of eye gaze tracking 2D continous ground truth values:
or fixation inference, and of driver’s attention prediction. More
focus would be given to one of recent papers on estimation of 1) : MSP-Gaze [6] it was used to train the model [5],
the gaze region, as well as two other papers on the prediction which we will talk about in next section. The dataset was
of a driver’s attention. Due to the fact that these two problems
are supervised machine learning problems, in the area of these recorded by mouse clicks that simultaneously shoot picture of
two problems, there will be a review on the datasets available, the subjects, mouse clicks were pointed to circle that appeared
machine learning algorithms that has been used, with the focus randomly on the screen. Microsoft Kinect sensor was also
on Convolutional neural notworks(CNN). Also an evaluation of used to save the distance between the head and the monitor.
the further-studied models will be presented. With Viola-Jones algorithm the two eyes of each image where
extracted later. The database contains 324,771 samples from
I. I NTRODUCTION 44 subjects with similar distribution between genders, and
from various ethic backgrounds. The two drawback for the
The importance of attention prediction systems comes from driving-settings can be, that the lighting condition, in which
warning drivers in situations where they are not paying the pictures were taken, was rather static with steady sufficient
attention to important objects in the driving scene. Studies illumination, which is not always realistic in for driving
have shown that driver’s distraction, whether he/she is novice situation. And the dimension of the monitor is smaller than
or experienced [1], is one of the main causes of the road the front glass in the car, which mean smaller field of view
accidents [2]. With the advent of strong computational power compared to the one of the car.
and Convolutional neural notworks(CNN), the recent attention
prediction models with CNNs showed a high performance. 2) : EYEDIAP[7]As the researchers used to apply their
However, we will discuss that there are aspects where im- models on different datasets that were collected under different
provements are still needed, especially in terms of datasets conditions. This dataset was proposed to standardized dataset,
use to train the models, and we take the paper of Xia et for the goal that researchers would use it, so that they will
all[3] which pointed out to these aspects to be studied further. clearly compare gaze tacking algorithms and identify their
We also look into the attention prediction model proposed advantages and disadvantages. In order to achieve that, a set
by Palazzi et all [4]. Predicting where the drivers attention of recording sessions was designed, each one characterized by
in the papers discussed in the survey is learned from human a combination of the four main variables that can affect gaze
drivers, therefore the operation is also depended on inferring estimation accuracy, these are, visual target, head pose, par-
the driver’s eye fixation. The problem of eye tracking has ticipants, recording conditions, and Sessions summary. More
been studied for long time, and the current of state-of-the details on those variables can be found in the paper in [7] The
art systems offers a high performance tracking. However, it dataset includes 3D information and has 3D annotation. The
is still constrained to controlled environments. Therefore we data is of 237 min from a 25fps camera and It was recorded
also take the paper of Jha and Busso [5] who proposed their in the lab.
method to address this problem. 3) : MPIIGaze[8] contains 213,659 full face images, and
in addition it includes 3D annotation. Similar to EYEDIAP,
II. I NFERRING E YE F IXATION it was recorded on various condition, with different subjects.
However, it was not done in the lab, and it was recorded
The problem of tracking eye gaze fixation is a machine over several months. This was considered advantageous by
learning problem. Thus, the survey will discuss some of the the authors. It can be also be similar to in-car conditions.
datasets available, the state-of-the-art models. However, the field of view is still different.
B. Models

Most of the early models, for example, the popular eye


tracking tools is Eyelink Toolbox Matlab [9], requires spe-
cial equipment, and user calibration. Most of which used
also infrared sensitive video technology, and different image
processing techniques tracked the eye pupil. A more recent
works such as [10] required only the appearance of the eye to
estimate the eye gaze, and used an algorithm that is based on
principle component analysis, in which images of the two eyes
are projected down to the 30 important principle components,
which are found by taking the 30 eigen vectors corresponding
to the highest eigenvalues, then applied linear regression to
learn the horizontal and vertical gaze pixels. There were also
other models that used different algorithms such as Random
Forests(RF) algorithm as in the work of Sugano et all[11], Fig. 1. CNN architecture taken from [5]
and Support Vector Regression (SVR)Schneider et all [12].
With early 2010’s and the powerful Graphics Processing
Units (GPUs), CNN became popular algorithm for solving Table 1. Errors of each model
many computer vision problem. In 2015 Zhang et all [13] Jha Baseline CNN SVR RF
showed that CNN-based approach for eye gaze estimation and model Zhang [12] [11]
outperforms alternative approaches such as SVR, RF, and k- Buss[5] [5] et all
nearest neighbor(KNN). Their method consider the problem [8]
as a regression problem, in which there were two outputs, x Mean 6.89 ◦ 6.89 ◦ 5.4 ◦ 6.6 ◦ 6.7 ◦
◦ ◦
and y coordinates of the eye fixation. Jha and Busso [5] came Median 6.11 9.69 - - -
with different approach for the problem and interpreted it as 95 14.2 ◦ 19.45◦ - - -
classification problem for each of the pixel, on how probable percentile
the eye gaze fixation is in a given pixel. This approach is called III. D RIVER ’ S ATTENTION P REDICTION
regression by classification, and gets an output of region of eye
The problem is a supervised machine learning problem.
gaze fixation, instead of one pixel. They used CNN with with
Datasets, Models as well as the way they trained, and the
three down sampling layer, and then three upsampling layers,
results are going to be discussed.
such that the output after the Softmax function, is 2D grid
with cells of probabilities of how likely the fixation is each A. Datasets
cell. The detailed CNN architecture can be found in figure 1. There are number of datasets that are named as driver
They also addressed the problem of cost-sensitivity of cross attention datasets. However, not all of them are relevant for
entropy, that is, the model is equally penalized whether if the this survey but worth mentioning. Datasets such as [14] [15]
the prediction is in the vicinity of the true label or far from which used image saliency for annotating driver’s attention in
it. Therefore instead of using one-hot encoding, a Gaussian static scene, are rather small and not enough for training strong
distribution will be around the true label, and Guassian filter models. Or a dataset such as [16] were annotated only for six
is also applied on output of convolutional layer before Softmax coarse gaze regions, these are, Road, Instrument Cluster, Left,
function. Rear view, Mirror, Center Stack, and Right were annotated and
the exterior scene was not recorded. The focus of this survey
C. Results is the attention on the road, what objects are there, and which
object should a driver pay attention to in a driving situation.
In Table 1 we can see the comparison between there For this purpose two datasets are going to be discussed:
different models. Unfortunately Jaha’s and Busso’s [5] model 1) : DR(eye)VE[17] This dataset was recorded in
and baseline model were not trained the same datasets as cars(referred as in-car settings) by car-mounted camera with
the other models, rather on MSP-Gaze, while the others were wide field of view, and on different rides under different
trained on MPIIGaze dataset. Therefore, there would not be weather conditions, and it is of 6 hours duration, making
a fair comparison between all of the models. Anyhow, we it the largest driver attention dataset. More details on the
can see that Zhang et all [8] trained different model on dataset’s statistics can found in Table 2. The datasets uses
the same data, and CNN had the best results. The baseline attention maps as an annotation, which are rather density
model mentioned in the table of Jaha and Busso[5] had the maps of how probable the attention to be fixated on some
same architecture as the regression-by-classification CNN one spot in the frame. These attention maps were generated over
excpet the upsampling layers where replaced by three dense temporal window of 25 frames, over which the fixations are
layers, with 2 outputs at the end, making it a regression model. accumulated then smoothed by spatio-temporal Gaussian. The
datasets has a drawbacks which were pointed out by Xia et all as refinement step. The output is the attention maps for the
[3]. Firstly, the dataset does not take into consideration that last frame in the sequence. The final output of the whole
humans tend to have a covert attention, that is, attending to model is the accumulated output of each branch which also
multiple important objects in a scene and being aware of them, then normalized to produce a probability distribution. The
though the eye gaze is fixated in a specific direction or at an authors stated that their architectural choices relayed on results
object in the scene [18]. Secondly, the inclusion of irrelevant presented in their paper, such as the consistent patterns that
gazes to driving situation(False positives) in the dataset, as are exhibited by the driver’s Focus of Attention(FoA), motion
drivers move their eyes to static objects, such as trees, building, cues, and the effect of strong prior on objects on driver’s gaze.
and anything that could have special appearance [19]. Thirdly, Regarding the measure of loss, they used KullbackâLeibler
the limited diversity of the datasets, as it was collocated only (KL) divergence between the prediction and ground truth.
from 74 rides. The model that has been proposed by Xia et all [3] along
2) : BDD-A[3] An abbreviation for Berkeley DeepDrive with BDD-A used a pretrained feature extractor of Alex net
Attention. It was collected in lab from 45 participants who [23] followed by 2D upsampling layer, then 3 fully convolution
were asked to imagine as they were driving instructors sitting layers, then a Long short-term memory (LSTM) layer which
in the copilot seat next to a student driver, and asked to press models the temporal dependency of sequence of frames, and
a key whenever they felt it is necessary to correct or warn the finally a Gaussian blur is applied to the input of the softmax
student of potential dangers. The dataset also used attention function, making the output grid, a collection cells with values
maps as an annotation. In order to overcome the drawbacks between zero and one, that is, a probability of attention fixation
of DR(eye)VE datasets, the attention maps where made by in each of the spot in the frame. Xia et all [3] also proposed
aggregation of eye gaze fixations of at least 4 participants, then a method, they called it Human weighted sampling(HWS), in
smoothed. This solves, as suggested by Xia et all Xia et all which before training the model, the mean attention map of
[3], both problem of covert attention and false positive because samples was calculated, then the KL divergence of attention
these resultant attention maps are the average of multiple map of each frame with the mean attention frame is calculated,
people attention, in which the covert attention is likely to be and the result was assigned as a sampling weight for each
covered, and false positives are faded out because it less likely frame accordingly. In that way, a sequence of frames has also
that all subjects will look at the same irrelevant object. It also a sampling weight which is the summation of all the weights
solves what has been suggested by psychological studies [20] of the frames in that sequence. The purpose of HWS, as stated
[21] that the human’s scanpath of a scene is highly subjective, by the authors, is to address the issue of rare critical events, as
and therefore, on a given frame, individual’s eye gaze fixation the more a frame diverges from the average frame, the more
may not be directly used to make a gaze map. More details likely it is a critical situation.
on the statistics of the datasets can be found the in Table 2.
C. Models comparison and results
B. Models The model proposed in [4] was given more attention, than
The focus of this section is going to be on CNNs, but that model in [3]. On the other hand, Xia et all [3] where more
does not mean the problem of attention prediction is tied only focused on the dataset and how to train on it. Unfortunately,
to CNNs, early models, such as in [14] [15] did not use CNN. there can be no fair comparison between those two models
The CNN model that has been tested on DR(eye)VE dataset previously represented, as they have been trained on different
of the revised paper by Palazzi et all [4], consisted of three datasets, and only with fine tuning on a common dataset, one
parts, each called a Focus of Attention(FoA) branch. These can make a fair comparison. However, Xia et all [3] fine tuned
were, a branch that works on RGB domain, a branch that an older version of Palazzi et all model[19] that did use multi
focuses on motion through the optical flow representation, and branch architecture on BDD-A dataset, and was outperformed
semantic segmentation branch in which objects in the driving by BDD-A model proposed by Xia et all [3] and trained
scene are classified. Each of the branches had the same of on BDD-A datset. Table 3 shows results next to each other.
architecture, takes two inputs, sequence of frames that are ran- Nevertheless one should keep in mind, no fair comparison can
domly cropped, and resized version of the original sequence. be made.
The inputs are both fed to two COARSE module, which
share weights. COARSE module, which is based on based IV. C ONCLUSION
on C3D architecture [22], models the temporal dependency of Although the eye tacking models presented in this papers
sequence of frames with the use of of two 3 dimensional(3D) showed good results, one can see that there is still need to
convolutions layers with 3D pooling layers in between, and adapt these works on the eye gaze fixation into car settings,
ends bilinear upsampling that outputs a presentation with and researchers should be encouraged to work on a united
same dimension as of the input. The output of the cropped standardized dataset to ease the model comparison. Also, one
version is used during training for data and variety of ground can acquire both datasets of the driver’s eye gaze tracking
truth fixation maps augmentation. While the output of the and attention prediction simultaneously. Further studies also
resized frames is stacked under the last input frame, then should be done to attention behavior and eye movement, to
the result goes through a sequence of 2D convolutional layers help shape the architecture of the eye tracking model. For
Table 2.
Durations # Gaze #Cars # Pedestrians # Braking
Dataset # Rides # Drivers
(hours) providers (per frame) (per frame) event
DR(eye)VE 74 6 8 8 1.0 0.04 464
BDD-A 1.232 3.5 1.232 45 4.4 0.25 1.427

Table 3.
Datasets BDD-A DR(eye)VE
KL CC KL CC
Models Mean CI Mean CI Mean CI Mean CI
BDD-A [3] 1.24 (1.21, 1.28) 0.58 (0.56, 0.59) - - - -
BDD-A(HWS) [3] 1.24 (1.21, 1.27) 0.59 (0.57, 0.60) - - - -
Palazzi et all 2017 [19] 1.95 (1.87, 2.04) 0.50 (0.48, 0.52) 1.42 (0.35, 2.49) 0.55 (0.27, 0.83)
Palazzi et all 2018 [4] - - - - 1.40 - 0.56 -
KL := KL divergence
CC := Correlation coefficient
CI := Confidence Interval

Fig. 2. CNN architecture of the model proposed by Xia et all[3]

example, whether it is good idea to use LSTM layer, that [5] S. Jha and C. Busso, “Estimation of gaze region using two dimensional
is making use of temporal information, in models that infers probabilistic maps constructed using convolutional neural networks,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech
the eye gaze fixation. One can also observe the problem and Signal Processing (ICASSP). IEEE, 2019, pp. 3792–3796.
with regards to unified and standardized dataset for driver’s [6] N. Li and C. Busso, “Calibration free, user-independent gaze estimation
attention prediction models. As it has been seen earlier, it with tensor analysis,” Image and Vision Computing, vol. 74, pp. 10–20,
2018.
was not possible to give a fair comparison to the models
[7] K. A. Funes Mora, F. Monay, and J.-M. Odobez, “Eyediap: A database
without fine tuning on some specific dataset. More attention to for the development and evaluation of gaze estimation algorithms from
attention prediction models should be also given. For example, rgb and rgb-d cameras,” in Proceedings of the Symposium on Eye
model in [3] is using AlexNet feature extractor, though it Tracking Research and Applications. ACM, 2014, pp. 255–258.
[8] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Mpiigaze: Real-world
is powerful, there are currently architecture such MobileNet dataset and deep appearance-based gaze estimation,” IEEE Transactions
v2 [24] that has 12 times less parameters, the same number on Pattern Analysis and Machine Intelligence, vol. 41, no. 1, pp. 162–
operation for forward pass, and a better accuracy. Also, one 175, 2017.
might also use a pretrained upsampling layer from a semantic [9] F. W. Cornelissen, E. M. Peters, and J. Palmer, “The eyelink toolbox: eye
tracking with matlab and the psychophysics toolbox,” Behavior Research
segmentation model. This might make things easier for the Methods, Instruments, & Computers, vol. 34, no. 4, pp. 613–617, 2002.
coming layers, as the objects in the driving scene are already [10] N. Li and C. Busso, “Evaluating the robustness of an appearance-based
classified. gaze estimation method for multimodal interfaces,” in Proceedings of
the 15th ACM on International conference on multimodal interaction.
ACM, 2013, pp. 91–98.
R EFERENCES [11] Y. Sugano, Y. Matsushita, and Y. Sato, “Learning-by-synthesis for
appearance-based 3d gaze estimation,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2014, pp.
[1] S. G. Klauer, F. Guo, B. G. Simons-Morton, M. C. Ouimet, S. E. Lee,
1821–1828.
and T. A. Dingus, “Distracted driving and risk of road crashes among
novice and experienced drivers,” New England journal of medicine, vol. [12] T. Schneider, B. Schauerte, and R. Stiefelhagen, “Manifold alignment
370, no. 1, pp. 54–59, 2014. for person independent appearance-based gaze estimation,” in 2014 22nd
[2] M. A. Regan, C. Hallett, and C. P. Gordon, “Driver distraction and driver International Conference on Pattern Recognition. IEEE, 2014, pp.
inattention: Definition, relationship and taxonomy,” Accident Analysis & 1167–1172.
Prevention, vol. 43, no. 5, pp. 1771–1781, 2011. [13] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Appearance-based
[3] Y. Xia, D. Zhang, J. Kim, K. Nakayama, K. Zipser, and D. Whitney, gaze estimation in the wild,” in Proceedings of the IEEE conference
“Predicting driver attention in critical situations,” in Asian Conference on computer vision and pattern recognition, 2015, pp. 4511–4520.
on Computer Vision. Springer, 2018, pp. 658–674. [14] L. Simon, J.-P. Tarel, and R. Brémond, “Alerting the drivers about
[4] A. Palazzi, D. Abati, F. Solera, R. Cucchiara et al., “Predicting the road signs with poor visual saliency,” in 2009 IEEE Intelligent Vehicles
driver’s focus of attention: the dr (eye) ve project,” IEEE transactions Symposium. IEEE, 2009, pp. 48–53.
on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1720– [15] G. Underwood, K. Humphrey, and E. Van Loon, “Decisions about
1733, 2018. objects in real-world scenes are influenced by visual saliency before
Fig. 3. CNN architecture of the model proposed by Palazzi et all[4]

and during their inspection,” Vision research, vol. 51, no. 18, pp. 2031–
2038, 2011.
[16] L. Fridman, P. Langhans, J. Lee, and B. Reimer, “Driver gaze region
estimation without use of eye movement,” IEEE Intelligent Systems,
vol. 31, no. 3, pp. 49–56, 2016.
[17] S. Alletto, A. Palazzi, F. Solera, S. Calderara, and R. Cucchiara, “Dr
(eye) ve: a dataset for attention-based tasks with applications to au-
tonomous and assisted driving,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition Workshops, 2016, pp. 54–
60.
[18] P. Cavanagh and G. A. Alvarez, “Tracking multiple targets with multi-
focal attention,” Trends in cognitive sciences, vol. 9, no. 7, pp. 349–354,
2005.
[19] A. Palazzi, F. Solera, S. Calderara, S. Alletto, and R. Cucchiara,
“Learning where to attend like a human driver,” in 2017 IEEE Intelligent
Vehicles Symposium (IV). IEEE, 2017, pp. 920–925.
[20] S. Mannan, K. Ruddock, and D. Wooding, “Fixation sequences made
during visual examination of briefly presented 2d images.” Spatial vision,
1997.
[21] R. Groner, F. Walder, and M. Groner, “Looking at faces: Local and
global aspects of scanpaths,” in Advances in Psychology. Elsevier,
1984, vol. 22, pp. 523–533.
[22] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” in Proceedings
of the IEEE international conference on computer vision, 2015, pp.
4489–4497.
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[24] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2018, pp. 4510–4520.

You might also like