Electronics 11 01500

electronics
Article
Recognizing Students and Detecting Student Engagement
with Real-Time Image Processing
Mustafa Uğur Uçar 1 and Ersin Özdemir 2, *
1 Havelsan A.Ş. Konya Ofisi 3, Ana Jet Üs K.lığı, Konya 42160, Turkey; mucar@havelsan.com.tr
2 Department of Electrical and Electronics Engineering, Iskenderun Technical University,
Iskenderun 31200, Turkey
* Correspondence: ersin.ozdemir@iste.edu.tr
Abstract: With COVID-19, formal education was interrupted in all countries and the importance of
distance learning has increased. It is possible to teach any lesson with various communication tools
but it is difficult to know how far this lesson reaches to the students. In this study, it is aimed to
monitor the students in a classroom or in front of the computer with a camera in real time, recognizing
their faces, their head poses, and scoring their distraction to detect student engagement based on
their head poses and Eye Aspect Ratios. Distraction was determined by associating the students’
attention with looking at the teacher or the camera in the right direction. The success of the face
recognition and head pose estimation was tested by using the UPNA Head Pose Database and, as
a result of the conducted tests, the most successful result in face recognition was obtained with
the Local Binary Patterns method with a 98.95% recognition rate. In the classification of student
engagement as Engaged and Not Engaged, support vector machine gave results with 72.4% accuracy.
The developed system will be used to recognize and monitor students in the classroom or in front of
the computer, and to determine the course flow autonomously.
Keywords: computer vision; machine learning; engagement detection; head pose estimation; eye
aspect ratio
Citation: Uçar, M.U.; Özdemir, E.
Recognizing Students and Detecting
Student Engagement with Real-Time
Image Processing. Electronics 2022, 11,
1500. https://doi.org/10.3390/
1. Introduction
electronics11091500 With the COVID-19 pandemic, there were disruptions in many sectors including
education. Young people in Turkey had to continue their education with the courses given
Academic Editor: Byung Cheol Song
online and through television. In face-to-face lessons, teachers can easily monitor if students
Received: 9 April 2022 follow the subject and the course flow is directed accordingly. It is important to determine
Accepted: 4 May 2022 instantly how much the education has reached to the students and to direct the course
Published: 7 May 2022 accordingly. In this project, it was aimed to recognize the student in e-learning and to
Publisher’s Note: MDPI stays neutral measure if the student concentrates on the lesson, to determine and analyze the non-verbal
with regard to jurisdictional claims in behaviors of the students during the lesson through image processing techniques.
published maps and institutional affil- Just like the teacher observes the classroom, the e-learning system must also see and
iations. observe the student. In the field of computer vision, subjects, such as object recognition,
face detection, face tracking, face recognition, emotion recognition, human action recogni-
tion/tracking, classification of gender/age, driver drowsiness detection, gaze detection,
and head pose estimation are among the subjects that attract many researchers and are
Copyright: © 2022 by the authors. being studied extensively.
Licensee MDPI, Basel, Switzerland. Detecting and analyzing human behavior with image processing techniques has
This article is an open access article also gained popularity in recent times and significant studies have been carried out in
distributed under the terms and
different fields.
conditions of the Creative Commons
It is important for students to pay attention to the teacher and the lesson in order to
Attribution (CC BY) license (https://
learn the relevant lesson. One of the most important indicators showing that the student
creativecommons.org/licenses/by/
pays attention to the lesson and the teacher is that the student is looking towards the teacher.
4.0/).
Electronics 2022, 11, 1500. https://doi.org/10.3390/electronics11091500 https://www.mdpi.com/journal/electronics

Electronics 2022, 11, 1500 2 of 23
In this study, it was aimed to find the real-time distraction rates and real-time recogni-
tion of students in the classroom or in front of the computer by using images taken from
the camera, image processing techniques, and machine learning algorithms. An algorithm
has been developed to organize the learning materials in a way to increase interest by
providing instant feedback from the obtained data to the teacher or autonomous lecture
system. Before going into the details about the study, it will be useful to mention what is
student engagement and the symptoms of distraction which are a good indicator to detect
student engagement.
The Australian Council of Educational Research defines student engagement as “stu-
dents’ involvement with activities and conditions likely to generate high quality learn-
ing” [1]. According to Kuh: “Student engagement represents both the time and energy
students invest in educationally purposeful activities and the effort institutions devote to
using effective educational practices” [2].
Student engagement is a significant contributor to student success. Several research
projects made with students reveal the strong relation between student engagement and
academic achievement [3,4]. Similarly, Trowler, who published a comprehensive literature
review on student engagement, stated that a remarkable amount of literature has estab-
lished robust correlations between student involvement and positive outcomes of student
success and development [5].
As cited by Fredrickj et al. [6], student engagement has three dimensions: behavioral,
cognitive, and emotional engagement. Behavioral engagement refers to student’s participa-
tion and involvement in learning activities. Cognitive engagement corresponds to student’s
psychological investment in the learning process, such as being thoughtful and focusing
on achieving goals. On the other hand, emotional engagement can be understood as the
student’s relationship with his/her teacher and friends; feelings about learning process,
such as being happy, sad, bored, angry, interested, and disappointed.
Observable symptoms of distraction in students can be listed as looking away from
the teacher or the object to be focused on, dealing with other things, closing eyes, dozing
off, not perceiving what is being told. While these symptoms can be easily perceived by
humans, detecting them via computer is a subject that is still being studied and keeps up
to date. Head pose, gaze, and facial expressions are the main contributors to behavioral
and emotional engagement. Automated engagement detection studies are mostly based on
these features. In order to detect the gaze direction for determining distraction, methods
such as image processing techniques, following head pose and eye movements with the
help of a device attached to the person’s head and processing the EEG signals of the person,
were utilized in several studies.
In this study, we focus on behavioral engagement, since its symptoms, such as drowsi-
ness, looking at an irrelevant place can be detected explicitly, and we prefer using image
processing techniques for recognizing students and detecting student engagement due to
their easy applicability and low cost.
By processing the real-time images of the students in the classroom or in front of the
computer captured by a CMOS camera, with the help of image processing and machine
learning technologies, we obtain the following results:
1. Face recognition models of the students were created, students’ faces were recog-
nized using these models and course attendance was reported by the application
we developed;
2. Face detection and tracking were performed in real time for each student;
3. Facial landmarks for each student were extracted and used at the stage of estimating
head pose and calculating Eye Aspect Ratio;
4. Engagement result was determined both for the classroom and each student by
utilizing head pose estimation and Eye Aspect Ratio.
Electronics 2022, 11, 1500 3 of 23
There are research works and various applications on the determination of distraction
in numerous fields, such as education, health, and driving safety. This study was carried
out to recognize a student in a classroom or in front of a computer and to estimate his/her
attention to the lesson.
The literature review was made under two main titles including the studies conducted
on face recognition to identify student participating in the lesson and determining the
distraction and gaze direction of the students taking the course.
Studies on Face Recognition:
Passwords, PINs, smart cards, and techniques based on biometrics are used in the
authentication of individuals in physical and virtual environment. Passwords, PINs can be
stolen, guessed or forgotten but biometric characteristics of a person cannot be forgotten,
lost, stolen, and imitated [7].
With the developments in computer technology, techniques that do not require inter-
action based on biometrics have come to the fore in the identification of individuals. While
characteristics, such as face, ear, fingerprint, hand veins, palm, hand geometry, iris, retina,
etc., are frequently used in the recognition based on physiological properties, character-
istics, such as walking type, signature, and the way of pushing a button, are used in the
recognition based on behavioral properties.
There are many identification techniques based on biometrics that make it possible to
identify the individuals or verify them. While fingerprint-based verification has reliable
accuracy rates, face recognition provides lower accuracy rates due to reasons, such as using
flash, image quality, beard, mustache, glasses, etc. However, while a fingerprint reader
device is needed for fingerprint recognition and also the person needs to scan his/her
fingers on this device, only a camera will be fine to capture images of people. That is why
face recognition has become widespread recently.
The first step in computer-assisted face recognition is to detect the face in the video
image (face detection). After the face is detected, the face obtained from the video image
is cropped and normalized by considering the factors such as lighting and picture size
(preprocessing). The features on the face are extracted (feature extraction) and training is
performed using these features. By classifying with the most suitable image among the
features in the database, face recognition is achieved.
Bledsoe, Chan, and Bisson carried out various studies on face recognition with a
computer in 1964 and 1965 [8–10].
A small portion of Bledsoe’s work has only been published because an intelligence
agency sponsoring the project allowed only limited sharing of the results of the study. In
the study, images matching or resembling with an image questioned in a large database
formed a subset. It was stated that the success of the method can be measured in terms of
the ratio of the number of photos in the answer list to the total number of pictures in the
database [11].
Bledsoe reported that parameters in face recognition problem, such as head pose,
lighting intensity and angle, facial expressions, and age, affect the result [9]. In the study,
the coordinates associated with the feature set in the photos (center of the pupil, inner
corner of the eye, outer corner of the eye, mid-point of the hair-forehead line, etc.) were
derived by an operator and these coordinates were used by the computer. For this reason,
he named his project “man-machine”. A total of 20 different distances (such as lip width,
width of eyes, pupillary distance, etc.) were calculated from these coordinates. It was stated
that people doing this job can process 40 photos per hour. While forming the database,
distances related to a photograph and the name of the person in that photograph were
associated and saved to the computer. In the recognition phase, distance associated with
the photograph in question were compared with the distances in each photograph in the
database and the most compatible records were found.
In recent years, significant improvement has been made also in face recognition and
some groundbreaking methods have emerged in this field. Now, let us check some of these
methods and studies.
Electronics 2022, 11, 1500 4 of 23
The Eigenfaces method uses the basic components of the face space and the projection
of the face vectors onto the basic components. It was used by Sirovich and Kirby for the
first time to express the face effectively [12]. This method is based of Principal Component
Analysis (PCA), which is a statistical approach or the Karhunen–Loève extension. The
purpose of this method is to find the basic components affecting the variety in pictures
the most. In this way, images in the face space can be expressed in a subspace with lower
dimension. It is based on a holistic approach. Since this method is based on dimension
reduction, recognition and learning processes are fast.
In their study, Turk and Pentland made this method more stable for face recognition.
In this way, the use of this method has also accelerated [13].
The Fisherfaces method is a method based on a holistic approach that was found as an
alternative to the Eigenfaces method [14]. The Fisherfaces method is based on reducing the
dimension of the feature space in the face area using the Principal Component Analysis
(PCA) method and then applying the LDA method formulated by Fisher—also known as
the Fisher’s Linear Discriminant (FLD) method—to obtain the facial features [15].
Another method widely used in pattern recognition applications is the Local Binary
Patterns method. This method was first proposed by Ojala et al. [16,17]. This method
provides extraction of features statistically based on the local neighborhood; it is not a
holistic method. It is preferred because it is not affected by light changes. The success of
this method in face recognition has been demonstrated in the study conducted by [18].
Another method that has started to be used widely in recent years in face recognition is
deep learning. With the development of Backpropagation applied in the late 1980s [19,20],
the deep learning method emerged. Deep learning suggests methods that better model the
work of the human brain compared to artificial neural networks. Most of the algorithms
used in deep learning are based on the studies made by Geoffrey Hinton and by the
researcher from the University of Toronto. Since these algorithms presented in the 1980s
required intensive matrix operations and high processing power to process big data, they
did not have a widespread application area in those years.
Convolutional Neural Networks (CNN), which are a specialized architecture of deep
learning gives successful results especially in the image processing field and are used more
commonly in recent times [21].
It can be said that the popularity of CNN constituting the basis of deep learning in
the science world started when the team including Alex Krizhevsky, Ilya Sutskever, and
Geoffrey E. Hinton won the “ImageNet Large Scale Visual Recognition Challenge-ILSVRC
2012” organized in 2012 by using this method [22,23].
In recent years, there are successful studies using YOLO V3 (You only look once)
algorithm [24], which is an artificial neural network-based, for face detection, and Microsoft
Azure (face database), which uses a face API for face recognition [24].
Today, many open-source deep learning libraries and framework, including Tensor-
Flow, Keras, Microsoft Cognitive Toolkit, Caffe, PyTorch, etc., are offered to users.
Studies on Determining Distraction and Gaze Direction

Correct determination of a person’s gaze direction is an important factor in daily
interaction [25,26]. Gaze direction comes first among the parameters used in determining
visual attention. The role of gaze direction in determining attention level was emphasized
in a study conducted by [25] in the 1970s and these findings were confirmed in later
studies by [27].
It is stated that the gaze direction defined as the direction the eyes point in space, is
composed of head orientation and eye orientation. That is, in order to determine the true
gaze direction, both directions must be known [28]. In this study, it was revealed that head
pose contributes to gaze direction by 68.9% and to the focus of attention by 88.7%. These
results show that head pose is a good indicator for detecting distraction.
In other words, it can be said that the visual focus of attention (VFoA) of a person is
closely related to his/her gaze direction and thus the head pose. Similarly, in the study
Electronics 2022, 11, 1500 5 of 23
conducted by [29], head pose was used in determining the visual focus of attention of the
participants in the meetings.
When the head and eye orientations are not in the same direction, two kinds of
deviations in the perceived gaze direction were stated [30–33]. These two deviations in the
literature are named as repulsive (overshoot) and attractive (attractive).
The study conducted by Wollaston, is one of the pioneers in this field [34]. Starting
from a portrait in which the eyes is pointing to the front and the head is directed to the left
from our perspective, Wollaston draw a portrait with eyes copied from the previous image
and the head directed to the right from our perspective and investigated the perceived
gaze directions in these two portraits. Although the eyes in the two portraits are exactly
the same, the person in the second portrait is perceived as looking to the right. Due to this
study, the perception of gaze direction towards the side of the head pose is also called the
Wollaston effect as well as the attractive effect [32].
Three important articles were published on this topic in the 1960s [35–37]. The results
of two of these studies were investigated and interpreted in different ways by many
authors. In the study conducted by Moors et al., the possible reasons for the confusion
that causes these two studies to be interpreted in different ways, were discussed [33]. In
order to eliminate these uncertainties and clarify the issue, they repeated the experiments
performed in both studies and found that there was a repulsive effect in both studies.
Although there are different interpretations and studies, it can be easily said that head
pose has a significant effect on the gaze direction.
In their study, Gee and Cipolla also preferred image processing techniques [38]. In the
method they used, five points on the face (right eye, left eye, right and left end points of the
mouth, and tip of the nose) were firstly determined. By using eye and mouth coordinates,
the starting point of the nose above the mouth is found, and the gaze direction was detected
with an acceptable accuracy by drawing a straight line (facial normal) from this point to the
tip of the nose. This method which does not require much processing power was applied
in real time in video images and the obtained results were shared.
Sapienza and Camilleri [39] developed the method of Gee and Cipolla [38] a little
further and used 4 points by taking the midpoint of the mouth rather than two points used
for the mouth. By using the Viola–Jones’ algorithm, face and facial features were detected,
the facial features were tracked with the normalized sum of squared difference method and
the head pose was estimated from the features followed using a geometric approach. This
head-pose estimation system they designed found the head pose with a deviation of about
6 degrees in 2 ms.
In a study conducted in 2017 by Zaletelj and Kosir, students in the classroom were
monitored with Kinect cameras, their body postures, gaze directions, and facial expres-
sions were determined, features were extracted through computer vision techniques, and
7 separate classifiers were trained with 3 separate datasets using machine learning algo-
rithms [40]. Five observers were asked to estimate the attention levels of the students in
the classroom for each second in their images and these data are included in the datasets
and used as a reference during the training phase. The trained model was tested with a
dataset of 18 people and an achievement of 75.3% was obtained in the automatic detection
of distraction. Again, in another study conducted in the same year by Monkaresi et al. [41],
a success of 75.8% was obtained when reports were used during activity and 73.3% success
was obtained when the reports after the activities were used in determining the attention
of students using the Kinect sensor.
In Raca’s doctoral thesis titled “Camera-based estimation of student’s attention in
class”, the students in the classroom were monitored with multiple cameras and the
movements (yaw) of the students’ head direction in the horizontal plane were determined.
The position of the teacher on the horizontal plane was determined by following his/her
movements along the blackboard. Then, a questionnaire was applied for each student with
10-min intervals during the lesson, to report their attention levels. According to the answers
Electronics 2022, 11, 1500 6 of 23
of the questionnaire, it has been determined that engaged students are those whose heads
are towards the teacher’s location on the horizontal axis [42].
Krithika and Lakshmi Priya developed a system in which students can be monitored
with a camera in order to improve the e-learning experience [43]. This system processes the
image taken from the camera and uses the Viola–Jones and Local Binary Patterns algorithms
to detect the face and eyes. They tried to determine the concentration level according to the
detection of the face and eyes of the students. Three different concentration levels (high,
medium, and low) were determined. The image frames are marked as low concentration in
which the face could not be detected; medium concentration in which the face was detected
but one of the eyes could not be detected (in cases where the head was not exactly looking
forward); high concentration in which the face and both eyes were detected. The developed
system was tested by showing a 10-minute video course to 5 students and the test results
were shared. According to the concentration levels in test results, parts of course which
needed to be improved were determined.
Ayvaz, ve Gürüler also conducted a study and developed an application using
OpenCV and dlib image processing libraries aiming to increase the quality of in-class
education [44]. Similar to determining the attention level, it was aimed to determine the
students’ emotional states in real time by watching them with a camera. This system
performed the face recognition with the Viola–Jones algorithm, then the emotional state
was determined according to the facial features. The training dataset used in determining
the emotional state was obtained from the face data of 12 undergraduate students (5 female
and 7 male students). The facial features of each student for 7 different emotional states
(happiness, sadness, anger, surprise, fear, disgust, and normal) were recorded in this train-
ing dataset. A Support Vector Machine is trained with this dataset and emotional states
of the students during the lesson were determined with an accuracy of 97.15%. Thanks
to this system, a teacher can communicate directly with the students and motivate them
when necessary.
In addition to the above studies, there are numerous studies conducted on automati-
cally detecting student engagement and participation in class by utilizing image processing
techniques based on the gaze direction determined the student engagement and participa-
tion based on their facial expressions [45–51]. In China, a system determining the students’
attentions based on their facial expressions was developed and this system was tested as a
pilot application in some schools [52].
In the study conducted by [53], the visual focus of attention of the person was tried
to be determined by finding his/her head pose using image processing techniques. For
this purpose, a system with two cameras was designed in which one camera was facing
to the user and the other was facing where the user looks. They developed an application
by using OpenCV and Stasm libraries. They tried to estimate head pose of the person in
vertical axis (right, left, front) from the images taken from the camera looking directly to
the person and which object the person paid attention by determining the objects in the
environment from the images taken from the other camera. In order to estimate head pose,
the point-based geometric analysis method was used and Euclidean distances between
the eye and the tip of the nose were utilized. In the determination of objects, the Sobel
edge detection algorithm and morphological image processing technique were used. It was
stated that satisfactory results were obtained in determining the gaze direction, detecting
objects and which objects were looked at, and the test results were shared.
One of the parameters used in determining the visual focus of attention of a person
is the gaze direction. Gaze detection can be made with image processing techniques or
eye tracking devices attached to the head. The use of image processing techniques is more
advantageous than other methods in terms of cost. However, getting successful results
from this method depends on having images of the eye area with sufficient resolution. On
the other hand, eye tracking devices are preferred because of their high sensitivity, although
they are expensive.
Electronics 2022, 11, 1500 7 of 23
Cöster and Ohlsson conducted a study aiming to measure human attention using
OpenCV library and Viola–Jones face recognition algorithm [54]. In this study, images in
which faces were detected and not detected were classified by the application developed
in this context. It is assumed that if a person is looking towards an object, that person’s
attention is focused on that object and images, in which no face can be detected because of
the head pose, are assumed to be inattentive. The face recognition algorithm was tested
with a dataset in which each student is placed in a separate image and looking towards
different directions during the lesson. It is ensured with the dataset that the students in
the images are centered and there are no other objects in the background. In order to
make a more comprehensive evaluation, three different pre-trained face detection models
that come with OpenCV Library are used for face detection. The images in the dataset
were classified as attentive or inattentive according to the face detection results. It was
expressed that this experiment, which is conducted in a controlled environment, yielded
successful results.
Additionally, an important part of the studies on detecting distraction is within the
scope of driving safety. Driver drowsiness detection is one of the popular research topics in
image processing. Tracking the eye and head movements/direction is one of the mostly
used methods in the area of driver drowsiness detection. Although there are many studies
on this subject in the literature, Alexander Zelinsky is one of the leading names in this field.
He released the first commercial version of the driver support system called faceLAB in
2001 [55]. This application performs eye and head tracking with the help of two cameras
monitoring the driver and determining where the driver is looking, the driver’s eye close-
ness, and blinking. The driver’s attention status is calculated by analyzing all these inputs
and the driver is warned before his/her attention level reaches critical levels [56,57].
2. Materials and Methods

In this section, hardware and software requirements of our application and the meth-
ods used in this study are explained.
2.1. Hardware
During developing and running our application for this study, an Intel(R) Core (TM)
i7-4770K CPU @ 3.50 GHz computer with 16 GB RAM, Gigabyte NVIDIA GeForce GTX
980 4 GB video card and a Logitech HD Webcam C525 is used.
2.2. Software Development Environment and Software Libraries

Windows 10 Pro 64-bit Operating System, Qt Creator 4.2.1 software development
environment, and OpenCV 3.3.1, dlib 19.8 software libraries are used.
2.3. Methods Used

In the application developed in this study, we used the following methods:
• Histogram of Oriented Gradients (HOG) method was used for face detection;
• Local Binary Patterns (LBP) method was used for face recognition;
• Head pose estimation was performed by solving the Perspective-n-Point (PnP) problem;
• For the classification of “Engaged” and “Not Engaged” students, a Support Vector
Machines (SVM) machine learning algorithm was used;
• Functions in dlib library were used for face detection and extracting the facial features
and algorithms in OpenCV library were used for face tracking, face recognition, and
head pose estimation.
The details of the methods used will be explained in this section according to their
orders in the algorithm.
Electronics 2022, 11, x FOR PEER REVIEW 8 of 23
Electronics 2022, 11, 1500 8 of 23
2.3.1. Histogram of Oriented Gradients (HOG)

2.3.1. Histogram of Oriented Gradients (HOG)
In this method, histograms prepared according to the density changes in the image
2.3.1.InHistogram
this of Oriented Gradients (HOG)according to the density changes in the image
are used as method,
features [58].histograms
In the methodprepared
they proposed, an image defined as a 64 × 128 × 3
are In this
used
matrix isas method, [58].
features
represented histograms
withIn prepared
the method
a feature according
vectorthey
with to thean
proposed,
3780 elements density
[59]. changes
image definedinasthe im-
a 64 × 128 × 3
age are used
matrixInisorder as features
to extract
represented [58].
HOG
with In the method
featuresvector
a feature they
of an image, proposed,
the image
with 3780 an image
is scaled
elements defined as
[59]. into the desireda
64 × 128
dimensions × 3 matrix is represented with a feature vector with 3780 elements [59].
In orderfirst and it was
to extract HOGthenfeatures
filtered with theimage,
of an masks the
shown in Figure
image 1 andinto
is scaled the gradi-
the desired
ent In orderon
images tothe
extract
x and HOG
y features
axis are of an image,
obtained. As an the image istoscaled
alternative these into thethe
masks, desired
Sobel
dimensions
dimensions
firstand
first
andit it was then
was then
filtered with the masks shown in Figure 1 and the gradi-
filtered with the masks shown in Figure 1 and the gradient
operator
ent can also be used foraxis
filtering.
images on the x and y axis are obtained.obtained.
images on the x and y are As an to
As an alternative alternative
these masks, tothe
these masks,
Sobel the Sobel
operator
operator
can also be can also
used forbefiltering.
used for filtering.
Figure 1. Masks used to obtain the gradient images on the x- and y-axis.
FigureAs
Figure 𝑔 and
1. Masks
Masks 𝑔 totorefer
used
used tothe
obtain
obtain the
the images
gradient
gradient filtered
images
images on in x-x-the
on
the andx-y-axis,
and the gradient size (g) and
and y-axis.
y-axis.
its direction (θ) can be calculated as in Equation (1):
As g𝑔x and
As andgy 𝑔refer to the
refer to images filtered
the images in x- and
filtered in x-y-axis,
and the gradient
y-axis, size (g) and
the gradient its (g) and
size
direction (θ) can be calculated as in Equation 𝑔 (1): 𝑔
𝑔
its direction (θ) can be calculated as in Equation (1):
(1)
𝑔 gy
q
g = gx 2 𝜃+ gy𝑎𝑟𝑐𝑡𝑎𝑛
2 θ = arctan (1)
𝑔 𝑔 𝑔 𝑔 gx
(1)
The
Thedensity
densitychanges
changesofofan animage inx-x-and
imagein andy-axis 𝑔 and
y-axis(edges
(edges andcorner)
corner)become
becomeapparent
apparent
with
withthe
thegradient
gradientofof thetheimage.
image.Gradient
Gradient 𝜃images
images 𝑎𝑟𝑐𝑡𝑎𝑛
of aofsample
a𝑔sampleimage on x-
on the
image theand y-axis
x- and are
y-axis
given in Figure
are given 2. It can
in Figure 2. Itbe observed
can that the
be observed thatgradient on theon
the gradient x-axis makesmakes
the x-axis the vertical lines
the vertical
clear,
lines while
clear,
The the
density gradient
while on of
the gradient
changes y-axis
thean
on the makes
imagey-axis the horizontal
in x-makes
and lines clear.
the horizontal
y-axis (edges linescorner)
and clear. become apparent
with the gradient of the image. Gradient images of a sample image on the x- and y-axis
are given in Figure 2. It can be observed that the gradient on the x-axis makes the vertical
lines clear, while the gradient on the y-axis makes the horizontal lines clear.
Figure2.2.Original
Figure Originalimage
image(left),
(left),gradient
gradientimage
imageon
onx-axis
x-axis(center),
(center),gradient
gradientimage
imageon
ony-axis
y-axis(right).
(right).
Afterobtaining
After obtainingthethegradient
gradient magnitudes
magnitudes (g)(g)
andand directions
directions (θ)(θ)
for for
each each pixel,
pixel, the the
imageim-
age is divided into blocks in the second step and the histograms of the gradients
is divided into blocks in the second step and the histograms of the gradients for each block for each
block
are are calculated.
calculated. There areThere are 9 elements
9 elements in the histogram,
in the histogram, 0◦ , 20◦ ,0°,
including
including 40◦20°, ◦ , 8060°,
, 6040°, ◦ , 100 ◦,
80°,
Figure
120 ◦
100°, 2.
, 140 Original
◦
120°, 140°,
, and 160image
◦
and (left),
. 160°. gradient image on x-axis (center), gradient image on y-axis (right).
After
2.3.2.
2.3.2. Local
Local obtaining
Binary the gradient
BinaryPatterns
Patterns (LBP) magnitudes (g) and directions (θ) for each pixel, the im-
(LBP)
age isThedivided
Local into blocks
Binary in
Patterns
The Local Binary Patterns the second
method
method wasstep
was firstand
first the histograms
proposed
proposed by Ojalaof
byOjala the
etetal. gradients
al.[16,17].
[16,17].This for each
This
block
method
methodareisiscalculated.
based
basedononthe There
the localare
local 9 elementsofof
neighborhood
neighborhood inpixels
the histogram,
pixels which including
whichisiscommonly
commonly 0°, 20°,
used
used 40°, 60°, 80°,
ininpattern
pattern
recognition
100°, applications.
120°, 140°,
recognition and 160°.Unlike
applications. Unlike Eigenfaces
Eigenfaces andand Fisherfaces,
Fisherfaces, which
which areareother
othermethods
methods
commonly
commonlyused usedininface
facerecognition,
recognition,thetheLBP
LBPmethod
methodisisnot notaaholistic
holisticmethod,
method,ititallows
allows
statistical
statistical
2.3.2. extraction
Localextraction of the features.
of the features.
Binary Patterns One of the strengths of LBP is that it is not affected
(LBP) One of the strengths of LBP is that it is not affected too
much by light changes.
too much by light changes.
The
In theLocal Binary
original version Patterns
of LBP, method was first
the pixel value in theproposed by Ojala
center of blocks in anet al. [16,17].
image with This
method is basedison
3 × 3 pixel-size setthe local
as the neighborhood
threshold value, and of pixels which
this value is commonly
is compared with used in pattern
adjacent
recognition applications. Unlike Eigenfaces and Fisherfaces, which are other methods
commonly used in face recognition, the LBP method is not a holistic method, it allows
statistical extraction of the features. One of the strengths of LBP is that it is not affected
too much by light changes.
In the original version of LBP, the pixel value in the center of blocks in an image with
3 × 3 pixel-size is set as the threshold value, and this value is compared with adjacent
pixels individually. In the case when the adjacent pixel value is lower than the threshold
Electronics 2022, 11, 1500 In the
value, the original
new value version
of theofadjacent
LBP, the pixel
pixel is
value in theas
assigned center
“0”, of blocks initan
otherwise is9image
assigned
of 23 with
as
3“1”.
× 3 After
pixel-size is set as
this process is the threshold
completed for value, andpixels,
8 adjacent this value
adjacentis compared
pixels arewith
takenadjacent
sequen-
pixels
tially individually.
according to In thethe case whendirection
determined the adjacent pixel value
(clockwise is lower than the threshold
or counterclockwise) and an 8-
value,
digit the
pixelsbinary new value
individually.
number of
In the the
caseadjacent
whenThe
is obtained. thepixel is assigned
adjacent
decimal pixel value
equivalentas is“0”,
lower
of otherwise
this than
binary it is assigned
thenumber
threshold as
becomes
“1”. After
value, the this
new process
value of is
the completed
adjacent for
pixel 8
is adjacent
assigned pixels,
as “0”, adjacent
otherwise pixels
it
the LBP code of the center pixel. An LBP image is obtained by processing all pixels in theis are
assignedtaken
as “1”.sequen-
tially
After according
image. this processto
An example the determined
is completed
thresholding direction
for 8 adjacent
process (clockwise
pixels,
is shown or counterclockwise)
adjacent pixels
in Figure 3.are taken sequentially and an 8-
digit binary number is obtained. The decimal equivalent of thisand
according to the determined direction (clockwise or counterclockwise) an 8-digit
binary numberbinary becomes
number is obtained. The decimal equivalent of this binary number becomes the LBP code
thethresholding
LBP code process
of the center pixel.7An LBP binaryimage
value is obtained
0 by processing all pixels in the
of the center pixel. An LBP image is obtained by processing all pixels in the image. An
image. An example thresholding
example thresholding process is shown process
0 0 in is
0 Figure shown
1 0 03. 1 1 in Figure 3.
7 6
5
1 2 2 0 0 0 7 binary value 0
thresholding process
9 5 6 0 1 1 4 0 0 0 1 19 0 0 1 1
5 3 1 1 60 0
71 5
1 2 2 0 02 0 3
9 5 3.6 Application
Figure
0 1 1 the
of
4 19
Local Binary Patterns method.
5 3 1 1 0 0
1 2 3
As (𝑥 , 𝑦 ) is the coordinate of the pixel in the center of the 2-dimensional image,
Figure is 3.
𝑑Figure 3. Application
theApplication
value of of ofthe
the the Local
Local
pixel Binary
Binary
in Patterns
thePatterns
center, method.
method.
and 𝑑 is the value of the adjacent pixel pro-
cessed, the operation explained above can be expressed as in Equation (2):
As (x , ym ) is the coordinate of the pixel in the center of the 2-dimensional image, dm
As (𝑥𝑥m𝑚𝑚 , 𝑦𝑦𝑚𝑚 ) is the coordinate of the pixel in the center of the 2-dimensional image,
is the value of the pixel in the center, and d p is the value of the adjacent pixel processed,
𝑑𝑑the is the value
𝑚𝑚 operation explainedof the pixel
abovein the center, and is the value
𝑑𝑑𝑝𝑝Equation of the adjacent pixel pro-
𝐿𝐵𝑃be𝑥expressed
can ,𝑦 as in2 𝑓 𝑑 (2):𝑑
cessed, the operation explained above can be expressed as in Equation (2):
p −1 (2)
LBP( xm , ym ) = ∑ 2𝑝𝑝−1 p× f d −d

0, 𝑝𝑝 p𝑥 m0
𝑥p)==
𝐿𝐿𝐿𝐿𝐿𝐿(𝑥𝑥𝑚𝑚 ,𝑓𝑦𝑦𝑚𝑚 0
� 1, 2 ×𝑥𝑓𝑓�𝑑𝑑0𝑝𝑝 − 𝑑𝑑𝑚𝑚 � (2)
f (x) =
x<0
0, 𝑝𝑝=0 (2)
It was observed that the original LBP 1, method
x ≥ 0 processed with fixed 3 × 3 blocks was
0, 𝑥𝑥 < 0
successful in classifying textures but 𝑓𝑓(𝑥𝑥)
this=method
� did not give successful results in differ-
It was observed that the originalet LBP 1, 𝑥𝑥 ≥ 0 with fixed 3 × 3 blocks was
ent scales. In their study, Ahonen al. method
developed processed
this method by using it in face recogni-
It was
successful inobserved
classifyingthat the
textures original LBP
but this method
tion applications and demonstrated the success of the method method
did notprocessed withresults
give successful fixedThe
[18].
in3 different
× 3new
blocks was
method,
scales. In their study, Ahonen et al. developed this method by using it in face recognition
successful
which in classifying
is accepted as thetextures
improved butversion
this method LBPdid
ofmethod and not
alsogive successful
known resultsLBP,
as Circular in differ-
applications and demonstrated the success of the [18]. The new method, which is made
ent
the scales.
current
accepted
In
as the
their
method study,
improved more Ahonen
stable
version
et
of with
al.
LBP and
developed
twoalso new this
parameters,
known
method
as Circular
by
radiususing it
(r) and
LBP, made
in face
thepixel
current
recogni-
count (n),
tion
to applications
overcome the and
scalingdemonstrated
problem. the
Neighborhoodsuccess of the
examples method
method more stable with two new parameters, radius (r) and pixel count (n), to overcome in the[18]. The
circular new
LBP method,
are shown
which
in
theFigureis accepted
scaling 4.problem.asNeighborhood
the improvedexamplesversion in of the
LBPcircular
and also LBPknown as Circular
are shown in FigureLBP,
4. made
the current method more stable with two new parameters, radius (r) and pixel count (n),
to overcome the scaling problem. Neighborhood examples in the circular LBP are shown
in Figure 4.
4.Circular
Figure 4.
Figure CircularLBP
LBPexamples.
examples. r = r1,=n1,
(a) (a) =n 8 (b) = 2,r n= =2,16
= 8 r(b) n (c)
= 16 2, rn ==2,
r =(c) 8. n = 8.
Considering that the LBP radius is r, the LBP adjacent pixels count is n, the adjacent
Considering that the LBP radius is r, the LBP adjacent pixels count is n, the adjacent
pixel to be processed is p, the coordinate of the center pixel is ( xm , ym ), the coordinate of
Figure
pixel 4. Circular
to LBP examples.
be processed is p, the(a) r = 1, n = 8of
coordinate (b)the 2, n = 16pixel
r = center (c) r =is2, n𝑥 = ,8.𝑦 , the coordinate of
the adjacent pixel to be processed is x p , y p and p ∈ n, and the coordinate of the adjacent
the adjacent pixel to be processed is 𝑥 , 𝑦
pixel for Circular LBP can be found as in Equation (3):and 𝑝 ∈ 𝑛, and the coordinate of the adjacent
pixelConsidering
for Circularthat
LBPthe
canLBP radiusasisin
be found r, Equation
the LBP adjacent
(3): pixels count is n, the adjacent
pixel to be processed isxp, =the coordinate
2π pof the center pixel 2π p is (𝑥𝑥𝑚𝑚 , 𝑦𝑦𝑚𝑚 ), the coordinate of
p xm + r cos 2𝜋𝑝 n
y p = ym + r sin (3)
the adjacent pixel to be processed is (𝑥𝑥 ∈ 𝑛𝑛, and
𝑥 𝑝𝑝n, 𝑦𝑦𝑝𝑝𝑥) and𝑟𝑝𝑝cos the coordinate of the adjacent (3)
𝑛
pixel In
forcase
Circular LBP can be found as in Equation (3):
when the coordinate found in this way does not correspond to a certain pixel on
the 2-dimensional image, the coordinate can be found by2𝜋𝜋𝜋𝜋
applying a bilinear interpolation
as in Equation (4): 𝑥𝑥𝑝𝑝 = 𝑥𝑥𝑚𝑚 + 𝑟𝑟 cos (3)
𝑛𝑛
2𝜋𝑝
𝑦 𝑦 𝑟 sin
𝑛
In case when the coordinate found in this way does not correspond to a certain pixel
Electronics 2022, 11, 1500 10 of 23
on the 2-dimensional image, the coordinate can be found by applying a bilinear interpo-
lation as in Equation (4):
𝑓 0,0 𝑓 0,1 1 𝑦
𝑓 𝑥, 𝑦 1 𝑥 f𝑥(0, 0) f (0, 1) 1 − y (4)
f ( x, y) = [1 − x x ] 𝑓 1,0 𝑓 1,1 𝑦 (4)
f (1, 0) f (1, 1) y
After finding decimal LBP codes for pixels in each block and forming the new LBP
After finding decimal LBP codes for pixels in each block and forming the new LBP
image, LBP histograms for each block in the new image are formed. By adding the value
image, LBP histograms for each block in the new image are formed. By adding the value of
of each pixel in the block being processed, the histogram for that specific block is obtained.
each pixel in the block being processed, the histogram for that specific block is obtained.
This
This processisisrepeated
process repeated for
for all blocksininthe
all blocks theimage.
image.AA sample
sample LBP
LBP image
image and and the combined
the combined
histogram
histogramofofthis
thisimage
imageareare shown
shown ininFigure
Figure5.5.
Figure
Figure 5. Formingthe
5. Forming theLBP
LBPhistogram,
histogram, the
theimage
imagewas taken
was with
taken thethe
with permission of Kelvin
permission SaltonSalton
of Kelvin do do
Prado from his article titled “Face Recognition: Understanding LBPH Algorithm”
Prado from his article titled “Face Recognition: Understanding LBPH Algorithm” [60]. [60].
2.3.3. Support Vector Machines (SVM)

2.3.3. Support Vector Machines (SVM)
The Support Vector Machines machine-learning method is a supervised learning
The Support
method Vector classification
that can perform Machines machine-learning
and regression on amethod is a supervised
tagged training dataset. learning
methodThe thatelements that separate
can perform the classes
classification andand are closest
regression on to the decision
a tagged boundary
training dataset. are
called
Thethe supportthat
elements vectors.
separate the classes and are closest to the decision boundary are
called the After training,
support a new point to be given to the system is classified by the SVM model
vectors.
according to its position with respect to the decision boundary.
After training, a new point to be given to the system is classified bythe → SVM model

according Our n-element training set that can be separated linearly, can be expressed as x , y , . . .,
to its position with respect to the decision boundary. 1 1
→ →
xOur
n , yn n-element trainingeach
. Here, x i indicates set element
that can be training
in our separated set, linearly, can
yi indicates be class
which expressedthis as
𝑥⃗ element
, 𝑦 , … , belongs
𝑥⃗ , 𝑦 to. Here, 𝑥
⃗ indicates → element in our training set, 𝑦 indicates
each
which is either −1 or 1. w refers to the weight vector (normal vector
which class this to
perpendicular element belongs
the decision to which
boundary), is either
b refers −1 or
to bias and1.the 𝑤⃗ decision
refers toboundary
the weight for avector
(normal
linear SVMvector perpendicular
can be expressed astointhe decision
Equation (5):boundary), b refers to bias and the decision
boundary for a linear SVM can be expressed → →
as in Equation (5):
w× x −b = 0 (5)
𝑤⃗ 𝑥⃗ b 0 (5)
Two parallel lines (negative and positive hyperplanes) that are used to find the decision
Two parallel lines (negative and positive hyperplanes) that are used to find the deci-
boundary and separate the classes from each other are as in Equation (6):
sion boundary and separate the classes from each other are as in Equation (6):
→ →
w× x −b = 1
→ 𝑤⃗→ 𝑥⃗ b 1 (6)
w × x − b = −1 (6)
𝑤⃗ 𝑥⃗ b 1
These explanations can be seen more clearly in Figure 6.
These
SVMexplanations can be so,
has a linear kernel; seen more
it has clearly
been in Figure
developed that6.it can make classification
in cases where linear classification is not possible. In this case, the data are moved to a
higher-dimensional space and classified using the “Kernel Trick” method. For this purpose,
various kernel functions can be used with SVM. Commonly used ones of these are Linear,
Polynomial, Gaussian Radial Basis Function, and Sigmoid kernel functions.
Electronics 2022, 11, 1500 11 of 23
Figure 6.
Figure 6. Decision
Decisionboundary
boundaryand
andsupport
support vectors
vectors in SVM.
in SVM.
2.3.4. Solving PnP problem and Finding Euler Angles

SVM has a linear kernel; so, it has been developed that it can make classification in
casesPerspective-n-Point or simply the
where linear classification is PnP
not problem
possible.isIn a problem of predicting
this case, the pose
the data are moved of a to a
calibrated camera if the coordinates of n points of an object are known in
higher-dimensional space and classified using the “Kernel Trick” method. For this pur- both 2-dimensional
and 3-dimensional space. With the solution of this problem, rotation and translation values
pose, various kernel functions can be used with SVM. Commonly used ones of these are
of the objects in front of the camera can be estimated with respect to the camera. We can
Linear, Polynomial, Gaussian Radial Basis Function, and Sigmoid kernel functions.
express the rotational movement of a 3-dimensional (3D) object relative the camera with
the rotation angles around the x-, y-, and z-axis. The same is valid for the translation. In
2.3.4. Solving PnP problem and Finding Euler Angles
order to known the translation of a 3D object with respect to the camera, it is necessary to
knowPerspective-n-Point
the translation amount or according
simply the toPnP problem
the x-, is a problem
y-, and z-axis. of predicting
In this case, the movementthe pose
a calibrated
of an camera
object according to if
thethe coordinates
camera, which isofthe
n points
combinationof an ofobject areand
rotation known in both 2-
translation
values, is called
dimensional pose
and in the literature
3-dimensional and requires
space. With the 6 values
solutionto be
of known.
this problem, rotation and
In the image processing field, many methods have been
translation values of the objects in front of the camera can be estimated proposed to solve this prob-
with respect to the
lem, which attracts the attention of researchers. From these solutions,
camera. We can express the rotational movement of a 3-dimensional (3D) object relative the solution of
Gao et al. [61],
the camera known
with as P3P, the
the rotation solution
angles of Lepetit
around the x-,ety-,
al. and
[62], z-axis.
knownThe as EPnP,
sameand the PnP
is valid for the
solutions proposed by Schweighofer and Pinz [63] stand out as non-iterative
translation. In order to known the translation of a 3D object with respect to the camera, itsolutions. The
solution proposed by Lu et al. [64] is one of the iterative solutions with high accuracy.
is necessary to know the translation amount according to the x-, y-, and z-axis. In this case,
The solvePnP function which comes with OpenCV library provides many solutions of
the movement of an object according to the camera, which is the combination of rotation
this problem to researchers. In order to use solvePnP() function, one requires
and translation values, is called pose in the literature and requires 6 values to be known.
• 3-dimensional coordinatesfield,
In the image processing of themany
points of the object;
methods have been proposed to solve this prob-
•lem, 2-dimensional coordinates for the same points
which attracts the attention of researchers. Fromin thethese
imagesolutions,
of the object;
the solution of Gao
•et al.Parameters specific to the camera used.
[61], known as P3P, the solution of Lepetit et al. [62], known as EPnP, and the PnP
After proposed
solutions solvePnP function is run, it produces
by Schweighofer and Pinza rotation vector
[63] stand out and a translation vector
as non-iterative solutions.
as output parameters.
The solution proposed by Lu et al. [64] is one of the iterative solutions with high accuracy.
The
TheEuler angles
solvePnP are named
function whichaftercomes
Leonhard
withEuler. Theselibrary
OpenCV angles provides
introduced in 1776
many by
solutions
Leonhard Euler are
of this problem the three angles
to researchers. describing
In order thesolvePnP()
to use orientationfunction,
of a rigid one
object with respect
requires
to a fixed coordinate system [65]. Euler angles also known as pitch, yaw, and roll angles for
• 3-dimensional coordinates of the points of the object;
a head example are illustrated in Figure 7.
• 2-dimensional
In coordinates
order to calculate the Eulerfor the same
angle, points inrotation
the Rodrigues the image of the
formula object;
[66] is applied to
•the rotation
Parameters specific to the camera used.
vector obtained from the solvePnP function and the rotation matrix is obtained.
For this transformation,
After the Rodrigues
solvePnP function function
is run, it producesthatacomes ready
rotation withand
vector the OpenCV library,
a translation vector
can be used. Using
as output parameters. the obtained rotation matrix, the Euler angles corresponding to the
rotation
Thematrix
Euler are calculated
angles by applying
are named a transformation
after Leonhard as described
Euler. These anglesinintroduced
[67]. in 1776
by Leonhard Euler are the three angles describing the orientation of a rigid object with
respect to a fixed coordinate system [65]. Euler angles also known as pitch, yaw, and roll
angles for a head example are illustrated in Figure 7.
y
yaw
Electronics 2022, 11, 1500 12 of 23
x
z
yaw
Figure 7. Euler angles. x
In order to calculate the Euler angle, the Rodrigues rotation formula [66] is applied
to the rotation vector obtained from the solvePnP function and the rotation matrix is ob-
tained. For this transformation, the Rodrigues function that comes ready with the
OpenCV
z library, can be used. Using the obtained rotation matrix, the Euler angles corre-
sponding to the rotation matrix are calculated by applying a transformation as described
in [67].
Figure 7. Euler angles.
Figure 7. Euler angles.
2.3.5.
2.3.5. Detecting
DetectingEye EyeAperture
Aperture
InEye
Eye aperture is calculated
aperture
order to is
calculate the Euler
calculated by the
theratio
byangle, theknown
ratio as
Rodrigues
known asEye Aspect
rotation
Eye Ratio
Ratioin
formula
Aspect the
[66]
in isliterature.
the applied
literature.
toSoukupová
the rotationand Cech
vector successfully
obtained from used
the the Eye
solvePnP Aspect Ratio
function to
and detect
the blink
rotation
Soukupová and Cech successfully used the Eye Aspect Ratio to detect blink moments moments
matrix [68].
is ob-
[68].
tained.Eye aperture
For this is an important
transformation, parameter
the in
Rodrigues order to
functiondetermine
that whether
comes
Eye aperture is an important parameter in order to determine whether the studentreadythe student
with theis
is
sleepy
OpenCV
sleepy oror distracted.
library, can be
distracted. When
used.
When the
Using
the eyes
eyes theare
are narrowed
obtained
narrowed or closed,
rotation matrix,
or closed, the
the eyethe eye
Euler
aspect aspect
angles
ratio ratio ap-
corre-
approaches
proaches
sponding
to 0. Thistoto 0. This
the situation
rotation
situation inisFigure
matrix
is seen seen in
8. Figure by
are calculated 8. applying a transformation as described
in [67].
2.3.5. Detecting Eye Aperture

Eye aperture is calculated by the ratio known as Eye Aspect Ratio in the literature.
Soukupová and Cech successfully used the Eye Aspect Ratio to detect blink moments [68].
Eye aperture is an important parameter in order to determine whether the student is
sleepy or distracted. When the eyes are narrowed or closed, the eye aspect ratio ap-
proaches to 0. This situation is seen in Figure 8.
Figure 8.
Figure Thestates
8. The statesof
ofthe
thelandmarks
landmarks according
according to
to the
the open
open and
and closed
closed position
position of
of eyes.
eyes.
By utilizing the Euclidean distance formula and the coordinates of the landmarks seen
By utilizing the Euclidean distance formula and the coordinates of the landmarks
in Figure 8, the Eye Aspect Ratio can be calculated as in Equation (7):
seen in Figure 8, the Eye Aspect Ratio can be calculated as in Equation (7):
k‖𝑝2
p2 − p6 k + k‖𝑝3
𝑝6‖ 𝑝5‖
p3 − p5 k
𝐸𝐴𝑅=
EAR
22× k‖𝑝1
p1 − p4 k
(7)
(7)
𝑝4‖
2.4. Dataset Used

2.4. Dataset Used
Figure In
8. The states
order to of thethe
test landmarks
successaccording to the open and
of face recognition, closed
head position
pose of eyes.and student
estimation,
In order detection
engagement to test theinsuccess of face application,
the developed recognition, the
head pose Head
UPNA estimation, and student
Pose Database was
engagement
usedBy[69]. detection
utilizing in the developed
the Euclidean distance application,
formula andthe theUPNA Head Pose
coordinates of theDatabase
landmarkswas
usedin[69].
seen Figure 8, the Eye Aspect Ratio can be calculated as in Equation (7):
UPNA Head Pose Database
‖𝑝2 𝑝6‖ ‖𝑝3 𝑝5‖
UPNA Head
This Pose Database
database is composed 𝐸𝐴𝑅of videos of 10 different people, including 6 men (7) and
2 ‖𝑝1 𝑝4‖
This database
4 women. is composed
It was designed of videos
to be used of 10about
in studies different
headpeople, including
tracking and head6 pose.
men and 4
In the
women.
database,Itthere
was designed to files
are 12 video be used in studies
for each person.about head
People tracking
in these andmoved
videos head pose. In the
(translation)
2.4.
andDataset
rotatedUsed(rotation) their heads along the x-, y-, and z-axis. The videos in the database
were recorded
In order to test at 1280 × 720 resolution
the success at 30 fps inhead
of face recognition, MPEG-4
pose format and and
estimation, eachstudent
video is
composed of
engagement 10 s, thatinis,
detection 300
the frames [70].
developed application, the UPNA Head Pose Database was
Along with the database, the text documents containing the measurement values
used [69].
obtained through the sensors for each video are also shared. Three types of text documents
are shared
UPNA Headfor Pose each video:
Database
This database is composed of videos of 10 different people, including 6 men and 4
women. It was designed to be used in studies about head tracking and head pose. In the
database, there are 12 video files for each person. People in these videos moved (transla-
tion) and rotated (rotation) their heads along the x-, y-, and z-axis. The videos in the data-
base were recorded at 1280 × 720 resolution at 30 fps in MPEG-4 format and each video is
composed of 10 s, that is, 300 frames [70].
Electronics 2022, 11, 1500 Along with the database, the text documents containing the measurement values 13 ofob-
23
tained through the sensors for each video are also shared. Three types of text documents
are shared for each video:
2-dimensionalmeasurement
2-dimensional measurementvalues values(*_groundtruth2D.txt):
(*_groundtruth2D.txt):ItItconsists consistsofof300
300rows,
rows,one
one
row for each frame. Each row has 108 columns separated by the
row for each frame. Each row has 108 columns separated by the TAB character. In these 108TAB character. In these
108 columns,
columns, therethere
are x are
andx yand y coordinate
coordinate values (x1 y1(xx12yy1 2x2· y· 2· ⋯x54
values x54y54
y54))of
of54
54 facial
facial landmarks.
landmarks.
3-dimensionalmeasurement
3-dimensional measurementvalues values(*_groundtruth3D.txt):
(*_groundtruth3D.txt):ItItconsists consistsofof300
300rows,
rows,one
one
rowfor
row foreach
eachframe.
frame. Each
Each rowrowhashas66columns
columnsseparated
separatedby bythetheTABTABcharacter.
character.In Inthese
these
columns,the
columns, thetranslation
translationamounts
amountsof ofthe
thehead
headalong
alongthe thex-, x-,y-,y-,and
andz-axis
z-axisare
aregiven
givenininmm mm
and the rotation angles are given in degrees. It has been reported
and the rotation angles are given in degrees. It has been reported that these values are that these values are
measured with high precision with the help of a sensor attached
measured with high precision with the help of a sensor attached to the user’s head [70]. to the user’s head [70].
Thesemeasurement
These measurementvalues
valuesareareasasfollows:
follows:
a.a. Translation
Translation amount of the head along the x-axis (T (𝑇x);
b.b. Translation
Translation amount of the head along the y-axis (T (𝑇y );
c.c. Translation
Translation amount of the head along the z-axis (T (𝑇z);
d.d. Roll, which
Roll, which is
is the
the rotation
rotation angle along the z-axis (roll);
e.e. Yaw, which is the rotation angle along
Yaw, the y-axis
along the y-axis (yaw);
(yaw);
f.f. Pitch, which is the rotation angle
Pitch, which is the rotation angle alongalong x-axis (pitch).
3-Dimensional
3-Dimensionalface facemodel
model(*_model3D.txt):
(*_model3D.txt):This
Thisfilefile
contains
containsx, y,
x, and z coordinates
y, and of
z coordinates
54 landmarks on the face for the corresponding person
of 54 landmarks on the face for the corresponding person [70]. [70].
The
Thesample
sampleimages
imagesofofthethevideos
videosininthe
thedatabase
databaseare areseen
seenininFigure
Figure9.9.
Figure 9. Sample images taken from the UPNA Head Pose Database.
Figure 9. Sample images taken from the UPNA Head Pose Database.
3.3.Developed
DevelopedAlgorithm
Algorithmof ofStudent
StudentEngagement
EngagementClassification
ClassificationSoftware
Software(SECS)
(SECS)
StudentEngagement
Student EngagementClassification
ClassificationSoftware
Softwareisisdesigned
designedtotodetermine
determinedistraction
distractioninin
realtime
real timeaccording
accordingtotohead
headposes
posesofofstudents
studentsand
andalso
alsorecognize
recognizethem:
them:
SECSsoftware
SECS softwareconsists
consistsofoftwo
twoparts.
parts.These
Thesesections
sectionsare:
are:
1.1. FaceRecognition
Face RecognitionModel
ModelPreparation
PreparationModule;
Module;
2.2. Main
Mainmodule.
module.
3.1. Face Recognition Model Preparation Module

The algorithm of SECS software is developed to prepare the face recognition model
as below.
As it can be seen in the flowchart (Figure 10), before starting the recording process, a
student name must be entered. Then, 50 photos are taken for each student at 5-s intervals
through the webcam and the student is asked to pose as different as possible during the
recording process (such as looking at different directions, rotate his/her head, posing with
and without glasses, with eyes closed and opened, different facial expressions, etc.). After
The algorithm of SECS software is developed to prepare the face recognition mode
as below.
As it can be seen in the flowchart (Figure 10), before starting the recording process,
student name must be entered. Then, 50 photos are taken for each student at 5-s interval
Electronics 2022, 11, 1500 through the webcam and the student is asked to pose as different as possible 14 of during
23 th
recording process (such as looking at different directions, rotate his/her head, posing wit
and without glasses, with eyes closed and opened, different facial expressions, etc.). Afte
therecording
the recording process
process is completed
is completed forstudents,
for all all students, therecognition
the face face recognition
model ismodel
createdis create
usingthe
using thecaptured
captured photos.
photos.
Figure10.10.Flowchart
Figure Flowchart of face
of face recognition
recognition model
model preparation
preparation modulemodule
of SECS.of SECS.
3.2. Main Module

3.2. Main Module
In the main part of SECS, face detection from the real-time images taken from the
In the main part of SECS, face detection from the real-time images taken from th
camera, facial landmarks extraction, face recognition, head pose estimation, and student
camera, facial
engagement landmarks
classification extraction,
processes face recognition,
are carried out. head pose estimation, and studen
engagement classification processes are carried out.
In addition to these, if the student is not far from the camera and the eye area can be
In addition
properly determined towith
these, if the
image student iseye
processing, not far from
aperture the calculated
is also camera and the eye the
to increase area can b
properly
success determined
of student with image
engagement processing,
classification. This eye aperture
parameter is alsooptionally
is added calculatedinto increase th
order
for the program
success to beengagement
of student used in e-learning environments.
classification. This parameter is added optionally in orde
for the program to be used in e-learningofenvironments.
The flow diagram of the main module SECS software is shown in Figure 11. The
methods used in the flowchart in the figure are as follows. Histogram of oriented gradients
The flow diagram of the main module of SECS software is shown in Figure 11. Th
were used for face detection, local binary patterns for face recognition, PnP for head pose
methods used in the flowchart in the figure are as follows. Histogram of oriented gradi
estimation, and SVM for classification.
ents“Engaged”
were usedorfor face
“Not detection,
Engaged” local
labels binary
that appear patterns for face recognition,
on the screenshots PnP for hea
are the real-time
pose estimation,
student engagementand SVM forresults
classification classification.
generated by the SVM machine learning algorithm
used in the application for each frame.
In addition to the real-time tracking of the distraction results, it is also possible to
record and report the results in a certain period of time. After the “Start Attention Tracker”
button is clicked, the application recognizes the faces detected in each frame and records
the student engagement classification results produced by SVM for the relevant people.
In addition, the total distraction rates for the detected faces are displayed instantly on the
screen. The mentioned total distraction results are shown in Figure 12.
Capture images
Histogram n
of Oriented is the face
Electronics 2022, 11,
Electronics x FOR
2022, PEER REVIEW
11, 1500
Gradients found?
15 of 15
23 of 23
y
Extract facial
landmarks
Capture images
Local n
is there any face
Binary
recognition model
Patterns
Histogram y n
of Oriented is the face
Gradients
PnP found?
Estimate head pose
y
Extract facial
Calculate eye
aperture
landmarks
n
Local
is there any any
is there face nn
BinarySVM end of task
recognition model
SVM model?
Patterns
yy y
Predict
PnP Estimate head pose end
distraction
Figure 11. Flowchart of main

Calculate eye module of SECS.
aperture
“Engaged” or “Not Engaged” labels that n appear on the screenshots are the real-time
student engagement classification results generated by the SVM machine learning algo-
rithm
SVMused in theis there any
application n each frame.
for end of task
SVM model?
record and report the resultsy y
in a certain period of time. After the “Start Attention Tracker”
button is clicked, Predict
the application recognizes the faces detected in each frame and records
end
distraction classification results produced by SVM for the relevant people.
the student engagement
Figure 11.11.
screen.
Figure Flowchart
TheFlowchart
mentionedofofmain
main module
module
total ofSECS.
of SECS.
distraction results are shown in Figure 12.
“Engaged” or “Not Engaged” labels that appear on the screenshots are the real-time
student engagement classification results generated by the SVM machine learning algo-
rithm used in the application for each frame.
record and report the results in a certain period of time. After the “Start Attention Tracker”
button is clicked, the application recognizes the faces detected in each frame and records
the student engagement classification results produced by SVM for the relevant people.
screen. The mentioned total distraction results are shown in Figure 12.
Figure 12. SECS main module—instant total distraction results.

Figure 12. SECS main module—instant total distraction results.
In the screenshot in Figure 12, the faces determined by the camera for 3 min and
56 s are seen to be focused (engaged) within the specified time with the rate of 62.5%.
Screenshots of the SECS software are given in Figure 13.
Electronics 2022, 11, 1500 In the screenshot in Figure 12, the faces determined by the camera for 3 min and 5623s
16 of
are seen to be focused (engaged) within the specified time with the rate of 62.5%. Screen-
shots of the SECS software are given in Figure 13.
Figure 13. SECS main module—various screenshots.

Figure 13. SECS main module—various screenshots.
Thereporting
The reportingprocess
processisisstopped
stoppedwhen
whenthe the“Stop
“StopAttention
AttentionTracker”
Tracker”buttonbuttonisis clicked
clicked
andthe
and thestudent
studentengagement
engagementclassification
classificationresults
resultsarearerecorded
recordedininthe
theresults.csv
results.csvfile.
file. In
In
this file, the name of the recognized person and the information on how
this file, the name of the recognized person and the information on how many frames that many frames that
personisis“Engaged”
person “Engaged” and and “Not
“Not Engaged”
Engaged”duringduringthe thefollow-up
follow-upperiod
period areare
included.
included. In the
In
case when the face detection process is performed but the student is
the case when the face detection process is performed but the student is not recognized,not recognized, stu-
dent engagement
student engagementclassification
classificationresults
resultsofofthe
theperson
personare areshown
shown in
in aa line
line named
named “default”
“default”
insteadofofthe
instead thestudent’s
student’s name.
name. A A sample
sample .cvs
.cvsfile
file(result.csv)
(result.csv)opened
openedininExcelExcelis is
shown
shown in
inTable
Table1.1.
Table
Table 1.1.SECS
SECSmain
mainmodule—student
module—studentengagement
engagementclassification
classificationresults.
results.
PersonName
PersonName Engaged
Engaged NotEngaged
NotEngaged
Ugur
Ugur 4301
4301 336
336
Berna 4214 423
Berna 4214 423
4. Results
4. Results
The results obtained from this application developed to classify student engagement
The aresults
during courseobtained from this application
will be investigated developedResults
under 3 categories. to classify student
of face engagement
recognition, head
during
pose estimation, and evaluation of the whole system will be discussed, respectively.head
a course will be investigated under 3 categories. Results of face recognition,
pose estimation, and evaluation of the whole system will be discussed, respectively.
4.1. Results Related to Face Recognition
4.1. Results Related to Face Recognition
In order to measure the success of the developed application on face recognition, the
In order to measure the success of the developed application on face recognition, the
UPNA Head Pose Database was used. For this purpose, face alignment, converting to
UPNA Head Pose Database was used. For this purpose, face alignment, converting to
grayscale, and resizing processes were applied to 50 face pictures of 10 students derived
grayscale, and resizing processes were applied to 50 face pictures of 10 students derived
from the video images of each student in the database, and these photographs were la-
from the video images of each student in the database, and these photographs were labelled
belled
and and recorded
recorded with the student’s
with the student’s name to be name
usedtoin be
theused in the preparation
preparation of the face
of the face recognition
recognition model. By training with LBP, Fisherfaces and eigenfaces face recognition
model. By training with LBP, Fisherfaces and eigenfaces face recognition algorithms in the al-
gorithms in the OpenCV library with the labelled photos, the face recognition
OpenCV library with the labelled photos, the face recognition models were created for each models
were created
method. Then, for each
using themethod. Then,Fisherfaces,
trained LBP, using the trained LBP, Fisherfaces,
and eigenfaces and eigenfaces
face recognition models,
the face recognition process was performed in each frame of each video in the frame
face recognition models, the face recognition process was performed in each UPNAofHead
each
video in the UPNA Head Pose Database and predicted
Pose Database and predicted labels for each frame were recorded. labels for each frame were rec-
orded.
When the predicted labels are compared with the real label values, the LBP method
achievedWhen thethe predicted with
recognition labelsthe
arehighest
compared
ratewith the realThe
of 98.95%. labelaverage
values,face
the LBP method
recognition
achieved the recognition with the highest rate of 98.95%. The average
accuracy rates for all people in UPNA Head Pose Database occurred as 98.69% in the face recognition
Fisherfaces method, 94.57% in the eigenfaces method. The details are shown in Table 2.
The LBP method was preferred as the face recognition algorithm giving the best results in
the developed application.
Electronics 2022, 11, 1500 17 of 23
Table 2. The results of the studied face recognition algorithms as percentages.
LBP % Fisherface % Eigenfaces %

User_01 98.69 99.86 90.13
User_02 100 100 99.25
User_03 96.36 97.58 90.22
User_04 98.97 96.16 95.58
User_05 100 100 96.22
User_06 99.52 99.75 96.52
User_07 99.08 100 95.77
User_08 99.80 99.88 97.38
User_09 99.83 96.36 94.11
User_10 97.30 97.33 90.55
Average 98.95% 98.69% 94.57%
4.2. Results of Head Pose Estimation

Head pose estimation in the developed application was obtained with the solution of
the Perspective-n-Point problem explained previously. In order to determine the success of
the part related to the head pose estimation, head pose angles, calculated by the developed
application for each frame (each video is composed of 300 frames) of each video (there are
a total of 120 videos in the database), were estimated and recorded. The recorded results
are compared with the reference values of head pose angles (pitch, yaw, and roll) that come
with the UPNA Head Pose Database and were measured with sensors for each frame of
each video. By grouping the comparison results to the persons in the database, the mean
absolute deviations of Euler angles were determined and the mean absolute deviations
from ground-truth data in pitch, yaw, and roll angles were calculated as 1.34◦ , 4.97◦ , and
4.35◦ respectively. The details are shown in Table 3.
Table 3. Comparison of the angles taken from the UPNA Head Pose Database with the angles
calculated by our algorithm.
Pitch◦ Yaw◦ Roll◦

User_01 1.65◦ 2.68◦ 1.94◦
User_02 1.10◦ 2.39◦ 1.52◦
User_03 1.09◦ 5.77◦ 2.94◦
User_04 1.40◦ 3.08◦ 1.95◦
User_05 1.24◦ 6.50◦ 4.36◦
User_06 1.76◦ 4.56◦ 4.52◦
User_07 1.24◦ 4.02◦ 2.85◦
User_08 1.33◦ 9.50◦ 6.24◦
User_09 0.96◦ 6.23◦ 5.70◦
User_10 1.59◦ 4.94◦ 11.44◦
Average 1.34◦ 4.97◦ 4.35◦
4.3. Results Related to the Overall System

Two separate test scenarios were conducted to measure the success of SECS application.
In the first scenario, it was emphasized if a student engagement classification could be
made independent of the machine learning algorithms. For this purpose, the developed
application has been modified, the threshold values for pitch, yaw and roll angles were
determined. Then, the student was accepted as “Not Engaged” when he/she does not look
at the camera (at least one of the pitch, yaw, and roll angle is lower than the threshold).
Since the webcam used has insufficient viewing angle, a 4-minute test was conducted
with 2 students at the same time. In each minute of these 4 min, the students followed
the instructions given to them. For example, when Student1 was asked to look at the
camera during the 3rd minute, Student 2 was asked to look in other directions. With the
instructions given to Student 1 and Student 2, the student engagement classification process
be made independent of the machine learning algorithms. For this purpose, the develo
application has been modified, the threshold values for pitch, yaw and roll angles w
determined. Then, the student was accepted as “Not Engaged” when he/she does not l
at the camera (at least one of the pitch, yaw, and roll angle is lower than the thresho
Since the webcam used has insufficient viewing angle, a 4-minute test was conducted w
Electronics 2022, 11, 1500 18 of 23
2 students at the same time. In each minute of these 4 min, the students followed the
structions given to them. For example, when Student1 was asked to look at the cam
during the 3rd minute, Student 2 was asked to look in other directions. With the inst
was initiated and thegiven
tions process was finished
to Student at the end
1 and Student of the
2, the 4th minute.
student The classification
engagement instructions process
initiated
given (on the left) and
and the the process
obtained wasare
results finished
shownatinthe end 4.
Table of the 4th minute. The instructions gi
(on the left) and the obtained results are shown in Table 4.
Table 4. Overall success of SECS—first scenario.
Table 4. Overall success of SECS—first scenario.
Student 1 Student 2
Student 1 Student 2
Instruction Result Instruction Result
Instruction Result Instruction Result
1. Minute 100%
1. Minute 100%96% 96%100% 100%93% 93%
2. Minute 50% 50% 50% 45%
2. Minute 50% 50% 50% 45%
3. Minute 100% 88% 0% 0%
4. Minute 3. Minute
70% 100%66% 88%35% 0% 30% 0%
4. Minute 70% 66% 35% 30%
In the second scenario, a test scenario,

In the second was carried out
a test with
was machine
carried learning
out with algorithms.
machine learningFor
algorithms.
this purpose, we this purpose, we created a student engagement dataset by usingHead
created a student engagement dataset by using the UPNA Pose Head P
the UPNA
Database. We extracted
Database.random 100 frames
We extracted random for100
each person
frames forfrom
each the UPNA
person fromHead Pose Head P
the UPNA
Database and created a dataset that consists of a total of 1000 images. Five human labelers
Database and created a dataset that consists of a total of 1000 images. Five human labe
annotated each image as (0)—“Not
annotated each image Engaged” or (1)—“Engaged”.
as (0)—“Not Engaged” orWe measured consistency
(1)—“Engaged”. We measured c
between the labelers based
sistency on Fleiss’
between kappa measure
the labelers based on [70], which
Fleiss’ kappais measure
a statistical approach
[70], which is a statist
for assessing the interrater
approach for reliability
assessing theof interrater
more than two raters.
reliability Percent
of more agreement
than two and agreem
raters. Percent
Fleiss’ kappa values of our five labelers calculated 0.95 and 0.85, respectively. These values
and Fleiss’ kappa values of our five labelers calculated 0.95 and 0.85, respectively. Th
indicate there isvalues
strong agreement
indicate between
there is labelers. Then,
strong agreement between welabelers.
labelledThen,
eachweimage as each im
labelled
majority decision of our labelers.
as majority decision of our labelers.
Sample images Sample
in our labelled
images in dataset are shown
our labelled in Figure
dataset 14. in Figure 14.
are shown
Figure 14. Sample photos from the dataset formed to test the overall success of SECS (first
Figure 14. Sample photos from the dataset formed to test the overall success of SECS (first line:
“Engaged”; second line: “Not Engaged”).
“Engaged”; second line: “Not Engaged”).
After the labelling process is completed, head pose angles and eye aperture ratios in
each of these 1000 photos were calculated and recorded as described in Section 3. For each
photo saved, an attribute vector in the form of:
v = [pitch, yaw, roll, eye_aspect_ratio, label]
was obtained and half of the 1000 photos was determined as training set and the other half
was determined as the test set.
Random Forest, Decision Tree, Support Vector Machines, and K Nearest Neighbor
machine learning algorithms in the OpenCV library were trained with the prepared training
set, and these models were tested over the test set.
The most successful result in student engagement classification was obtained with
Support Vector Machines with an accuracy rate of 72.4% and, thus, the use of SVM was
preferred in the student engagement classification part of the developed application. The
details regarding the obtained results are shown in Table 5.
Electronics 2022, 11, 1500 19 of 23
Table 5. Overall success of SECS—second scenario (by using estimated Euler angles).
Accuracy Rate %
SVM 72.4%
K Nearest Neighbor 71.6%
Random Forest 70.6%
Decision Tree 70.0%
In addition, a second test for the same scenario was carried out on the training and
test set prepared by using Euler angles measured with sensors that come with the UPNA
Head Pose Database instead of the estimated Euler angles. As a result of this test, the most
successful results were obtained with the Random Forest machine learning algorithm with
78.6% accuracy rate. Details regarding the obtained results are shown in Table 6.
Table 6. Overall success of SECS—second scenario (by using Euler angles measured with sensors).
Accuracy Rate %
SVM 76.4%
K Nearest Neighbor 77.0%
Random Forest 78.6%
Decision Tree 77.8%
5. Conclusions
With the application developed for this study:
• A total of 50 random photo frames were taken for each person from videos of 10 people
in the UPNA Head Pose Database; these photos were saved in separate folders with
the names of those people and a face recognition model was prepared using these
photos. The prepared face recognition model was tested using all videos in the UPNA
Head Pose Database and a 98.95% face recognition success was achieved with the
LBP method.
• Head pose angles were estimated for 120 videos each of which consists of 300 frames
in the UPNA Head Pose Database using image processing techniques. The obtained
pitch, yaw, and roll angles were compared with the angle values measured precisely
by the sensors. The estimated pitch, yaw, and roll angles were obtained with mean
absolute deviation values of 1.34◦ , 4.97◦ , and 4.35◦ , respectively.
In the literature, 75.3% success was reported in the model trained with the data
obtained with the Kinect One sensor, which is used to determine the distraction of students,
and 75.8% success in another study.
In this study, using 1000 photos taken from the videos from the UPNA Head Pose
Database, a distraction database was prepared. Pitch, yaw, roll, and eye aperture rates
were calculated for each photo and feature vectors were formed. Each photo in the dataset
was labelled as “Engaged” or “Not Engaged” by 5 labellers. Fleiss’ kappa coefficient was
calculated as 0.85 (high reliability) in order to determine the reliability of the labelling
process. While assigning labels to each photo in the dataset, the majority of decisions were
taken as the basis. Half of the labelled dataset was used in the training of machine learning
algorithms and the other half was used in testing. In the conducted tests, the SVM machine
learning algorithm showed 72.4% success rate in determining distraction.
Considering the results of this study conducted on student engagement classification,
it is seen that the results obtained using a simple CMOS camera gave similar results with
the studies conducted with the Kinect sensor. In addition, the developed system is more
useful in terms of recognizing the student and associating the student engagement result
with the student.
Electronics 2022, 11, 1500 20 of 23
In this study, algorithms were developed for face detection and recognition, estimation
of head pose, and classification of student engagement in the lesson, and successful results
were obtained.
Moreover, it was seen that the viewing angle and distance were limited due to the
camera used in the developed system. This is especially effective for the success of the
face recognition algorithm. In order for the system to provide more successful results in
the classroom environment, it is planned to conduct in-class tests using more professional,
wide-angle cameras and using additional light.
A labelled dataset consisting of 1000 photos was prepared for the training and tests
of machine learning algorithms that will determine student engagement. Expanding this
dataset will make the results more successful.
The feature vector used in training student engagement classifiers consists of 5 param-
eters. In addition to these parameters, adding features, such as eye direction, emotional
state of the student, and body direction, is predicted to be effective in increasing the success
of student engagement classification.
We continue to work on algorithms that guide the teaching method of the course by
determining the course topics and weights, taking into account the characteristics of the
students. To give a brief detail, data from this study are one of the input parameters of
the fuzzy logic algorithm we use to determine the level of expression during the lesson.
Students in front of the computer or in the classroom who are engaged in the lesson are
counted. The obtained value and the time spent in that lesson are entered into the fuzzy
logic algorithm as input data. From here, it is decided at which level the lecture of the
course, which is prepared at different levels, will be chosen. The student’s interest is tried
to be kept during the lesson. Our work on this issue continues.
Author Contributions: Conceptualization, E.Ö.; methodology, E.Ö. and M.U.U.; software, M.U.U.;
validation, M.U.U.; formal analysis, E.Ö.; investigation, E.Ö. and M.U.U.; resources, M.U.U.; data
curation, M.U.U.; writing—original draft preparation, E.Ö.; writing—review and editing, E.Ö. and
M.U.U.; visualization, E.Ö. and M.U.U.; supervision, E.Ö.; project administration, E.Ö.; funding
acquisition. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: In this study, no data were collected from the students by
camera or any other way. In order to test the success of face recognition, head pose estimation and
student engagement detection in the developed application, the UPNA Head Pose Database was
used [64].
Informed Consent Statement: The person in the photograph in Figures 2, 10, 12 and 13 is my
graduate student, Mustafa Uğur Uçar, with whom we work in this study. That is why I attached the
informed consent form for subjects. The people in the photos in Figures 9 and 14 are the photos of
the people in the UPNA Head Pose Database. There are also photographs of these people on the
relevant website. This database is licensed under a Creative Commons Attribution-Non-Commercial-
ShareAlike 4.0 International License. For this reason, I could not add the informed consent form for
subjects in Figures 9 and 14.
Data Availability Statement: Data available on request due to privacy. The data presented in this
study are available on request from the corresponding author.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Coates, H. Engaging Students for Success: Australasian Survey of Student Engagement; Australian Council for Educational Research:
Victoria, Australia, 2008.
2. Kuh, G.; Cruce, T.; Shoup, R.; Kinzie, J.; Gonyea, R. Unmasking The Effects of Student Engagement on First Year College Grades
and Persistence. J. High. Educ. 2008, 79, 540–563. [CrossRef]
3. Günüç, S. The Relationships Between Student Engagement and Their Academic Achievement. Int. J. New Trends Educ. Implic.
(IJONTE) 2014, 5, 216–223.
Electronics 2022, 11, 1500 21 of 23
4. Casuso-Holgado, M.J.; Cuesta-Vargas, A.I.; Moreno-Morales, N.; Labajos-Manzanares, M.T.; Barón-López, F.J.; Vega-Cuesta,
M. The association between academic engagement and achievement in health sciences students. BMC Med. Educ. 2013, 13, 33.
[CrossRef] [PubMed]
5. Trowler, V. Student Engagement Literature Review; Higher Education Academy: New York, NY, USA, 2010.
6. FredrickJ, A.S.; Blumenfeld, P.C.; Paris, A.H. School Engagement: Potential of the Concept, State of the Evidence. Rev. Educ. Res.
2004, 74, 59–109. [CrossRef]
7. Jafri, R.; Arabnia, H. A Survey of Face Recognition Techniques. J. Inf. Process. Syst. 2009, 5, 41–68. [CrossRef]
8. Bledsoe, W.W. (a). Man-Machine Facial Recognition: Report on a Large-Scale Experiment; Technical Report PRI 22; Panoramic Research,
Inc.: Palo Alto, CA, USA, 1966.
9. Bledsoe, W.W. (b). Some Results on Multi Category Pattern Recognition. J. Assoc. Comput. Mach. 1966, 13, 304–316. [CrossRef]
10. Bledsoe, W.W.; Chan, H. A Man-Machine Facial Recognition System—Some Preliminary Results; Technical Report PRI 19A; Panoramic
Research, Inc.: Palo Alto, CA, USA, 1965.
11. Ballantyne, M.; Boyer, R.S.; Hines, L.M. Woody Bledsoe—His Life and Legacy. AI Mag. 1996, 17, 7–20.
12. Sirovich, L.; Kirby, M. Low-dimensional Procedure for the Characterization of Human Faces. J. Opt. Soc. Am. 1987, A4, 519–524.
[CrossRef]
13. Turk, M.A.; Pentland, A.P. Face Recognition using Eigenfaces. In Proceedings of the 1991 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, Maui, HI, USA, 3–6 June 1991; pp. 586–591.
14. Belhumeur, P.N.; Hespanha, J.P.; Kriegman, D.J. Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection.
IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 711–720. [CrossRef]
15. Fisher, R.A. The Use of Multiple Measurements in Taxonomic Problems. Ann. Eugen. 1936, 7, 179–188. [CrossRef]
16. Ojala, T.; Pietikäinen, M.; Harwood, D. Performance Evaluation of Texture Measures with Classification Based on Kullback
Discrimination of Distributions. In Proceedings of the 12th International Conference on Pattern Recognition (ICPR 1994),
Jerusalem, Israel, 9–13 October 1994; Volume 1, pp. 582–585. [CrossRef]
17. Ojala, T.; Pietikäinen, M.; Harwood, D. A Comparative Study of Texture Measures with Classification Based on Featured
Distributions. Pattern Recognit. 1996, 29, 51–59. [CrossRef]
18. Ahonen, T.; Hadid, A.; Pietikäinen, M. Face Description with Local Binary Patterns: Application to Face Recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2006, 28, 2037–2041. [CrossRef] [PubMed]
19. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-propagating Errors. Nature 1986, 323, 533–536.
[CrossRef]
20. Le Cun, Y. A Theoretical Framework for Back-Propagation. In Connectionist Models Summer School, CMU; Touretzky, D., Hinton,
G., Sejnowski, T., Eds.; Morgan Kaufmann: Pittsburg, PA, USA, 1988; pp. 21–28.
21. İnik, Ö.; Ülker, E. Derin Öğrenme ve Görüntü Analizinde Kullanılan Derin Öğrenme Modelleri. Gaziosmanpaşa Bilimsel Araştırma
Derg. 2017, 6, 85–104.
22. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of
the 25th International Conference on Neural Information Processing Systems—Volume 1 (NIPS’12), Lake Tahoe Nevada, CA, USA, 3–6
December 2012; Curran Associates Inc.: Nice, France, 2012; Volume 1, pp. 1097–1105.
23. Zeiler, M.D.; Fergus, R. Visualizing and Understanding Convolutional Networks. In European Conference on Computer Vision
(ECCV); Springer: Cham, Switzerland, 2013; Volume 8689, pp. 818–833.
24. Khan, S.; Akram, A.; Usman, N. Real Time Automatic Attendance System for Face Recognition Using Face API and OpenCV.
Wireless Pers. Commun. 2020, 113, 469–480. [CrossRef]
25. Argyle, M.; Cook, M. Gaze and Mutual Gaze; Cambridge University Press: London, UK, 1976.
26. Emery, N.J. The Eyes Have it: The Neuroethology, Function and Evolution of Social gaze. Neurosci. Biobehav. Rev. 2000, 24,
581–604. [CrossRef]
27. Vertegaal, R.; Slagter, R.; Van Der Veer, G.; Nijholt, A. Eye Gaze Patterns in Conversations: There is more to Conversational
Agents than Meets the Eyes. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Seattle, WA,
USA, 1 March 2001.
28. Stiefelhagen, R.; Zhu, J. Head Orientation and Gaze Direction in Meetings. In CHI ‘02 Extended Abstracts on Human Factors in
Computing Systems (CHI EA ‘02); ACM: New York, NY, USA, 2002; pp. 858–859.
29. Ba, S.O.; Odobez, J.M. Recognizing Visual Focus of Attention from Head Pose in Natural Meetings. IEEE Trans. Syst. Man Cybern.
Part B (Cybern.) 2009, 39, 16–33. [CrossRef]
30. Langton, S.R.; Watt, R.J.; Bruce, I.I. Do the Eyes Have It? Cues to the Direction of Social Attention. Trends Cogn. Sci. 2000, 4, 50–59.
[CrossRef]
31. Kluttz, N.L.; Mayes, B.R.; West, R.W.; Kerby, D.S. The Effect of Head Turn on the Perception of Gaze. Vis. Res. 2009, 49, 1979–1993.
[CrossRef]
32. Otsuka, Y.; Mareschal, I.; Calder, A.J.; Clifford, C.W.G. Dual-route Model of the Effect of Head Orientation on Perceived Gaze
Direction. J. Exp. Psychol. Hum. Percept. Perform. 2014, 40, 1425–1439. [CrossRef]
33. Moors, P.; Verfaillie, K.; Daems, T.; Pomianowska, I.; Germeys, F. The Effect of Head Orientation on Perceived Gaze Direction.
Revisiting Gibson and Pick (1963) and Cline (1967). Front. Psychol. 2016, 7, 1191. [CrossRef]
34. Wollaston, W.H. On the Apparent Direction of Eyes in a Portrait. Philos. Trans. R. Soc. Lond. 1824, 114, 247–256. [CrossRef]
Electronics 2022, 11, 1500 22 of 23
35. Gibson, J.J.; Pick, A.D. Perception of Another Person’s Looking Behavior. Am. J. Psychol. 1963, 76, 386–394. [CrossRef] [PubMed]
36. Cline, M.G. The Perception of Where a Person is Looking. Am. J. Psychol. 1967, 80, 41–50. [CrossRef]
37. Anstis, S.M.; Mayhew, J.W.; Morley, T. The Perception of Where a Face or Television “Portrait is Looking”. Am. J. Psychol. 1969, 82,
474–489. [CrossRef]
38. Gee, A.; Cipolla, R. Determining the Gaze of Faces in Images. Image Vis. Comput. 1994, 12, 639–647. [CrossRef]
39. Sapienza, M.; Camilleri, K. Fasthpe: A Recipe for Quick Head Pose Estimation. Technical Report. 2011. Available online:
https://www.semanticscholar.org/paper/Fasthpe-%3A-a-recipe-for-quick-head-pose-estimation-Sapienza-Camilleri/cea2e0
9926e48f3682912dd1d849a5bb2f4570ac (accessed on 25 May 2020).
40. Zaletelj, J.; Kosir, A. Predicting Students’ Attention in the Classroom from Kinect Facial and Body Features. EURASIP J. Image
Video Process. 2017, 2017, 80. [CrossRef]
41. Monkaresi, H.; Bosch, N.R.; Calvo, A.; Dmello, S.K. Automated Detection of Engagement Using Video-Based Estimation of Facial
Expressions and Heart Rate. IEEE Trans. Affect. Comput. 2017, 8, 15–28. [CrossRef]
42. Raca, M. Camera-Based Estimation of Student’s Attention in Class; EPFL: Lausanne, Switzerland, 2015.
43. Krithika, L.B.; Lakshmi Priya, G.G. Student Emotion Recognition System (SERS) for E-Learning Improvement Based on Learner
Concentration Metric. Int. Conf. Comput. Model. Secur. 2016, 85, 767–776. [CrossRef]
44. Ayvaz, U.; Gürüler, H. Öğrencilerin Sınıf İçi Duygusal Durumlarının Gerçek Zamanlı Tespit Edilmesi. In Proceedings of the
25th Signal Processing and Communications Applications Conference (SIU), IEEE, Antalya, Turkey, 15–18 May 2017; pp. 1–4.
[CrossRef]
45. Bidwell, J.; Fuchs, H. Classroom Analytics: Measuring Student Engagement with Automated Gaze Tracking. 2011. Available
online: https://www.researchgate.net/publication/301956819_Classroom_Analytics_Measuring_Student_Engagement_with_
Automated_Gaze_Tracking (accessed on 6 May 2022).
46. D’Mello, S.; Olney, A.; Williams, C.; Hays, P. Gaze Tutor: A Gaze-Reactive Intelligent Tutoring System. Int. J. Hum.-Comput. Stud.
2012, 70, 377–398. [CrossRef]
47. Lee, H.-J.; Lee, D. Study of Process-Focused Assessment Using an Algorithm for Facial Expression Recognition Based on a Deep
Neural Network Model. Electronics 2021, 10, 54. [CrossRef]
48. Whitehill, J.; Serpell, Z.; Lin, Y.-C.; Foster, A.; Movellan, J. The Faces of Engagement: Automatic Recognition of Student
Engagement from Facial Expressions. IEEE Trans. Affect. Comput. 2014, 5, 86–98. [CrossRef]
49. Ngoc Anh, B.; Tung Son, N.; Truong Lam, P.; Phuong Chi, L.; Huu Tuan, N.; Cong Dat, N.; Huu Trung, N.; Umar Aftab, M.; Van
Dinh, T.A. Computer-Vision Based Application for Student Behavior Monitoring in Classroom. Appl. Sci. 2019, 9, 4729. [CrossRef]
50. Ayouni, S.; Hajjej, F.; Maddeh, M.; Al-Otaibi, S. A new ML-based approach to enhance student engagement in online environment.
PLoS ONE 2021, 16, e0258788. [CrossRef]
51. Toti, D.; Capuano, N.; Campos, F.; Dantas, M.; Neves, F.; Caballé, S. Detection of Student Engagement in e-Learning Systems
Based on Semantic Analysis and Machine Learning. In Advances on P2P, Parallel, Grid, Cloud and Internet Computing; Barolli, L.,
Takizawa, M., Yoshihisa, T., Amato, F., Ikeda, M., Eds.; 3PGCIC 2020; Lecture Notes in Networks and Systems; Springer: Cham,
Switzerland, 2021; p. 158. [CrossRef]
52. Connor. 2018. Available online: https://www.telegraph.co.uk/news/2018/05/17/chinese-school-uses-facial-recognition-
monitor-student-attention/ (accessed on 15 August 2021).
53. Afroze, S.; Hoque, M.M. Detection of Human’s Focus of Attention using Head Pose. In Proceedings of the International
Conference on Advanced Information and Communication Technology, Thai Nguyen, Vietnam, 12–13 December 2016.
54. Cöster, J.; Ohlsson, M. Human Attention: The Possibility of Measuring Human Attention using OpenCV and the Viola-Jones Face
Detection Algorithm (Dissertation); KTH Royal Institute of Technology: Stockholm, Sweden, 2015.
55. Apostoloff, N.; Zelinsky, A. Vision In and Out of Vehicles: Integrated Driver and Road Scene Monitoring. Springer Tracts Adv.
Robot. Exp. Robot. 2003, 23, 634–643. [CrossRef]
56. Morais, R. Seeing A Way To Safer Driving. Forbes. 2004. Available online: https://www.forbes.com/2004/01/23/cz_rm_0123
davosseeing.html (accessed on 11 September 2021).
57. Zelinsky, A. Smart Cars: An Ideal Applications Platform for Robotics and Vision Technologies; Colloquium on Robotics and Automation:
Napoli, Italy, 2006.
58. Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893.
Available online: https://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf (accessed on 11 September 2019).
59. Mallick, S. Histogram of Oriented Gradients. 2016. Available online: https://www.learnopencv.com/histogram-of-oriented-
gradients/ (accessed on 20 June 2019).
60. Do Prado, K.S. Face Recognition: Understanding LBPH Algorithm. 2017. Available online: https://towardsdatascience.com/
face-recognition-how-lbph-works-90ec258c3d6b (accessed on 30 August 2021).
61. Gao, X.S.; Hou, X.R.; Tang, J.; Cheng, H.F. Complete Solution Classification for the Perspective-Three-Point Problem. Pattern
Analysis and Machine Intelligence. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 930–943.
62. Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An Accurate O(n) Solution to the PnP Problem. Int. J. Comput. Vis. 2009, 81, 155–166.
[CrossRef]
Electronics 2022, 11, 1500 23 of 23
63. Schweighofer, G.; Pinz, A. Robust Pose Estimation from a Planar Target. Pattern Analysis and Machine Intelligence. IEEE Trans.
2006, 28, 2024–2030.
64. Lu, C.P.; Hager, G.D.; Mjolsness, E. Fast and Globally Convergent Pose Estimation from Video Images. Pattern Analysis and
Machine Intelligence. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 610–622. [CrossRef]
65. Euler, L. Novi Commentari Academiae Scientiarum Petropolitanae. Novi Comment. Acad. Sci. Imp. Petropol. 1776, 20, 189–207.
66. Wikipedia-RRF. Available online: https://en.wikipedia.org/wiki/Rodrigues%27_rotation_formula (accessed on 30 August 2021).
67. Slabaugh, G.G. Computing Euler Angles from a Rotation Matrix. 1999. Available online: http://eecs.qmul.ac.uk/~{}gslabaugh/
publications/euler.pdf (accessed on 6 May 2022).
68. Soukupová, T.; Cech, J. Real-Time Eye Blink Detection using Facial Landmarks. In Proceedings of the 21st Computer Vision
Winter Workshop, Rimske Toplice, Slovenia, 3–5 February 2016.
69. Ariz, M.; Bengoechea, J.; Villanueva, A.; Cabeza, R. A novel 2D/3D Database with Automatic Face Annotation for Head Tracking
and Pose Estimation. Comput. Vis. Image Underst. 2016, 148, 201–210. [CrossRef]
70. Gwet, K.L. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement among Raters; Advanced
Analytics; LLC.: Gaithersburg, MD, USA, 2014.

Electronics 11 01500

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Electronics 11 01500

Uploaded by

Copyright:

Available Formats

electronics

Electronics 2022, 11, 1500. https://doi.org/10.3390/electronics11091500 https://www.mdpi.com/journal/electronics

Studies on Determining Distraction and Gaze Direction

2. Materials and Methods

2.2. Software Development Environment and Software Libraries

2.3. Methods Used

Electronics 2022, 11, 1500 8 of 23

2.3.1. Histogram of Oriented Gradients (HOG)

2.3.3. Support Vector Machines (SVM)

2.3.4. Solving PnP problem and Finding Euler Angles

Figure 7. Euler angles. x

2.3.5. Detecting Eye Aperture

2.4. Dataset Used

3.1. Face Recognition Model Preparation Module

3.2. Main Module

Figure 11. Flowchart of main

Figure 12. SECS main module—instant total distraction results.

Figure 13. SECS main module—various screenshots.

Table 2. The results of the studied face recognition algorithms as percentages.

LBP % Fisherface % Eigenfaces %

4.2. Results of Head Pose Estimation

Pitch◦ Yaw◦ Roll◦

4.3. Results Related to the Overall System

In the second scenario, a test scenario,

v = [pitch, yaw, roll, eye_aspect_ratio, label]

You might also like