Professional Documents
Culture Documents
Engleza Radio
Engleza Radio
Engleza Radio
Thesis committee:
Tilburg University
Faculty of Humanities
Department of Communication and Information Sciences
Tilburg center for Cognition and Communication (TiCC)
Tilburg, the Netherlands
January 2012
1
2
PREFACE
A lot of people give up just before they're about to make it.
You know you never know when that next obstacle is going to be the last one.
Chuck Norris
I guess thats the way it goes with everything in life. Every goal has its obstacles and overcoming
them is a part of reaching you goal. Life isnt always going to be easy or fast and often requires
thought and effort to overcome the problems that lay ahead. For me, graduation was no exception.
This thesis marks the end of my time here at Tilburg University. I was proud when I received my
Bachelors degree for the Business Communication and Digital Media track, but becoming a Master
of Arts in the Human Aspects of Information Technology track has been my goal from the very
beginning. It took me 5.5 years to get to this point, slightly longer than the rest. I am proud to say
that I finally achieved my goal of becoming a master! I could say that the delay was merely caused by
obstacles and difficulties and surely there were, but lets be honest, most of it was caused by me.
During my time at Tilburg University I may have been a little lazy, but most of all Ive had a lot of fun.
John Lennon once said: Time you enjoy wasting, was not wasted. and let me tell you, I strongly
agree with him! I want to take this opportunity to thank all the people that helped me reach my goal
of becoming a master. First of all I want to thank Eric Postma for bearing with me for the past couple
of years. Guiding me with both my Bachelor and my Master thesis must have been quite a challenge.
I also want to thank Bart Joosten for his technical support and constant willingness to help me during
the writing of my thesis. Furthermore, during my time here at TiU two groups have played a vital
role. First, I want to thank the people from the T-gang for the fun times we had together and the
memories we created (Peace up T-town!). Second, I want to thank all the people from Chuck Norris
HQ for all the useless and useful discussions we had. Finally I want to thank my mom and dad and
brothers for always pushing me towards my goal even when I didnt want to. Now it is time to set a
new goal, with new obstacles to overcome with the help of friends and family. Thank you all!
Sincerely,
Davy Verbeek
3
SUMMARY
Communication is one of the most important aspects of life. Faces are especially important for
communication so it is important that humans can read and interpret facial expressions. In this thesis
we focus on the emotional content of broadcast news items. Research conducted by Swertz and
Krahmer (2010) revealed that humans can guess the emotional content of a news item (i.e., bad
news versus good news) by means of the facial expressions of newsreaders, only. State-of-the-
art digital recognition methods, such as the facial feature recognition method proposed by Joosten
(2011), may be able to replicate this ability.
The problem statement addressed in this thesis reads: Can we determine the emotional content of a
news item by analyzing the face of newsreaders using an automatic facial feature recognition
method?
To address this problem statement, an adapted version of the automatic facial feature recognition
method of Joosten (2011) has been applied to a collection of video clips of news items. In addition,
we searched for facial features that predict the emotional content of news items. We found the
vertical movements of the eyebrows to be predictive features. Using these features, we trained
classifiers to recognize the emotional content of news items. We trained four of commonly used
classifiers: the k-nearest neighbor classifier, the decision tree classifier, the multilayer perceptron,
and the support vector machine. The classifiers were able to predict the emotional content of news
items with an accuracy of about 70%. The conclusion is that the emotional content of a news item
can be determined quite well using an automatic facial analysis method.
4
CONTENTS
Preface 3
Summary 4
Contents 5
Chapter 1: Introduction 6
1.1 Problem Statement and Research Questions 6
1.2 Related Work 7
1.3 Thesis Outline 7
Chapter 2: The FER Method 8
2.1 Face Detection 9
2.2 Facial Feature Extraction 9
2.2.1 Labeling Training Images 9
2.2.2 Creating the Shape and Appearance Models 9
2.2.3 Fitting Unseen Images 10
2.3 Determining the Feature 10
2.4 Classification 11
Chapter 3: Experimental Set-up 14
3.1 The Dataset 14
3.2 Experimental Procedure 14
3.2.1 Face Detection 15
3.2.2 Facial Feature Extraction 15
3.2.3 Classification 16
3.3 Evaluation 16
Chapter 4: The Results 17
4.1 Can we find a distinctive feature based on the movement of the eyebrows? 17
4.2 What classifier performs best in our classification task? 19
Chapter 5: General Discussion 22
5.1 The Task 22
5.2 The Experiment 23
Chapter 6: Conclusion and Future Work 25
6.1 Research Questions and Problem Statement 25
6.2 Application and Future Research 26
Literature 27
5
Chapter 1
Introduction
Human-computer interaction has become an important aspect of our daily lives and is still growing
on a daily base. The computer has been given a permanent place in every household and has become
an important aspect in science. Recent developments are trying to take an extra step in creating a
computer with which we can actually have a real conversation. Even though computers are able to
register everything we say using speech recognition, communication will not be possible until
computers possess the ability to give meaning to the registered words. Humans are highly capable of
understanding and interpreting signals that give words meaning. We use our hands to depict what
we say, the pitch and melody of our voice to emphasize or clarify our words and our face to express
our attitude towards the information we are transferring (Cassell, 2001) (Vinciarelli, Pantic, Boulard,
& Pentland, 2008). This is an aspect of communication with which the computer is not yet able to
deal with. To be able to fully integrate the realistic human-computer interaction, realistic human-
centered user interfaces are needed which can respond naturally during communicating with human
users. These interfaces must possess the ability to recognize human social signals and social
behaviour in order to accomplish this goal (Vinciarelli, Pantic, Boulard, & Pentland, 2008) (Zeng,
Pantic, Roisman, & Huang, 2009) (Pantic, Nijholt, Pentland, & Huanag, 2008) (Pantic & Rothkrantz,
2003). This field of research emerged approximately 15 years ago in which computer scientists
started to use the power of computing to automatically analyze non-verbal behaviour.
The two most important categories of non-verbal communication are hand gestures and facial
expressions. In this thesis we will focus on facial expressions. Humans derive all kind of different
information from the facial features of a speaker during communication (Swerts & Krahmer, 2010).
By using different expressions of the face one can convey a whole range of different meanings. Facial
expressions have become of great interest to the fields of computer vision and human-computer
interaction (Pantic & Bartlett, 2007) (Cassell, 2001). Research shows that facial expressions can show
a persons emotional state, give information about the speakers personality, enhance and support
speech or possibly even replace it (Cassell, 2001) (Donato, Bartlett, Hager, Ekman, & Sejnowski,
1999).
Swerts and Krahmer (2010) conducted an experiment in which human participants viewed a
newsreader presenting the news. Their research showed that humans are capable of determining the
emotional content (whether a news item was positive or negative) of a video based on the facial
expression of the newsreaders. Participants noted that newsreaders were generally more expressive
in their facial expressions when they were conveying a positive news item. In this thesis we will try to
determine the emotional content of a video using an automatic facial analysis method. To know if a
message is positive or negative is an important step in understanding the message itself. If a
computer is able to determine the emotional content of everything we say, it may take us a step
closer to achieving natural communication with computers. We will use a dataset which consist of
similar videos as used by Swerts and Krahmer (2010). The task consists of classifying the emotional
content of a news item presented by a newsreader based on the movement of the eyebrows. We do
not take other aspects of the face into account because we believe that the eyebrows are a good
indicator of expressiveness. The automatic facial feature analysis method will be applied using a
computer. It is our goal to find a distinctive feature which can be analyzed and recognized and can be
used by a classifier to predict the emotional content of a video message.
6
1.1 Problem Statement and Research Questions
This thesis focuses on the connection between the facial expressions of local newsreaders and the
message they are trying to convey. In the study by Swerts & Krahmer it appeared that human
observers are indeed capable of recognizing the emotional content of a message, based on the facial
expressions of news readers. We want to know if a computer is able to classify the emotional content
of a video using an automatic facial feature recognition method. For this purpose we need a facial
feature which provides a clear distinction between positive and negative videos. Furthermore we
want to know which classifier is the most suitable for the task proposed. The problem statement (PS)
and Research Questions (RQ) addressed in this thesis read as follows.
PS: Can we determine the emotional content of a video by analyzing the face of newsreaders
using an automatic facial feature recognition method?
We are searching for a feature which represents the clearest distinction between the two conditions.
As mentioned above the focus during our search is the movement of the eyebrows. If a distinctive
feature is found we will use this feature to carry out our classification task. For this purpose we will
use a total of four different classifiers. We will evaluate their performance and highlight the
differences between the classifiers. The problem statement will be answered based on answers to
the research questions.
7
Chapter 2
The FER Method
The automatic analysis of human faces can be described as the measuring of deformations of the
different facial components and their special relations (Chibelushi & Bourel, 2003). The difficult part
is to translate these deformations to meaningful features which can be measured or counted. In the
face there are numerous features which we can analyze. The features with the most interest are
changes in the eyebrows, the expression of the mouth, movement of the head and eye gazing. These
features are considered to be the social signals with the highest informational value in
communication (Cassell, 2001). To extract the features present in the face we need to extract their
coordinates. Several methods have been developed capable of this task. We will use a facial feature
recognition method based on the Active Appearance model. Joosten (2010) proposed a method
called the Facial Expression Recognition method which consists of three basic steps (Figure 1).
For the first step we use a method called the Viola-Jones detector. Research shows that this face
detector is highly effective and known to be computationally highly efficient and fast (Joosten, 2011)
(Viola & Jones, 2001). In the second step we will extract several facial features from the face
detected in an image. The Active Appearance Model is used to automatically fit a specified grid of
coordinates on unseen images. For a large number of training instances these coordinates have to be
specified by hand. This results in a training set with which the AAM can be trained (Matthews, 2004).
After fitting the grid on an unseen image, the coordinate values will be stored. Using only the
coordinates of the eyebrows we will begin our search for a distinctive feature. This feature can then
be used to classify the video fragments for our classification task. A classifier is a machine learning
method in which the class of an unseen sample is determined using a set of training data. As
mentioned in chapter 1 we will use a total of four different classifiers.
8
In the following of this chapter we will discuss each stage in further detail. In the section 2.1
we will describe the Viola-Jones detector method used to detect the location of the faces in our
image dataset. Section 2.2 will explain the Active Appearance Model and the pre-work that is
necessary for this method. In section 2.3 we will describe the method used for searching our
distinctive feature. Finally, in section 2.4, we will describe the four different classifiers used for our
classification task.
9
four variables called the Procrustes components (Joosten, 2011). An average model, called the mean
face shape, is now calculated using these Procrustes components. The second step is to apply
Principal Component Analysis (Jolliffe, 2002) to the set of aligned training grids. Principal Component
Analysis is a method to determine the components in which the training samples differs the most.
Principal Component Analysis is used to find the modes of shape variation (Joosten, 2011) (Asthana,
Saragih, Wagner, & Goecke, 2009). At this stage the shape model is complete, consisting of the mean
face shape, the 4 Procrustes components and the modes of shape variation (the shape components).
These components account for the range of facial variation to which this model can be fitted.
In creating the appearance model, the first step consists of warping the training images onto
the mean face shape of the shape model. These warped images are considered the shape normalized
appearances. As in the shape model, the shape normalized appearances are used to compute a mean
face appearance. PCA is applied to model the texture variation of the skin and face (called the
appearance components). The appearance model now consists of both the mean appearance and
the appearance components. The Active Appearance Model consists of both the shape and the
appearance model. Once the AAM is created, new instances of this model can be generated due to
its capability of representing large variation in both shape and texture (Joosten, 2011) (Asthana,
Saragih, Wagner, & Goecke, 2009).
10
distributions of calculated features we decided whether or not the distinction between categories
was strong enough to use in our classification task.
2.4 Classification
In this section we will explain the choice for our classifiers. It is commonly known that the
performance of classifiers depends on the sort of data to be classified. Considering the no-free-lunch
theorem by Wolpert and Macready (1997), we can also assume that no single classifier performs best
on all problems available. To determine a suitable classifier for our classification task we opted for
using multiple classifiers. We will determine which classifier performs best in our classification task.
We chose to use four of the most commonly used classifiers in statistical classification. As mentioned
in chapter 1 we used: k-nearest neighbor, decision tree, multilayer perceptron and support vector
machine. In optimizing our results we limited our changes to parameters which alter the complexity
of the classifiers. Below we describe each classifier shortly and indicate the value used to increase
the complexity.
k-Nearest Neighbor
To classify a new video, a number of neighbors are located of which the class is known. The majority
rule is used to determine the class of the unknown video according to its nearest neighbors (Witten
& Frank, 2005). The k-nearest neighbor classifier is a very robust classifier capable of dealing with
noisy training data and large datasets. The complexity of the k-nearest neighbor can be decreased or
increased by changing the k-value, i.e. the number of nearest neighbors used during classification.
Figure 2 shows an example of the k-nearest neighbor using five nearest neighbors. In this example
the unseen sample (triangle) is classified using five neighbors (k=5). These consist of two class 1
(circles) samples and three class 2 (squares) samples. In this case the unseen sample is classified as a
class 2 sample.
Decision tree
A decision tree is the result of a divide-and-conquer learning approach and is constructed like a
tree. Each video is passed along the tree structure and is subjected to a set of rules after which a
class is decided (Witten & Frank, 2005). These rules can range from matching a certain characteristic
11
to having a certain value. Beside the fact that decision trees require little computational effort, the
real advantage is that a decision tree can be described as a set of rules which can be followed. The
complexity of a decision tree depends on the number of branches. Figure 3 shows an example of a
very simple decision tree which determines the class, in this example class A or class B, of an unseen
sample (X).
Multilayer Perceptron
The multilayer perceptron is a classifier that is capable of classifying non-linear point distributions.
The decision boundary of a multilayer perceptron is calculated using several linear decision
boundaries. Combining these boundaries results in a non-linear decision boundary capable of
classifying all sorts of data (Witten & Frank, 2005). The complexity of the multilayer perceptron
corresponds with the number of linear decision boundaries used. To increase the complexity we
need to increase the amount of hidden layers/neurons. An example of decision boundary calculated
using a multilayer perceptron can be seen in figure 4. In this example you can see that the decision
boundary encircles the class in the middle very good. Everything within the decision boundary will be
classified as a circle. Non-linear decision boundaries can take on different forms than the one shown
in this example.
A Support Vector Machine is a classifier that is highly efficient when dealing with high dimensional
data. To classify a new instance; an SVM transforms data to a dimension in which the data is linearly
separable. This transformation is done by using a kernel and the type of kernel depends on the data
used. A basic Support Vector Machine can be represented as a point distribution of two different
categories, separated by a line with a margin (Figure 5). This line is oriented so that the margin
between the two classes is maximized. The points that determine the width of the margin are called
the support vectors. The complexity of a support vector machine can be changed by altering the c-
value. This value represents the trade-off between learning error and the number of support vectors
(Witten & Frank, 2005).
13
Chapter 3
Experimental Set-up
Chapter 3 will focus on how we applied the method described in chapter 2 in our experiment. We
will start by describing the process of creating our dataset in section 3.1. In section 3.2 we will
describe the application of the FER-method (described in chapter 2) to our dataset. How we applied
the classifiers for our classification task is described in section 3.3.
14
component will be discussed. These two steps are incorporated in a single algorithm provided by
Laurens van der Maaten (Maaten & Hendriks, 2010). In section 3.2.3 we will describe how the
classifiers have been applied.
15
Model Generation
Our next step consists of generating the shape and the appearance models which are needed to fit
the determined coordinate grid on unseen images. The shape model and the appearance model
(described in chapter 2) together form the Active Appearance Model. These models are
automatically generated in our algorithm during the training period. In our experiment we created
two different AAMs, for the male and the female newsreader. Each AAM model is generated using
100 annotated training images.
Image Fitting
The first step in fitting our coordinates unto an unseen image is the detection of several important
facial feature points. These are determined in the facial-extraction component of the algorithm as
the corners of the eyes, mouth and the tip of the nose. Estimates of their locations are used to
calculate the values of the so-called Procrustes components (translation, rotation, rescaling and
reflection). The shape model is now fitted to the new image and converged to an optimal solution
using the values of the Procrustes components (Joosten, 2011). These Procrustes values are stored
and processed further into usable coordinate values. These values are ready to be analyzed and used
for our classification task.
3.2.3 Classification
In chapter 2 we shortly discussed the classifiers used in our experiment namely; k-Nearest neighbor,
Decision trees, Multilayer perceptron and a Support vector machine. We will now discuss the method
used to apply these classifiers for our classification task. We applied these classifiers using Weka 3.6
(Waikato environment for knowledge analysis). Weka is an open source machine learning software
freely available from http://www.cs.waikato.ac.nz/ml/weka/. This program can be used for
classifying, preprocessing, clustering or visualizing dataset. For our purpose we only used the
classification function. Most generally used classifiers are present in this software package. As a k-
nearest neighbor we used the IBk algorithm. For the decision tree classifier we used j48, an
implementation of the c4.5 algorithm which is used to generate a decision tree (developed by Ross
Quinlan). The multilayer perceptron function is readily present in Weka. Lastly we choose the SMO
function developed by John Platt (Witten & Frank, 2005). For our classification task we used the
leaving-one-out training method. This means that every instance is used as a training example except
for the instance that is being classified. During our classification we varied the complexity values as
described in chapter 2. After determining the optimal complexity values for our classification task we
stored the results. The results of our classification task can be found in Chapter 4.
3.3 Evaluation
This section describes the way we evaluate the performance of our experiments. In chapter 4 we will
discuss the quality of the determined feature based on its distinctiveness. We will also reflect on the
performance of the classifiers in our classification task. In this thesis we label a classification as
successful if it reaches a correct classification rate of 65% or more. In chapter 5 we will discuss the
quality of our dataset and the possible effects this could have on the results.
16
Chapter 4
The Results
In this chapter we will discuss the results from our experiments. These results are divided into two
paragraphs, each covering a specific research question. The first paragraph describes the results of
our search for the most optimal facial feature in our classification task based on the movement of the
eyebrows. In section 4.2 we will discuss the results of the different classifiers and their optimal
complexity settings.
Figure 8 Y-variance of the eyebrow coordinates (female) plotted against each other using MATLAB.
17
Figure 9 Plot of the distinctive feature in the female category.
18
Figure 11 Plot of the distinctive feature in the total category.
80
70
60
50
Male
40
Female
30
Total
20
10
0
k-nearest decision tree multilayer support vector
neighbor perceptron machine
Figure 12 Correctly classified video percentage of the four classifiers on each dataset.
19
k-Nearest Neighbor
In our attempt to optimize the k-nearest neighbor classifier for our classification task we found an
optimal k-value of 14 neighbors. Increasing or decreasing the complexity of the classifier any further
resulted in a lesser performance for both male and female, and for the total dataset. For the male
dataset 64% of the classifiers predictions were correct with an NPV of 63% and a PPV of 65.2%. For
the female dataset scored a total of 68.75% correct predictions with an NPV of 64.5% and a PPV of
76.5%. On the combined dataset, it reached a performance of 70.4% with an NPV of 67.25% and a
PPV of 75%. See table 1 for an overview of the results for the k-nearest neighbor classifier.
Decision Tree
For the j48 decision tree classifier, changing the default Weka settings resulted in a lower
performance. Therefore we believe that the default value is most suitable for our classification task.
This resulted in a performance of 70% for the male dataset with an NPV of 100% and a PPV of 62.5%.
For the female dataset scored a total of 75% correct predictions with an NPV of 70% and a PPV of
88.2%. On the combined dataset, it reached a performance of 63.25% with an NPV of 60.7% and a
PPV of 67.6%. See table 2 for an overview of the results for the decision tree classifier.
Decision Tree
Confusion Matrix
%
Dataset correct Predicted
correct Actual
Neg Pos
10 15 Neg
Male 70% 35/50
0 25 Pos
21 2 Neg
Female 75% 36/48
9 15 Pos
37 12 Neg
Total 63.25% 62/98
24 25 Pos
Table 2 Results for the decision tree classifier.
Multilayer Perceptron
For the multilayer perceptron we used the default setting which automatically determined the
optimal number of hidden layers and neurons. This resulted in 72% correct predictions for the male
dataset with an NPV of 73.9% and a PPV of 70.4%. For the female dataset the multilayer perceptron
scored a total of 72.9% correct predictions with an NPV of 69% and a PPV of 79%. On the combined
dataset, it reached a performance of 66.3% with an NPV of 66% and a PPV of 66.7%. See table 3 for
an overview of the results for the multilayer perceptron classifier.
20
Multilayer Perceptron Lr = 0.1 TrT = 1000
Confusion Matrix
%
Dataset correct Predicted
correct Actual
Neg Pos
17 8 Neg
Male 72% 32/50
6 19 Pos
20 4 Neg
Female 72.9% 35/48
9 15 Pos
33 16 Neg
Total 66.3% 65/98
17 32 Pos
Table 3 Results for the multilayer perceptron classifier.
21
Chapter 5
General Discussion
Here we will discuss the research described in this thesis. The results of our classification task will be
discussed in section 5.1. We will question the validity and quality of our results and some
improvements to increase the quality of our research will be proposed. In section 5.2 we will point
out the strengths and weaknesses of our dataset and the method applied to solve our classification
task. We will focus on the quality of our dataset, the importance of using real world data and the
applicability of our face analysis method. Furthermore we will give suggestions about which aspects
of our experiment could be improved.
22
Lastly, in our experiments we used four different classifiers. Using other classifiers or even
combining several classifiers may result in a higher accuracy. Combining multiple classifiers has
proven to increase the overall accuracy (Vinciarelli, Pantic, Boulard, & Pentland, 2008).
The Dataset
One of the most important issues in social signal processing, and an aspect mostly neglected for the
sake of research is the use of real world data. Most datasets are collected in either laboratories or
any other artificial settings. Even though the use of actors is very common in research focused on the
study of faces, it is likely to oversimplify a real world situation and many aspect of real social
behaviour could be missing (Vinciarelli, Pantic, Boulard, & Pentland, 2008) (Wilting, Krahmer, &
Swerts, 2006). Unfortunately recordings of genuine facial behaviour suitable for research are difficult
to find (Pantic, 2009). In our experiment we focused on the use of Dutch newsreaders. These
newsreaders are expected to present the news in a neutral way but also serve as the face of the
news channel and are expected to show a certain degree of emotion to attract the audience while
preserving their neutral position (Tomascikova, 2010). Using newsreader data in research comes with
both advantages and disadvantages.
First of all, even though newsreader data can be labeled as real world data, the question
remains if this kind of discourse can be used to explain other kinds of discourse. It can even be the
case that the expressions used by the newsreaders are different on-air. Nevertheless, this is not
always the case. The newsreaders used in our experiments mentioned that while presenting the
news, neither one of them thinks about the use of their facial expression (Swerts & Krahmer, 2010).
With this in mind we can assume that the facial expressions in our dataset are natural facial behavior
indeed. Unfortunately, this may be the case in our research but it cannot be guaranteed for other
newsreader data. The biggest advantage of the use of newsreader data in social signal processing is
the setting. The stable position of the head and the uniform background makes newsreader data
highly suitable for automatic facial analysis (Joosten, 2011).
A possible weakness of our research is the quality of our dataset. The quality of the video
fragments used was generally lower than those of other datasets. Due to the preprocessing steps of
the video, the quality decreased even more. Therefor our dataset consists of low resolution images
(150x150 pixels), with face region resolution ranging from 40x40 to 45x45 pixels. These low
resolutions might influence the efficiency of the FER-method used in our experiments. Tian (2004)
evaluated the performance of the most common facial analysis methods on several resolutions. They
show that face detectors are able to locate faces with a face region from 36x48 and bigger. In the
facial feature recognition step it appears that for a face region of 36x48 or smaller it is best to use
appearance features instead of geometric features. With a resolution lower than 36x48 it is also
more difficult to recognize fine detailed expressions. It seems that a resolution of 40x40 seems
enough besides the fact that we chose geometric features. This may have an influence on the
efficiency of the FFR-Method.
23
set is that the training examples have to be labeled manually. This manual labour is very time
consuming and has a major influence in the efficiency of the FER-Method. Lucey, Lucey, and Cohn
(2010) argue that about 60 or more landmarks are required for facial-expression recognition.
Unfortunately this is much higher than the 38 landmarks used in our experiments. We only chose
landmarks which seemed most relevant to us in an attempt to decrease the time needed for manual
labor.
A more critical concern is the reliability of the manually labeled images. Cohn (2007) noted
that in most cases approximately 20-30% of the manually labeled images are inaccurate. This
inaccuracy can have a huge impact on the performance of our FER-Method. After visual inspection of
the fitted images we came to the conclusion that a high degree of jittering (Movement of the model
while the face showed no movement) was present. This jittering can influence the correctness of the
extracted features which could influence the results of our classification task (Joosten, 2011). This
jittering may be reduced by using a feature tracking method instead of the feature fitting method
currently used. With feature tracking, the previous frame is taken into account during the fitting
process. This results in a more natural movement of the landmark coordinates and is thus more likely
to be correct.
24
Chapter 6
Conclusion and future work
In this chapter we will reach a conclusion based on the results or our experiment. In section 6.1 we
will answer the research questions posed in chapter one and discuss the outcome of our problem
statement. Section 6.2 gives some suggestions of improvement for future research.
We have provided answers to our research questions and can now focus on answering the problem
statement as stated in chapter 1:
Can we determine the emotional content of a video by analyzing the face of newsreaders using an
automatic facial feature recognition method?
Considering the performance of our classifiers we can conclude that a computer is capable of
determining the emotional content using an automatic facial feature recognition method. An average
performance of 67%, 72.4% and 67.3% for the male, female and total dataset shows that the feature
used in our experiments can be used for classifying the emotional content of a video. Even though
we found these results for newsreaders only, we believe that these results can be used for other
purposes. However, improvements are needed to increase the performance before it can be used in
real applications.
25
6.2 Application and Future Research
Research in the field of automatic facial analysis is most important if we want to achieve the goal of
natural communication between man and machine. For this we need to explore every possible
aspect of human-machine communication including topics like verbal and non-verbal
communication. Every research concerning these aspects is a small step towards the successful
application of natural human-computer interaction. Every step is an exploration of what we can
achieve and how this can be improved. This thesis is no exception as it is an attempt to explore the
possibilities of an automatic facial analysis system. The problem statement answered in this thesis is
and interesting one, but mostly serves as a small part of a much bigger picture. The results are not
yet strong enough to apply this in a real life situation like for example the automatic classifying of
news fragments for cataloging purposes. To achieve this, a much higher accuracy is needed and the
methods have to be improved due to the problem of automation of the entire process. In this thesis
however it is not our intention to improve the method used, but use the method to solve a
classification task. Therefore, we will focus on what can be done to increase the accuracy of our
classification task. First of all, increasing the amount of landmarks may increase the amount of
information available. More landmarks combined with using more different features may greatly
increase the accuracy. Second, using a dataset with a higher quality may increases the performance
of the Active Appearance Model and may decrease the jittering of the image fitting process which
result in better fitting. Finally using different classifiers or combining several classifiers is likely to
increase the quality of the classification task. This however may lead to either a higher or a lower
accuracy of the task.
Lastly I want to encourage future research to focus on all sorts of different classification
tasks. These tasks provide us with the knowledge of the possibilities of automatic facial analysis and
the problems that may arise. This knowledge enables us to go a step forward in achieving natural
communication between man and machine.
26
Literature
Asthana, A., Saragih, J., Wagner, M., & Goecke, R. (2009). Evaluating AAM fitting methods for facial
expression recognition. 3rd Internation Conference on Affective Computing and Intelligent
Interaction and Workshops (pp. 1-8). IEEE.
Cassell, J. (2001). Nudge nudge wink wink: elements of face-to-face conversation for embodied
conversational agents. MIT Press.
Chibelushi, C. C., & Bourel, F. (2003). Facial Expression Recognition: A Brief Tutorial Overview.
CVonline: On-Line Compendium of Computer Vision, 9.
Cohn, J. (2007). Foundations of human computing: facial expression and emotion. Artifical
Intelligence for Human Computing, 1-16.
Cohn, J. (2010). Advances in Behavioral Science Using Automated Facial Image Analysis and
Synthesis. Signal Processing Magazine, IEEE, 27(6), 128-133.
Cootes, T., Edwards, G., & Taylor, C. (2001). Active appearance models. Pattern Analysis and Machine
Intelligence, 23(6), 681-685.
Cootes, T., Taylor, C., & others. (2004). Statistical models of appearance for computer vision. World
Wide Web Publication.
Donato, G., Bartlett, M., Hager, J., Ekman, P., & Sejnowski, T. (1999). Classifying facial actions. Pattern
Analysis and Machine Intelligence, 21(10), 974-989.
Edwards, G., Taylor, C., & Cootes, T. (1998). Interpreting face images using active appearance models.
Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998.
Proceedings. (pp. 300-305). IEEE.
Esposito, A. (2009). The Perceptual and Cognitive Role of Visual and Auditory Channels in Conveying
Emotional Information. Cognitive Computation, 1(3), 268-278.
Everingham, M., Sivic, J., & Zisserman, A. (2006). Hello! My name is... Buffy - Automatic Naming of
Characters in TV Video. Proceedings of the 17th British Machine Vision Conference, (pp. 889-
908).
Goodall, C. (1991). Procrustes methods in the statistical analysis of shape. Journal of the Royal
Statistical Society. Series B. Methodological, 53(2), 285-339.
Jolliffe, I. T. (2002). Principal Component Analysis. New York: Springer, second edition.
27
Maaten, L. V., & Hendriks, E. (2010). Capturing Appearance Variation in Active Appearance Models.
Computer Vision and Pattern Recognition Workshops, 34-41.
Matthews, I. a. (2004). Active appearance models revisited. International Journal of Computer Vision,
60(2), 135-164.
Pantic, M. (2009). Machine analysis of facial behaviour: Naturalistic and dynamic behaviour.
Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535), 3505-3513.
Pantic, M., & Bartlett, M. (2007). Machine analysis of facial expressions. 377-416.
Pantic, M., Nijholt, A., Pentland, A., & Huanag, T. (2008). Human-Centred Intelligent Human?
Computer Interaction (HCI2): how far are we from attaining it? International Journal of
Autonomous and Adaptive Communications Systems, 1(2), 168-187.
Pantic, M., Pentland, A., Nijholt, A., & Huang, T. (2006). Human computing and machine
understanding of human behavior: a survey. Proceedings of the 8th international conference
on Multimodal interfaces (pp. 239-248). ACM.
Schmidt, K., Ambadar, Z., Cohn, J., & Reed, L. (2006). Movement differences between deliberate and
spontaneous facial expressions: Zygomaticus major action in smiling. Journal of Nonverbal
Behavior, 30(1), 37-52.
Schuller, B., Mueller, R., Hoernler, B., Hoethker, A., Konosu, H., & Rigoll, G. (2007). Audiovisual
recognition of spontaneous interest within conversations. Proceedings of the 9th
international conference on Multimodal interfaces (pp. 30-37). ACM.
Swerts, M., & Krahmer, E. (2010). Visual prosody of newsreaders: Effects of information structure,
emotional content and intended audience on facial expressions. Journal of Phonetics, 38(2),
197-206.
Tian, Y. (2004). Evaluation of face resolution for expression analysis. Conference on Computer Vision
and Pattern Recognition Workshop, 2004. CVPRW'04. IEEE.
Vinciarelli, A., Pantic, M., Boulard, H., & Pentland, A. (2008). Social signal processing: state-of-the-art
and future perspectives of an emerging domain. MM '08 proceedings of the 16th ACM
international conference for Multimedia (pp. 1061-1070). New York: ACM.
Viola, P., & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple Seatures.
Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition. 1, pp. 511-518. IEEE.
Williams, A. e. (2002). Facial expression of pain: an evolutionary account. Behavioral and brain
sciences, 25(4), 439-455.
28
Wilting, J., Krahmer, E., & Swerts, M. (2006). Real vs. acted emotional speech. Ninth International
Conference on Spoken Language Processing.
Witten, I., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan
Kaufmann.
Zeng, Z., Pantic, M., Roisman, G., & Huang, T. (2009). A survey of affect recognition methods: Audio,
visual, and spontaneous expressions. Pattern Analysis and Machine Intelligence, 31(1), 39-58.
29