On Improving Visual-Facial Emotion Recognition With Audio-Lingual and Keyboard Stroke Pattern Information

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

On improving visual-facial emotion recognition with audio-lingual and


keyboard stroke pattern information

George A. Tsihrintzis, Maria Virvou, Ioanna-Ourania Stathopoulou, and Efthimios Alepis


Department of Informatics, University of Piraeus, Piraeus 185 34, Greece
{geoatsi,mvirvou,iostath,talepis}@unipi.gr

Abstract computer interaction has to take into account users’


feelings. Picard [2] points out that one of the major
In this paper, we investigate the possibility of challenges in affective computers is to try to improve
improving the accuracy of visual-facial emotion the accuracy of recognizing people’s emotions. Images
recognition through use of additional (complementary) that contain faces are instrumental in the development
information. The investigation is based on three of more effective and friendly methods in multimedia
empirical studies that we have conducted involving interactive services and human computer interaction
human subjects and human observers. The studies systems. Vision-based human-computer interactive
were concerned with the recognition of emotions from systems assume that information about a user’s
a visual-facial modality, audio-lingual and keyboard- identity, state and intent can be extracted from images,
stroke information, respectively. They were inspired by and that computers can then react accordingly. Similar
the relative shortage of such previous research in information can also be used in security control
empirical work concerning the strengths and systems or in criminology to uncover possible
weaknesses of each modality so that the extent can be criminals.
determined to which the keyboard-stroke and audio- In facial expression recognition, the system
lingual information complements and improves the attempts to recognize the expression formed on a
emotion recognition accuracy of the visual-facial detected face. Such a task is quite challenging because
modality. Specifically, our research focused on the faces are non-rigid and have a high degree of
recognition of six basic emotion states, namely variability in size, shape, color and texture.
happiness, sadness, surprise, anger and disgust as well Furthermore, variations in pose, facial expression,
as the emotionless state which we refer to as neutral. image orientation and conditions add to the level of
We have found that the visual-facial modality may difficulty of the problem. The task is complicated
allow the recognition of certain states, such as neutral further by the problem of pretence, i.e. the case of
and surprise, with sufficient accuracy. However, its someone’s facial expression not corresponding to
accuracy in recognizing anger and disgust can be his/her true psychological state.
improved significantly if assisted by keyboard-stroke Improving the accuracy on emotion recognition
information. may require the combination of many modalities, other
Keywords: Facial expression analysis, visual-facial than visual-facial, in user interfaces. Indeed, human
affect recognition, empirical studies, affective user emotions are usually expressed in many ways. For
modeling, multi-modal interfaces, human-computer example, as we articulate speech we usually move the
interaction head and exhibit various facial emotions [3]. Ideally
evidence from many modes of interaction should be
1. Introduction combined by a computer system so as to generate as
valid hypotheses as possible about users’ emotions.
When mimicking human-to-human This view has been supported by many researchers in
communication, human-computer interaction systems the field of human computer interaction [4], [5], [6].
must determine the psychological state of a person, so However, progress in emotion recognition based on
that the computer can react accordingly. How people multiple modalities has been quite slow. Although
feel may play an important role on their cognitive several approaches have been proposed to recognize
processes as well [1]. Thus the whole issue of human- human emotions based on facial expressions or speech,

978-0-7695-3496-1/08 $25.00 © 2008 IEEE 806


810
DOI 10.1109/WIIAT.2008.282
relatively limited work has been done to fuse these two empirical studies on visual-facial emotion recognition.
and other modalities to improve the accuracy and In section 3 we present the empirical studies for audio-
robustness of the emotion recognition system [7]. lingual and keyboard evidence emotion recognition,
In view of the above, it is our aim to improve the which lead us to the combination of the three studies in
accuracy of visual-facial emotion recognition by Section 4. Finally, in Section 5, we conclude and point
combining other modalities, namely keyboard stroke to future work.
patterns and audio-lingual information. Towards
building a facial expression recognition system, called 2. Empirical Studies for Visual-Facial
NEU-FACES [8-10], we conducted a fairly intensive Emotion Recognition
empirical study, which is presented in Section 2. A
system that we have already constructed combining Our study of facial expression classification by
two modalities, namely the keyboard and the voice, is human observers consisted of three steps:
described briefly in [11]. In this paper, we focus on the 1. Observation of the user’s reactions during a
improvement of the emotion recognition by the visual- typical human-computer interaction session: From this
facial modality by incorporating the other two step, we concluded that the facial expressions
modalities. Towards combining the three modalities, corresponding to the “neutral”, “happy”, “sad”,
we had to determine the extent to which these two “surprised”, “angry”, “disgusted” and “bored-sleepy”
different modalities can provide emotion recognition psychological states arose quite commonly in human-
from the perspective of a human observer. Moreover, computer interaction sessions and, thus, form the
we had to specify the strengths and weaknesses of each corresponding classes for our classification task.
modality. 2. Data acquisition: We created our own database
These empirical studies provide the basis towards of facial expressions, by photographing individuals
the combination of modalities into the affective user aged 19-35 years old while they were forming various
modeling component of our tri-modal system and also expressions. To ensure spontaneity, each subject was
provide evidence for other researchers to use since presented with pictures on a screen behind the camera.
there are not many results from such empirical studies These pictures were expected to generate those
available in the literature. Indeed, after an extensive emotional states that would map on the subject’s face
search of the literature, we found that there is a as the desired facial expression. For example, to have a
shortage of empirical evidence concerning the subject assume a “happy” expression, we showed
strengths and weaknesses of these modalities. The him/her a picture of funny content. We photographed
most relevant research work is that of De Silva et al. the resulting facial expression and then asked him/her
[6] who performed an empirical study and reported to classify this expression. If the image shown to
results on human subjects’ ability to recognize him/her had resulted in the desired facial expression,
emotions. However, De Silva et al. focus on the audio the corresponding photographs were saved and
signals of voice concentrating on the pitch and volume labelled; otherwise, the procedure was repeated with
of voice rather than lingual keywords that convey other pictures. The final dataset consists of the images
affective information. On the other hand, in our of 250 different individuals, each forming the seven
research we have included the lingual aspect of users’ expressions: “neutral”, “happy”, “sad”, “surprise”,
spoken words on top of the pitch and volume of voice “angry”, “disgust” and “bored-sleepy”.
and have compared the audio-lingual results with the Studying this dataset, we identified differences
results from the other two modes so that we can see between the “neutral” expression of a model and its
which modality conveys more information for human deformation into other expressions. We quantified
observers for six basic emotions, namely happiness, these differences into measurements of the face (such
sadness, surprise, anger and disgust as well as the as size ratio, distance ratio, texture, or orientation), so
emotionless state which we refer to as neutral. as convert pixel data into a higher-level representation
In this paper, we present extensively the empirical of shape, motion, color, texture and spatial
study for visual-facial emotion recognition as well as configuration of the face and its components.
emotion recognition by the other two modalities and Specifically, we locate and extract the corner points
discuss the comparison results. This paves the way to of specific regions of the face, such as the eyes, the
combination methods that take into account the human mouth and the brows, and compute their variations in
observers’ performances in the recognition of size, orientation or texture between the neutral and
emotions. some other expression. This constitutes the feature
More specifically, the paper us organized as extraction process and reduces the dimensionality of
follows: In Section 2, we present and analyze our

811
807
the input space significantly, while retaining essential decision. A typical print-screen of the first part
information of high discrimination power and stability. of the questionnaire is depicted in Figure 1.
In order to validate these facial features and 2. In the second part of the questionnaire, each
understand how they are used by humans to deduce participant had to classify the emotion from
someone’s emotion from his/her facial expression, we portions of the face. Specifically, we showed
developed questionnaires where the participants were the participant the “neutral” facial image and an
asked to determine which facial features helped them image of some facial expression of a subject.
in the expression recognition/classification task. The latter image was cut into the corresponding
3. Questionnaires -- empirical study by observers: facial portions, namely, the eyes, the mouth, the
In order to understand aspects of the process of facial forehead, the cheeks, the chin and the brows. A
expression recognition by human observers and set typical print-screen of this part of the
target error rates for automated systems, we conducted questionnaire is shown in Figure 2.
two relevant empirical studies based on two
corresponding questionnaires, as described below. The
first study was only preliminary and was based on a
short (“preliminary”) questionnaire. The purpose of
this study was to obtain an overall idea and identify the
general aspects of the facial expression recognition
process in humans. The images used in this
preliminary study were gathered from the Web and
existing facial expression databases. The lack of a
complete facial expression database, containing a
sufficient number of all seven expressions of interest to
us, forced us to create our own database of better
quality images [9]. We also developed a “detailed” Figure 1: The first part of the detailed questionnaire
questionnaire, as described below, which used images
of our own database. Then, we conducted a second,
more detailed empirical study. Results from both
empirical studies are presented in this paper and
conclusions are drawn.

2.1. The detailed questionnaire structure


In order to understand how humans classify
someone else’s facial expression and set a target error
rate for automated systems, we developed a
questionnaire filled again by 300 study participants.
These were not the same participants as those who Figure 2: The second part of the detailed questionnaire
filled the preliminary questionnaire.
Specifically, the questionnaire consisted of three Again, each participant could choose from the
parts: 7 emotions or specify any other emotion that
1. In the first part of the questionnaire, each he/she thought appropriate. Next, the
participant was asked to map the facial participant had to specify the degree (0-100%)
expressions that appeared in 14 images into of his/her confidence in the identified emotion.
facial emotions. Each participant could choose Finally, he/she had to indicate which features
from the 7 common emotions “angry”, (such as the eyes, the nose, the mouth, the
“happy”, “neutral”, “surprised”, “sad”, cheeks, etc.) had helped him/her make that
“disgusted”, “bored-sleepy”, or specify any decision.
other emotion that he/she thought appropriate. 3. In the third part of the questionnaire, we
Next, the participant had to specify the degree collected background information (e.g. age,
(0-100%) of his/her confidence in the identified interests, etc.) about the study participants.
emotion. Finally, he/she had to indicate which Additional information provided by the
features (such as the eyes, the nose, the mouth, participants at this stage included:
the cheeks, etc.) had helped him/her make that

812
808
x The level of difficulty of the 3. Empirical studies for audio-lingual and
questionnaire, with regards to the keyboard evidence emotion recognition
facial expression classification task
x Which emotion they considered the The empirical study involved 100 male and female
most difficult to classify users of varying educational background, ages and
x Which emotion they considered the levels of computer experience. People’s behavior
easiest to classify while doing something may be affected by several
x The degree (0-100%) to which an factors concerning their personality, age, experience,
emotion maps into a facial etc. For example, experienced computer users may use
expression a keyboard, as a mode of interaction, more often than
novice users, while younger people may prefer
2.2. The observer and subject backgrounds different modes in interacting with computers, such as
A total number of 300 participants participated in audio-lingual interaction, comparing with older people,
our study and filled up the detailed questionnaires. All etc. Thus for the purpose of analyzing the results of
participants and expression-forming subjects were our empirical study and taking into account important
Greek, so they were used to the Greek culture and the characteristics of users we categorized them into
Greek ways of expressing emotions. The study several groups. These groups are presented below.
participants aged 19 to 45 years old and their majority Figure 4.a illustrates the distribution of participants
was either under-graduate or graduate students in the in the empirical study in terms of age. In particular
University of Piraeus. A small number of the 12,5 % of participants were under the age of 18 and,
participants were employees of the University. approximately 20% of participants between the ages of
18 and 30. A considerable percentage of our
2.3. Statistical results participants was over the age of 40. Similarly, Figure
4.b illustrates the distribution of participants in the
The participants’ opinion regarding the easiest empirical study in terms of computer knowledge level.
emotion to recognize coincides with the results of the
questionnaire, as the “happy” and “surprised” emotion Ages
achieved the lowest error rates, of 17% and 7%,
respectively. As for the most difficult emotion to 30,00% 27%
22,90%
recognize, our questionnaire showed that the 25,00% 20,80%
16,60%
“disgusted” and the “neutral” are the most difficult 20,00%

15,00% 12,50%
emotions to recognize. Error rates corresponding to all
10,00%
emotions are shown in Figure 3.
5,00%

0,00%
Average error rates for each expression for the two parts of the 1) 14-18 2) 18-30 3) 30-40 4) 40-50 5) 50+

questionnaire
Computer Knowledge level

Surprised; 7%
20

Angry; 27% Neutral; 62%


15
Bored-Sleepy; 36%
10

Happy; 17% 5

Sad; 42% 0
Disgusted; 84%
1) Not good 2) Good 3) Very good 4) Excellent

Figure 4.a Ages of participants


Neutral Happy Sad Disgusted Bored-Sleepy Angry Surprised Figure 4.b Computer knowledge level of the participants

Figure 3: Error rates in recognizing the expressions in our detailed 3.1. Keyboard mode analysis
questionnaire

The basic function of this experiment was to


capture all the data inserted by the keyboard. The data

813
809
was recorded into a database of video clips. A
monitoring component recorded the actions of users Table 1. Percentages of successful emotion recognition by human
experts.
and stored data into the database. After completing
Emotional Percentage of recognition by human
using the educational application, participants were state experts (keyboard mode)
asked to watch the video clips concerning exclusively Neutral 65%
their personal interaction and to determine in which
situations they where experiencing changes in their
Happiness 60%
emotional state. Then they associated each change in
their emotional state with one of the six basic emotion Sadness 57%
states of our study and the data was recorded with a Surprise 5%
time stamp. Anger 74%
As a next step, the collected transcripts were then Disgust 4%
given to 20 human expert-observers who were asked to
Finally Table 1 illustrates the percentages of
perform emotion recognition with regard to the six
successful emotion recognition by human experts
emotion states, namely happiness, sadness, surprise,
concerning the participants’ emotional states and the
anger, disgust and neutral. All the human experts
keyboard mode of interaction. These percentages
possessed a first and/or higher degree in Computer
result by the comparison of the recognized emotional
Science. In the case of the keyboard empirical study,
states by the human experts and the actual emotional
human expert-observers analyzed the data
states as recorded by the participants themselves.
corresponding to the keyboard input only. Thus, they
were asked to watch the video tape and were also
given a print out of what the user had written as well as 3.2. Audio-lingual information
the exact time of each event as it was captured by the
user monitoring component. Correct answers, wrong The participants were asked to use an application
answers as well as the consequence of these events which incorporated a user monitoring component. The
were analyzed as long as these events involved only basic function of this application was to capture all the
the keyboard mode of interaction. Frequent use of data inserted by the user orally. The data was recorded
backspace and other basic keyboard buttons was also to a database and video clips. This component
recorded and associated with specific emotional states. recorded the actions of users from the microphone.
The empirical study revealed that when participants Then the transcripts collected were given to 10 human
were nervous the possibility of making mistakes, while expert-observers who were asked to perform emotion
typing, increased rapidly. This is also the case when recognition with regard to the six emotion states,
they had negative feelings. Mistakes in typing were namely happiness, sadness, surprise, anger, disgust and
followed by many backspace-key keyboard strokes and neutral. Human expert-observers analyzed the data
concurrent changes in the emotional state of the user in corresponding to the audio-lingual input only.
a percentage of 82 %. Yet users under 20 years old and The audio-lingual empirical study gave important
users who were over 20 years old but had low results about the strengths and weaknesses of emotion
educational background seemed to be more prone to recognition that is based on the audio-lingual modality.
making even more mistakes as a consequence of an The human experts’ recognition rates of the six
initial mistake and lose their concentration while emotions showed that some emotions are easily
interacting with the educational application (67%). recognized by the audio-lingual modality. Sadness is
From the analysis of the data it also became clear that, such an emotion. On the other hand, other emotions
when the participants were angry the rate of mistakes are more difficult to recognize by the audio-lingual
increased, the rate of their typing become slower 62 % modality. Surprise is such an emotion. That made the
(on the contrary, when they were happy they type application more accurate in recognizing
faster 70 %) and the keystrokes on the keyboard emotions.Thus, the database that had been resulted
became harder (65 %). The pressure of the user’s from the empirical study could be made more accurate.
fingers on the keyboard would give a further help in In particular the results from the empirical study could
recognizing emotional states of users but was be either confirmed be the recognition of the users’
something that could not be measured without the emotions in real-time (by using the user modeling
appropriate hardware. Similar effects in the interaction component), or in some cases take specific results into
via the keyboard were reported with the emotions is reconsideration. Combining the two phases of the
boredom and anger. empirical study we come up with a database
constructed from the users’ oral input and contained

814
810
words, phrases and exclamations. At the same time something, while in anger and disgust we could remark
changes in voice volume and voice pitch were the greatest changes in the users’ voice pitch.
recorded in relation with the oral input.
Table 2 illustrates the results of the empirical study 4. Combination of the results from the
in terms of the audio-lingual input via microphone and three modes of interaction
the six basic recognized emotions (neutral, happiness,
sadness, surprise, anger, disgust) by human expert- In the following Table 3, we present the results of our
observers. For each emotion we can remark the two empirical studies in a comparative manner. The
percentages of the users’ oral reaction or the absence rates of correct identification of each emotion with use
of audio input. Furthermore we can consider the of a visual-facial modality and the combination of the
changes in the users’ voice (volume and pitch) while audio-lingual information and keyboard-stroke pattern
saying a word or phrase or saying an exclamation in information are shown in the left and right columns,
each emotional situation. For example, a user in respectively.
surprise may have said an exclamation (58 %) rather The aim of the paper is to present concisely
than having spoken a word (24 %). This action is empirical studies concerning emotion recognition
recognised to a degree of 66 % accompanied by an (audio, visual and keyboard) and then to come up with
increase in the user’s volume. In addition, the audio- results regarding the combination of these modes. Our
lingual empirical study supplied us with a significant purpose is to set the empirical basis for creating an
number of words, phrases and exclamations, which are improved tri-modal affective system. The empirical
used in the creation of an “active grammar”, with data from the three modes will, eventually, be
specific words and phrases for emotion recognition. combined, finally, using a multi-criteria decision
Table 2. Basic Emotional States through Microphone - Empirical theory, where each mode will be a criterion. The
Study Results. percentages of emotion recognition for each emotion
and for each mode will be analyzed and then used as
Say a certain word
Say an exclamation
or phrase
Say nothing weights in order to improve the accuracy of the system
Changes and determine the prevailing emotion. Figure 3
in in in in in in illustrates the system emotion recognition process,
Emotion
volume pitch volume pitch volume pitch after collecting data from the three 3 modes of
6% 22% 72% interaction, namely audio mode, visual mode and the
Neutral
45% 18% 37% 12%
interaction through the keyboard.
The three modes of interaction are complementary
31% 45% 24%
Happy to a high degree. In cases where all modes have high or
40% 37% 55% 25%
low percentages of successful emotion recognition, we
Sadness
8% 28% 64% still gain from their combination, since we improve the
52% 34% 44% 14% probability of emotion recognition.
58% 24% 18%
Surprise
66% 23% 60% 21% 5. Conclusions
39% 41% 20%
Anger
62% 58% 70% 62% In this paper we have described and discussed three
50% 39% 11% empirical studies that concern the audio-lingual the
Disgust
64% 43% 58% 38%
visual-facial recognition and the recognition through
the keyboard of human users’ emotions from the
We may remark that the neutral emotion and perspective of human observers. These modalities are
sadness could not be recognized easily in this modality complementary to each other and, thus, can be used in
because users did not express themselves orally when a multi-modal affective computer system that can
they were in these moods. Users expressing the perform affect recognition taking into account the
emotions of surprise and disgust usually said an strengths and weaknesses of each modality.
exclamation (58 % and 50 % respectively) while happy From the empirical studies we found that certain
users and users in anger would likely say a word or emotion states such neutral and surprise are more
phrase contained in our ‘emotional database’ of words clearly recognized from the visual-facial modality
and phrases (45 % and 41 % respectively). Particularly rather than the audio-lingual and keyboard-stroke
in regard to the emotions of surprise and anger, users pattern information. Other emotion states, such as
would increase the volume of their voice while saying anger and disgust are more clearly recognized from the

815
811
audio-lingual and the keyboard-stroke modalities There is ongoing research of construction of an
respectively, rather than the visual-facial. affective user interface that will use the different
modalities as criteria for recognition of emotions and
Table 3. Comparative presentation of empirical study results will use the results of performances for each modality
Visual-facial Keyboard-stroke pattern and as the basis for the specification of weights. This and
modality audio-lingual information other related work is going to be presented in a future
Facial (%) (%) for (%) for Mean publication.
Expression keyboard- audio- value
stroke lingual 6. References
patterns
[1] Goleman, D.: Emotional Intelligence, Bantam Books,
Neutral
New York (1995)
[2] Picard, R.W.: Affective Computing: Challenges, Int.
Journal of Human-Computer Studies, Vol. 59, Issues 1-2,
61,74% 65% 18% 41,50% (2003) 55-64
[3] Graf, H.P., Cosatto, E., Strom, V., Huang, F.J.: Visual
Prosody: Facial Movements Accompanying Speech,
Proceedings of the fifth IEEE International Conference on
Surprise Automatic Face and Gesture Recognition (2002) 381-386
[4] Pantic, M., Rothkrantz, L.J.M.: Toward an affect-
sensitive multimodal human-computer interaction. Vol.
33,50 91, Proceedings of the IEEE (2003) 1370-1390
92,61% 5% 62% [5] Chen, L.S., Huang, T.S., Miyasato, T., Nakatsu, R.:
%
Multimodal human emotion/expression recognition, in
Proc. of Int. Conf. on Automatic Face and Gesture
Recognition, IEEE Computer Soc., Nara, Japan (1998)
Anger
[6] De Silva, L., Miyasato, T., Nakatsu, R.: Facial emotion
recognition using multimodal information, in Proc. IEEE
Int. Conf. on Information, Communications and Signal
72,92% 74% 79% 76,50% Processing (ICICS'97) (1997) 397-401
[7] Busso, C., Deng , Z., Yildirim, S., Bulut, M., Lee, C.M.,
Kazemzadeh, A., Lee, S., Neumann, U., Narayanan, S.:
Analysis of emotion recognition using facial expressions,
Happiness speech and multimodal information, Proceedings of the
6th international conference on Multimodal interfaces,
State College, PA, USA (2004) 205 – 211
[8] I.-O. Stathopoulou and G.A. Tsihrintzis, Detection and
82,57% 60% 46% 53% Expression Classification Systems for Face Images
(FADECS), 2005 IEEE Workshop on Signal Processing
Systems (SiPS’05), Athens, Greece, November 2 – 4,
2005
Sadness [9] I.-O. Stathopoulou and G.A. Tsihrintzis, Facial
Expression Classification: Specifying Requirements for
an Automated System, 10th International Conference on
Knowledge-Based & Intelligent Information &
58,33% 57% 48% 52,50% Engineering Systems, Bournemouth, United Kingdom,
October 9-11, 2006
[10] I.-O. Stathopoulou and G.A. Tsihrintzis, NEU-FACES:
A Neural Network-based Face Image Analysis System,
Disgust 8th International Conference on Adaptive and Natural
Computing Systems, Warsaw, Poland, April 11-14, 2007
[11] Alepis, E., Virvou, M., Kabassi, K.: Affective student
modeling based on microphone and keyboard user
16,19% 4% 57% 30,50% actions, Proceedings of the Sixth IEEE International
Conference on Advanced Learning Technologies
(ICALT), Kerkrade, The Netherlands (2006) 139 - 141

816
812

You might also like