Professional Documents
Culture Documents
On Improving Visual-Facial Emotion Recognition With Audio-Lingual and Keyboard Stroke Pattern Information
On Improving Visual-Facial Emotion Recognition With Audio-Lingual and Keyboard Stroke Pattern Information
On Improving Visual-Facial Emotion Recognition With Audio-Lingual and Keyboard Stroke Pattern Information
811
807
the input space significantly, while retaining essential decision. A typical print-screen of the first part
information of high discrimination power and stability. of the questionnaire is depicted in Figure 1.
In order to validate these facial features and 2. In the second part of the questionnaire, each
understand how they are used by humans to deduce participant had to classify the emotion from
someone’s emotion from his/her facial expression, we portions of the face. Specifically, we showed
developed questionnaires where the participants were the participant the “neutral” facial image and an
asked to determine which facial features helped them image of some facial expression of a subject.
in the expression recognition/classification task. The latter image was cut into the corresponding
3. Questionnaires -- empirical study by observers: facial portions, namely, the eyes, the mouth, the
In order to understand aspects of the process of facial forehead, the cheeks, the chin and the brows. A
expression recognition by human observers and set typical print-screen of this part of the
target error rates for automated systems, we conducted questionnaire is shown in Figure 2.
two relevant empirical studies based on two
corresponding questionnaires, as described below. The
first study was only preliminary and was based on a
short (“preliminary”) questionnaire. The purpose of
this study was to obtain an overall idea and identify the
general aspects of the facial expression recognition
process in humans. The images used in this
preliminary study were gathered from the Web and
existing facial expression databases. The lack of a
complete facial expression database, containing a
sufficient number of all seven expressions of interest to
us, forced us to create our own database of better
quality images [9]. We also developed a “detailed” Figure 1: The first part of the detailed questionnaire
questionnaire, as described below, which used images
of our own database. Then, we conducted a second,
more detailed empirical study. Results from both
empirical studies are presented in this paper and
conclusions are drawn.
812
808
x The level of difficulty of the 3. Empirical studies for audio-lingual and
questionnaire, with regards to the keyboard evidence emotion recognition
facial expression classification task
x Which emotion they considered the The empirical study involved 100 male and female
most difficult to classify users of varying educational background, ages and
x Which emotion they considered the levels of computer experience. People’s behavior
easiest to classify while doing something may be affected by several
x The degree (0-100%) to which an factors concerning their personality, age, experience,
emotion maps into a facial etc. For example, experienced computer users may use
expression a keyboard, as a mode of interaction, more often than
novice users, while younger people may prefer
2.2. The observer and subject backgrounds different modes in interacting with computers, such as
A total number of 300 participants participated in audio-lingual interaction, comparing with older people,
our study and filled up the detailed questionnaires. All etc. Thus for the purpose of analyzing the results of
participants and expression-forming subjects were our empirical study and taking into account important
Greek, so they were used to the Greek culture and the characteristics of users we categorized them into
Greek ways of expressing emotions. The study several groups. These groups are presented below.
participants aged 19 to 45 years old and their majority Figure 4.a illustrates the distribution of participants
was either under-graduate or graduate students in the in the empirical study in terms of age. In particular
University of Piraeus. A small number of the 12,5 % of participants were under the age of 18 and,
participants were employees of the University. approximately 20% of participants between the ages of
18 and 30. A considerable percentage of our
2.3. Statistical results participants was over the age of 40. Similarly, Figure
4.b illustrates the distribution of participants in the
The participants’ opinion regarding the easiest empirical study in terms of computer knowledge level.
emotion to recognize coincides with the results of the
questionnaire, as the “happy” and “surprised” emotion Ages
achieved the lowest error rates, of 17% and 7%,
respectively. As for the most difficult emotion to 30,00% 27%
22,90%
recognize, our questionnaire showed that the 25,00% 20,80%
16,60%
“disgusted” and the “neutral” are the most difficult 20,00%
15,00% 12,50%
emotions to recognize. Error rates corresponding to all
10,00%
emotions are shown in Figure 3.
5,00%
0,00%
Average error rates for each expression for the two parts of the 1) 14-18 2) 18-30 3) 30-40 4) 40-50 5) 50+
questionnaire
Computer Knowledge level
Surprised; 7%
20
Happy; 17% 5
Sad; 42% 0
Disgusted; 84%
1) Not good 2) Good 3) Very good 4) Excellent
Figure 3: Error rates in recognizing the expressions in our detailed 3.1. Keyboard mode analysis
questionnaire
813
809
was recorded into a database of video clips. A
monitoring component recorded the actions of users Table 1. Percentages of successful emotion recognition by human
experts.
and stored data into the database. After completing
Emotional Percentage of recognition by human
using the educational application, participants were state experts (keyboard mode)
asked to watch the video clips concerning exclusively Neutral 65%
their personal interaction and to determine in which
situations they where experiencing changes in their
Happiness 60%
emotional state. Then they associated each change in
their emotional state with one of the six basic emotion Sadness 57%
states of our study and the data was recorded with a Surprise 5%
time stamp. Anger 74%
As a next step, the collected transcripts were then Disgust 4%
given to 20 human expert-observers who were asked to
Finally Table 1 illustrates the percentages of
perform emotion recognition with regard to the six
successful emotion recognition by human experts
emotion states, namely happiness, sadness, surprise,
concerning the participants’ emotional states and the
anger, disgust and neutral. All the human experts
keyboard mode of interaction. These percentages
possessed a first and/or higher degree in Computer
result by the comparison of the recognized emotional
Science. In the case of the keyboard empirical study,
states by the human experts and the actual emotional
human expert-observers analyzed the data
states as recorded by the participants themselves.
corresponding to the keyboard input only. Thus, they
were asked to watch the video tape and were also
given a print out of what the user had written as well as 3.2. Audio-lingual information
the exact time of each event as it was captured by the
user monitoring component. Correct answers, wrong The participants were asked to use an application
answers as well as the consequence of these events which incorporated a user monitoring component. The
were analyzed as long as these events involved only basic function of this application was to capture all the
the keyboard mode of interaction. Frequent use of data inserted by the user orally. The data was recorded
backspace and other basic keyboard buttons was also to a database and video clips. This component
recorded and associated with specific emotional states. recorded the actions of users from the microphone.
The empirical study revealed that when participants Then the transcripts collected were given to 10 human
were nervous the possibility of making mistakes, while expert-observers who were asked to perform emotion
typing, increased rapidly. This is also the case when recognition with regard to the six emotion states,
they had negative feelings. Mistakes in typing were namely happiness, sadness, surprise, anger, disgust and
followed by many backspace-key keyboard strokes and neutral. Human expert-observers analyzed the data
concurrent changes in the emotional state of the user in corresponding to the audio-lingual input only.
a percentage of 82 %. Yet users under 20 years old and The audio-lingual empirical study gave important
users who were over 20 years old but had low results about the strengths and weaknesses of emotion
educational background seemed to be more prone to recognition that is based on the audio-lingual modality.
making even more mistakes as a consequence of an The human experts’ recognition rates of the six
initial mistake and lose their concentration while emotions showed that some emotions are easily
interacting with the educational application (67%). recognized by the audio-lingual modality. Sadness is
From the analysis of the data it also became clear that, such an emotion. On the other hand, other emotions
when the participants were angry the rate of mistakes are more difficult to recognize by the audio-lingual
increased, the rate of their typing become slower 62 % modality. Surprise is such an emotion. That made the
(on the contrary, when they were happy they type application more accurate in recognizing
faster 70 %) and the keystrokes on the keyboard emotions.Thus, the database that had been resulted
became harder (65 %). The pressure of the user’s from the empirical study could be made more accurate.
fingers on the keyboard would give a further help in In particular the results from the empirical study could
recognizing emotional states of users but was be either confirmed be the recognition of the users’
something that could not be measured without the emotions in real-time (by using the user modeling
appropriate hardware. Similar effects in the interaction component), or in some cases take specific results into
via the keyboard were reported with the emotions is reconsideration. Combining the two phases of the
boredom and anger. empirical study we come up with a database
constructed from the users’ oral input and contained
814
810
words, phrases and exclamations. At the same time something, while in anger and disgust we could remark
changes in voice volume and voice pitch were the greatest changes in the users’ voice pitch.
recorded in relation with the oral input.
Table 2 illustrates the results of the empirical study 4. Combination of the results from the
in terms of the audio-lingual input via microphone and three modes of interaction
the six basic recognized emotions (neutral, happiness,
sadness, surprise, anger, disgust) by human expert- In the following Table 3, we present the results of our
observers. For each emotion we can remark the two empirical studies in a comparative manner. The
percentages of the users’ oral reaction or the absence rates of correct identification of each emotion with use
of audio input. Furthermore we can consider the of a visual-facial modality and the combination of the
changes in the users’ voice (volume and pitch) while audio-lingual information and keyboard-stroke pattern
saying a word or phrase or saying an exclamation in information are shown in the left and right columns,
each emotional situation. For example, a user in respectively.
surprise may have said an exclamation (58 %) rather The aim of the paper is to present concisely
than having spoken a word (24 %). This action is empirical studies concerning emotion recognition
recognised to a degree of 66 % accompanied by an (audio, visual and keyboard) and then to come up with
increase in the user’s volume. In addition, the audio- results regarding the combination of these modes. Our
lingual empirical study supplied us with a significant purpose is to set the empirical basis for creating an
number of words, phrases and exclamations, which are improved tri-modal affective system. The empirical
used in the creation of an “active grammar”, with data from the three modes will, eventually, be
specific words and phrases for emotion recognition. combined, finally, using a multi-criteria decision
Table 2. Basic Emotional States through Microphone - Empirical theory, where each mode will be a criterion. The
Study Results. percentages of emotion recognition for each emotion
and for each mode will be analyzed and then used as
Say a certain word
Say an exclamation
or phrase
Say nothing weights in order to improve the accuracy of the system
Changes and determine the prevailing emotion. Figure 3
in in in in in in illustrates the system emotion recognition process,
Emotion
volume pitch volume pitch volume pitch after collecting data from the three 3 modes of
6% 22% 72% interaction, namely audio mode, visual mode and the
Neutral
45% 18% 37% 12%
interaction through the keyboard.
The three modes of interaction are complementary
31% 45% 24%
Happy to a high degree. In cases where all modes have high or
40% 37% 55% 25%
low percentages of successful emotion recognition, we
Sadness
8% 28% 64% still gain from their combination, since we improve the
52% 34% 44% 14% probability of emotion recognition.
58% 24% 18%
Surprise
66% 23% 60% 21% 5. Conclusions
39% 41% 20%
Anger
62% 58% 70% 62% In this paper we have described and discussed three
50% 39% 11% empirical studies that concern the audio-lingual the
Disgust
64% 43% 58% 38%
visual-facial recognition and the recognition through
the keyboard of human users’ emotions from the
We may remark that the neutral emotion and perspective of human observers. These modalities are
sadness could not be recognized easily in this modality complementary to each other and, thus, can be used in
because users did not express themselves orally when a multi-modal affective computer system that can
they were in these moods. Users expressing the perform affect recognition taking into account the
emotions of surprise and disgust usually said an strengths and weaknesses of each modality.
exclamation (58 % and 50 % respectively) while happy From the empirical studies we found that certain
users and users in anger would likely say a word or emotion states such neutral and surprise are more
phrase contained in our ‘emotional database’ of words clearly recognized from the visual-facial modality
and phrases (45 % and 41 % respectively). Particularly rather than the audio-lingual and keyboard-stroke
in regard to the emotions of surprise and anger, users pattern information. Other emotion states, such as
would increase the volume of their voice while saying anger and disgust are more clearly recognized from the
815
811
audio-lingual and the keyboard-stroke modalities There is ongoing research of construction of an
respectively, rather than the visual-facial. affective user interface that will use the different
modalities as criteria for recognition of emotions and
Table 3. Comparative presentation of empirical study results will use the results of performances for each modality
Visual-facial Keyboard-stroke pattern and as the basis for the specification of weights. This and
modality audio-lingual information other related work is going to be presented in a future
Facial (%) (%) for (%) for Mean publication.
Expression keyboard- audio- value
stroke lingual 6. References
patterns
[1] Goleman, D.: Emotional Intelligence, Bantam Books,
Neutral
New York (1995)
[2] Picard, R.W.: Affective Computing: Challenges, Int.
Journal of Human-Computer Studies, Vol. 59, Issues 1-2,
61,74% 65% 18% 41,50% (2003) 55-64
[3] Graf, H.P., Cosatto, E., Strom, V., Huang, F.J.: Visual
Prosody: Facial Movements Accompanying Speech,
Proceedings of the fifth IEEE International Conference on
Surprise Automatic Face and Gesture Recognition (2002) 381-386
[4] Pantic, M., Rothkrantz, L.J.M.: Toward an affect-
sensitive multimodal human-computer interaction. Vol.
33,50 91, Proceedings of the IEEE (2003) 1370-1390
92,61% 5% 62% [5] Chen, L.S., Huang, T.S., Miyasato, T., Nakatsu, R.:
%
Multimodal human emotion/expression recognition, in
Proc. of Int. Conf. on Automatic Face and Gesture
Recognition, IEEE Computer Soc., Nara, Japan (1998)
Anger
[6] De Silva, L., Miyasato, T., Nakatsu, R.: Facial emotion
recognition using multimodal information, in Proc. IEEE
Int. Conf. on Information, Communications and Signal
72,92% 74% 79% 76,50% Processing (ICICS'97) (1997) 397-401
[7] Busso, C., Deng , Z., Yildirim, S., Bulut, M., Lee, C.M.,
Kazemzadeh, A., Lee, S., Neumann, U., Narayanan, S.:
Analysis of emotion recognition using facial expressions,
Happiness speech and multimodal information, Proceedings of the
6th international conference on Multimodal interfaces,
State College, PA, USA (2004) 205 – 211
[8] I.-O. Stathopoulou and G.A. Tsihrintzis, Detection and
82,57% 60% 46% 53% Expression Classification Systems for Face Images
(FADECS), 2005 IEEE Workshop on Signal Processing
Systems (SiPS’05), Athens, Greece, November 2 – 4,
2005
Sadness [9] I.-O. Stathopoulou and G.A. Tsihrintzis, Facial
Expression Classification: Specifying Requirements for
an Automated System, 10th International Conference on
Knowledge-Based & Intelligent Information &
58,33% 57% 48% 52,50% Engineering Systems, Bournemouth, United Kingdom,
October 9-11, 2006
[10] I.-O. Stathopoulou and G.A. Tsihrintzis, NEU-FACES:
A Neural Network-based Face Image Analysis System,
Disgust 8th International Conference on Adaptive and Natural
Computing Systems, Warsaw, Poland, April 11-14, 2007
[11] Alepis, E., Virvou, M., Kabassi, K.: Affective student
modeling based on microphone and keyboard user
16,19% 4% 57% 30,50% actions, Proceedings of the Sixth IEEE International
Conference on Advanced Learning Technologies
(ICALT), Kerkrade, The Netherlands (2006) 139 - 141
816
812