Facial Expressions Recognition in Filipino Sign Language: Classification Using 3D Animation Units

44 Proceedings of the 18th Philippine Computing Science Congress (PCSC 2018)
Facial Expressions Recognition in Filipino Sign Language:

Classification using 3D Animation Units
Joanna Pauline Rivera Clement Ong
De La Salle University De La Salle University
Malate, Manila Malate, Manila
joanna.rivera@dlsu.edu.ph clem.ong@delasalle.ph
ABSTRACT precisely identify a series of signs and to understand their mean-

Filipino Sign Language (FSL) is a mode of communication for the ing through developing algorithms and methods. A common mis-
Deaf in the Philippines. Though it has two components, namely conception in SLR is approaching the problem through Gesture
manual and non-manual signals, most research works focus on Recognition (GR) alone [4].
manual signals only. However, non-manual signals play a signif- Sign language is a multi-modal language that has two compo-
icant role in Sign Language Recognition (SLR) as it can be mixed nents: manual and non-manual signals. Manual signals are hand
freely with manual signals, often changing the meaning of the signs gestures, positions and shapes which convey lexical information.
[2]. Non-manual signals are facial expressions, head movements, and
Although it is possible to use FSL without facial expressions, upper body posture and movements which express syntactic and se-
it would be diicult to carry out a conversation especially when mantic information. These two signals usually co-occur [17]. Hence,
telling a story or demonstrating the amount of pain which both SLR systems or techniques would yield better results if non-manual
involve diferent types of facial expressions [9], [10]. Being able to signals are considered as well [21].
recognize the non-manual signals, especially the facial expressions, As researchers realize the signiicance of non-manual signals,
will greatly improve the performances of SLR systems for FSL. some studies focusing on facial expression recognition in Sign Lan-
This study aims to recognize facial expressions in FSL using guages were conducted [21]. Unlike most facial expression recog-
Animation Units (AU) based on peak facial expressions. It describes nition systems that focus on Afective Facial Expressions (AFE),
the diferent types of facial expressions in FSL. These types are speciically emotions, these studies focus more on Grammatical
used as labels for the 3D data gathered using Microsoft Kinect for Facial Expressions (GFE) that convey semantic functions such as
Windows 2.0 [15]. It also discusses the collection, representation, WH-question, Topic and Assertion. GFE difer from AFE in terms
and annotation of the data, and the machine learning process with of the facial muscles used [13] and its behavior, such as form and
the results and analysis. duration [16].
Most of the studies that focus on recognizing facial expressions
CCS CONCEPTS on Sign Language mainly difer in data representation, and machine
learning. One similarity among them is that they all focused on
· Computing methodologies → Image representations; Su-
GFEs that convey semantic functions such as WH-question, Topic
pervised learning by classiication;
and Assertion. On the other hand, facial expressions in sign lan-
guages are not limited to those semantic functions. For instance,
KEYWORDS
facial expressions in ASL are used to convey degrees of adjectives
Filipino Sign Language, facial expression recognition, animation (e.g. color intensity), adverbial information (e.g. carelessly) and
units emotions as well[12].
ACM Reference Format: In Filipino Sign Language (FSL), facial expressions are used in
Joanna Pauline Rivera and Clement Ong. 2018. Facial Expressions Recogni- similar way. However, very minimal research was done regarding
tion in Filipino Sign Language: Classiication using 3D Animation Units. In facial expressions and its integration with the manual signals [12]
Proceedings of 18th Philippine Computing Science Congress (PCSC 2018). even though facial expressions play a signiicant role in conversa-
tions as it can be mixed freely with manual signals [2].
This study aims to recognize facial expressions in FSL using
1 INTRODUCTION Animation Units based on peak facial expressions. The rest of the
Several researches regarding Sign Language Recognition (SLR) have paper is organized as follows. In section 2, the components of FSL,
already been conducted. SLR intends to teach a computer how to and the diferent types of facial expressions which are used as the
annotation labels are described. In section 3, the data collection,
representation and annotation are discussed. In section 4, the ma-
chine learning technique applied is discussed. In section 5, the
experimental setups are enumerated and described. In section 6,
the results and analysis are explained for each category. In section
7, a comparison to other works is described. Lastly, conclusions and
recommendations are listed in section 8.
Facial Expressions Recognition in Filipino Sign Language: Classification using 3D Animation Units 45
2 FILIPINO SIGN LANGUAGE Table 1: Types of facial expressions and sample sentences
Filipino Sign Language (FSL) is a mode of communication used by
the Deaf individuals in the Philippines. It originally rooted from Category Types Sentences
American Sign Language (ASL) after the Spanish-American War Degrees Absence 1. My head is not painful.
in 1899 [7]. Nonetheless, it soon developed and became a separate of 2. I do not like you.
language from ASL as Filipino Deaf used it in communication at Adjectives 3. I am not tired.
school passing Filipino signs that emerged naturally through gen- 4. You are not slow.
erations and adding new signs, especially those that are related to 5. This is not hard.
technology [1]. Presence 1. My head is painful.
FSL involves one or both hands, face, and other parts of the body. 2. I like you.
It has ive components: hand shape, location, palm orientation, 3. I am tired.
movement, and non-manual signal. Hand shape is the arrangement 4. You are slow.
of the ingers and their joints. Location is the place where the hand/s 5. This is hard.
is/are. Palm orientation is where the palm is facing. Movement is High Presence 1. My head is very painful.
the change in hand shape and/or path of the hands. Non-manual 2. I like you.
signals are facial expressions and/or movement of the other parts 3. I am tired.
of the body that goes with the signing [3]. 4. You are so slow.
Non-manual signals are shown using brows, eyes, nose, lips, 5. This is very hard.
cheek, mouth, mouthing, tongue, eyes and head, brows and head,
teeth and head, head, whole face and shoulder [3]. There are two Happy 1. Thank you.
major categories of facial expressions in FSL: Afective Facial Ex- 2. The trip is amazing.
Emotions
pressions (AFE) and Grammatical Facial Expressions (GFE). AFEs 3. The show is amazing.
are used to convey emotions or feelings. GFEs are used to convey 4. I am proud of you!
lexical information, types of sentences, degrees of adjectives, and 5. Our team won!
emotions [10]. Sad 1. I am sorry.
Lexical information are words that are part of the lexicon of FSL 2. My dog died.
such as ‘thin’ and ‘fat’. The types of sentences that are part of FSL 3. I am alone.
are question, statement, and exclamation. The degrees of adjectives 4. I am heartbroken.
has three basic levels: absence, presence, and high presence [9]. 5. I failed the exam.
Stationary 1. I hate you!
Danger 2. You are disgusting!
3 DATA PREPARATION 3. I do not like you!
3.1 Data Collection 4. You are so slow!
5. Stay away from me!
As most of the studies regarding FSL focused on the manual signals, Fast-approaching 1. I am scared.
the existing datasets may not have all the types of facial expressions. Danger 2. I am nervous.
Thus, data focusing on the facial expressions in FSL is gathered. 3. I am worried.
The data gathering is performed in a controlled environment 4. I saw a ghost.
using Microsoft Kinect for Windows v2.0 sensor (Kinect sensor) to 5. I am shocked!
capture three-dimensional video. The Kinect sensor is a hardware
device that has a depth sensor, a color camera, an infrared (IR) emit-
ter, and a microphone array. These components allow the sensor to
track the location, movement, and voice of people [22]. Kinect for Windows Software Development Kit 2.0 (Kinect SDK)
The participants are two male, and three female third-year FSL [15].
students, ranging from 20-24 years old. They have been learning, Most international research works have concluded that the eyes,
and using FSL for 2-4 years. Most of them have 1-3 months learning eyebrows, mouth, nose, and head pose must be well-represented to
experience with facial expressions at school, while some of them achieve efective recognition. Thus, face orientation, Shape Units
have 1-3 years experience. Each of the participants signed ive (SU), and Animation Units (AU) are used for this study. The face
sentences for each type of facial expression that are formulated orientation is the computed center of the head which is used to
with the help of their FSL professor. Refer to Table 1 for the complete calculate the angle rotations of the head with respect to the optical
list of sentences signed. center of the camera of Kinect sensor (i.e. pitch, yaw, and roll). The
SUs are the weights that indicate the diferences between the shape
of the face tracked and the average shape of a person which is
3.2 Data Representation
derived using the Active Appearance Models (AMM) of [20]. These
The features such as color images, depth images, audio input, and SUs are used to indicate the neutral shape of the face which is
skeletal data are extracted using Kinect sensor. To represent the derived from the irst few frames. The AUs are the changes in the
data, these features are then processed with the help of Microsoft facial features of the face tracked from the neutral shape.
Table 2: Face Animation Units and the minimum and maxi- an instance. As a result, there is a maximum of two labels in an
mum weights instance. For example, the facial expression for speciic sign/s can
be a combination of presence (one of the degrees of adjectives),
Animation Unit Min Value Max Value and sad (one of the emotions). Thus, individual experiments were
conducted for Degrees of Adjectives and Emotions, each applying
JawOpen 0 +1
ANN and SVM as classiication techniques.
LipPucker 0 +1
JawSlideRight -1 +1
LipStretcherRight 0 +1 4 MACHINE LEARNING
LipStretcherLeft 0 +1 Before the data has undergone classiication, some preprocessing
LipCornerPullerRight 0 +1 tasks are performed. In particular, only samples based from peak
LipCornerPullerLeft 0 +1 facial expressions are selected since some frames between the neu-
LipCornerDepressorLeft 0 +1 tral and peak facial expressions showed hand occlusions on the
LipCornerDepressorRight 0 +1 face. This also ensures that rising and falling facial expressions are
LeftCheekPuf 0 +1 excluded from the samples. Afterwards, uniform undersampling
RightCheekPuf 0 +1 is applied since the data for Degrees of Adjectives is imbalanced.
LeftEyeClosed 0 +1 Normalization through z-transformation is also applied due to the
RightEyeClosed 0 +1 diferent ranges of feature values.
RighteyebrowLowerer -1 +1 Then, feature selection is applied to determine the most efec-
LefteyebrowLowerer -1 +1 tive features for each category. The Wrapper Subset Evaluation
LowerlipDepressorLeft 0 +1 dominated in terms of improving the accuracy; however, it is com-
LowerlipDepressorRight 0 +1 putationally expensive [6]. Thus, Genetic Algorithm is applied to
reduce the amount of resources needed.
Some of the most commonly used classiiers in recent studies
The inal list of features are the pitch, yaw and roll angles, and regarding Facial Expression Recognition in Sign Language are Arti-
the seventeen AUs shown in Table 2. Most of the values range from icial Neural Network (ANN), and Support Vector Machine (SVM)
0 to 1. Negative minimum value indicates delta on the opposite [11]. ANN with a learning rate of 0.3, and SVM with a kernel type
direction. For example, if the delta for EyebrowLowerer is -1, the of radial basis function are applied for this study.
eyebrows are raised instead of lowered. Lastly, the validation technique used is 10-fold Cross-Validation.
Machine Learning was applied using RapidMiner [19]
3.3 Data Annotation
Most studies internationally focused on semantic functions such 5 EXPERIMENTAL SETUPS
as WH-question, Topic, and Assertion. However, the degrees of There are a total of six experimental setups which can be catego-
adjectives, and the emotions are also widely used in FSL; for in- rized into three: Partitioning-based, Feature-based, and Class-based.
stance, when expressing the level of pain or when telling a story. For each experimental setup, the classiication techniques used
Thus, in this study, the focus is on the degrees of adjectives, and are Support Vector Machine (SVM), and Artiicial Neural Network
the emotions. (ANN).
To scope down the number of classes, the emotions used in this
study are the four basic emotions (i.e. happy, sad, fast-approaching 5.1 Partitioning-based Setups
danger and stationary danger) by [8]. In their work, they have This category is based on the technique of splitting the dataset into
shown that there are only four basic emotions which were only training and test set for the validation phase. There are two setups
discriminated into six (i.e. happy, sad, fear, surprise, disgust, and under this category: Partitioning by Participants, and Partitioning
anger) as time passed by. Fear and surprise can be combined as by Stratiied Sampling. The purpose of this experiment is to test if
they both belong in the fast-approaching danger, while disgust and introducing a new participant would make a diference.
anger both belong in the stationary danger.
The annotation labels are then limited to: absence, presence, (1) Partitioning by Participants
high presence, happy, sad, fast-approaching danger, and stationary In this setup, there are ive folds for the validation phase.
danger. In each fold, there are four participants in the training set
Supposedly, the annotation of each sentence is their intended while there is one participant in the test set. The purpose of
type since the facial expressions are acted. However, due to the this experiment is to gain insights from the performances
diferent experience levels of the signers (i.e. some have 1-3 months when a new face structure is introduced.
experience, while some have 1-3 years experience), it is possible (2) Partitioning by Stratiied Sampling
that some of their acted facial expressions are unreliable. Thus, the Unlike Partitioning by Participants, the data for this setup is
videos are annotated by an FSL annotation expert to lessen the not segregated based on the participants but based on the
inaccuracy of the acted facial expressions. sentences. In this setup, there are a total of 10 folds for the
After the annotation, co-occurrences of the diferent classes validation phase. In each fold, 90% of the sentences are in
are discovered. The classes from the major categories co-occur in the training set, while 10% are in the test set.
5.2 Feature-based Setups Table 3: Performances for Degrees of Adjectives using the
diferent Experimental Setups
This category is based on the features used. Features are reduced in
some setups, while some features are added in some setups. There
are two setups under this category: Classes as features, and Feature Setup ANN SVM
Selection. Partitioning by Participants 37.27% 48.71%
Partitioning by
(1) Classes as features 36.56% 48.22%
Stratiied Undersampling
In this setup, classes from other categories are added as fea- High Presence or Absence 82.38% 87.14%
tures. The purpose of this experiment is to represent the Adding Emotions 51.22% 62.33%
co-occurrences of diferent categories in one instance. Since Feature Selection 43.11% 61.22%
all classiication techniques that are applied require numeri-
cal values, the Nominal to Numerical operator of RapidMiner
is applied where each class is converted into a binary feature.
For example, each of the classes of Emotions are converted
into an attribute resulting to four attributes (i.e. Happy, Fast,
Sad, and Stationary) with binary values. If an instance is
labeled as Happy, the attributes will have values of 1, 0, 0, Absence and High Presence and the classiiers cannot really distin-
and 0 respectively. guish Presence from the rest of the classes. This is because Absence
(2) Feature Selection and High Presence are polar opposites.
In this setup, features are selected with the aid of Genetic Among the three classes, Presence stands for a neutral facial
Algorithm. The purpose of this experiment is to improve the expression in the context of the Degrees of Adjectives. For example,
performances, and to determine which features best repre- the sentence ‘You are slow’ is not necessarily conveyed with a
sents the classes. facial expression just to show the presence of the word ‘slow’, while
the sentences ‘You are not slow’ and ‘You are very slow’ require
a marker for the words ‘not’ and ‘very’ to convey the absence
5.3 Class-based Setups
and high presence of the word ‘slow’ respectively. However, if
This category is based on the manipulation of the number of classes, Presence is mixed with the Emotion, Anger for example, the facial
or the classes themselves. There are two setups under this category: expression will change into a frowning face which is similar to the
One versus Many, and One class removed. The purpose of these characteristics of most sentences with Absence.
experiments is to pinpoint which classes are causing the confusions. Based from the observations of the videos, High Presence can
(1) One versus Many be diferentiated from Presence based on the intensity of the facial
For this setup, there are only two classes. One class is retained expression of the sentence. For example, an extreme smiling face
while the remaining classes are merged to form a new class. and frowning face must both be classiied as High Presence. How-
This experiment can show the distinction of a class versus ever, the head rotation angles and seventeen AUs cannot represent
the remaining classes, and the correlation of the merged diferent facial expressions with same intensities.
classes. ‘Classes as Features’ setup intends to consider the co-occurrences
(2) One class removed of Sentences and Emotions with Degrees of Adjectives, and the
For this setup, one class is removed for each category. This intensities. The idea here is that the Degree of Adjective is Presence
experiment can show the classes that are causing confusions. if the feature values are within the average given a co-occurring
Emotion. It is High Presence if the feature values are higher than
the average given a co-occurring Emotion. Using this setup, the
6 RESULTS AND ANALYSIS inclusion of Emotions as part of the features have signiicantly
Generally, the performances are below average at irst which only improved the distinction of Presence from the other classes reaching
improved after reducing the classes for each category, reducing the the highest accuracy of 62.33% and kappa of 0.435 using SVM. This
features, and/or adding classes from other categories as features. proves that the co-occurrence of Emotions have an impact in the
These experiments have diferent efects on each category. classiication. Refer to Table 3 for the summary of performances.
Another observation from the videos is that Absence can also
6.1 Degrees of Adjectives be diferentiated from Presence by detecting the motion of head
The performances of Partitioning-based setups do not have much shaking which involves yawing towards left and right repeatedly.
diference from one another which suggests that introducing a Although motions are captured using AUs, these AUs only represent
new face to the classiiers do not have much impact. Despite the the motion from the neutral face to the peak facial expression.
results being below average, a distinction between Absence and Yawing towards left or right can be captured; however, the whole
High Presence and a confusion of Presence from the other classes motion of head shaking cannot be represented using AUs alone.
can be observed from the confusion matrices. When Emotions are added as features and feature selection is
By removing Presence, the performances signiicantly improved applied, the highest accuracy reached 70.89% with a kappa of 0.562
attaining the highest accuracy of 87.14% and kappa of 0.727 as using SVM. Nonetheless, Presence is still confused with Absence
shown in Table 3. This proves that there is a distinction between and High Presence as shown in Table 4.
Table 4: Confusion Matrix when Emotions are added as fea- Table 6: Confusion Matrix when Feature Selection is applied
tures and Feature Selection is applied on Degrees of Adjec- on Emotions
tives
true true true true class
true true true class happy stationary fast sad precision
highpresence presence absence precision
pred. 26 0 0 3 89.66%
pred. 23 8 2 69.70% happy
highpresence pred. 0 14 3 6 60.87%
pred. 3 14 0 82.35% stationary
presence pred. fast 2 1 16 0 84.21%
pred. 5 9 29 67.44% pred. sad 0 11 2 19 59.38%
absence class 92.86% 53.85% 76.19% 67.86%
class 74.19% 45.16% 93.55% recall
recall
are mostly characterized by eyes and/or mouth wide opened. Smil-

Table 5: Performances for Emotions using diferent Experi- ing face is represented using RighteyebrowLowerer, LefteyeClosed,
mental Setups LipPucker, LipStretcherLeft, LipCornerPullerRight, LipCornerDe-
pressorLeft, and LipCornerDepressorRight. Eyes and/or mouth
Setup ANN SVM wide opened are represented using RighteyeClosed, JawOpen, Lip-
Partitioning by Participants 51.00% 43.15% Pucker, LipStretcherRight, LipCornerDepressorLeft, and Lower-
Partitioning by lipDepressorRight. Since the distinct characteristics of Happy and
53.55% 54.45%
Stratiied Undersampling Fast-approaching danger are represented using the seventeen AUs,
Happy or not 87.00% 85.33% these classes are easily distinguishable from the other classes. In
Fast or not 68.50% 73.50% addition, Face Tracking SDK is commonly used to detect if the user
Adding Sentences and is Happy which is probably why Happy is well-represented by their
54.73% 58.18%
Degrees output features.
Feature Selection 67.00% 72.91% On the other hand, Sad and Stationary danger are both highly
characterized by a frowning face based from the observation of the
videos. A frowning face is mostly composed of slightly closed eyes,
6.2 Emotions lowered eyebrows, and stretched down corner of the lips. These
features are included in the seventeen AUs. However, the diferences
Among the three categories, Emotions have the highest perfor-
are very subtle for the computer to recognize. The reason why the
mances when Partitioning-based setups are applied. However, un-
annotator was able to recognize it is likely because they know the
like the other main categories, ‘Classes as Features’ setup have
meaning of the gestures being accompanied by the facial expression.
only improved the performances by a bit, shown in Table 5, which
Take for example the sentence, ‘You are disgusting!’. Saying this
implies that the co-occurrences of the other categories do not have
would normally mean that you have the feeling of disgust or hate.
much impact on Emotions.
The highest performances are observed when Class-based se-
7 COMPARISON TO OTHER WORKS
tups are applied highlighting the distinction of Happy and Fast-
approaching danger from the other classes, and the confusion be- Although there are no other studies in Facial Expressions in FSL,
tween Sad and Stationary. Refer to Table 5 for the summary of as far as this research is concerned, several studies can be found
performances. for other Sign Languages. However, only a few of them actually
Using feature selection with One versus Many setup for Happy did classiication of the types of facial expressions. The following
and Fast-approaching danger, higher performances were achieved. researches are compared with this study:
For Happy and Not Happy, the highest accuracy and kappa are (1) Recognition of Nonmanual Markers in American Sign Lan-
98.00% and 0.962 respectively using SVM. For Fast-approaching guage (ASL) Using Non-Parametric Adaptive 2D-3D Face
danger and Not Fast-approaching danger, the highest accuracy and Tracking [14]
kappa are 89.00% and 0.785 respectively using ANN. (2) Recognizing Continuous Grammatical Marker Facial Ges-
Using feature selection without manipulating the classes, the tures in Sign Language Video [18]
highest performance reached 72.91% accuracy and 0.639 kappa. (3) Tracking Facial Features under Occlusions and Recognizing
Still, Happy has the highest recall and precision followed by Fast- Facial Expressions in Sign Language [17]
approaching danger as shown in Table 6. The studies listed above focused on American Sign Language
Based from the observation of the videos, indeed, Happy and (ASL) with 6 types of facial expressions that convey semantic func-
Fast-approaching danger have characteristics that are distinct from tions: Negation, WH-questions, Yes/No Questions, Topic, Focus,
other classes. Happy is mostly characterized by a smiling face, Conditional and When. Studies [1] and [3] reduced the number
while Fast-approaching danger which includes fear and surprise of classes to 4 and 5 respectively while study [2] added undeined
facial expressions as part of the types. In this study, the focus are Table 7: Table of comparison to other works (1/2)
the Degrees of Adjectives and Emotions. The major diference of
this study from other studies is that co-occurences of diferent types No. Samples Par-
of facial expressions are considered. Perfor-
# of per tici- Features
Comparing the accuracies shown in Table 7, studies [1] and [2] mance
Class Class pants
are better than this study, while this study has better results than
[1] 5 37-106 - Eyebrow Accuracy:
study [3]. The other studies did not have much samples as well,
height, Eye 88.16%
while the number of their participants are even less. This suggests
height, Head
that neither the number of samples nor participants is the issue.
poses
Looking at the feature detection methods, each study applied
[2] 7 ∼30 3 Eyebrow Accuracy:
diferent approaches. The only similarity between them is that
height, Eye 80.82%
both methods focused on being robust to occlusions of manual
height, Inner
signals. On the other hand, the method used for this study, Kinect
eye corners,
2.0 HD Face Tracking, is focused on detecting the changes on a
Middle of
face by using Active Appearance Models (AAM) of [20]. AAM is
the nose
one of the models for face tracking that is used to achieve global
shape optimization, but are sensitive to occlusions on the face [14].
height, Eye 84.19%
The whole face shape is usually distorted even when partial hand
height, Inner
occlusions occur. Hence, the frames with hand occlusions were
eye corners,
removed for this study resulting to loss of potentially signiicant
Middle of
frames that can be used for capturing motions.
the nose
Another signiicant diference of this study from other works
is the representation of temporal information. The other studies
height and 63.00%
used several consecutive frames to represent motions which was
distance, Eye
proven efective in the studies of [21], [5], and [18]. However, these
height, Eye
approaches cannot be applied for this study due to the removal of
inner corners
frames with hand occlusions.
distance,
Therefore, the possible reasons why studies [1] and [2] are better
Motion
than this study are the feature detection methods and representa-
vectors
tion of temporal information. Nonetheless, this study performed
of eye inner
better than study [3] despite having the same dataset and feature
corners,
detection method as study [2] and using several consecutive frames
Motion
to represent motions. This is due to the classiication technique
vectors of
that was applied as stated by the authors.
nose corners
Refer to Table 7 and Table 8 for the summary of comparison
[4] 3 for 15-169, 5 Head rotation Accuracy:
of this study with other works. Study [4] describes this research,
Degrees of 31-136, angles, 17 70.89%,
while study [1], [2], and [3] are the other works as enumerated
Adjectives, 21-28 Animation 72.91%
previously.
4 for Units
Emotions representing
8 CONCLUSIONS AND RECOMMENDATION eyebrows,
Being able to recognize the diferent types of facial expressions in eyes, mouth,
FSL can alleviate the gap in the communication between signers cheeks and
and non-signers. Particularly, recognizing the Degree of Adjective jaw, Classes
can help the doctors determine the amount of pain the signers from other
feels, while recognizing the Emotions can help the psychologists categories
understand the emotional state of the signers [9].
In this research study, a system was developed to recognize Facial
Expressions in Filipino Sign Language (FSL) using head rotation are Artiicial Neural Network (ANN) and Support Vector Machine
angles and Animation Units (AU) based on peak facial expressions. (SVM).
The head rotation angles include pitch, yaw, and roll, while the For Partitioning-based setups, there are not much diferences
seventeen AUs are deltas of the diferent facial features from neutral on the performances which suggests that introducing a new face
to current facial expressions. Individual experiments are conducted would not have much impact on the classiication. This is because
for Degrees of Adjectives, and Emotions. The Degrees of Adjectives AUs are deltas and not the exact points on the face. Thus, AUs
are Absence, Presence, and High Presence. The Emotions are Happy, are efective in representing diferent facial structures as long as
Fast-approaching danger (i.e. fear and surprise), Sad, and Stationary the participants are all expressive. For Class-based setups, highest
danger (i.e. disgust and anger). The classiication techniques used performances for all categories are observed which implies that
Table 8: Table of comparison to other works (2/2) Bayesian Feedback Mechanism and Non-parametric Adaptive 2D-
3D Tracking in the studies of [14] and [18], instead of AAM-based
Classi- methods. Aside from motions, intensities and co-occurrences must
Feature Temporal also be represented well. In this study, an attempt to represent these
# ication
Detection Information is adding classes from other categories as features. Signiicantly
Technique
better performances were observed using this setup. However, this
[1] Non-Parametric Sliding HM-SVM
approach can still be improved. One way to recognize the intensi-
Adaptive 2D-3D Window=5
ties could be applying Convolutional Neural Network (CNN) for
Face Tracking
classiication which is one of the state-of-the-art approach of deep
[2] KLT (Kanade- Motion Vector 2-layered CRF
learning for images. Lastly, an integration of the gesture data can
Lucas-Tomasi) (i.e. delta
be explored as it contains the context that might be signiicant in
Tracker with between current
distinguishing the facial expressions. As the annotator and other
Bayesian and previous)
FSL experts have mentioned, annotating the data without seeing
Feedback
the gestures is possible but it would be diicult.
Mechanism
[2] KLT (Kanade- Motion Vector 2-layered CRF
Lucas-Tomasi) (i.e. delta
ACKNOWLEDGMENTS
Tracker with between current The authors would like to thank Ms. Maria Elena Lozada, Mr. Rey Al-
Bayesian and previous), fred Lee and Mr. John Xander Baliza for imparting their knowledge
Feedback 2-layered CRF on Filipino Sign Language.
Mechanism The authors would also like to express their gratitude to the
[3] KLT Motion Vector HMM-NN anonymous FSL students who agreed to participate in the data
(Kanade- (i.e. delta gathering.
Lucas- between current
Tomasi) and previous, REFERENCES
Tracker HMM-NN [1] Yvette S. Apurado and Rommel L. Agravante. 2006. The phonology and regional
with variation of Filipino Sign Language: considerations for language policy. In 9th
Philippine Linguistic Congress.
Bayesian [2] Ed Peter Cabalin, Liza B. Martinez, Rowena Cristina L. Guevara, and P. C. Naval.
Feedback 2012. Filipino Sign Language recognition using manifold projection learning. In
TENCON 2012-2012 IEEE Region 10 Conference. IEEE, 1ś5.
Mechanism [3] Philippine Deaf Research Center and Philippine Federation of the Deaf. 2004. An
[4] Kinect 2.0 Animation Units SVM, and SVM Introduction to Filipino Sign Language: Traditional and emerging signs. Philippine
HD Face (i.e. delta respectively Deaf Resource Center, Incorporated.
[4] Helen Cooper, Brian Holt, and Richard Bowden. 2011. Sign language recognition.
Tracking between current In Visual Analysis of Humans. Springer, 539ś562.
and neutral) [5] Fernando de Almeida Freitas, Sarajane Marques Peres, Clodoaldo Aparecido de
Moraes Lima, and Felipe VenÃćncio Barbosa. 2014. Grammatical facial expres-
sions recognition with machine learning.. In FLAIRS Conference.
[6] Mark A. Hall and Geofrey Holmes. 2003. Benchmarking Attribute Selection
Techniques for Discrete Class Data Mining. IEEE Trans. on Knowl. and Data Eng.
the features are lacking to represent all the classes. This is because 15, 6 (Nov. 2003), 1437ś1447. https://doi.org/10.1109/TKDE.2003.1245283
motions and context are not represented which are signiicant for [7] Hope M. Hurlbut. 2008. Philippine signed languages survey: a rapid appraisal.
some of the classes. For Feature-based setups, adding classes from SIL Electronic Survey Reports 10 (2008), 16.
[8] Rachael E. Jack, Oliver GB Garrod, and Philippe G. Schyns. 2014. Dynamic
other categories as features showed better performances than when facial expressions of emotion transmit an evolving hierarchy of signals over time.
features are left as is, except for Emotions. Thus, this approach can Current biology 24, 2 (2014), 187ś192.
[9] Rey Alfred Lee. 2016. Facial expressions in Filipino Sign Language. (March 2016).
be used to represent the co-occurrences of the diferent categories Personal Communication.
in one instance. Consistently, Feature Selection has improved the [10] Maria Elena Lozada. 2016. Facial expressions in Filipino Sign Language. (Feb.
performances for all the categories. 2016). Personal Communication.
[11] X. Mao and Y. Xue. 2011. Human Emotional Interaction.
As concluded, in addition to head rotation angles and AUs, mo- [12] Liza Martinez and Ed Peter Cabalin. 2008. Sign language and computing in a de-
tions must be captured to represent the data better. In the stud- veloping country: a research roadmap for the next two decades in the Philippines.
ies of [21], [5], and [18], representations of motions through the In PACLIC. 438ś444.
[13] Stephen McCullough and Karen Emmorey. 2009. Categorical perception of
inclusion of temporal information such as Sliding Window and afective and linguistic facial expressions. Cognition 110, 2 (Feb. 2009), 208ś221.
Spatio-Temporal Pyramids improved their recognition rates. In the https://doi.org/10.1016/j.cognition.2008.11.007
[14] Dimitris N. Metaxas, Bo Liu, Fei Yang, Peng Yang, Nicholas Michael, and Carol
studies of [14], and [18], machine learning techniques that can Neidle. 2012. Recognition of nonmanual markers in American Sign Language
handle temporal dynamics by making use of sequential data were (ASL) Using non-parametric adaptive 2D-3D face tracking. In LREC. Citeseer,
applied such as Hidden Markov Model and Conditional Random 2414ś2420.
[15] Microsoft. 2016. Kinect hardware requirements and sensor setup. (2016).
Field. Motions is not represented in this study since some frames [16] Cornelia Muller. 2014. Body - Language - Communication. Walter de Gruyter
were dropped due to hand occlusions. Based from the other works, GmbH & Co KG. Google-Books-ID: QAGTBgAAQBAJ.
removing the frames with hand occlusions is not the solution to the [17] Tan Dat Nguyen and Surendra Ranganath. 2008. Tracking facial features under
occlusions and recognizing facial expressions in sign language. In Automatic Face
problem. Instead, feature detection techniques that can handle oc- & Gesture Recognition, 2008. FG’08. 8th IEEE International Conference on. IEEE,
clusions must be applied such as Kanade-Lucas-Tomasi (KLT) with 1ś7.
[18] Tan Dat Nguyen and Surendra Ranganath. 2010. Recognizing Continuous Gram-
matical Marker Facial Gestures in Sign Language Video. In Computer Vision
- ACCV 2010. Springer, Berlin, Heidelberg, 665ś676. https://doi.org/10.1007/
978-3-642-19282-1_53
[19] O. Ritthof, R. Klinkenberg, S. Fisher, I. Mierswa, and S. Felske. 2001. YALE: Yet
Another Learning Environment. Technical Report 763. University of Dortmund,
Dortmund, Germany. 84ś92 pages.
[20] Nikolai Smolyanskiy, Christian Huitema, Lin Liang, and Sean Eron Anderson.
2014. Real-time 3D face tracking based on active appearance model constrained
by depth data. Image and Vision Computing 32, 11 (Nov. 2014), 860ś869. https:
//doi.org/10.1016/j.imavis.2014.08.005
[21] Ulrich Von Agris, Moritz Knorr, and Karl-Friedrich Kraiss. 2008. The signiicance
of facial features for automatic sign language recognition. In Automatic Face &
Gesture Recognition, 2008. FG’08. 8th IEEE International Conference on. IEEE, 1ś6.
[22] Zhengyou Zhang. 2012. Microsoft Kinect Sensor and Its Efect. IEEE MultiMedia
19 (April 2012).

Facial Expressions Recognition in Filipino Sign Language: Classification Using 3D Animation Units

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Facial Expressions Recognition in Filipino Sign Language: Classification Using 3D Animation Units

Uploaded by

Copyright:

Available Formats

44 Proceedings of the 18th Philippine Computing Science Congress (PCSC 2018)

Facial Expressions Recognition in Filipino Sign Language:

ABSTRACT precisely identify a series of signs and to understand their mean-

are mostly characterized by eyes and/or mouth wide opened. Smil-

You might also like