Image Comm Pardas Bonafonte

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

FACIAL ANIMATION PARAMETERS EXTRACTION AND EXPRESSION RECOGNITION USING HIDDEN MARKOV MODELS Montse Pards, Antonio Bonafonte

e Universitat Politcnica de Catalunya C/Jordi Girona 1-3, D-5 08034 Barcelona Spain email: {montse,antonio}@gps.tsc.upc.es

The video analysis system described in this paper aims at facial expression recognition consistent with the MPEG4 standardized parameters for facial animation, FAP. For this reason, two levels of analysis are necessary: low level analysis to extract the MPEG4 compliant parameters and high level analysis to estimate the expression of the sequence using these low level parameters. The low level analysis is based on an improved active contour algorithm that uses high level information based on Principal Component Analysis to locate the most significant contours of the face (eyebrows and mouth), and on motion estimation to track them. The high level analysis takes as input the FAP produced by the low level analysis tool and, by means of a Hidden Markov Model classifier, detects the expression of the sequence. 1. INTRODUCTION The critical role that emotions play in rational decision-making, in perception and in human interaction has opened an interest in introducing the ability to recognize and

This work has been supported by the European project InterFace and TIC2001-0996 of the Spanish Government

reproduce emotions in computers. In [26] many applications which could benefit from this ability are explored. The importance of introducing non verbal communication in automatic dialogue systems is highlighted in [4] and [5]. However, the research that has been carried out on understanding human facial expressions cannot be directly applied to a dialogue system, as it has mainly worked on static images or on image sequences where the subjects show a specific emotion but there is no speech. Psychological studies have indicated that at least six emotions are universally associated with distinct facial expressions: happiness, sadness, surprise, fear, anger and disgust [32]. Several other emotions and many combinations of emotions have been studied but remain unconfirmed as universally distinguishable. Thus, most of the research up to now has been oriented towards detecting these six basic expressions. Some research has been conducted on pictures that capture the subjects expression at its peak. These pictures allow detecting the presence of static cues (such as wrinkles) as well as the position and shape of the facial features [32]. For instance, in [23] and [7] classifiers for facial expressions were based on neural networks. Static faces were presented to the network as projections of blocks from feature regions onto the principal component space generated from the image data set. They obtain around 86% accuracy in distinguishing the six basic emotions. Research has also been conducted on the extraction of facial expressions from video sequences. Most works in this area develop a video database from subjects making expressions on demand. Their aim has been to classify the six basic expressions that we have mentioned above, and they have not been tested in a dialogue environment. Most approaches in this area rely on the Ekman and Friesen Facial Action Coding System [10]. The FACS is based on the enumeration of all Action Units of a face that cause facial movements. The combination of these actions units results in a large set of

possible facial expressions. In [31] the directions of rigid and non-rigid motions that are caused by human facial expressions are identified computing the optical flow at the points with high gradient at each frame. Then they propose a dictionary to describe the facial actions of the FACS through the motion of the features and a rule-based system to recognize facial expressions from these facial actions. A similar approach is presented in [3], where a local parameterized model of the image motion in specific facial areas is used for recovering and recognising the non-rigid and articulated motion of the faces. The parameters of this detected motion are related to the motion of facial features during facial expressions. In [28] a radial basis function network architecture is developed that learns the correlation between facial feature motion patterns and human emotions. The motion of the features is also obtained computing the optical flow at the points with high gradient. The accuracy of all these systems is also around 85%. Other approaches employ physically-based models of heads including skin and musculature [11], [12]. They combine this physical model with registered optical flow measurements from human faces to estimate the muscle actuations. They propose two approaches, the first one creates typical patterns of muscle actuation for the different facial expressions. A new image sequence is classified by its similarity to the typical patterns of muscle actuation. The second one builds on the same methodology to generate, from the muscle actuation, the typical pattern of motion energy associated with each facial expression. In [19] a system is presented that uses facial feature point tracking, dense flow tracking and high gradient component analysis in the spatiotemporal domain to extract the FACS Action Units, from which expressions can be derived. A complete review of this techniques can be found in [24] In this work we have developed an expression recognition technique consistent with the MPEG4 standardized parameters for facial definition and animation, FDP and FAP.

Thus, the expression recognition process can be divided in two steps: facial parameter extraction and facial parameter analysis, which we also refer as low level and high level analysis. The facial parameter extraction process is based on a feature point detection and tracking system which uses active contours. Conventional active contours (snakes) approaches [16] find the position of the snake by finding a minimum of its energy, composed of internal and external forces. The external forces pull the contours toward features such as lines and edges. However, in many applications this minimization leads to contours that do not represent correctly the feature we are looking for. We propose in this paper to introduce some higher level information by a statistical characterization of the snaxels that should represent the contour. From the automatically produced initialization, the MPEG-4 compliant FAP are computed. These FAP will be used for training of the expressions extraction system. These techniques are developed in Section 2. Using the MPEG4 parameters for the analysis of facial emotions has different advantages. In first place, the developed high level analysis can benefit from already existing low level analysis techniques for FDP and FAP extraction, as well as from any advances that will be made in this area in the future years. Besides, the low-level FAP constitute a concise representation of the evolution of the expression of the face. From the training database, and using the available FAP, spatio-temporal patterns for expressions will be constructed. Our approach for the interpretation of facial expressions will use Hidden Markov Models (HMM) to recognize different patterns of FAP evolution. For every defined facial expression, a HMM will be trained with the extracted feature vectors. In the recognition phase, the HMM system will use as classification criterion the probability of the input sequence with respect to all the models. Tests are also performed in sequences composed of different emotions and with

periods of speech, using connected HMM. This process will be explained in Section 3. Finally, Section 4 will resume the conclusions of this work. 2. LOW LEVEL ANALYSIS This Section describes the low level video processing required for the extraction of Facial Animation Parameters (FAPs). The first step in this analysis is the face detection. Different techniques can be used for this aim, like [17] or [30], and we will not go into details of this part. After the face is detected, facial features have to be located and tracked. Two different approaches are possible: contour based or model based. The first approach will localize and track the most important contours of the face and track them in 2D (that is, track their projection in the image plane). The second approach consists in the adaptation, in each frame, of a 3D wireframe model of a face to the image. Example of this second approach can be found in [1], [6], [8] and [9]. In this work we will use a technique belonging to the first class because, as it will be explained later on, we will work on a restricted environment. Facial Animation Parameters (FAPs) will be computed from the tracked contours or from the adapted model. First, in section 2.1 we will review the meaning of these parameters. The following subsections are devoted to present a 2D technique for the extraction of these parameters, which uses active contours. So, section 2.1 will review the general framework of active contours, section 2.2 will apply this technique to facial feature detection and section 2.3 to facial feature tracking.

2.1 FACIAL ANIMATION PARAMETERS Facial Animation Parameters (FAPs) are defined in the ISO MPEG-4 standard [21], together with the Facial Definition Parameters (FDPs), to allow the definition of a facial

shape and its animation reproducing expressions, emotions, and speech pronunciation. FDPs are illustrated in Figure 1 (figures 1 and 2 have been provided by Roberto Pockaj, from Genova University), they represent key points in a human face. All the feature points can be used for the calibration of a model to a face, while only those represented by black dots in the image are used also for the animation (that is, there are FAPs describing its motion). The FAPs are based on the study of minimal facial actions and are closely related to muscle actions. They represent a complete set of basic facial actions, such as squeeze or raise eyebrows, open or close eyelids, and therefore allow the representation of most natural expressions. All FAPs involving translational movement are expressed in terms of the Facial Animation Parameter Units (FAPUs). These units aim at allowing interpretation of the FAPs on any facial model in a consistent way, producing reasonable results in terms of expression and speech pronunciation. FAPUs are illustrated in Figure 1 and correspond to fractions of distances between some key facial features. We will be interested in those FAPs which a) convey important information about the emotion of the face and b) can be reliably extracted from a natural video sequence. In particular, we have focused in those FAPs related to the motion of the contour of the eyebrows and the mouth. These FAPs are specified in Table 1.

Figure 1. Facial Definition Parameters

FAPU value IRISD = IRISD0 / 1024 ES = ES0 / 1024 ENS = ENS0 / 1024 MNS = MNS0 / 1024 MW = MW0 / 1024 AU = 10-5 rad Figure 2. Facial Animation Parameter Units

FAP name

FAP description Vertical displacement of left inner eyebrow Vertical displacement of right inner eyebrow Vertical displacement of left middle eyebrow Vertical displacement of right middle eyebrow Vertical displacement of left outer eyebrow

Unit ENS ENS ENS ENS ENS

31 raise_l_i_eyebrow 32 raise_r_i_eyebrow 33 raise_l_m_eyebrow 34 raise_r_m_eyebrow 35 raise_l_o_eyebrow

36 raise_r_o_eyebrow 37 squeeze_l_eyebrow 38 squeeze_r_eyebrow 51 lower_t_midlip _o 52 raise_b_midlip_o

Vertical displacement of right outer eyebrow Horizontal displacement of left eyebrow Horizontal displacement of right eyebrow Vertical top middle outer lip displacement Vertical bottom middle outer lip displacement

ENS ES ES MNS MNS MW MW

53 stretch_l_cornerlip_o Horizontal displacement of left outer lip corner 54 stretch_r_cornerlip_o Horizontal displacement of right outer lip corner 55 lower_t_lip_lm _o 56 lower_t_lip_rm _o 57 raise_b_lip_lm_o 58 raise_b_lip_rm_o 59 raise_l_cornerlip_o 60 raise_r_cornerlip _o

Vertical displacement of midpoint between left corner MNS and middle of top outer lip Vertical displacement of midpoint between right MNS corner and middle of top outer lip Vertical displacement of midpoint between left corner MNS and middle of bottom outer lip Vertical displacement of midpoint between right MNS corner and middle of bottom outer lip Vertical displacement of left outer lip corner Vertical displacement of right outer lip corner Table 1. FAP table for the eyebrows MNS MNS

To extract these FAPs in frontal faces, the tracking of the 2-D contour of the eyebrows and mouth is sufficient. If the sequences present rotation and translation of the head, then a system based on a 3D face model tracking would be more robust. In our case the FAPs are used to train the HMM classifier. For this reason, it is better to constrain the kind of sequences that are going to be analyzed. We will use frontal faces and a supervised tracking procedure, in order not to introduce any errors in the training of the system. Once the system has been trained, any FAP extraction tool can be used. If the results it produces are accurate, the high level analysis will have a higher probability of success. The low level analysis system that we propose for training can also be used for testing with other sequences, as long as the faces are frontal or we previously apply a global motion estimation for the head.

The detection and tracking algorithm that will be explained in the next section has been integrated in a Graphical User Interface (GUI) that supports corrections to the results being produced by the automatic algorithm. From the detection of facial features produced, it computes the FAP units and then from the results of the automatic tracking it writes the MPEG-4 compliant FAP files. These FAP files will be used for training of the expressions extraction system. The algorithm is applied to tracking of eyebrows and mouth, thus producing the MPEG4 FAPs for the eyebrows 31 32 33 34 35 36 37 38 and for the mouth 51 52 53 54 55 56 57 58 59 60, described in Table 1.

2.2 ACTIVE CONTOURS 2.2.1 Introduction Active contours were first introduced by Kass et al. [16]. They proposed energy minimization as a framework where low-level information (such as image gradient or image intensity) can be combined with higher-level information (such as shape, continuity of the contour or user interactivity). In their original work the energy minimization problem was solved using a variational technique. In [2] Amini et al. proposed Dynamic Programming (DP) as a different solution to the minimization problem. The use of the Dynamic Programming approach will allow us to introduce the concept of candidates for the control points of the contour, thus avoiding the risk to fall on local minima located near the initial position. In [25], we proposed a first approach for tracking facial features, based on dynamic programming, introducing a new term in the energy of the snake, and selecting the candidate pixels for the contour (snaxels) using motion estimation. Currently the DP

approach has been extended to be able to automatically initialize the facial contours. In this case, the candidate snaxels are selected by a statistical characterization of the contour based on Principal Component Analysis (PCA). In this subsection we will review the basic formulation of active contours, while the next subsections explain how we apply them for facial feature detection (2.3) and for facial feature tracking (2.4). 2.2.2 Active contours formulation In the discrete formulation of active contour models a contour is represented as a set of snaxels i=(xi,yi) for i=0,,N-1, where xi and yi are the x and y coordinates of the snaxel i. The energy of the contour, which is going to be minimized, is defined by:

E snake (v ) = (Eint (v i ) + Eext (v i ) )


i =0

N 1

(1)

We can use a discrete approximation of the second derivative to compute Eint.

Eint (v i ) = v i 1 2v i + v i +1

(2)

This is an approximation to the curvature of the contour at snaxel i, if the snaxels are equidistant. Minimizing this energy will produce smooth curves. This is only appropriate for snaxels that are not corners of the contour. More appropriated definitions for the internal energy will be proposed in Section 2.3.2 for the initialisation of the snake and in Section 2.4.2 for tracking. The purpose of the term Eext is to attract the snake to desired feature points or contours in the image. In this work we have used the gradient of the image intensity (I(x,y)), along the contour from vi to vi+1. Thus, the Eext at snaxel vi will depend only on the position of the snaxels vi and vi+1. That is,

Eext (v i ) = Econt vi vi + 1 = f ( I , v i , v i +1 )

(3)

However, a new term will also be added to the external energy, in order to be able to track contours that have stronger edges nearby. 2.2.3 Dynamic programming We will use the DP approach to minimize the energy in Eq. (1). Let us express the Energy of the snake remarking the dependencies of its terms:

E snake (v ) = [Eint (v i 1 , v i , v i +1 ) + Eext (v i , v i +1 )] = Ei (v i 1 , v i , v i +1 )


i =0 i =0

N 1

N 1

(4)

Although snakes can be open or closed, the DP approach can be applied directly only to open snakes. To apply DP to open snakes, the limits of Eq. 4 are adjusted to 1 and N-2 respectively. Now, as described in [2], this energy can be minimized via discrete DP defining a twoelement vector of state variables in the ith decision stage: (vi+1, vi). The optimal value function is a function of two adjacent points on the contour Si(vi+1,vi), and can be calculated, for every couple of possible candidate positions for snaxels vi+1 and vi, as:

S i (v i +1 , v i ) = min[S i 1 (v i , v i 1 ) + Ei (v i 1 , v i , v i +1 )]
vi 1

(5)

S0(v1, v0) is initialized to Eext(v0,v1) for every possible candidate pair (v0,v1) and from this, Si can be computed iteratively from i=1 up to i=N-2 for every candidate position for vi. The total energy of the snake will be

Esnake (v ) = min S N 2 (v N 1, v N 2 )
vN 1

(6)

Besides, we have to store at every step i a matrix which stores the position of vi-1 that minimizes Eq. (5), that is,

M i (v i + 1 , v i ) = v i 1 such that vi-1 minimizes (5).

By backtracking from the final energy of the snake and using matrix Mi, the optimal position for every snaxel can be found. In the case of a closed contour the solution proposed in [13] is to impose the first and last snaxels to be the same, and fix it to a given candidate for this position. The application of the DP algorithm will produce the best result under this restriction. Then this initial (and final) snaxel is successively changed to all the possible candidates, and the one that produces a smaller energy is selected. We use an approximation proposed in [14] that requires only two open contour optimisation steps. 2.2.4 Selection of candidates Up to now, we have assumed that for every snaxel vi there is a finite (and hopefully small) number of candidates, but we have omitted how to select these candidates. The computational complexity of each optimisation step is O(nm3), where n is the number of snaxels and m the number of candidates for every snaxel. Thus, it is very important to maintain m low. In [2] only a small neighbourhood around the previous position of the snaxel was considered. However, the algorithm was iteratively applied starting from the obtained solution until there was no change in the total energy of the snake. This method has several disadvantages. First, like in the approaches which use variational techniques for the minimization, the snake can fall into a local minimum. Second, the computational time can be very high if the initialisation is far from the minimum. In [13] and [14] a different set of candidates is considered for every snaxel. In particular, [13] establishes uncertainty lists for the high curvature points and defines a

search space between these uncertainty lists. In [14] the search zone is defined with two initial concentric contours. Each contour point is constrained to lie on a line joining these two initial contours. This approach gives very good results if the two concentric contours that contain the expected contour are available and the contour being tracked is the absolute minima in this area. However, these concentric contours are not always available. In the next sections we will describe how we can select these candidates and how the snake energy is defined for facial feature point detection and tracking, respectively.

2.3 FACIAL FEATURE POINT DETECTION 2.3.1 Selection of candidates We propose a new method that in a first step needs to fix the topology of the snakes. In our case, we are using a 16 snaxels equally spaced snake for the mouth and a 8 snaxels equally spaced snake for the eyebrows. To select the best candidates for each of these snaxels we compute what we call the vi-eigen-snaxel, by extracting samples of them from a database. That is, after resizing the faces from our database to the same size (200x350), we extract for each snaxel vi the 16x16 area around the snaxel in every image of the database. After an illumination normalization using an histogram specification for every snaxel, the extracted sub-images are used to form the training set of vectors for the snaxel vi, and from them, the eigenvectors (eigen-snaxels) are computed by classical PCA techniques. The first step for initialising the snakes in a new image is to roughly locate the face. Different techniques can be used for this aim [17], [30]. After size normalization of the face area, a large area around the rough position for every snaxel vi is examined by

computing the reconstruction error with the corresponding vi-eigen-snaxel. Those pixels leading to the smaller reconstruction error are considered as candidates for the snaxel vi. Principal Component Analysis (PCA) The principle of PCA is to construct a low dimensional space with decorrelated features which preserve the majority of the variation in the training set [20]. PCA has been used in the past to detect faces (by means of eigen-faces) or facial features (by means of eigen-features). In this paper, we extend its use to detect the snake control points for specific facial features (by using eigen-snaxels). For each snaxel vi, the vectors {xt} are constructed by lexicographic ordering of the pixel elements of each sub-image from the training set. A partial KLT is performed on these vectors to identify the largest-eigenvalue eigenvectors. Distance measure To evaluate the feasibility for a given pixel to be a given snaxel we first construct the vector {xt} using the corresponding sub-image. Then we obtain a principal component
~ feature vector y=MT ~ , where x = x x is the mean-normalized image vector, is x

the eigenvector matrix for snaxel vi and M is a sub-matrix of that contains the M principal eigenvectors. The reconstruction error is calculated as:

(x ) =
2

i = M +1

2 y = ~ yi2 x 2 i i =1

(7)

We have used M=5 in our experiments. The pixels producing the smaller reconstruction error are selected as candidate locations for the snaxel vi. 2.3.2. Snake energy

External energy A new term is added to the external Energy function in Eq (3) to take into account the result obtained in the PCA process. The energy will be lower in those positions where the reconstruction error is lower.

Eext (v i ) = Econt vi vi +1 + (1 ) 2 (v i )

(8)

In our experiments the value of has been set to 0.5.

Internal energy As mentioned, the Energy term defined in (2) is only appropriate for snaxels that are not corners of the contour. In the case of corners the energy has to be low when the second derivative is high. We use, for the energy of the snaxels belonging to the eyebrows and the mouth:

Eint(vi ) = [i vi1 2vi + vi+1 + (1 i )(B vi1 2vi + vi+1 )]

(9)

where i is set to 1 if vi is not a corner point and to 0 if it is (note that we work with a fixed topology for the snakes). B represents the maximum value that the approximation of the second derivative can take.

2.4 FEATURE POINT TRACKING 2.4.1. Selection of candidates In the initialisation of the snake, the candidates for every snaxel were selected on the basis of a statistical characterization of the texture around them. However, in the case of

tracking, we have a better knowledge of this texture if we use the previous frame. We propose to find the candidates for every snaxel in the tracking process using motion estimation in order to select the search space for every snaxel. A small region around every snaxel is selected as basis for the motion estimation. The shape of this region is rectangular and its size is application dependent. However, the region should be small enough so that its motion can be approximated by a translational motion. The motion compensation error (MCE) for all the possible displacements (dx,dy) of the block in a given range is computed as:
MCE vi0 (dx, dy ) =
j = Ry i = Rx

j = Ry i = Rx

t 1

( x 0 i, y 0 j ) I t ( x 0 i + dx, y 0 j + dy )

(10)

being (x0, y0) the x and y coordinates of the snaxel vi in the previous frame, which we have called vi0. The region under consideration is centred at the snaxel and has size 2Rx in the horizontal dimension and 2Ry in the vertical dimension. The range for (dx,dy) determines the maximum displacement that a snaxel can suffer. The M positions with the minumum MCEvi0(dx,dy) are selected as possible new locations for snaxel vi. 2.4.2 Snake energy External energy We use for the external energy in the tracking procedure the same principle as for the initialisation. That is, it will be composed, as in (8), of two terms. The first one is the gradient along the contour, but the second one is slightly different, as in this case we can use the texture provided by the previous frame instead of the statistical characterization. Thus, the second term corresponds to the compensation error obtained in the motion estimation. In this way preference is given to those positions with the

smaller compensation error. That is, the energy will be lower in those positions whose texture is most similar to the texture around the position of the corresponding snaxel in the previous frame. Therefore, the expression for the external energy will be:

Eext (v i ) = Econt vi 1 vi + (1 ) MCEvi0 (v i )


The constant is also set to 0.5. Internal energy

(11)

The internal energy we use to track feature points has the same formulation as in section 2.4.2. As before, we assume here that we know the geometry of the different face features to correctly set i to 1 or 0, Eq. (9). To solve problems of snaxel grouping we also add, as in [18], another term to the internal energy that forces snaxels to preserve the distance between them from one frame to the next one:
+1 E dist ,i = | uit + 1 uit | + | uit+ 1 uit+ 1 |

(12)

uit = v i v i 1 , in frame t
So when the distance is altered the energy increases proportionally.

2.5 RESULTS In Figures 3 and 4 we show some examples of automatic initialisation and tracking. To perform the tests, the algorithms have been introduced in a Graphical User Interface (GUI) that allows manual correction of the snaxels position. In the tests performed, the

contours obtained with the automatic initialisation always correspond to the facial features, if the face has been located accurately. However, as shown in Figure 4, sometimes some points should be manually modified if more accuracy is required. The tracking has been performed on 276 sequences from the Cohn-Kanade facial expression database [15]. From these sequences, 180 needed no manual correction, 48 needed 1, 2 or 3 points correction along the sequence and 40 sequences had one major problem in at least one frame for one feature tracking. From the initialization, the Facial Animation Units (FAPUs) are computed. This means that the first frame needs to present a neutral expression. Then, from the output of the tracking, the FAPs for the sequence are extracted. These FAPs are going to be used in the second part of the system for the expression recognition.

Figure 3. First row: Initialization examples. Second and third row: tracking examples.

Figure 4. Automatic initialisation examples with minor errors

3. HIGH LEVEL ANALYSIS The second part of the system is based on the modeling of the expressions by means of Hidden Markov Models. The observations to model are the MPEG-4 standardized Facial Animation Parameters (FAPs) computed in the previous step. The FAPs of a video sequence are first extracted and then analyzed using semi-continuous HMM. The Cohn-Kanade facial expression database [15] has been selected as basis for doing the training of the models and the evaluation of the system. The whole database has been processed in order to extract a subset of the Low Level FAPs and perform the expression recognition experiments. Different experiments have been carried out to

determine the best topology of the HMM to recognize expressions from the Low Level FAPs. HMM are one of the basic probabilistic tools used for time series modelling. They are definitely the most used model in speech recognition, and they are beginning to be used in vision, mainly because they can be learned from data and they implicitly handle timevarying signals by dynamic time warping. They have already been successfully used to classify the mouth shape in video sequences [22] and to combine different information sources (optical flow, point tracking and furrow detection) for expression intensity estimation [19]. HMM were also used in [29] for expression recognition. In this work a spatio-temporal dependent shape/texture eigenspace is learned in a HMM like structure. We will extend the use of HMM creating the feature vectors from the available lowlevel FAP. In the following sections we review the basic concepts of HMM (3.1), define the topology of the models that we will use (3.2) and explain how we estimate the parameters of these models (3.3). This first approach tries to recognize isolated expressions. That is, we take a sequence which contains the transition from a neutral face to a given expression and we have to decide which is this expression. All the sequences from the Cohn-Kanade facial expression database belong to this type. We will evaluate the system using this approach in section 3.4. In section 3.5 we will introduce a new model which will allow us to use the system in more complex environments. It will model the parts of the sequences where the person is talking. Finally, in section 3.6 a more complex experiment is presented, where the system classifies the different parts of the sequences using a one-stage dynamic programming algorithm.

3.1 HIDDEN MARKOV MODELS DEFINITION

HMM model temporal series assuming a (hidden) temporal structure. For each emotion a HMM can be estimated from training examples. Once these models are trained we will be able to compute, for a given sequence of observations (the FAPs of the video sequence in our case), the probability that this sequence was produced by each of the models. Thus, this sequence of observations will be assigned to the expression whose HMM gives a higher probability of generating it. The HMM [27] are defined by a number N of states connected by transitions. The output of these states are the observations, which belong to a set of symbols. The time variation is associated with transitions between states. To completely define the models, the following parameters are used:

The set of states S=(s1,s2,...,sN) The set of symbols that can be observed from the states (O1,...OM), in our case, the

FAPs from each frame

The probability of transition from state i to state j, defined by the transition matrix

A, where aij=p(qt+1=sj |qt=si)


1 i, j N

The probability distribution function of the different observations in state j, B,

where

bj(Oi)=P(Oi | q=Sj)

1 j N

The initial state probabilities i=p(q1=Si) 1 i N

To train a HMM for an emotion (that is, to adjust its parameters), we will use as training sequences all those sequences that we select as representatives of the given emotion. In our case, we use all those sequences from the Cohn-Kanade facial expression database manually classified as belonging to this emotion.

3.2 SELECTION OF THE HMM TOPOLOGY The Baum-Welch algorithm [27] can estimate the value of A and . However, better results can be obtained if the topology of the Markov chain is restricted using a priori knowledge. In our case, each considered emotion (sad, anger, fear, joy, disgust and surprise) reflects a temporal structure: let's name it start, middle and end of the emotion. This structure can be modeled using left-to-right HMM. This topology is appropriated for signals whose properties change over time in a successive manner. It implies aij=0 if

i>j and i=1 for i=1, i=0 for i1. As the time increases, the observable symbols in each
sequence either stay at the same state or increase in a successive manner. In our case, we have defined for each emotion a three-states HMM. To select this topology different configurations have been tested. The experiments show that using HMM with only 2 states the recognition rate is 4% lower. Using more than 3 states increases the complexity of the model without producing any improvement in the recognition results. The selected topology is represented in the following figure.

Figure 5. Topology of Hidden Markov Models

3.3 ESTIMATION OF THE OBSERVATION PROBABILITIES Each state of each emotion models the observed FAPs using a probability function,

bj(O) which can be either a discrete or a continuous one. In the first case, each set of

FAPs for a given image has to be discretized using vector quantification. In the continuous case, a parametric pdf is defined for each state. Typically, multivariate Gaussian or Laplacian pdf's are used. In the case of having sparse data, as it is our case, better results can be achieved if all the states of all the emotion models share the Gaussian mixtures (mean and variance). Then, the estimation of the pdf's for each state is reduced to the estimation of the contribution of each mixture to the pdf. Each set of FAPs has been divided in two subsets corresponding to eyebrows and mouth, following the MPEG4 classification. At each frame, the probability of the whole set of FAPs is computed as the product of the probability of each subset (independence assumption). For each of these subsets, a set of 32 Gaussian mixtures has been estimated (mean and variance) applying a clustering algorithm on the FAPs of the training database. In order to select the number of Gaussian mixtures, different experiments have been performed. Having less than 16 mixtures produces a recognition rate more than a 10% lower than that obtained with 16 mixtures. The use of 32 mixtures produces a slightly better recognition rate than 16, while more than 32 do not produce any improvement.

3.4 EVALUATION OF THE SYSTEM To evaluate the system we use the Cohn-Kanade facial expression database, which consists of the recording from 90 subjects, each one with several basic expressions. The recording was done with a standard VHS video camera, with a video rate of 30 frames per second, with constant illumination and only full-face frontal views were captured. Although the subjects were not previously trained in displaying facial expressions, they practiced the expressions with an expert prior to video recording. Each posed expression

begins from neutral and ends at peak expression. The number of sequences for each expression is the following:
F 33 Sa 52 Su 60 J 61 A 33 D 37

# Seq

Table 2. Number of available sequences for each expression (F: Fear, Sa: Sad, Su: Surprise, J: Joy, A: Anger, D: Disgust). The FAPs of these sequences were first extracted using the FAP extraction tool described in Section 2. The system will train, for every defined facial expression (sad, anger, fear, joy, disgust and surprise), a HMM with the extracted FAPs. In the recognition phase, the HMM system computes the maximum probability of the input sequence with respect to all the models and assigns to the sequence the expression with the highest probability. The system has been evaluated by first training the six models with all the subjects except one, and then testing the recognition with this subject which has not participated in the training. This process has been repeated for all the subjects, obtaining the following recognition results:
F Sa Su 26 3 0 8 31 2 0 0 60 4 0 0 0 2 0 0 0 0 J 3 2 0 57 1 0 A 0 6 0 0 24 1 D %Cor 1 78,7% 3 59,6% 0 100% 0 93,4% 6 72,7% 36 97,3%

F Sa Su J A D

Table 3. Number of sequences correctly detected and number of confusions. The overall recognition rate obtained is 84%. These results are coherent with our expectation: surprise, joy and disgust have the higher recognition rates, because they involve clear motion of mouth or eyebrows, while sadness, anger and fear have a higher confusability rate. Experiments have been

repeated involving only three emotions, obtaining 98% recognition rate when joy, surprise and anger are involved, and 95% with joy, surprise and sadness. We have also studied which one of the facial features conveys more information about the facial expressions, by obtaining the recognition results using only the FAPs related with the motion of the eyebrows, or only the FAPs related with the motion of the mouth. The overall recognition rate in the first case is 50%, while in the second it is 78%. This result shows that the mouth conveys most of the information, although the information of the eyebrows helps to improve the results. Although it is difficult to compare, due to the usage of different databases, the recognition rates obtained using all the extracted information are of the same order than those obtained by other systems which use dense motion fields combined with physically-based models, showing that the MPEG4 FAPs convey the necessary information for the extraction of these emotions.

3.5 INTRODUCTION OF THE TALK MODEL An important objective of this work is to use sequences that represent a conversation with an agent, in order to consider a real situation where the system could be applied. The system developed should be able to classify expressions in silence frames as well as to detect important clues in speech frames. A first step has been to work on silence frames. However, we also want to separate the motion due to the expressions from the motion due to speech. One possibility is to combine audio analysis with video analysis. The other possibility that we have developed consists on the generation of a new model, in addition to those of the six emotions, to classify the speech sequences. In order to train this additional model, that we have designated as talk, we have selected 28 sequences that contain speech, from available data sequences. A few more

sequences corresponding to different expressions have also been added to our dataset. These sequences have been processed with the FAP extraction tool, in the same way that the emotion models had been previously analysed. Then, the recognition tests were repeated including these sequences. The performance of the recognition algorithm in this case is shown in next table. The overall recognition rate is 81%, showing the possibility to distinguish talk from emotion using this methodology.
F Sa Su 27 2 1 6 29 3 0 0 64 4 0 1 0 1 0 1 0 0 0 3 0 J 3 2 0 57 2 0 0 A 0 7 0 0 22 1 4 D 1 1 0 0 7 34 0 T %Cor 0 79,4% 5 54,7% 0 100% 0 91,9% 2 64,7% 0 91,9% 21 75,0%

F Sa Su J A D T

Table 4. Number of sequences correctly detected and number of confusions adding the talk (T) sequence. The overall recognition rate obtained is 81%.

3.6 CONNECTED SEQUENCES RECOGNITION As discussed before, the first experiments performed were oriented to the extraction of an expression from the input sequence. However, in real situations, we will have a continuous video sequence where the person is talking or just looking at the camera and, at a given point, an expression will be performed by the person. So, we want to explore the capability of the system to separate, from a video sequence, different parts where different emotions occur. In this case the emotions can be decoded using the same HMM by means of a one-stage dynamic programming algorithm. This algorithm is widely used in connected and continuous speech recognition. The basic idea is to allow during the decoding a transition from the final state of each HMM (associated to each emotion) to the first state of the other HMMs. This can be interpreted as a large HMM composed from the

emotion HMMs. The transitions along this large HMM indicate both the decoded emotions and the frames associated to each emotion. In order to recover the transitions, the algorithm needs to save some backtracking information, as it is usually the case in dynamic programming. Because of the lack of a more appropriate database, the connected recognition algorithm has been tested by concatenating sequences of six FAP files extracted from the emotions and talk sequences. Results are given in next table. The recognition rate for these sequences is 64%, showing a lower performance than the isolated sequences recognition.
F Sa Su 25 4 1 5 25 4 2 2 37 2 2 1 0 2 1 2 0 9 1 8 4 J 3 2 0 56 2 0 3 A 0 8 6 0 16 3 6 D 1 2 4 0 4 19 1 T %Cor 0 73,5% 7 47,1% 9 57,8% 0 91,8% 8 47,0% 3 51,3% 56 70,9%

F Sa Su J A D T

Table 5. Number of sequences correctly detected and number of confusions in the connected sequences experiment. The overall recognition rate obtained is 64%.

4. CONCLUSIONS We have presented a video analysis system that aims at expression recognition in two steps. In the first step, the Low Level Facial Animation Parameters are extracted. The second step analyses these FAPs by means of HMM to estimate the expressions. For training purposes, a supervised active contour based algorithm is applied to extract these Low Level FAPs from an emotions face database. This algorithm introduces a criterion for selection of the candidate snaxels in a dynamic approach implementation of the active contours algorithm. Besides, an additional term is introduced in the external

energy, related to the specific texture around the snaxels, in order to avoid the contour to fall in stronger edges that might be around the one we are aiming at. This technique allows to extract the Low Level FAPs that we will use for training the recognition system. Once the system has been trained, the recognition procedure can be applied to the FAPs extracted by any other method. The results obtained prove that the FAPs convey the necessary information for the extraction of emotions. The system shows good performance for distinguishing isolated expressions and can also be used, with lower accuracy, to extract the expressions in long video sequences where speech is mixed with silence frames. 5. REFERENCES
[1] J. Ahlberg, An Active Model for Facial Feature Tracking. Accepted for publication in EURASIP Journal on Applied Signal processing, to appear in June 2002. [2] A. Amini, T. Weymouth and R. Jain, Using Dynamic Programming for Solving Variational Problems in Vision, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No. 9, September 1990. [3] M. Black, Y. Yacoob, Recognizing facial expressions in image sequences using local parameterized models of image motion, Int. Journal of Computer Vision, 25 (1), 1997, 23-48. [4] J. Cassell, Embodied Conversation: Integrating Face and Gesture into Automatic Spoken Dialogue Systems, In Luperfoy (ed.), Spoken Dialogue Systems, Cambridge, MA: MIT Press. [5] F. Cassell, K.R. Thorisson, The power of a nod and a glance: Envelope vs. Emotional Feedback in Animated Conversational Agents, Applied Artificial Intelligence 13: 519538, 1999. [6] T.F. Cootes, G.J. Edwards, C.J. Taylor, C.J. Active appearance models, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 23 Issue: 6 , pp. 681 685, June 2001.

[7]

G.W. Cottrell and J. Metcalfe, EMPATH: Face, emotion, and gender recognition using holons, in Neural Information Processing Systems, vol.3, pp. 564-571, 1991.

[8] D. DeCarlo, D. Metaxas, Adjusting shape parameters using model-based optical flow residuals, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 24 Issue: 6 , pp. 814 823, June 2002 [9] P. Eisert, B. Girod, Analyzing Facial Expression for Virtual Conferencing, IEEE Computer Graphics and Applications, Vol. 18, N. 5, 1998. [10] P. Ekman and W. Friesen, Facial Action Coding System. Consulting Psychologists Press Inc., 577 College Avenue, Palo Alto, California 94306, 1978. [11] I. Essa and A. Pentland, Facial Expression Recognition using a Dynamic Model and Motion Energy, Proc. of the International Conference on Computer Vision 1995, Cambridge, MA, May 1995. [12] I. Essa, Analysis, Interpretation and Synthesis of Facial Expressions, Ph.D. Thesis, Massachusetts Institute of Technology (MIT Media Laboratory). [13] D. Geiger, A. Gupta, L. Costa and J. Vlontzos, Dynamic Programming for Detection, Tracking, and Matching Deformable Contours, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 3, March 1995. [14] S. Gunn and M. Nixon, Global and Local Active Contours for Head Boundary Extraction, International Journal of Computer Vision 30(1), pp. 43-54, 1998. [15] Kanade, T., Cohn, J.F., Tian. Y., Comprehensive Database for Facial Expression Analysis, Proceedings of the Fourth IEEE International Conference on Automatic Face and Gestures Recognition, Grenoble, France, 2000. [16] M. Kass, A. Witkin and D. Terzopoulos, Snakes: Active Contour Models, International Journal of Computer Vision, Vol. 1, No. 4, pp. 321-331, 1988. [17] C. Kervrann, F. Davoine, P. Perez, R. Forchheimer and C. Labit, Generalized likelihood ratio-based face detection and extraction of mouth features, Pattern Recognition letters 18, pp. 899-912, 1997. [18] C. L. Lam, S. Y. Yuen, An unbiased active contour algorithm for object tracking, Pattern Recognition Letters 19, pp. 491-498, 1998. [19] J. Lien, T. Kanade, J. Cohn, C. Li, Subtly different Facial Expression Recognition And Expression Intensity Estimation, in Proc. Of the IEEE Int. Conference on Computer Vision and Pattern Recognition, pp. 853-859, Santa Barbara, Ca, June 1998.

[20] B. Moghaddam, A. Pentland, Probabilistic Visual Learning for Object Detection, in 5th International Conference on Computer Vision, Cambridge, MA, June 95. [21] ISO/IEC MPEG-4 Part 2 (Visual) [22] N. Oliver, A. Pentland, F. Berard, LAFTER: A Real-time Lips and Face Tracker with Facial Expression Recognition, in Proc. of IEEE Conf. on Computer Vision, Puerto Rico, 1997. [23] C. Padgett, G. Cottrell, Identifying emotion in static face images, in Proc. Of the 2nd Joint Symposium on Neural Computation, Vol.5, pp.91-101, La Jolla, CA, University of California, San Diego. [24] M. Pantic , L. Rothkrantz, Automatic Analysis of Facial Expressions: The State of the Art, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, N. 12, pp. 1424-1445, 2000. [25] M. Pards, E. Sayrol, Motion estimation based tracking of active contours, Pattern Recognition Letters, Vol. 22/13, pp. 1447-1456, Ed. Elsevier Science, ISSN, November 2001. [26] R.W. Picard, Affective Computing, MIT Press, Cambridge 1997. [27] L. Rabiner, A tutorial on Hidden Markov Models and selected applications in Speech Recognition, Proceedings IEEE, pp. 257-284, February 1989. [28] M. Rosenblum, Y. Yacoob, L.S. Davis, Human Expression Recognition from Motion Using a Radial Basis Function Network Architecture, IEEE Trans. On Neural Networks, 7 (5), 1121-1138, 1996. [29] F. de la Torre, Y. Yacoob, L. Davis, A probabilistic framework for rigid and non-rigid appearance based tracking and recognition, Int. Conf. on Automatic Face and Gesture Recognition, pp. 491-498, 2000. [30] V. Vilaplana, F. Marqus, P. Salembier, L Garrido, Region Based Segmentation and Tracking of Human Faces, Proceedings of European Signal Processing Conference (EUSIPCO-98), pp. 311-314, 1998. [31] Y. Yacoob, L.S. Davis, Computing Spatio-Temporal Representation of Human Faces, IEEE Trans. On PAMI, 18 (6), 636-642 [32] A. Young and H. Ellis (eds.), Handbook of Research on Face Processing, Elsevier Science Publishers 1989.

You might also like