Professional Documents
Culture Documents
Trupu Matematic
Trupu Matematic
AN HONORS THESIS
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
OF STANFORD UNIVERSITY
Sergey Levine
Principal Adviser: Vladlen Koltun
Secondary Adviser: Christian Theobalt
May 2009
Abstract
Human communication involves not only speech, but also a wide variety of gestures and
body motions. Interactions in virtual environments often lack this multi-modal aspect of
communication. This thesis presents a method for automatically synthesizing body lan-
guage animations directly from the participants’ speech signals, without the need for addi-
tional input. The proposed system generates appropriate body language animations in real
time from live speech by selecting segments from motion capture data of real people in
conversation. The selection is driven by a hidden Markov model and uses prosody-based
features extracted from speech. The training phase is fully automatic and does not require
hand-labeling of input data, and the synthesis phase is efficient enough to run in real time
on live microphone input. The results of a user study confirm that the proposed method is
able to produce realistic and compelling body language.
ii
Acknowledgements
First and foremost, I would like to thank Vladlen Koltun and Christian Theobalt for their
help and advice over the course of this project. Their help and support was invaluable at
the most critical moments, and this project would not have been possible without them. I
would also like to thank Jerry Talton for being there to lend a word of advice, and never
hesitating to point out the occasional glaring flaws that others were willing to pass over.
I would like to thank Stefano Corazza and Emiliano Gambaretto of Animotion Inc.,
Alison Sheets of the Stanford BioMotion Laboratory, and all of the staff at PhaseSpace
Motion Capture for the time and energy they spent helping me with motion capture pro-
cessing and acquisition at various stages in the project. I would also like to thank Sebastián
Calderón Bentin for his part as the main motion capture actor, and Chris Platz, for help with
setting up scenes for the figures and videos. I would also like to thank the many individuals
who reviewed my work and provided constructive feedback, including Sebastian Thrun,
Matthew Stone, Justine Cassell, and Michael Neff.
Finally, I would like to thank all of the participants in my evaluation surveys, as well as
all other individuals who were asked to evaluate the results of my work in some capacity.
The quality of body language can only be determined reliably by a human being, and the
human beings around me were often my first option for second opinions over the entire
course of this project. Therefore, I would like to especially thank all of my friends, who
remained supportive of me throughout this work despite being subjected to an unending
barrage of trial videos and sample questions.
iii
Contents
Abstract ii
Acknowledgements iii
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Gesture and Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
iv
4.3 Parameterizing the Gesture Model . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Estimating Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Synthesis 22
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Forward Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Most Probable and Coherent State . . . . . . . . . . . . . . . . . . . . . . 25
5.4 Early and Late Termination . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.5 Constructing the Animation Stream . . . . . . . . . . . . . . . . . . . . . 27
6 Evaluation 29
6.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Video Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Bibliography 37
v
List of Tables
2.1 This table shows the number of segments and clusters identified for each
body section in each training set. Training set I consisted of 30 minutes
divided into four scenes. Training set II consisted of 15 minutes in eight
scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 This table presents a summary of the audio features used by the system.
For each continuous variable X, µX represents its mean value and σX rep-
resents its standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . 16
vi
List of Figures
1.1 Training of the body language model from motion capture data with simul-
taneous audio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 The body parts which constitute each section are shown in (a) and the mo-
tion summary is illustrated in (b). . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 The hidden state space of the model consists of the index of the active
motion, At = si or ci , and the fraction of that motion which has elapsed up
to this point, τt . Observations Vt are paired with the active motion at the
end of the next syllable. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1 Diagram of the synthesis algorithm for one observation. At is the tth ani-
mation state, Vt is the tth observation, α
~ t is the forward probability vector,
τ̄t is the vector of expected τ values, and ~γt is the direct probability vector. 23
5.2 The image shows the states selected over 40 observations with maxi γt (i)
(red) and maxi αt (i) × γt (i) (green), with dark cells corresponding to high
forward probability. The graph plots the exact probability of the selections,
as well as maxi αt (i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vii
Chapter 1
Introduction
1
CHAPTER 1. INTRODUCTION 2
contribution is a method of modeling the gesture formation process that is appropriate for
real-time synthesis, as well as an efficient algorithm that uses this model to produce an
animation from a live speech signal, such as a microphone.
Prosody is known to correspond well to emotional state [1, 46] and emphasis [49].
Gesture has also been observed to reflect emotional state [50, 32] and highlight empha-
sized phrases [30]. Therefore, the system generates the animation by selecting appropriate
gesture subunits from the motion capture training data based on prosody cues in the speech
signal. The selection is performed by a specialized hidden Markov model (HMM), which
ensures that the gesture subunits transition smoothly and are appropriate for the tone of the
current utterance. In order to synthesize gestures in real time, the HMM predicts the next
gesture and corrects mispredictions. The use of coherent gesture subunits, with appropriate
transitions enforced by the HMM, ensures a smooth and realistic animation. The use of
prosody for driving the selection of motions also ensures that the synthesized animation
matches the timing and tone of the speech.
Four user studies were conducted to evaluate the effectiveness of the proposed method.
The user studies evaluated two different training sets and the ability of the system to gen-
eralize to novel speakers and utterances. Participants were shown synthesized animations
for novel utterances together with the original motion capture sequences corresponding to
each utterance and an additional sequence of gestures selected at random from the training
set. The results of the study, presented in Chapter 6, confirm that the method produces
compelling body language and generalizes to different speakers.
By animating human characters from live speech in real time with no additional input,
the proposed system can seamlessly produce plausible body language for human-controlled
characters, thus improving the immersiveness of interactive virtual environments and the
fluidity of virtual conversations.
automatic methods have proposed synthesis of facial expressions using more sophisticated
morphing of video sequences [15], physical simulation of muscles [47], or by using hybrid
rule-based and data-driven methods [5].
Although speech-based synthesis of facial expressions is quite common, it does not
generally utilize vocal prosody. Since facial expressions are dominated by mouth move-
ment, many speech-based systems use techniques similar to phoneme extraction to select
appropriate mouth shapes. However, a number of methods have been proposed that use
speech prosody to model expressive human motion beyond lip movements. Albrecht et al.
[2] use prosody features to drive a rule-based facial expression animation system, while
more recent systems apply a data-driven approach to generate head motion from pitch [13]
and facial expressions from vocal intensity [19]. Incorporating a more sophisticated model,
Sargin et al. [42] use prosody features to directly drive head orientation with a HMM. Al-
though these methods only animate head orientation from prosody, Morency et al. [33]
suggest that prosody may also be useful for predicting gestural displays.
The proposed system selects appropriate motions using a prosody-driven HMM remi-
niscent of the above techniques, but each output segment corresponding to an entire ges-
ture subunit, such as a “stroke” or “hold.” This effectively reassembles the training motion
segments into a new animation. Current techniques for assembling motion capture into
new animations generally utilize some form of graph traversal to select the most appro-
priate transitions [3, 24]. The proposed system captures this notion of a motion graph in
the structure of the HMM, with high-probability hidden-state transitions corresponding to
transitions which occur frequently in the training data. While facial animation synthesis
HMMs have previously used remapping of observations [8], or conditioned the output an-
imations on the input audio [25], the proposed method maps animation states directly to
the hidden states of the model. This allows a simpler system to be designed that is able
to deal with coherent gesture subunits without requiring them to directly coincide in time
with audio segments.
While the methods described above are able to synthesize plausible body language or
facial expression for ECAs, none of them can generate full-body animations from live
speech. Animating human-controlled characters requires real-time speeds and a predictive
model that does not rely on looking ahead in the audio stream. Such a model constitutes
CHAPTER 1. INTRODUCTION 5
semantic meaning, provided it occurs at the right time. Since such gestures often accom-
pany a central or emphasized idea, we can hypothesize that their occurrence may also be
signaled by prosody cues.
Since prosody correlates well to emphasis [49], it should correspond to the empha-
sized words that gestures highlight, making it useful for selecting gestures, such as beats
and metaphorics, with the appropriate timing and rhythm. In addition, there is evidence
that prosody carries much of the emotive content of speech [1, 44], and that emotional
state is often reflected in body language [50, 32]. Therefore, we can hypothesize that a
prosody-driven system would produce accurately timed and appropriate beats, as well as
more complex abstract or emotion-related gestures with the appropriate emotive content.
The system captures prosody using the three features that it is most commonly associ-
ated with: pitch, intensity, and duration. Pitch and intensity have previously been used to
drive expressive faces [13, 19], duration has a natural correspondence to the rate of speech,
and all of these aspects of prosody are informative for determining emotional state [43].
While semantic meaning also corresponds strongly to gesture, the system does not attempt
to interpret the utterance. This approach has some limitations, as discussed in Section 7.2,
but is more appropriate for online, real-time synthesis. Extracting semantic meaning from
speech to synthesize gesture without knowledge of the full utterance is difficult because it
requires a predictive speech interpreter, while prosody can be obtained efficiently without
looking ahead in the utterance.
Segmentation Segmentation
Figure 1.1: Training of the body language model from motion capture data with simulta-
neous audio.
The correlation between speech and body language is inherently a many-to-many map-
ping: there is no unique animation that is most appropriate for a given utterance [8]. This
ambiguity makes validation of synthesis techniques difficult. Since the only way to judge
the quality of a synthesized body language animation is by subjective evaluation, surveys
were conducted to validate the proposed method. The results of these surveys are presented
in Chapter 6.
Chapter 2
9
CHAPTER 2. MOTION CAPTURE PROCESSING 10
joint orientations. Both the markerless data and the active marker data obtained from the
PhaseSpace system is processed to obtain an animated 14-joint skeleton. The data is then
segmented into gesture unit phases, henceforth referred to as gesture subunits. These seg-
ments are then clustered to identify recurrences of similar motions.
14
X
z= uj ||ωj ||2 .
j=1
A stroke is expected to have a high z value, while a hold will have a low value. Similarly,
a pause between a stroke and a retraction will be marked by a dip in z as the arm stops
CHAPTER 2. MOTION CAPTURE PROCESSING 11
Figure 2.1: The body parts which constitute each section are shown in (a) and the motion
summary is illustrated in (b).
and reverses direction. Therefore, segment boundaries are inserted when this sum crosses
an empirically determined threshold. To avoid creating small segments due to noise, the
system only creates segments which exceed either a minimum length or a minimum limit
on the integral of z over the segment. To avoid unnaturally long still segments during long
pauses, the algorithm places an upper bound on segment length. The motions of the head,
arms, and lower body do not always coincide, so these three body sections are segmented
separately and henceforth treated as separate animation streams. The joints which consti-
tute each of the sections are illustrated in Figure 2.1(a). The number of segments extracted
from the training sets is summarized in Table 2.1.
though a variety of methods may be appropriate. Ward hierarchical clustering merges two
clusters at every iteration so as to minimize the error sum-of-squares criterion (ESS) for
(j)
the new cluster. If we define Ci to be the center of the ith cluster, and Si to be the j th
segment in that cluster, we may define ESSi as
Ni
(j)
X
ESSi = D(Ci , Si ),
j
where D is the distance metric. The algorithm is modified to always select one of the exist-
ing sample points as a cluster center (specifically, the point which minimizes the resulting
ESS value), thus removing the requirement that a Euclidian metric be used. This allows
the distance function to be arbitrary complex.
Segments are compared according to a three-part distance function that takes into ac-
count length, starting pose, and a fixed-size “summary” of the dynamics of the segment.
The “summary” holds some measure of velocity and maximum and average displacement
along each of the axes for a few key body parts in each section, as in Figure 2.1. The head
section uses the rotation of the neck about each of the axes. The arm section uses the po-
sitions of the hands, which often determine the perceptual similarity of two gestures. The
lower body uses the positions of the feet relative to the pelvis, which capture motions such
as the shifting of weight. For the arms and lower body, velocity and maximum and average
displacement each constitute a six-dimensional vector (three axes per joint), while for the
head this vector has three dimensions. Each vector is rescaled to a logarithmic scale, since
larger gestures can tolerate greater variation while remaining perceptually similar. Using
six-dimensional vectors serves to de-emphasize the unused hand in one-handed gestures: if
instead a three-dimensional vector for each hand was rescaled, small motions in the unused
hand would be weighted too heavily on the logarithmic scale relative to the more important
large motion of the active hand. To compare two summaries, the algorithm uses the sum of
the distances between the three vectors.
Although the “summary” is the most significant part of the distance function, it is im-
portant that all gestures in a cluster have roughly the same length and pose, as this allows
us to synthesize a smooth animation stream. In particular, still segments or “holds” (seg-
ments without much motion – e.g., a z value that is consistently below the threshold) tend
CHAPTER 2. MOTION CAPTURE PROCESSING 13
Table 2.1: This table shows the number of segments and clusters identified for each body
section in each training set. Training set I consisted of 30 minutes divided into four scenes.
Training set II consisted of 15 minutes in eight scenes.
to have very similar summaries, so the weight of the pose is increased on such segments
to avoid clustering all still segments together. The starting pose is far more important for
still segments, which remain in their starting pose for their duration, while active segments
quickly leave the starting pose.
Table 2.1 provides a summary of the number of clusters identified in each training
set. It should be noted that the lower body section generally contained very few unique
segments. In fact, in training set II, only two holds and one motion were identified. The
motion corresponded to a shifting of weight from one foot to the other. The larger training
set contained a few different weight shifting motions. The arms and head were more varied
by comparison, reflecting their greater prominence in body language.
Chapter 3
14
CHAPTER 3. AUDIO SEGMENTATION AND PROSODY 15
Audio Features
Feature Variable Value 0 Value 1 Value 2
Inflection up Pitch change dF0 dF0 − µdF0 < σdF0 dF0 − µdF0 ≥ σdF0
Inflection down Pitch change dF0 dF0 − µdF0 > −σdF0 dF0 − µdF0 ≤ −σdF0
Intensity Intensity I I < Csilence I − µI < σI I − µI ≥ σI
Length Syllable length L L − µL ≤ − σ2L − σ2L < L − µL < σL L − µL ≥ σL
Table 3.1: This table presents a summary of the audio features used by the system. For
each continuous variable X, µX represents its mean value and σX represents its standard
deviation.
aspects of prosody: pitch, intensity, and duration. The features are summarized in Table
3.1. Pitch is described using two binary features indicating the presence or absence of sig-
nificant upward or downward inflection, corresponding to one standard deviation above the
mean. Intensity is described using a trinary feature indicating silence, standard intensity,
or high intensity, corresponding to half a standard deviation above the mean. Length is
also described using a trinary feature, with “long” segments one deviation above the mean
and “short” segments half a deviation below. This asymmetry is necessitated by the fact
that syllable lengths are not normally distributed. The means and deviations of the training
sequence are established prior to feature extraction. During synthesis, the means and devi-
ations are estimated incrementally from the available audio data as additional syllables are
observed.
This set of discrete prosody-based features creates an observation state-space of 36
states, which enumerates all possible combinations of inflection, intensity, and duration. In
practice, the state space is somewhat smaller, since silent segments do not contain inflec-
tion, and are restricted to short or medium length. Long silent segments are truncated, since
they do not provide additional information over a sequence of medium segments, though
short segments are retained, as these usually correspond to brief pauses between words.
This effectively limits the state space to about 26 distinct observation states.
Chapter 4
17
CHAPTER 4. PROBABILISTIC GESTURE MODEL 18
In Section 4.2, I formulate the hidden state space for the gesture model. A parametriza-
tion of the model which allows efficient inference and training is described in Section 4.3,
and the estimation of model parameters from training data is detailed in Section 4.4.
m0 m1 m2
V1 V2 V3 V4 V5
V1 V2 V3 V4
A1 : s1 A2 : c1 A3 : c1 A4 : s2
τ1 : −.1 τ2 : .3 τ3 : .5 τ3 : −.2
Figure 4.1: The hidden state space of the model consists of the index of the active motion,
At = si or ci , and the fraction of that motion which has elapsed up to this point, τt .
Observations Vt are paired with the active motion at the end of the next syllable.
At = At−1 = mi
where 4τt (i) is the fraction of the motion mi which has elapsed since the previous ob-
servation. If instead the current motion terminates at observation Vt and is succeeded by
At = mj , then we simply have τt (j) = 0.
Let τt0 (i) = τt−1 (i)+4τt (i), then τt is completely determined by τt0 and At , except in the
case when motion mi follows itself, and At = At−1 = mi but τt (i) = 0. To disambiguate
this case, we introduce a start state si and a continuation state ci for each motion cluster mi ,
so that At may take on either si or ci (we will denote an arbitrary animation state as aj ).
This approach is similar to the “BIO notation” (beginning, inside, outside) used in semantic
role labeling [41]. Under this new formulation, τt (i) = τt0 (i) when At = ci and At−1 = ci
CHAPTER 4. PROBABILISTIC GESTURE MODEL 20
We can also assume that the continuation of ci depends only on τt0 (i), since if At = ci ,
At−1 = ci or si , providing little additional information. Intuitively, this seems reason-
able, since encountering another observation within a motion becomes less probable as the
motion progresses. Therefore, we can construct the probabilities of transitions from ci as
a linear combination of Tci and eci , the guaranteed transition into ci , according to some
function fi of τt0 (i):
fi (τ 0 (i)) At = ci
t
P (At |At−1 = ci ) = (4.1)
(1 − f (τ 0 (i)))T otherwise
i t ci ,At
We now define the vector Tsi as the distribution of transitions out of si , and construct the
full transition matrix T = [Ts1 , Tc1 , ..., Tsn , Tcn ], where n is the total number of motion
clusters. The parameters of the HMM are then completely specified by the matrix T , a ma-
trix of observation probabilities O given by Oi,j = P (vi |aj ), and the interpolation functions
f1 , f2 , ..., fn .
segment, the observed evidence would indicate that ci is the most probable state, which is
not the case. Instead, we would like to predict the next state, which is precisely the state
that is active at the end of the next syllable Vt+1 .
The value of an interpolation function, fi (τt (i)), gives the probability that another syl-
lable will terminate within motion mi after the current one, which terminated at point τt (i).
Since motions are more likely to terminate as τt (i) increases, we may assume that fi is
monotonically decreasing. From empirical observation of the distribution of τ values for
final syllables within training motions, I concluded that a logistic curve would be well
suited for approximating fi . Therefore, the interpolation functions are estimated by fitting
a logistic curve to the τ values for final syllables within each training motion cluster by
means of gradient descent.
Chapter 5
Synthesis
5.1 Overview
The model assembled in the training phase allows the system to animate a character in real
time from a novel speech signal. Since the model is trained on animation segments which
follow audio segments, it is able to predict the most appropriate animation as soon as a new
syllable is detected. As noted earlier, gesture strokes terminate at or before the intensity
peak of the accompanying syllable [30]. The syllable segmentation algorithm detects a
syllable in continuous speech when it encounters the syllable peak of the next syllable, so
new gestures begin at syllable peaks. Consequently, the previous gesture ends at or before
a syllable peak.
To synthesize the most probable and coherent animation, the system computes a vector
of forward probabilities along with direct probabilities based on the previously selected
hidden state. The forward probabilities carry context from previous observations, while the
direct probabilities indicate likely transitions given only the last displayed animation and
current observation. Together, these two vectors may be combined to obtain a prediction
that is both probable given the speech context and coherent given the previously displayed
state, resulting in animation that is both smooth and plausible. The synthesis process is
illustrated in Figure 5.1.
Once the most probable and coherent motion cluster has been selected, the system
selects a single segment from this cluster which blends well with the current pose of the
22
CHAPTER 5. SYNTHESIS 23
Previous Update
Syllable Observation Vt
~γt ~ t × ~γt
α α
~ t , τ̄t
At Stochastic Sampling
Next Update
Figure 5.1: Diagram of the synthesis algorithm for one observation. At is the tth animation
state, Vt is the tth observation, α~ t is the forward probability vector, τ̄t is the vector of
expected τ values, and ~γt is the direct probability vector.
character. This segment is then blended into the animation stream, as described in Section
5.5.
that using the expected value gives a good approximation of the distribution. This assump-
tion may not hold in the case of two or more probable timelines which transition into mi
at different points in the past, resulting in two or more peaks in the distribution of τt (i).
However, we assume that this case is not common.
As before, we define a vector τ̄t0 (i) = τ̄t−1 (i)+4τt (i). Using this vector, computing the
transition probabilities between any two states at observation Vt is analogous to Equation
4.1:
f (τ̄ 0 (i)) At = At−1 = ci
i t
P (At |At−1 ) = (1 − fj (τ̄t0 (j)))Tcj ,At At−1 = cj
T
At−1 ,At otherwise
With this formula for the transition between any two hidden states, we can use the usual
update rule for α
~ t:
2n
X
αt (i) = η αt−1 (k)P (At = ai |At−1 = ak )P (Vt |ai ), (5.1)
k=1
where η is the normalization value. Once the forward probabilities are computed, τ̄t must
also be updated. This is done according to the following update rule:
Since only si and ci can transition into ci , P (At = ci , At−1 ) = 0 if At−1 6= ci and
At−1 6= si . If At−1 = si , this is the first event during this motion, so the previous
value τt−1 (i) must have been zero. Therefore, E(τt (i)|At = ci , At−1 = si ) = 4τt (i).
If At−1 = ci , the previous expected value τt−1 (i) is simply τ̄t−1 (i), so the new value is
E(τt (i)|At = ci , At−1 = ci ) = τ̄t−1 (i) + 4τt (i) = τ̄t0 (i). Therefore, we may reduce Equa-
tion 5.2 to:
CHAPTER 5. SYNTHESIS 25
Once the forward probabilities and τ̄t are computed, the most probable state could sim-
ply be selected from αt (i). However, choosing the next state based solely on forward
probabilities would produce an erratic animation stream, since the most probable state may
alternate between several probable paths as new observations are provided.
Using ~γt directly, however, would quickly drive the synthesized animation down an im-
probable path, since context is not carried forward. Instead, ~γt and α
~ t can be used together
to select a state which is both probable given the sequence of observations and coherent
given the previously displayed animation state. The final distribution for the current state
~ t × ~γt .
is therefore computed as the normalized pointwise product of the two distributions α
As shown in Figure 5.2, this method also generally selects states that have a higher forward
CHAPTER 5. SYNTHESIS 26
0.2
0.1
0 10 20 30 40 0 10 20 30 40
Figure 5.2: The image shows the states selected over 40 observations with maxi γt (i) (red)
and maxi αt (i) × γt (i) (green), with dark cells corresponding to high forward probability.
The graph plots the exact probability of the selections, as well as maxi αt (i).
probability than simply using ~γt directly. The green lines, representing states selected us-
ing the combined method, tend to coincide with darker areas, representing higher forward
probabilities, though they deviate as necessary to preserve coherence.
To obtain the actual estimate, the most probable and coherent state could be selected as
arg maxi (~
αt × ~γt )(i). However, always displaying the “optimal” animation is not neces-
sary, since various gestures are often appropriate for the same speech segment. Instead, the
current state is selected by stochastically sampling the state space according to this prod-
uct distribution. This has several desirable properties: it introduces greater variety into the
animation, prevents grating repetitive motions, and makes the algorithm less sensitive to
issues arising from the specific choice of clusters.
To illustrate this last point, consider a case where two fast gestures receive probabilities
of 0.3 each, and one slow gesture receives a probability of 0.4. Clearly, a fast gesture is
more probable, but a highest-probability selection will play the slow gesture every time.
Randomly sampling according to the distribution, however, will properly display a fast
gesture six times out of ten.
CHAPTER 5. SYNTHESIS 27
and adjusts joint orientations to be as close as possible to the desired pose within this
constraint. For a new rotation quaternion rt , previous rotation rt−1 , desired rotation dt , and
the derivative in the source animation of the desired frame 4dt = dt d¯t−1 , the orientation
of a joint at frame t is given by
angle(4dt )
rt = slerp rt−1 , dt ; ,
angle(dt r̄t−1 )
where “slerp” is the quaternion spherical interpolation function [45], and “angle” gives
the angle of the axis-angle representation of a quaternion. The rotation of each joint can
be represented either in parent-space or world-space. Using a parent-space representation
drastically reduces the chance of “unnatural” interpolations (such as violation of anatomic
constraints), while using a world-space representation preserves the “feel” of the desired
animation, even when the starting pose is different, by keeping to roughly the same align-
ment of the joints. For example, a horizontal gesture performed at hip level would remain
horizontal even when performed at chest level, so long as world-space representation of the
joint rotations is used. In practice, the artifacts introduced by world-space blending are ex-
tremely rare, so all joints are blended in world-space with the exception of the hands, which
are blended in parent-space. Being farther down on the skeleton, the hands are particularly
susceptible to unnatural poses as a result of world-space blending.
The velocity-based method roughly preserves the “feel” of the desired animation even
when the starting pose is different, by maintaining the same velocity magnitude. One effect
of this is that the actual pose of still segments, or “holds,” is relatively unimportant, since
when 4dt ≈ 0, the last frame is held indefinitely. Due to noise and low-level motion, these
segments will still eventually “drift” toward the original segment pose, but will do so very
gradually.
Chapter 6
Evaluation
29
CHAPTER 6. EVALUATION 30
“We just, uh, you know ... we just hung around, uh ... went to the beach.”
Figure 6.1: An excerpt from a synthesized animation used in the evaluation study.
Training set I consisted of 30 minutes of motion capture data in four scenes, recorded from
a trained actor. Training set II consisted of 15 minutes in eight scenes, recorded from a
good speaker with no special training. For each training set, two surveys were conducted.
Surveys AI and AII sought to determine the quality of the synthesized animation compared
to motion capture of the same speaker, in order to ensure a reliable comparison without
interference from confounding variables, such as gesturing style. These surveys used ut-
terances from the training speakers that were not present in the training data. Surveys BI
and BII sought to determine how well the system performed on an utterance from a novel
speaker.
6.2 Results
After viewing each animation, participants rated each animation on a five-point Likert scale
for timing and appropriateness. To determine timing, participants were asked their agree-
ment on the statement “the movements of character X were generally timed appropriately.”
To determine the appropriateness of the motions for the current speech, they were asked
their agreement on the statement “the motions of character X were generally consistent
with what he was saying.” The questions and average scores for the surveys are presented
in Figure 6.2.
In all surveys, the synthesized sequence outperformed the random sequence, with p <
0.05. In fact, with the exception of survey AI, the score given to the synthesized sequences
remained quite stable. The relatively low scores of both the synthesized and randomly
constructed sequences in survey AI may be accounted for by the greater skill of the trained
CHAPTER 6. EVALUATION 31
4.45 4.20
AI 2.70 2.52
N=45
3.27 2.93
1.0 2.0 3.0 4.0 5.0 1.0 2.0 3.0 4.0 5.0
novel speaker
4.27 3.57
BI 3.41 2.95
N=37
3.97 3.97
1.0 2.0 3.0 4.0 5.0 1.0 2.0 3.0 4.0 5.0
same speaker
4.55 4.25
AII 3.00 2.65
N=40
3.83 3.55
1.0 2.0 3.0 4.0 5.0 1.0 2.0 3.0 4.0 5.0
novel speaker
3.38 3.12
BII 3.36 3.14
N=50
4.04 3.70
1.0 2.0 3.0 4.0 5.0 1.0 2.0 3.0 4.0 5.0
Figure 6.2: Mean scores on a 5-point Likert scale for the four evaluation surveys. Error
bars indicate standard error.
CHAPTER 6. EVALUATION 32
actor in set I, which is the reason that training set I is used in all other examples in this
thesis and the accompanying videos discussed in Section 6.3.
In the generalization surveys BI and BII, the original motion capture does not always
outperform the synthesized sequence, and in survey BII, its performance is comparable
to that of the random sequence. This indicates that the two speakers used in the general-
ization tests had body language that was not as compelling as the training data. This is
not unexpected, since skilled speakers were intentionally chosen to create the best possi-
ble training data, and different individuals may appear more or less comfortable with their
body language in conversation. It may be difficult for viewers to separate their notions of
“believable” and “effective” speakers (e.g., an awkward speaker will appear less “realistic,”
even though he is just as human as a more skilled speaker). Even the random sequence used
the gesture subunits from the training data, and the transitions in the random sequence were
smooth, due to the fact that motion segments were segmented at low-velocity boundaries
and the blending was velocity-based. Therefore, it is reasonable to conclude that even a
random sequence from a skilled speaker might appear more “believable” than the original
motion capture of a less confident speaker.
It is notable, however, that our system was able to synthesize gestures that were as
appropriate as those in the same-speaker tests. The stability of the scores for our system
across the surveys indicates that it was able to successfully transplant the training speakers’
more effective gesturing styles to the novel speakers.
33
CHAPTER 7. DISCUSSION AND FUTURE WORK 34
“Motorcycles become a fully immersive experience, where the sound of it, the vibration, the seating, it all matters.”
Figure 7.1: An excerpt from a synthesized animation.
in the final survey to avoid fatiguing the responders, comments from the pilot study indi-
cated that both synchrony and proper selection were needed to synthesize plausible body
language. Responders noted that the randomly animated character “felt a little forced” and
“didn’t move with the rhythm of the speech,” while the character which did not synchronize
with syllable peaks appeared “off sync” and “consistently low energy.” Example anima-
tions that use random selection with synchronization and HMM-driven selection without
synchronization may be viewed online, as discussed in Section 6.3. These findings suggest
that both the “synchrony rule” and prosody-based gesture selection are useful for gesture
synthesis.
7.2 Limitations
Despite the effectiveness of this method for synthesizing plausible animations, it has several
inherent limitations. Most importantly, relying on prosody alone precludes the system from
generating meaningful iconic gestures when they are not accompanied by emotional cues.
Metaphoric gestures are easier to select because they originate from a smaller repertoire and
are more abstract [30], and therefore more tolerant of mistakes, but iconic gestures cannot
be “guessed” without interpreting the words in an utterance. This fundamental limitation
may be addressed in the future with a method that combines rudimentary word recognition
with prosody-based gesture selection. Word recognition may be performed either directly
or from phonemes, which can already be extracted in real time [37].
A second limitation of this method is that it must synthesize gestures from information
that is already available to the listener, thus limiting its ability to provide supplementary
CHAPTER 7. DISCUSSION AND FUTURE WORK 35
details. While there is no consensus in the linguistics community on just how much infor-
mation is conveyed by gestures [27], a predictive speech-based real-time method is unlikely
to impart on the listener any more information than could be obtained from simply listen-
ing to the speaker attentively, while real gestures often convey information not present in
speech [30]. Therefore, while there are clear benefits to immersiveness and realism from
compelling and automatic animation of human-controlled characters, such methods cannot
provide additional details without some form of additional input. This additional input,
however, need not necessarily be as obtrusive as keyboard or mouse controls. For example,
facial expressions carry more information about emotional state than prosody [1], which
suggests that more informative gestures may be synthesized by analyzing the speaker’s
facial expressions, for example though a consumer webcam.
“Which is also one of those very funny episodes that are in this movie.”
Figure 7.2: Synthesized body language helps create an engaging virtual interaction.
7.4 Conclusion
In this thesis, I present a new method for generating plausible body language for a variety of
utterances and speakers, such as the virtual interaction portrayed in Figure 7.2, or the story
being told by the character in Figure 7.1. The use of vocal prosody allows the system to
detect emphasis, produce gestures that reflect the emotional state of the speaker, and select
a rhythm that is appropriate for the utterance. Since the system generates compelling body
language in real time without specialized input, it is particularly appropriate for animating
human characters in networked virtual worlds.
Bibliography
[1] Ralph Adolphs. Neural systems for recognizing emotion. Current Opinion in Neuro-
biology, 12(2):169–177, 2002.
[2] Irene Albrecht, Jörg Haber, and Hans peter Seidel. Automatic generation of non-
verbal facial expressions from speech. In Computer Graphics International, pages
283–293, 2002.
[3] Okan Arikan and David Forsyth. Interactive motion generation from examples. ACM
Transactions on Graphics, 21(3):483–490, 2002.
[4] Jernej Barbič, Alla Safonova, Jia-Yu Pan, Christos Faloutsos, Jessica K. Hodgins,
and Nancy S. Pollard. Segmenting motion capture data into distinct behaviors. In
Proceedings of Graphics Interface 2004, pages 185–194, 2004.
[5] Jonas Beskow. Talking Heads - Models and Applications for Multimodal Speech
Synthesis. PhD thesis, KTH Stockholm, 2003.
[6] Paul Boersma. Accurate short-term analysis of the fundamental frequency and the
harmonics-to-noise ratio of a sampled sound. In Proceedings of the Institute of Pho-
netic Sciences, volume 17, pages 97–110, 1993.
[7] Paul Boersma. Praat, a system for doing phonetics by computer. Glot International,
5(9-10):314–345, 2001.
[8] Matthew Brand. Voice puppetry. In SIGGRAPH ’99: ACM SIGGRAPH 1999 Papers,
pages 21–28, New York, NY, USA, 1999. ACM.
37
BIBLIOGRAPHY 38
[9] Christoph Bregler, Michele Covell, and Malcolm Slaney. Video rewrite: driving vi-
sual speech with audio. In SIGGRAPH ’97: ACM SIGGRAPH 1997 Papers, pages
353–360, New York, NY, USA, 1997. ACM.
[11] Justine Cassell, Catherine Pelachaud, Norman Badler, Mark Steedman, Brett Achorn,
Tripp Becket, Brett Douville, Scott Prevost, and Matthew Stone. Animated conver-
sation: rule-based generation of facial expression, gesture & spoken intonation for
multiple conversational agents. In SIGGRAPH ’94: ACM SIGGRAPH 1994 Papers,
pages 413–420, New York, NY, USA, 1994. ACM.
[12] Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. Beat: the be-
havior expression animation toolkit. In SIGGRAPH ’01: ACM SIGGRAPH 2001
Papers, pages 477–486, New York, NY, USA, 2001. ACM.
[13] Erika Chuang and Christoph Bregler. Mood swings: expressive speech animation.
ACM Transactions on Graphics, 24(2):331–347, 2005.
[14] David Efron. Gesture, Race and Culture. The Hague: Mouton, 1972.
[15] Tony Ezzat, Gadi Geiger, and Tomaso Poggio. Trainable videorealistic speech anima-
tion. In SIGGRAPH ’02: ACM SIGGRAPH 2002 Papers, pages 388–398, New York,
NY, USA, 2002. ACM.
[16] Ajo Fod, Maja J. Mataric, and Odest Chadwicke Jenkins. Automated derivation of
primitives for movement classification. Autonomous Robots, 12:39–54, 2002.
[17] Björn Hartmann, Maurizio Mancini, and Catherine Pelachaud. Formational parame-
ters and adaptive prototype instantiation for mpeg-4 compliant gesture synthesis. In
Proceedings on Computer Animation, page 111, Washington, DC, USA, 2002. IEEE
Computer Society.
BIBLIOGRAPHY 39
[18] Carlos Jensen, Shelly D. Farnham, Steven M. Drucker, and Peter Kollock. The effect
of communication modality on cooperation in online environments. In Proceedings
of CHI ’00, pages 470–477, New York, NY, USA, 2000. ACM.
[19] Eunjung Ju and Jehee Lee. Expressive facial gestures from motion capture data.
Computer Graphics Forum, 27(2):381–388, 2008.
[20] Adam Kendon. Gesture. Cambridge University Press, New York, NY, USA, 2004.
[21] Michael Kipp, Michael Neff, and Irene Albrecht. An annotation scheme for conversa-
tional gestures: how to economically capture timing and form. Language Resources
and Evaluation, 41(3-4):325–339, December 2007.
[22] Michael Kipp, Michael Neff, Kerstin H. Kipp, and Irene Albrecht. Towards natural
gesture synthesis: Evaluating gesture units in a data-driven approach to gesture syn-
thesis. In 7th international conference on Intelligent Virtual Agents, pages 15–28,
Berlin, Heidelberg, 2007. Springer-Verlag.
[23] Stefan Kopp and Ipke Wachsmuth. Synthesizing multimodal utterances for conversa-
tional agents: Research articles. Computer Animation and Virtual Worlds, 15(1):39–
52, 2004.
[24] Lucas Kovar, Michael Gleicher, and Frédéric Pighin. Motion graphs. ACM Transac-
tions on Graphics, 21(3):473–482, 2002.
[25] Yan Li and Heung-Yeung Shum. Learning dynamic audio-visual mapping with input-
output hidden markov models. IEEE Transactions on Multimedia, 8(3):542–549, June
2006.
[26] Yan Li, Tianshu Wang, and Heung-Yeung Shum. Motion texture: a two-level statisti-
cal model for character motion synthesis. In SIGGRAPH ’02: ACM SIGGRAPH 2002
Papers, pages 465–472, New York, NY, USA, 2002. ACM.
[27] Daniel Loehr. Gesture and Intonation. PhD thesis, Georgetown University, 2004.
BIBLIOGRAPHY 40
[28] O. Maeran, V. Piuri, and G. Storti Gajani. Speech recognition through phoneme
segmentation and neural classification. Instrumentation and Measurement Technology
Conference, 1997. IMTC/97. Proceedings. ’Sensing, Processing, Networking.’, IEEE,
2:1215–1220, May 1997.
[29] Anna Majkowska, Victor B. Zordan, and Petros Faloutsos. Automatic splic-
ing for hand and body animations. In Proceedings of the 2006 ACM SIG-
GRAPH/Eurographics symposium on Computer animation, pages 309–316, 2006.
[30] David McNeill. Hand and Mind: What Gestures Reveal About Thought. University
Of Chicago Press, 1992.
[31] David McNeill. Gesture and Thought. University Of Chicago Press, November 2005.
[32] Joann Montepare, Elissa Koff, Deborah Zaitchik, and Marilyn Albert. The use of body
movements and gestures as cues to emotions in younger and older adults. Journal of
Nonverbal Behavior, 23(2):133–152, 1999.
[33] Louis-Philippe Morency, Candace Sidner, Christopher Lee, and Trevor Darrell. Head
gestures for perceptual interfaces: The role of context in improving recognition. Ar-
tificial Intelligence, 171(8-9):568–585, 2007.
[34] Meinard Müller, Tido Röder, and Michael Clausen. Efficient content-based retrieval
of motion capture data. In SIGGRAPH ’05: ACM SIGGRAPH 2005 Papers, vol-
ume 24, pages 677–685, New York, NY, USA, 2005. ACM.
[35] Lars Mündermann, Stefano Corazza, and Thomas P. Andriacchi. Accurately measur-
ing human movement using articulated ICP with soft-joint constraints and a repository
of articulated models. In Computer Vision and Pattern Recognition, pages 1–6, June
2007.
[36] Michael Neff, Michael Kipp, Irene Albrecht, and Hans-Peter Seidel. Gesture model-
ing and animation based on a probabilistic re-creation of speaker style. ACM Trans-
actions on Graphics, 27(1):1–24, 2008.
BIBLIOGRAPHY 41
[37] Junho Park and Hanseok Ko. Real-time continuous phoneme recognition system us-
ing class-dependent tied-mixture HMM with HBT structure for speech-driven lip-
sync. IEEE Transactions on Multimedia, 10(7):1299–1306, Nov. 2008.
[38] Ken Perlin. Layered compositing of facial expression. In SIGGRAPH ’97: ACM SIG-
GRAPH 97 Visual Proceedings, pages 226–227, New York, NY, USA, 1997. ACM.
[39] Ken Perlin and Athomas Goldberg. Improv: a system for scripting interactive actors
in virtual worlds. In SIGGRAPH ’96: ACM SIGGRAPH 1996 Papers, pages 205–216.
ACM, 1996.
[40] Lawrance R. Rabiner. A tutorial on hidden markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2):257–286, Feb 1989.
[41] Lance A. Ramshaw and Mitchell P. Marcus. Text chunking using transformation-
based learning. In Proceedings of the Third Workshop on Very Large Corpora, pages
82–94, 1995.
[43] Klaus R. Scherer, Rainer Banse, Harald G. Wallbott, and Thomas Goldbeck. Vocal
cues in emotion encoding and decoding. Motivation and Emotion, 15(2):123–148,
1991.
[44] Marc Schröder. Speech and Emotion Research: An overview of research frameworks
and a dimensional approach to emotional speech synthesis. PhD thesis, Phonus 7,
Research Report of the Institute of Phonetics, Saarland University, 2004.
[45] Ken Shoemake. Animating rotation with quaternion curves. In SIGGRAPH ’85: ACM
SIGGRAPH 1985 Papers, pages 245–254, New York, NY, USA, 1985. ACM.
[46] Marc Shröder. Speech and Emotion Research: An overview of research frameworks
and a dimensional approach to emotional speech synthesis. PhD thesis, Phonus 7,
Research Report of the Institute of Phonetics, Saarland University, 2004.
BIBLIOGRAPHY 42
[47] Eftychios Sifakis, Andrew Selle, Avram Robinson-Mosher, and Ronald Fedkiw. Sim-
ulating speech with a physics-based facial muscle model. In Proceedings of the 2006
ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 261–270,
Aire-la-Ville, Switzerland, Switzerland, 2006. Eurographics Association.
[48] Matthew Stone, Doug DeCarlo, Insuk Oh, Christian Rodriguez, Adrian Stere, Alyssa
Lees, and Chris Bregler. Speaking with hands: creating animated conversational char-
acters from recordings of human performance. In SIGGRAPH ’04: ACM SIGGRAPH
2004 Papers, pages 506–513, New York, NY, USA, 2004. ACM.
[49] Jacques Terken. Fundamental frequency and perceived prominence of accented syl-
lables. The Journal of the Acoustical Society of America, 89(4):1768–1777, 1991.
[50] Harald G. Wallbot. Bodily expression of emotion. European Journal of Social Psy-
chology, 28(6):879–896, 1998.
[51] Jing Wang and Bobby Bodenheimer. Synthesis and evaluation of linear motion tran-
sitions. ACM Transactions on Graphics, 27(1):1:1–1:15, Mar 2008.
[52] Joe H. Ward. Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58(301):236–244, 1963.
[53] Jianxia Xue, J. Borgstrom, Jintao Jiang, Lynne E. Bernstein, and Abeer Alwan.
Acoustically-driven talking face synthesis using dynamic bayesian networks. IEEE
International Conference on Multimedia and Expo, pages 1165–1168, July 2006.
[54] Nick Yee. Motivations for play in online games. Cyberpsychology and Behavior,
9:772–775, 2006.