Microphones and Measurements PDF

ISSN 0280-526X
Lund University
Centre for Languages and Literature
General Linguistics
Phonetics
Working Papers
52. 2006
Proceedings from Fonetik 2006
Lund, June 7–9, 2006
Edited by Gilbert Ambrazaitis and Susanne Schötz

Lund University
General Linguistics
Phonetics
Working Papers
52. 2006
Proceedings from Fonetik 2006
Lund, June 7–9, 2006
Edited by Gilbert Ambrazaitis and Susanne Schötz

Working Papers
Department of Linguistics and Phonetics
Lund University
Box 201
S-221 00 LUND
Sweden
Fax +46 46 2224210
http://www.sol.lu.se/
This issue was edited by Gilbert Ambrazaitis and Susanne Schötz
© 2006 The Authors and the Department of Linguistics and

Phonetics, Centre for Languages and Literature, Lund University
ISSN 0280-526X
Printed in Sweden, Mediatryck, Lund, 2006

i
Preface
The Proceedings of the Nineteenth Swedish Phonetics Conference, Fonetik 2006, com-
prise this volume of the Working Papers from the Department of Linguistics and Phonet-
ics at Lund University. Fonetik 2006 held at the Centre for Languages and Literature, June
7-9, 2006, is one in the series of yearly conferences for phoneticians and speech scientists
in Sweden which regularly also attract participants from Denmark, Finland and Norway
and sometimes from other countries as well. There are 38 contributions represented in this
volume. A large variety of topics are covered in the papers, and we think that the volume
gives a representative overview of current phonetics research in Sweden.
We would like to thank all contributors to the Proceedings. We would like to acknowl-
edge the valuable support from The Swedish Phonetics Foundation (Fonetikstiftelsen) and
from The Centre for Languages and Literature and Lund University Faculty of the Hu-
manities and Theology.
Lund, May 2006
The Organizing Committee
Gilbert Ambrazaitis, Gösta Bruce, Johan Frid, Per Lindblad, Susanne Schötz, Anders
Sjöström, Joost van de Weijer, Elisabeth Zetterholm
ii
Previous Swedish Phonetics Conferences

(from 1986)
I 1986 Uppsala University
II 1988 Lund University
III 1989 KTH Stockholm
IV 1990 Umeå University (Lövånger)
V 1991 Stockholm University
VI 1992 Chalmers and Göteborg University
VII 1993 UppsalaUniversity
VIII 1994 Lund University (Höör)
–– 1995 (XIIIth ICPhS in Stockholm)
IX 1996 KTH Stockholm (Nässlingen)
X 1997 Umeå University
XI 1998 Stockholm University
XII 1999 Göteborg University
XIII 2000 Skövde University College
XIV 2001 Lund University (Örenäs)
XV 2002 KTH Stockholm
XVI 2003 Umeå University (Lövånger)
XVII 2004 Stockholm University
XVIII 2005 Göteborg University
iii
Contents
Emilia Ahlberg, Julia Backman, Josefin Hansson, Maria Olsson, and Anette Lohmander
Acoustic Analysis of Phonetically Transcribed Initial Sounds in Babbling Sequences
from Infants with and without Cleft Palate .............................................................................1
Gilbert Ambrazaitis and Gösta Bruce
Perception of South Swedish Word Accents............................................................................5
Jonas Beskow, Björn Granström, and David House
Focal Accent and Facial Movements in Expressive Speech ...................................................9
Ulla Bjursäter
A Study of Simultaneous-masking and Pulsation-threshold Patterns of a Steady-state
Synthetic Vowel: A Preliminary Report ................................................................................13
Petra Bodén and Julia Gro e
Youth Language in Multilingual Göteborg ...........................................................................17
Rolf Carlson, Kjell Gustafson, and Eva Strangert
Prosodic Cues for Hesitation ................................................................................................21
Frantz Clermont and Elisabeth Zetterholm
F-pattern Analysis of Professional Imitations of “hallå” in three Swedish Dialects ...........25
Una Cunningham
Describing Swedish-accented English ..................................................................................29
Wim A. van Dommelen
Quantification of Speech Rhythm in Norwegian as a Second Language ..............................33
Jens Edlund and Mattias Heldner
/nailon/ – Online Analysis of Prosody...................................................................................37
Olov Engwall
Feedback from Real & Virtual Language Teachers .............................................................41
Lisa Gustavsson, Ellen Marklund, Eeva Klintfors, and Francisco Lacerda
Directional Hearing in a Humanoid Robot...........................................................................45
Gert Foget Hansen and Nicolai Pharao
Microphones and Measurements ..........................................................................................49
Mattias Heldner and Jens Edlund
Prosodic Cues for Interaction Control in Spoken Dialogue Systems ...................................53
Pétur Helgason
SMTC – A Swedish Map Task Corpus ..................................................................................57
Snefrid Holm
The Relative Contributions of Intonation and Duration to Degree of Foreign Accent
in Norwegian as a Second Language....................................................................................61
iv
Merle Horne
The Filler EH in Swedish ......................................................................................................65
Per-Anders Jande
Modelling Pronunciation in Discourse Context....................................................................69
Christian Jensen
Are Verbs Less Prominent?...................................................................................................73
Yuni Kim
Variation and Finnish Influence in Finland Swedish Dialect Intonation .............................77
Diana Krull, Hartmut Traunmüller, and Pier Marco Bertinetto
Local Speaking Rate and Perceived Quantity: An Experiment with Italian Listeners .........81
Jonas Lindh
A Case Study of /r/ in the Västgöta Dialect...........................................................................85
Jonas Lindh
Preliminary Descriptive F0-statistics for Young Male Speakers..........................................89
Robert McAllister, Miyoko Inoue, and Sofie Dahl
L1 Residue in L2 Use: A Preliminary Study of Quantity and Tense-lax...............................93
Yasuko Nagano-Madsen and Takako Ayusawa
Cross-speaker Variations in Producing Attitudinally Varied Utterances in Japanese ........97
Daniel Neiberg, Kjell Elenius, Inger Karlsson, and Kornel Laskowski
Emotion Recognition in Spontaneous Speech .....................................................................101
Susanne Schötz
Data-driven Formant Synthesis of Speaker Age .................................................................105
Rein Ove Sikveland
How do we Speak to Foreigners? – Phonetic Analyses of Speech Communication
between L1 and L2 Speakers of Norwegian ........................................................................109
Maria Sjöström, Erik J. Eriksson, Elisabeth Zetterholm, and Kirk P. H. Sullivan
A Switch of Dialect as Disguise ..........................................................................................113
Gabriel Skantze, David House, and Jens Edlund
Prosody and Grounding in Dialog......................................................................................117
Eva Strangert and Thierry Deschamps
The Prosody of Public Speech – A Description of a Project...............................................121
Katrin Stölten
Effects of Age on VOT: Categorical Perception of Swedish Stops
by Near-native L2 Speakers ................................................................................................125
Kari Suomi
Stress, Accent and Vowel Durations in Finnish..................................................................129
Bosse Thorén
Phonological Demands vs. System Constraints in an L2 Setting........................................133
Hartmut Traunmüller
Cross-modal Interactions in Visual as Opposed to Auditory Perception of Vowels ..........137
v
Marcus Uneson
Knowledge-light Letter-to-Sound Conversion for Swedish with FST and TBL ..................141
Sidney Wood
The Articulation of Uvular Consonants: Swedish...............................................................145
Niklas Öhrström and Hartmut Traunmüller
Acoustical Prerequisites for Visual Hearing ......................................................................149
Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics 1
Working Papers 52 (2006), 1–4
Acoustic Analysis of Phonetically Transcribed

Initial Sounds in Babbling Sequences from
Infants with and without Cleft Palate
Emilia Ahlberg, Julia Backman, Josefin Hansson,
Maria Olsson, and Anette Lohmander
Institute of Neuroscience and Physiology/Speech Pathology,
Sahlgrenska Academy at Göteborg University
{gusahemi|gusbacju|gushanjos}@student.gu.se, kanelstang@hotmail.com,
anette.lohmander@logopedi.gu.se
Abstract
The aim of this study was to compare acoustic analysis of initial sound in babbling with
corresponding phonetic transcription. Two speech pathologists had transcribed eight
babbling sequences with total disagreement about whether the initial phoneme was a vowel
or a plosive. After discussion, however, a consensus judgment was reached. To decide about
the initial phoneme, an acoustic analysis was performed. Because of the deficient quality of
some of the recordings, the results of the acoustic analysis were not completely reliable.
However, the results were in relatively good agreement with the consensus judgments and
indicate that the two methods should be used as complements to each other.
1 Background and introduction

Perceptual judgment is most commonly used in clinical practice of speech pathology.
However, what a human listener perceives is dependent of many different components, among
these a listener’s expectations have great importance. Categorical perception signifies the
tendency of humans to classify speech sounds into different categories (Lieberman and
Blumstein 1988; Reisberg 2001). This may cause various listeners to sort a phone that is in
between categories in different phonemic groups. In addition, perception is always subjective.
Training with direct feedback is considered to be of great importance to create a common
frame of reference for two different judges and to increase the perceptual awareness (Shriberg,
1972).
Children with cleft palate often have difficulties attaining a sufficient intra-oral pressure
when producing plosives. In babbling, the phonetic features of the speech sounds are also less
distinctive than in later speech due to maturation. In the acoustic analysis plosives may
therefore appear with a less distinct mark in the spectrogram.
The purpose of this study was to examine the agreement between perceptual and acoustic
analysis in judging the initial sound of babbling sequences of infants with and without cleft
palate.
2 EMILIA AHLBERG ET AL.
2 Method
2.1 Material
Babbling sequences at 18 months of age from 41 children with and without cleft palate had
been phonetically transcribed by two listeners. The transcriptions were made independently,
but in close connection with the transcriptions a consensus judgment was made. Consensus
rules had been designed in advance to keep the judgments as equal as possible. Eight of these
independent transcriptions by the two listeners had total disagreement regarding the initial
phoneme. These as well as the consensus transcription were chosen for acoustic analysis.
2.2 Acoustic analysis

Acoustic analysis of the eight babbling sequences was made to establish the initial sounds
(vowel or plosive). The judges were unaware of whether the children had a cleft palate or not.
The analysis was made using the software Praat (Boersma & Weenink, 2005). The intension
to be able to gather specific signs of a plosive had to be complemented by more subtle
features. In the analysis the formant transitions and the intensity (of blackness) were observed
in detail. Reduction of noise and in some cases filtering in frequency was also made.
2.3 Statistical method

To compare data from the judgments made by perceptual and acoustic analysis, the statistics
software SPSS was used. The agreement between the different methods, as well as between
the judges, was calculated using Cohen’s Kappa.
3 Results and discussion

The acoustic analyses are presented in Table 1. In Table 2 the results from the acoustic and
perceptual (phonetic transcriptions) analyses are presented.
Table 1. Description of the acoustic analysis of the initial sound from eight babbling
sequences.
1. Plosive Possibly approximant since there is no obvious burst.
2. Vowel Vowel. Woman and child are speaking at the same time. The woman however is
using ”baby talk”, meaning her F0 is higher and she is thereby sounding more
like the child. Still, no natural formant-transitions for plosives are seen in the
utterance.
3. Plosive Rather distinct burst, where formant-transitions are seen.
4. Plosive Distinct burst and formant-transitions are seen.
5. Plosive Obvious weakening of the formants initially, which implies a more closed
mouth and could indicate a plosive.
6. Plosive At a detailed level a trace of a plosive is seen. Could also be an approximant?
The result is very uncertain since an adult and the child is speaking at the same
time.
7. Plosive A small formant-transition appears after a filtering of noise. Weak formants that
could signify a nasal vowel or a more closed mouth are observed. Click/Bang
sound is heard at the exact same time as the possible plosive. Uncertain result.
8. Vowel Nothing in the acoustic analysis indicates a plosive. 89
ACOUSTIC ANALYSIS OF TRANSCRIBED INITIAL SOUNDS IN BABBLING 3
Table 2. The results of the three perceptual judgments (judge 1, judge 2 and consensus
judgment) and acoustic analysis. P = plosive, V = vowel.
Child CP/Control Judge 1 Judge 2 Consensus Acoustic analysis
1 CP C V C C
2 CP V C V V
3 Control C V C C
4 Control C V C C
5 Control C V C C
6 CP V C C C
7 Control C V V C
8 Control V C V V
A calculation of agreement between the consensus judgment and the acoustic analysis resulted
in a Kappa value of .71 (>.75 is considered good concordance) (Table 3).
Table 3. The values for agreement between acoustic analysis and the different listener
conditions. Cons = consensus judgment, J1 = judge 1 and J2 = judge 2, Negative value =
disagreement.
Judgments Cohen’s Kappa
Acous – Cons .71
J1 – J2 -.88
Acous – J1 .71
Acous – J2 -.56
Cons – J1 .47
Cons – J2 -.41
4 Conclusions
According to the statistical analysis the agreement between the acoustic analysis and the
consensus judgment is relatively good. Even though the Kappa value is .71, seven out of the
eight judgments were equivalent. The limited amount of samples makes it difficult to draw
conclusions. However, the results show that the consensus judgment was reliable. It also
implies that judge 1 had better agreement with the acoustic analysis than judge 2. In fact judge
1 had as good agreement with the acoustic analysis as the consensus judgement but based on
different samples. According to the acoustic analysis some initial sounds appeared more like
approximants than plosives, which can explain the uncertainty in the perceptual judgments.
Since there were only two possible options (plosive or vowel) in the acoustic analysis no
consideration was taken to the approximantic signs in the spectrogram. This also makes the
perceptual judgments less reliable as well as the results of the concordance between the
perceptual judgments and the acoustic analysis.
Three out of the eight babbling sequences were produced by children with cleft palate.
Children with cleft palate have difficulties building up a sufficient intra-oral pressure, which
can result in a less distinct burst in the spectrogram. In the acoustic analysis two specific
sounds were interpreted as plosives (possibly approximants). These sounds were produced by
children with CP, which explain the approximantic appearances.
Since the result of the acoustic analysis is interpreted by a human, it is a subjective
judgment. The anatomy of the speaker, the design of the room and the recording method are
examples of things that are of importance for the analysis. In theory, the acoustic signs are
well described, but are difficult to interpret when used clinically. Interfering sounds and noise
4 EMILIA AHLBERG ET AL.
in the spectrogram are difficult to separate from the babbling. This is also important for the
results of the perceptual judgments and could explain disagreement between perceptual
judges.
The fact that the judges in this study are disagreeing is not unique. In a study by Shriberg
(1972), conclusion are drawn that training with a key is of importance for the concordance
between judges. In this study the judges had experience of transcribing together but without
direct feedback or a key.
In conclusion, the results from this study show that neither the perceptual nor the acoustic
judgment, gave reliable answers. However, one could assume that these both methods
complement each other. In order to increase the reliability, both perceptually and acoustically,
it is important that the recordings are of sufficient quality. This can be achieved by using high
quality equipment, carefully consider the placing of the microphone and by using a designed
recording room where the risk for disturbing sounds is minimal (for example by using soft
toys when recording children). To get a more valid judgment it is also suggested to exclude
utterances where competing sounds cannot be avoided.
References
Boersma, P. & D. Weenink, 2005. Praat: doing phonetics by computer (Version 4.3.33)
[Computer program] Retrieved October 7, 2005, from http://www.praat.org/.
Lieberman, P. & S.E. Blumstein, 1988. Speech Physiology. Speech Perception and Acoustic
Phonetics. Cambridge: Cambridge University Press.
Lindblad, P., 1998. Talets Akustik och Perception. Kompendium, Göteborgs Universitet.
Reisberg, D., 2001. Cognition – Exploring the Science of the Mind. New York: W.W. Norton
& Company, Inc.
Shriberg, L.D., 1972. Articulation Judgments: Some Perceptual Considerations. Journal of
Speech and Hearing Research 15, 876-882.
Perception of South Swedish Word Accents

Gilbert Ambrazaitis and Gösta Bruce
Dept. of Linguistics and Phonetics, Centre for Languages and Literature, Lund University
{Gilbert.Ambrazaitis|Gosta.Bruce}@ling.lu.se
Abstract
A perceptual experiment concerning South Swedish word accents (accent I, accent II) is
described. By means of editing and resynthesis techniques the F0 pattern of a test word in a
phrase context has been systematically manipulated: initial rise (glide vs. jump) and final
concatenation (6 timing degrees of the accentual fall). The results indicate that both a gliding
rise and a late fall seem necessary for the perception of accent II, while there appear to be no
such specific, necessary cues for the perception of accent I.
1 Introduction
In the original Swedish intonation model (Bruce & Gårding, 1978) the two tonal word accents
(accent I and accent II) are assigned bitonal representations in terms of High plus Low (HL),
representing the accentual F0 fall. These Highs and Lows are timed differently, however, in
relation to the stressed syllable depending on dialect type. For all dialect types, the HL of
accent I precedes the HL of accent II. In South Swedish, the HL of accent I is aligned with the
stressed syllable, while the HL of accent II is instead aligned with the post-stress syllable.
A problem with the latter representation is that the stressed syllable in accent II words has
no direct tonal representation. Thus this modelling does not reflect what should be the most
perceptually salient part of the pitch pattern of accent II. Figure 1 shows prototypical F0
contours of the two word accents (minimal pair) in a prominent position of an utterance as
produced by a male speaker of South Swedish (the second author).
This particular problem of intonational modelling has been the starting-point of a phonetic
experiment aimed at examining what is perceptually relevant in the F0 contours of accent I
and accent II in the South Swedish dialect type. More specifically, our plan has been to run a
perceptual experiment, where the intention was to find out what are the necessary and
sufficient cues for the identification of both word accents.
Figure 1. Prototypical F0 contours of the two word accents in a prominent position of an

utterance as produced by a male speaker of South Swedish: Jag har sett anden i dimman. (‘I
have seen the duck/spirit in the fog.’) Thick line: acc. I (‘duck’); thin line: acc. II (‘spirit’).
6 GILBERT AMBRAZAITIS & GÖSTA BRUCE
2 Method
We asked subjects to judge whether they perceive the test word anden as either meaning ‘the
duck’ (accent I) or ‘the spirit’ (accent II), in naturally produced and synthetically manipulated
test utterances. We chose to put the test word in a non-final accented position of an utterance
containing two accented words (test word and context word; see Table 1), for several reasons.
First, we wanted to have the possibility of removing the accentual F0 fall of the test word
while maintaining an utterance-final falling pattern. Second, we chose two different context
words – one with accent I (drömmen, ‘the dream’), one with accent II (dimman, ‘the fog’) – in
order to provide a “dialectal anchor” for the listeners. Third, by having the test word non-
finally, we avoided phrase-final creaky voice on the test word, thus facilitating the editing of
F0. Regarding semantic factors, we tried to choose context words which would be as “neutral”
as possible, i.e. which would not bias the ratings of the test word. The test material was
recorded by a male speaker of South Swedish (the second author) in the anechoic chamber at
the Centre for Languages and Literature, Lund University.
Table 1. The structure of the test material, or the four recorded test utterances respectively.
(‘I have seen the duck/spirit in the dream/fog.’)
Test word Context word Used for
Jag har sett anden (accI) i drömmen (accI) / i dimman (accII). A: control stimuli
anden (accII) i drömmen (accI) / i dimman (accII). B: primary stimuli
2.1 Stimuli
We created 12 F0 contours and implemented them in two recorded utterances (B in Table 1),
by means of F0 editing and resynthesis using the PSOLA manipulation option in Praat
(<http://www.praat.org>). Figure 2 displays the 12 contours for one of the utterances
(dimman) as an example. The starting point was a stylization of the originally recorded F0
contours, i.e. with accent II on the test word (glide/dip4 in Figure 2). Based on this stylized
accent II contour, three contours with a successively later F0 fall were created (dip3, dip2,
dip1), each one aligned at successive segmental boundaries: in dip3, the fall starts at the
vowel onset of the post-stress syllable (schwa), in dip2 at the following /n/ onset, and in dip1
at the onset of /i/. Thus, a continuum of concatenations between the two accented words was
created. Two further steps were added to this continuum: one by completely removing the fall,
yielding a contour that exhibits a high plateau between the two accented words (dip0), and
one by shifting back the whole rise-fall pattern of the original accent II, yielding a typical
accent I pattern (dip5). For each dip position, we also created a contour that lacks the initial
jump 0
2 1
glide 3
4
5
Figure 2. Stimulus structure, exemplified in the dimman context: 6 dip levels (0...5) x 2 rise
types (jump, glide). These 12 F0 contours were implemented in both recordings (dimman and
drömmen), yielding 24 stimuli.
PERCEPTION OF SOUTH SWEDISH WORD ACCENTS 7
gliding rise on /a(n)/, by simply transforming it into a “jump” from low F0 in sett to high F0
right at the onset of anden. It should be pointed out that the difference between glide and jump
is marginal for dip5 (i.e. accent I), and was implemented for the sake of symmetry only.
Additionally, we generated 4 control stimuli which were based on the A-recordings (cf.
Table 1). These are, however, not further considered in this paper.
2.2 Procedure
All 24+4=28 stimuli were rated 4 times. The whole list of 112 stimuli was randomized and
presented to the listeners in 8 blocks of 14 stimuli each, via headphones. The listeners heard
each stimulus only once and had to rate it as either referring to a duck (and) or a spirit (ande),
within 3 seconds, by marking it on a paper sheet. The whole test was included in a wav-file
and took 11:31 minutes. Instructions were given orally and in written form. A training test
with two blocks of 4 stimuli each was run before the actual experiment. 20 South Swedish
native speakers, 5 male, 15 female, aged 19-32, with no reported hearing impairments,
volunteered as subjects.
2.3 Data analysis

Based on the four repetitions of each stimulus, an accent II score in % (henceforth %accII)
was calculated per stimulus and subject. These %accII scores were used as raw data in the
analyses. Means and standard deviations, pooled over all 20 listeners, were calculated for
every stimulus. A three-way repeated-measures ANOVA was run for the 24 primary stimuli to
test for effects of the following factors: FINAL WORD (2 levels: drömmen, dimman), RISE
TYPE (2 levels: jump, glide), and CONCATENATION (6 levels: dip0…dip5).
3 Results
The mean %accII ratings are displayed in Figure 3. The stimuli that were intended to represent
clear cases of accent I (dip5), and accent II (glide/dip4) were convincingly rated as expected.
The graphs for the two different contexts look very similar, and accordingly, FINAL WORD had
no significant effect (p>.8). Also, as would be expected from Figure 3, both RISE TYPE and
CONCATENATION have a significant effect (p<.001 each). However, the difference in rise type
is not reflected in a constant %accII difference, which is especially salient in dip5. According-
ly, we also found a significant interaction between RISE TYPE and CONCATENATION (p<.001).
final word = drömmen (accent I) final word = dimman (accent II)
100 100
mean accent II rating in %
mean accent II rating in %
90 90
80 80
70 70
60 60
50 50
40 40
30 30
20 20
10 10
0 0
0 1 2 3 4 5 0 1 2 3 4 5
concatenation concatenation
Figure 3. Mean accII-ratings in % for 2 final word conditions, 6 dip levels (concatenation),
and 2 rise types: glide (straight line) and jump (dotted line).
8 GILBERT AMBRAZAITIS & GÖSTA BRUCE
4 Discussion
Referring back to the issue about necessary and sufficient cues for word accent identification
(cf. Introduction), we will comment on a number of points in the light of our experiment.
Is the gliding rise through the stressed syllable necessary for the perception of accent II? –
Replacing this glide by an F0 jump up to the stressed syllable results in a sizeable decrease in
the votes for accent II (cf. glide/dip4 vs. jump/dip4). This suggests that the gliding rise is
necessary for the unambiguous perception of accent II.
Is a late-timed fall necessary for the perception of accent II? – Replacing the F0 fall
through the post-stress syllable by a high plateau in the target word yields a tendency towards
accent I (cf. glide/dip4 vs. glide/dip0). This suggests that the fall is necessary. However, the
fall must not be substancially earlier than in the original accent II word, since this would
correspond to accent I.
Thus, both the gliding rise and the late fall seem necessary for the unambiguous perception
of accent II. When one of them is removed, the ratings tend towards accent I. When both these
cues are absent, the tendency becomes rather strong (cf. jump/dip0).
What is necessary, and what is sufficient for the perception of accent I? – Accent I is most
convincingly represented by stimuli with an early fall (dip5). However, the discussion above
has already shown that this early fall cannot be a necessary cue, since a number of stimuli
lacking this early fall have received high accent I ratings (cf. also jump/dip1-2). Furthermore,
a high-starting stressed syllable (jump) favors accent I ratings, but cannot be regarded as
sufficient, since the absence of an accent II-like fall appears necessary (cf. dip3).
Thus, our results are most easily explainable along the hypothesis that there are no specific
necessary cues for accent I at all, but that simply the absence of accent II cues is sufficient for
the perception of accent I. It is still remarkable, though, that the absence of only one accent II
cue alone (e.g. the late fall) results in more votes for accent I than for accent II.
Why does a glide followed by a plateau trigger a considerable number of votes for accent
I? – We do not have a definite answer to this question. One possibility is that the conditions of
phrase intonation play a role. In the early part of a phrase, the expectation is a rising pattern.
Thus a phrase-initial accent I may be realized as a rising glide even in South Swedish, as long
as there is no immediately following F0 fall. Another possibility is that the glide-plateau
gesture represents a typical accent I pattern of another dialect type (Svea or Stockholm
Swedish), even if the context word at the end has a stable South Swedish pattern.
What does our experiment tell us about the markedness issue? – From the perspective of
perceptual cues, accent II in South Swedish appears to be more “special” than accent I. This
will lend some support to the traditional view of accent II being the marked member of the
opposition (cf. Elert, 1964; Engstrand,1995; Riad 1998).
Acknowledgements
Joost van de Weijer assisted us with advice concerning methodology and statistics.
References
Bruce, G. & E. Gårding, 1978. A prosodic typology for Swedish dialects. In E. Gårding et al.
(eds.), Nordic Prosody. Lund: Lund University, Department of Linguistics, 219-228.
Elert, C-C., 1964. Phonologic Studies of Quantity in Swedish. Stockholm: Almqvist &
Wiksell.
Engstrand, O., 1995. Phonetic interpretation of the word accent contrast in Swedish.
Phonetica 52, 171-179.
Riad, T., 1998. Towards a Scandinavian accent typology. In W. Kehrein & R. Wiese (eds.),
Phonology and Morphology of the Germanic Languages. Tübingen: Niemeyer, 77-109.
Focal Accent and Facial Movements in

Expressive Speech
Jonas Beskow, Björn Granström, and David House
Dept. of Speech, Music and Hearing, Centre for Speech Technology (CTT), KTH, Stockholm
{beskow|bjorn|davidh}@speech.kth.se
Abstract
In this paper, we present measurements of visual, facial parameters obtained from a speech
corpus consisting of short, read utterances in which focal accent was systematically varied.
The utterances were recorded in a variety of expressive modes including Certain, Confirming,
Questioning, Uncertain, Happy, Angry and Neutral. Results showed that in all expressive
modes, words with focal accent are accompanied by a greater variation of the facial
parameters than are words in non-focal positions. Moreover, interesting differences between
the expressions in terms of different parameters were found.
1 Introduction
Much prosodic information related to prominence and phrasing, as well as communicative
information such as signals for feedback, turn-taking, emotions and attitudes can be conveyed
by, for example, nodding of the head, raising and shaping of the eyebrows, eye movements
and blinks. We have been attempting to model such gestures in a visual speech synthesis
system, not only because they may transmit important non-verbal information, but also
because they make the face look alive.
In earlier work, we have concentrated on introducing eyebrow movement (raising and
lowering) and head movement (nodding) to an animated talking agent. Lip configuration and
eye aperture are two additional parameters that we have experimented with. Much of this
work has been done by hand-manipulation of parametric synthesis and evaluated using
perception test paradigms. We have explored three functions of prosody, namely prominence,
feedback and interrogative mode useful in e.g. multimodal spoken dialogue systems
(Granström, House & Beskow, 2002).
This type of experimentation and evaluation has established the perceptual importance of
eyebrow and head movement cues for prominence and feedback. These experiments do not,
however, provide us with quantifiable data on the exact timing or amplitude of such
movements used by human speakers. Nor do they give us information on the variability of the
movements in communicative situations. This kind of information is important if we are to be
able to implement realistic facial gestures and head movements in our animated agents. In this
paper we will report on methods for the acquisition of visual and acoustic data, and present
measurement results obtained from a speech corpus in which focal accent was systematically
varied in a variety of expressive modes.
2 Data collection and corpus

We wanted to be able to obtain articulatory data as well as other facial movements at the same
time, and it was crucial that the accuracy in the measurements was good enough for
10 JONAS BESKOW ET AL.
resynthesis of an animated head. The opto-electronic motion tracking system, the Qualysis
MacReflex system, that we use has an accuracy better than 1 mm with a temporal resolution
of 60 Hz. The data acquisition and processing is similar to earlier facial measurements carried
out at CTT by e.g. Beskow et al. (2003). The set-up can be seen in Fig. 1, left picture.
Figure 1. Data collection setup with video and IR-cameras, microphone and a screen for
prompts (left) and a test subject with the IR-reflecting markers (right).
The subject could either pronounce sentences presented on the screen outside the window or
be engaged in a (structured) dialogue with another person as shown in the figure. By attaching
infrared (IR) reflecting markers to the subject’s face (see Fig. 1), the system is able to register
the 3D coordinates for each marker. We used a number of markers to register lip movements
as well as other facial movements such as eyebrows, cheek and chin.
The speech material used for the present study consisted of 39 short, content neutral
sentences such as “Båten seglade förbi” (The boat sailed by) and “Grannen knackade på
dörren” (The neighbor knocked on the door), all with three content words which could each
be focally accented. To elicit visual prosody in terms of prominence, these short sentences
were recorded with varying focal accent position, usually on the subject, the verb and the
object respectively, thus making a total of 117 sentences. The utterances were recorded in a
variety of expressive modes including Certain, Confirming, Questioning, Uncertain, Happy,
Angry and Neutral. This database is part of a larger database collected in the EU PF-Star
project (Beskow et al., 2004).
3 Measurement procedure
In the present database a total of 29 IR-sensitive markers were attached to the speaker’s face,
of which 4 markers were used as reference markers (on the ears and on the forehead). The
marker setup (as shown in Fig. 1) largely corresponds to the feature point (FP) configuration
of the MPEG-4 facial animation standard.
In the present study, we chose to base our quantitative analysis of facial movement on the
MPEG-4 Facial Animation Parameter (FAP) representation. Specifically, we chose a subset of
31 FAPs out of the 68 FAPs defined in the MPEG-4 standard, including only the ones that we
were able to calculate directly from our measured point data.
We wanted to obtain a measure of how (in what FAPs) focus was realised by the recorded
speaker for the different expressive modes. In an attempt to quantify this, we introduce the
Focal Motion Quotient, FMQ, defined as the standard deviation of a FAP parameter taken
over a word in focal position, divided by the average standard deviation of the same FAP in
the same word in non-focal position. This quotient was then averaged over all sentence-
triplets spoken with a given expressive mode.
FOCAL ACCENT AND FACIAL MOVEMENTS IN EXPRESSIVE SPEECH 11

As a first step in the analysis, the FMQs for all the 31 measured FAPs were averaged across
the 39 sentences. These data are displayed in Fig. 2 for the analyzed expressive modes, i.e.
Angry, Happy, Confirming, Questioning, Certain, Uncertain and Neutral. As can be seen, the
FMQ mean is always above one, irrespective of which facial movement, FAP, is studied. This
means that a shift from a non-focal to a focal pronunciation on the average results in greater
dynamics in all facial movements for all expressive modes. It should be noted that these are
results from only one speaker and averages across the whole database. It is however
conceivable that facial movements will at least reinforce the perception of focal accent. The
mean FMQ taken over all expressive modes is 1.6. The expressive mode yielding the largest
mean FMQ is Happy (1.9) followed by Confirming (1.7), while Questioning has the lowest
mean FMQ value of 1.3. If we look at the individual parameters and the different expressive
modes, some FMQs are significantly greater, especially for the Happy expression, up to 4 for
parameter 34 “raise right mid eyebrow”.
4,5
3,5
3 Angry
Happy
2,5 Confirming
Questioning
2 Certain
Uncertain
1,5 Neutral
0,5
0
3: open jaw
14: thrust jaw
15: shift jaw
18: depress chin
39: puff left cheek
40: puff right cheek
41: lift left cheek
42: lift right cheek
16: push bottom lip
52: raise bottom midlip
57: raise bottom lip lm
58: raise bottom lip rm
17: push top lip
51: lower top midlip
55: lower top lip left mid
56: lower top lip rm
53: strech left cornerlip
54: strech right cornerlip
59: raise left cornerlip
60: raise right cornerlip
31: raise left inner eyebrow
32: raise right inner eyebrow
33: raise left mid eyebrow
34: raise right mid eyebrow
35: raise left outer eyebrow
36: raise right outer eyebrow
37: squeeze left eyebrow
38: squeeze right eyebrow
48: head pitch
49: head yaw
50: head roll
FAP
Figure 2. The focal motion quotient, FMQ, averaged across all sentences, for all measured
MPEG-4 FAPs for several expressive modes (see text for definitions and details).
In order to more clearly see how different kinds of parameters affect the movement pattern, a
grouping of the FAPs is made. In Fig. 3 the “Articulation” parameters are the ones primarily
involved in the realization of speech sounds (the first 20 in Fig. 2). The “Smile” parameters
are the 4 FAPs relating to the mouth corners. “Brows” correspond to the eight eyebrow
parameters and “Head” are the three head movement parameters. The extent and type of
greater facial movement related to focal accent clearly varies with the expressive mode.
Especially for Happy, Certain and Uncertain, FMQs above 2 can be observed. The Smile
group is clearly exploited in the Happy mode, but also in Confirming, which supports the
finding in Granström, House & Swerts (2002) where Smile was the most prominent cue for
confirming, positive feedback, referred to in the introduction. These results are also consistent
with Nordstrand et al. (2004) which showed that lip corner displacement was more strongly
influenced by utterance emotion than by individual vowel features.
12 JONAS BESKOW ET AL.
2,5
2 articulation
smile
1,5
brows
1 head
0,5
0
Angry
Happy
Confirming
Questioning
Certain
Uncertain
Neutral
Figure 3. The effect of focus on the variation of several groups of MPG-4 FAP parameters,
for different expressive modes
While much more detailed data on facial movement patterns is available in the database, we
wanted to show the strong effects of focal accent on basically all facial movement patterns.
Modelling the timing of the facial gestures and head movements relating to differences
between focal and non-focal accent and to differences between expressive modes promises to
be a fruitful area of future research.
Acknowledgements
This paper describes research in the CTT multimodal communication group including also
Loredana Cerrato, Mikael Nordenberg, Magnus Nordstrand and Gunilla Svanfeldt which is
gratefully acknowledged. Special thanks to Bertil Lyberg for making available the Qualisys
Lab at Linköping University. The work was supported by the EU/IST projects SYNFACE,
PF-Star and CHIL, and CTT, supported by VINNOVA, KTH and participating Swedish
companies and organizations.
References
Beskow, J., L. Cerrato, B. Granström, D. House, M. Nordstrand & G. Svanfeldt, 2004. The
Swedish PF-Star Multimodal Corpora. Proc. LREC Workshop, Multimodal Corpora:
Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and
Output Interfaces, Lisbon, 34-37.
Beskow, J., O. Engwall & B. Granström, 2003. Resynthesis of Facial and Intraoral
Articulation from Simultaneous Measurements. Proc. ICPhS 2003, Barcelona, 431-434.
Granström, B., D. House & J. Beskow, 2002. Speech and gestures for talking faces in
conversational dialogue systems. In B. Granström, D. House & I. Karlsson (eds.),
Multimodality in language and speech systems. Dordrecht: Kluwer Academic Publishers,
209-241.
Granström, B., D. House & M. Swerts, 2002. Multimodal feedback cues in human-machine
interactions. Proc. Speech Prosody 2002, Aix-en-Provence, 347-350.
Nordstrand, M., G. Svanfeldt, B. Granström & D. House, 2004. Measurement of articulatory
variation in expressive speech for a set of Swedish vowels. Journal of Speech
Communication 44, 187-196.
A Study of Simultaneous-masking and

Pulsation-threshold Patterns of a Steady-state
Synthetic Vowel: A Preliminary Report
Ulla Bjursäter
Department of Linguistics, Stockholm University
ullabj@ling.su.se
Abstract
This study will be a remake in part of Tyler & Lindblom “Preliminary study of simultaneous-
masking and pulsation-threshold patterns of vowels” (1982), with the use of today’s
technology. A steady-state vowel as masker and pure tones as signals will be presented using
simultaneous-masking (SM) and pulsation-threshold (PT) procedures in an adjustment
method to collect the vowel masking pattern. Vowel intensity is changed in three steps of 15
dB. For SM, each 15 dB change is expected to result in about a 10-13-dB change in signal
thresholds. For PT, the change in signal thresholds with vowel intensity is expected to be
about 3-4 dB. These results would correspond with the results from the Tyler & Lindblom
study. Depending on technology outcome, further experiments can be made, involving
representations of dynamic stimuli like CV-transitions and diphthongs.
1 Introduction
This study is an attempt to partially replicate Tyler & Lindblom “Preliminary study of
simultaneous-masking and pulsation-threshold patterns of vowels” (1982). Their intention
was to investigate the effect of the two different masking types as well as the role of
suppression in the coding of speech spectra.
Suppression, or lateral inhibition, refers to the reduction in the reaction to one stimulus by
the introduction of a second (Oxenham & Plack, 1998). The ability of one tone to suppress the
activity of another tone of adjacent frequency has been thoroughly documented in auditory
physiology (Delgutte, 1990; Moore, 1978). In speech, suppression can be used to investigate
formant frequencies.
In the original article, the authors (Tyler & Lindblom, 1982) constructed an experiment
masking steady-state synthetic pure tones by simultaneous and pulsation-threshold patterns of
vowels. Their vowels were synthesized on an OVE 1b speech synthesizer (Fant, 1960) with
formant frequencies, bandwidths and intensities as approximate values for Swedish. In this
study only one of the vowels from the original experiment is synthesized, using Madde, a
singing synthesizer (<www.speech.kth.se/smptool/>) instead of OVE 1b.
In this experiment, the original vowel masking patterns will be used on the Swedish vowel
/y/, a vowel that, according to Tyler & Lindblom (1982), is particularly useful in testing the
role of suppression in speech as it has three closely spaced high frequency formants (F2, F3
and F4). F2 and F4 have about the same frequency as in the vowels /i/ and /e/, and a distinct
perception of these three vowels must depend on good frequency resolution of F3 (Carlson et
al., 1970; Bladon & Fant, 1978; Tyler & Lindblom, 1982).
14 ULLA BJURSÄTER
Tyler, Curtis & Preece (1978) have shown that vowel masking patterns from a forward
masking (FM) procedure preserves the formants better than patterns obtained using
simultaneous masking (SM).
2 Method
2.1 Subject
All collected data will be from an experienced listener who will receive about 30 minutes of
practice with the test procedure.
2.2 Procedure
The subject’s hearing level will be controlled and set in connection with the experiment in the
phonetics laboratory at the Department of Linguistics, SU, by Philips SBC HP890
headphones. The tone will be presented at different intensities to get a baseline point.
The tests are constructed in a graphic programming language, LabView (<www.ni.com/
labview/>).
The Swedish vowel /y/ is synthesized in Madde, an additive, real-time, singing synthesizer
(<www.speech.kth.se/smptool/>). The vowel formant frequencies and bandwidths used in this
study (Table 1) are the same as used in Tyler & Lindblom (1982), and defined in Carlson,
Fant & Granström (1975) and in Carlson, Granström & Fant (1970).
Table 1. Formant frequencies (F1, F2, F3 and F4), bandwidths and Q-values in Hertz for the
vowel /y/.
/y/ Frequency Bandwidth Q
F1 255 Hz 62.75 Hz 4.06
F2 1930 Hz 146.5 Hz 13.17
F3 2420 Hz 171.0 Hz 14.15
F4 3300 Hz 215.0 Hz 13.35
The procedures of the two simultaneous-masking (SM) and pulsation-threshold (PT) tests are
the same as in the Tyler & Lindblom (1982) study, although in this case with only one vowel
instead of three.
In the SM procedure, the vowels will be presented 875 ms (between 50% points), repeated
with a 125-ms silent interval. Three pulses of the pure tone will appear within each masking.
These pulses start 125 ms after the vowel onset, continue for 125 ms and are separated by 125
ms. Rise/fall times (between 10% and 90% points) are 7 ms for both signal and masker.
In the PT procedure the masking vowel and the pulsating signal alternate, with duration of
125 ms each. In order to assist the subject in the task, every fourth signal (125 ms) is omitted.
Rise/fall times are 7 ms and the signal and the vowel are separated by 0 ms.
F0 levels vary between 80 Hz, 120 Hz and 240 Hz. The vowel intensity changes
parametrically over a range of 45 dB in three steps of 15 dB. Intensity levels alternate between
55.5, 70.5 and 85.5 dB SPL, representing low-voiced, medium and strong speech. The
presentation order of the vowel’s fundamental frequency and intensity will be randomized.
The testing with PT-values and SM will alternate every 10th minute. Each condition will be
presented until five estimates are registered. The stimuli will be presented monaurally, to the
right ear. The subject will be instructed to adjust the level of the signal to a just noticeable
pulsation threshold level; all answers are automatically registered in the LabView programme.
All data will be analyzed in SPSS 14.0 and MS Excel 2002.
SIMULTANEOUS-MASKING & PULSATION-THRESHOLD PATTERNS OF /y/ 15
3 Results
The technical solution of the test in LabView is currently under construction. The results of
the tests are expected to correspond with the results from the Tyler & Lindblom (1982) study.
Some variation may occur due to individual variations between subjects.
4 Discussion
The expectations of this study are to get data that concurs with the results from the Tyler &
Lindblom (1982) study.
The results from the Tyler & Lindblom (1982) study show that the masking pattern received
with the PT method delineates the vowel’s formant frequencies better than the pattern
received with the SM pattern. Suppression only occurs when two sounds are presented
simultaneously, as in the SM procedure, which seems to result in the signal needing a higher
intensity to be detected.
The difference between the SM and PT measurements were very small at low signal
frequencies and quite large at high signal frequencies (Tyler & Lindblom, 1982). One of the
explanations offered were that the high-intensity F1 suppressed the activity caused by the
higher formants, resulting in lower PT in the high-frequency regions.
Tyler & Lindblom (1982) also propose that the suggested suppression effects for steady-
state vowels also could occur for all speech sounds, although in natural speech, the duration
for which the vowel achieves its target is typically very short.
Depending on the outcome of the technology used in the PT and SM procedures, the
program used in the test can be extended to further investigations of the effects of the two
masking procedures on representations of dynamic stimuli, like CV-transitions and
diphthongs.
Acknowledgements
The LabView test is being constructed with the invaluable help of Ellen Marklund, Francisco
Lacerda and Peter Branderud, SU.
References
Bladon, A. & G. Fant, 1978. A two-formant model and the cardinal vowels. Speech
Transmission Laboratory Quarterly Progress Status Report (STL-QPSR 1), 1-8.
Carlsson, R., G. Fant & B. Granström, 1975. Two-formant Models, Pitch and Vowel
Perception. In G. Fant & M.A.A. Tathham (eds.), Auditory Analysis and Perception of
Speech. London: Academic.
Carlson, R., B. Granström & G. Fant, 1970. Some studies concerning perception of isolated
vowels. Speech Transmission Laboratory Quarterly Progress Status Report (STL-QPSR
2/3), 19-35.
Delgutte, B., 1990. Physiological mechanisms of psychophysical masking: Observations from
auditory-nerve fibers. Journal of the Acoustical Society of America 87 (2), 791-809.
Fant, G., 1960. Acoustical Theory of Speech Production. The Hague: Mouton.
Moore, B.C.J., 1978. Psychophysical tuning curves measured in simultaneous and forward
masking. Journal of the Acoustical Society of America 63 (2), 524-532.
Oxenham, A.J. & C.J. Plack, 1998. Suppression and the upward spread of masking. Journal
of the Acoustical Society of America 104 (6), 3500-3510.
Tyler, R.S., J.F. Curtis & J.P. Preece, 1978. Formant preservation revealed by vowel masking
patterns. The Canadian Speech and Hearing Convention, Saskatoon, Saskatchewan.
16 ULLA BJURSÄTER
Tyler, R.S. & B. Lindblom, 1982. Preliminary Study of Simultaneous-Masking abs Pulsation
Threshold Patterns of Vowels. Journal of the Acoustical Society of America 71 (1), 220-
224.
Youth Language in Multilingual Göteborg

Petra Bodén1 and Julia Gro e2
1
petra.boden@ling.lu.se
2
Institute of Swedish as a Second Language, Dept. of Swedish, Göteborg University
julia.grosse@svenska.gu.se
Abstract
In this paper, the results from a perception experiment about youth language in multilingual
Göteborg are presented and discussed.
1 Introduction
1.1 Language and language use among young people in multilingual urban settings
The overall goal of the research project ‘Language and language use among young people in
multilingual urban settings’ is to describe and analyze a Swedish variety (or set of varieties)
hereafter called SMG (Lindberg, 2006). SMG stands for ‘Swedish on Multilingual Ground’
and refers to youth varieties like “Rinkeby Swedish” and “Rosengård Swedish”. In the present
paper, we address two of the project’s research questions: SMG’s relation to foreign accent
and how SMG is perceived by the adolescents themselves.
1.2 Purpose of the perception experiment

In the perception experiment, Göteborg students are asked to listen for examples of
“gårdstenska” (the SMG spoken in Göteborg) in recordings from secondary schools. The
purpose is to identify speakers of SMG for future studies and to test the hypotheses that 1)
monolingual speakers of Swedish can speak SMG and 2) speakers of SMG can code-switch to
a more standardized form of Swedish. Foreign accent, defined here as the result of negative
interference from the speaker’s L1 (first language), cannot occur in the Swedish that is spoken
by persons who have Swedish as their (only) L1, nor can foreign accent be switched off in
certain situations.
2 Method
Stimuli were extracted from the research project’s speech database and played once (over
loudspeakers) to a total of 81 listeners. The listeners were asked to answer two questions
about each stimulus: Does the speaker speak what is generally called gårdstenska? (yes or
no), and How confident are you about that? (confident, rather confident, rather uncertain or
uncertain). The listeners were also asked to answer a few questions about who they believed
typically speaks gårdstenska. The 19 stimuli used in the experiment were approximately 30
second long sections that had been extracted from spontaneous (unscripted) recordings made
at secondary schools in Göteborg. The listeners in the experiment were students from the
same two schools as the speakers.
After having collected the answer sheets, a general discussion on SMG was held in each
class.
18 PETRA BODÉN & JULIA GROSSE

3.1 Listeners’ views on gårdstenska
80 of the 81 listeners answered the questions about who typically speaks gårdstenska. All 80
answered that adolescents are potential speakers of gårdstenska. 54% (43) answered that also
children can speak gårdstenska and 15% (12) that adults are potential speakers. Almost half of
the listeners (37) claimed that only adolescents speak gårdstenska. Only a third of the listeners
(25) believed that gårdstenska can be spoken by persons without an immigrant background.
Most of them, 23, answered that gårdstenska is also spoken by first and second generation
immigrants. One listener, however, answered that only persons without immigrant
background and first generation immigrants speak gårdstenska. The listener was herself a
second generation immigrant. One listener in a similar experiment undertaken in Malmö
(Hansson & Svensson, 2004) answered in the same fashion, i.e. excluding persons with the
same background as the listener herself. Finally, 69% (55) of the listeners answered that only
persons with an immigrant background speak gårdstenska. The majority, 30, regards both first
and second generation immigrants as potential speakers of gårdstenska, whereas 15 only
include first generation immigrants and 10 only second generation immigrants, see Figure 1.
First generation immigrants
Second generation immigrants
Number of answers
30 Listeners with one Swedish-born parent

Listeners without immigrant background
20
10
0
Only speakers Speakers born in Only second Everybody Second and first Only first
without Sweden generation (regardless generation generation
immigrant (regardless immigrants background and immigrants immigrants
background background) place of birth)
Figure 1. The listeners’ answers to the question: Who speaks what is generally called
gårdstenska?
3.2 Listeners’ classification of the stimuli

A statistically significant majority of the listeners regarded the stimuli P05a, P11a, P19, P10a,
P08, P10b, P05b, P35, P11b and P47 as examples of gårdstenska (p<.05). Between 36% and
80% of those listeners felt confident in their classification of the stimuli as gårdstenska
whereas the corresponding percentages for the listeners classifying the same stimuli as ‘not
gårdstenska’ varied from 27 to 57. Stimuli P38, P25, S40, S08, S15 and S30b were judged as
examples of something else than gårdstenska by a majority of the listeners (p<.05). Between
39% and 73% of the listeners felt confident in their classification of the stimuli as something
else than gårdstenska whereas the corresponding percentages for the listeners classifying the
stimuli as gårdstenska varied from only 10 to 50. Stimuli P49, S43, S30a and S23 were
perceived as gårdstenska by about half of the listeners (p>.05). 27% to 45% of the listeners
that classified the stimuli as gårdstenska reported feeling confident and 23% to 42% of the
listeners that classified the stimuli as not gårdstenska. Both the listeners’ classifications of
these four stimuli and their reported uncertainty indicate that the stimuli in question contain
speech that cannot unambiguously be classified as either SMG or something else than SMG.
YOUTH LANGUAGE IN MULTILINGUAL GÖTEBORG 19
3.3 Foreign accent or language variety?

Two hypotheses were tested in the experiment: 1) that monolingual speakers of Swedish can
speak gårdstenska and 2) that speakers of gårdstenska can code-switch to a more standardized
form of Swedish. Table 1 shows the relationship between the speakers’ background (if they
have an immigrant background or not) and language use (SMG or not). An immigrant
background is neither necessary nor sufficient for a speaker to be classified as a speaker of
SMG. Three monolingual speakers of SMG were identified.
Speech produced by speakers P11, P05 and P10 was used in two different types of stimuli:
a) talking to friends and b) to a project member/researcher. Unlike in the Malmö experiment
(Hansson & Svensson, 2004), no speaker was classified as a speaker of SMG in one stimulus
but not in the other. P11a was perceived as a speaker of gårdstenska by a statistically
significant majority of the listeners in situation a (93%, p<.05) but not unambiguously
classified in situation b (69% SMG classifications, p>.05). P05 and P10 also got larger
proportions of SMG classifications in the a stimuli than the b stimuli, but in both types of
stimuli they were classified as speakers of SMG (p<.05).
Table 1. Speakers’ background and classification by the listeners in the experiment.

Classification Speakers born in Sweden Speakers not born in Sweden
according to the with at least one with parents not to Sweden before 6 to Sweden at 6
listeners (p<.05) parent born in born in Sweden years of age years of age or later
Sweden
SMG P47, P35, P08 P10 P19 P11, P05
Not SMG S15 S30, P25, P38 S08 S40
3.4 Differences in awareness of and attitude towards gårdstenska

One thing that should be mentioned is that the term gårdstenska used in the survey (and in this
paper) did not seem to be as widely accepted as we thought. Initially there was some
uncertainty among the listeners what kind of language use we were referring to. However,
when we described it as a “Göteborg version of Rinkeby Swedish”, the listeners seemed to
understand what they were asked to listen for. Since we were interested in the listeners’
attitudes towards, and their awareness of gårdstenska, we tried to initiate a discussion about
the subject matter after having completed the experiment. The observations described below
are based on field notes and recollections and are not to be seen as results of the experiment
but rather as overall impressions.
When asked on what grounds they had categorized the speakers in the experiment most
listeners seemed to agree that the use of certain words was crucial for their decision. Some
students mentioned pronunciation, prosody and word order as typical features. In two of the
five classes most of the time was spent listing words and phrases typical for gårdstenska. In
the other three classes the discussion topic varied from typical linguistic features to more
socio-linguistic aspects of multi-ethnic youth language. Several students in different classes
made an explicit distinction between gårdstenska and foreign accent, and in one class a
discussion developed about the function of multi-ethnic youth language as an identity marker
used by adolescences who aim to underline their non Swedish identity. The student who was
the most active in this part of the discussion also emphasized the difference between multi-
ethnic youth varieties and foreign accents and drew parallels to regional varieties of Swedish.
Concerning students’ attitudes towards gårdstenska there appeared to be some considerable
differences between some of the classes. From this angle the discussion was particularly
interesting in one of the classes. Only one male student in this class seemed to identify with
speakers of gårdstenska (or “invandriska” as he himself called it). This student said that he
20 PETRA BODÉN & JULIA GROSSE
would not use what we referred to as gårdstenska in class because his classmates would laugh
at him. He refused to name any typical words or features of gårdstenska in class but
volunteered to hand in a word list, which only we as researchers were allowed to look at. This
student made it clear that “invandriska” was a language he used with his friends outside his
class and never in the classroom. Interestingly this was the same class we mentioned above,
where students talked about gårdstenska as an identity marker, whereas some students were
quite determined in their opinion that this kind of language use was due to a low proficiency
in Swedish. Within the other classes the subject seemed less controversial. We can, of course,
only speculate about the cause for these differences between the classes. One impression was
that there was less controversy about the issue in those classes where more students seemed to
identify with speakers of gårdstenska, which were also the more heterogeneous regarding the
students’ linguistic and cultural background.
3.5 Listeners’ awareness of sociolinguistic variation

After visiting the five different school classes in two of Göteborgs multi-lingual areas the
overall impression was that a lot of the students showed at least some awareness of socio-
linguistic aspects in language use. Some students, as mentioned above, explicitly discussed
aspects of language and identity, showing great insight and strong opinions on the issue.
Overall most students seemed to acknowledge that gårdstenska is spoken in certain groups
(i.e. among friends but not with teachers or parents) and in certain situations and not in others.
Thus the listeners showed some awareness of register variation, even though there were
different opinions on the question to what extent speakers make a conscious linguistic choice
or unconsciously adapt their language when code-switching between gårdstenska and other
varieties of Swedish. There was, however, a minority of listeners who categorized what they
heard in some of the stimuli as interlanguage of individuals lacking proficiency in Swedish.
4 Future work
The monolingual speakers of SMG support the hypothesis of SMG being a variety of Swedish
rather than foreign accent. From discussions with adolescents we have learnt that SMG is
primarily used among friends and not with e.g. teachers and parents. Therefore it is interesting
that some speakers in the experiment were perceived as speaking SMG (albeit to a lesser
degree) even in dialogues with adults. Future work includes investigating if some features of
SMG (e.g. the foreign-sounding pronunciation) are kept even in situation where other features
(e.g. the SMG vocabulary) are not used, and if these features possibly are kept also later in life
when the speakers no longer use a youth language.
Acknowledgements
The research reported in this paper has been financed by the Bank of Sweden Tercentenary
Foundation. The authors would like to thank Roger Källström for fruitful discussions on the
experiment’s design and much appreciated help!
References
Hansson, P. & G. Svensson, 2004. Listening for “Rosengård Swedish”. Proceedings
FONETIK 2004, 24-27.
Lindberg, I., 2006. Språk och språkbruk bland ungdomar i flerspråkiga storstadsmiljöer
2000–2006. Institute of Swedish as a Second Language, Göteborg University.
http://hum.gu.se/institutioner/svenska-spraket/isa/verk/projekt/pag/pg_forsk2
Prosodic Cues for Hesitation

Rolf Carlson1, Kjell Gustafson1,2, and Eva Strangert3*
1
Department of Speech, Music and Hearing, KTH
{rolf|kjellg}@speech.kth.se
2
Acapela Group Sweden AB, Solna
kjell.gustafson@acapela-group.com
3
Department of Comparative Literature and Scandinavian Languages, Umeå University
eva.strangert@nord.umu.se
*names in alphabetical order
Abstract
In our efforts to model spontaneous speech for use in, for example, spoken dialogue systems,
a series of experiments have been conducted in order to investigate correlates to perceived
hesitation. Previous work has shown that it is the total duration increase that is the valid cue
rather than the contribution by either of the two factors pause duration and final lengthening.
In the present experiment we explored the effects of F0 slope variation and the presence vs.
absence of creaky voice in addition to durational cues, using synthetic stimuli. The results
showed that variation of both F0 slope and creaky voice did have perceptual effects, but to a
much lesser degree than the durational increase.
1 Introduction
Disfluencies of various types are a characteristic feature of human spontaneous speech. These
can occur for reasons such as problems in lexical access or in the structuring of utterances or
in searching feedback from a listener. The aim of the current work is to gain a better
understanding of what features contribute to the impression of hesitant speech on a surface
level. One of our long term research goals is to build a synthesis model which is able to
produce spontaneous speech including disfluencies. Apart from increasing our understanding
of the features of spontaneous speech, such a model can be explored in spoken dialogue
systems, both to increase the naturalness of the synthesized speech (Callaway, 2003) and as a
paralinguistic signalling of for example uncertainness in a dialogue. The current work deals
with the modelling of one type of disfluency, hesitations. The work has been carried out
through a sequence of experiments using Swedish speech synthesis.
If we are to model hesitations in a realistic way in dialogue systems, we need to know more
about what phonetic features contribute to the impression that a speaker is being hesitant. A
few studies have shown that hesitations (and other types of disfluencies) very often go
unnoticed in normal conversation, even during very careful listening, but scientific studies
have in the past concentrated much more on the production than on the perception of hesitant
speech. Pauses and retardations have been shown to be among the acoustic correlates of
hesitations (Eklund, 2004). Significant patterns of retardation in function words before
hesitations have been reported (Horne et al., 2003). A recent perception study (Lövgren & van
Doorn, 2005) confirms that pause insertion is a salient cue to the impression of hesitation, and
the longer the pause, the more certain the impression of hesitance.
With a few exceptions, relatively little effort has so far been spent on research on
spontaneous speech synthesis with a focus on disfluencies. In recent work (Sundaram &
22 ROLF CARLSON ET AL.
Narayanan, 2003) new steps are taken to predict and realize disfluencies as part of the unit
selection in a synthesis system. In Strangert & Carlson (2006) an attempt to synthesize
hesitation using parametric synthesis was presented. The current work is a continuation of this
effort.
2 Experiment
Synthetic versions of a Swedish utterance were presented to listeners who had to evaluate if
and where they perceived a hesitation. The subjects, regarded as naive users of speech
synthesis, were 14 students of linguistics or literature from Umeå University, Sweden.
The synthetic stimuli were manipulated with respect to duration features, F0 slope and
presence vs. absence of creaky voice to invoke the impression of a hesitation. A previous
study (Carlson et al., 2006) showed the total increase in duration at the point of hesitation to
be the most important cue rather than each of the factors pause and final lengthening
separately. Therefore, pause and final lengthening were now combined in one “total duration
increase” feature. The parameter manipulation was done in two different sentence positions as
in the previous study. However, in the current experiment the manipulations were similar in
the two positions, whereas different parameter settings were used in the previous one.
The stimuli were synthesized using the KTH formant based synthesis system (Carlson &
Granström, 1997), giving full flexibility for prosodic adjustments. 160 versions of the
utterance were created covering all feature combinations in two positions: A hesitation was
placed either in the first part (F) or in the middle (M) of the utterance “I sin F trädgård har
Bettan M tagetes och rosor.” (English word-by-word translation: “In her F garden has Bettan
M tagetes and roses.”) In addition, there were stimuli without inserted hesitations.
The two positions were chosen to be either inside a phrase (F) or between two phrases (M).
The hesitation points F and M were placed in the unvoiced stop consonant occlusion and were
modelled using three parameters: a) total duration increase combining retardation before the
hesitation point and pause, b) F0 slope variation and c) presence/absence of creak.
2.1 Retardation and pause

The segment durations in our test stimuli were set according to the default duration rules in
the TTS system. The retardation adjustment was applied on the VC sequence /in/ in “sin” and
/an/ in “Bettan” before the hesitation points F and M, respectively, and the pausing was a
simple lengthening of the occlusion in the unvoiced stop. All adjustments were done with an
equal retardation and pause contribution following our earlier results in Carlson et al. (2006).
Figure 1. a) F0 shapes for the two possible hesitation positions F and M. D=Retardation +
Pause. b) Illustration of intonation contours for the two extreme cases in position F.
PROSODIC CUES FOR HESITATION 23
2.2 F0 slope variation

The intonation was modelled by the default rules in the TTS system. At the hesitation point
the F0 was adjusted to model slope variation in 5 shapes with rising contours (+20, +40 Hz) a
flat contour (0) and falling contours (-20, -40 Hz). The pivot point before the hesitation was
placed at the beginning of the last vowel before the hesitation, see Figure 1a. Figure 1b shows
spectrograms with intonation curves for the two extreme cases in the F position.
2.3 Creak
Creaky voice was set to start three quarters into the last vowel before the hesitation and to
reach full effect at the end of the vowel. The creak was modelled by changing every other
glottal pulse in time and amplitude (Klatt & Klatt, 1990).
100 30
Hesitation perception increase (%)

Hesitation perception (%)
F First
M Mid 20 F
75
M
10
50
0
25 1 10 100 1000
-10
0
1 10 100 1000 -20
Duration increase (ms) Duration increase (ms)
Figure 2. a) Distribution of hesitation responses b) Distribution of hesitation perception

increase due to addition of creak. Data separated according to position of hesitation.

The results of the experiment are summarized in Figure 2. In Figure 2a, hesitation perception
is plotted as a function of total duration increase. The strong effect is similar to and confirms
the previous result that the combined effect of pause and retardation is a very strong cue to
hesitation. In Figure 2b, the increase in hesitation perception due to the addition of creak is
plotted against total duration increase. Here, a compensatory pattern is revealed, in particular
in the first position; when the duration adjustment is at the categorical border (at a total
duration increase of about 100 ms, cf. Figure 2a), creak has a strengthening effect, favouring
the perception of a hesitation. In a similar way, falling F0 contours made perception of
hesitation easier at the categorical border for duration, compensating for weak duration cues.
These results support the conclusion that duration increase, achieved by the combined
effects of retardation and pause, is an extremely powerful cue to perceived hesitation. F0 slope
variation and creak play a role, too, but both are far less powerful, functioning as supporting
rather than as primary cues. Their greatest effects apparently occur at the categorical border,
when the decision hesitation/no hesitation is the most difficult.
The results further indicate that subjects are less sensitive to modifications in the middle
position (M) than in the first position (F). We relate this to the difference in syntactic
structure: in the F position the hesitation occurs in the middle of a noun phrase (“I sin F
trädgård”), whereas in the M position it occurs between two noun phrases, functioning as
subject and object respectively. A reasonable assumption is that the subjects expected some
kind of prosodic marking in the latter position and that therefore a greater lengthening was
required in order to produce the percept of hesitation.
24 ROLF CARLSON ET AL.
This assumption is strengthened by the subjects’ reaction to the other two features
investigated. Both intonation and creaky voice have the capacity to signal an upcoming
boundary so that they are more likely to facilitate the detection of a hesitation in a phrase-
internal position, where a boundary is unexpected, than between two grammatical phrases.
This dependence on syntax is not unexpected: vast numbers of production studies have shown
the strength of prosodic signalling to depend on the strength of the syntactic boundary.
In conclusion, our results indicate that the perception of hesitation is strongly influenced by
deviations from an expected temporal pattern. In addition, different syntactic conditions have
an effect on how much changes in prosodic features like the F0 contour and retardation and
the presence of creaky voice contribute to the perception of hesitation. In view of this, the
modelling of hesitation in speech technology applications should take account of the
supporting roles that F0 and creak can play in achieving a realistic impression of hesitation.
An important step in the modelling of spontaneous speech would be to include predictions
of different degrees of hesitations depending on the utterance structure. To do this, data are
required of the distribution of hesitations, see e.g. Strangert (2004). Our long-term goal is to
build a synthesis model which is able to produce spontaneous speech on the basis of such
data. An even more long-term goal is to include other kinds of disfluencies as well, and to
integrate the model in a conversational dialogue system, cf. Callaway (2003).
Acknowledgements
We thank Jens Edlund, CTT, for designing the test environment, and Thierry Deschamps,
Umeå University, for technical support in performing the experiments. This work was
supported by The Swedish Research Council (VR) and The Swedish Agency for Innovation
Systems (VINNOVA).
References
Callaway, C., 2003. Do we need deep generation of disfluent dialogue? In AAAI Spring
Symposium on Natural Language Generation in Spoken and Written Dialogue, Tech. Rep.
SS-03-07. Menlo Park, CA: AAAI Press.
Carlson, R. & B. Granström, 1997. Speech synthesis. In W.J. Hardcastle & J. Laver (eds.),
The Handbook of Phonetic Science. Oxford: Blackwell Publ., 768-788.
Carlson, R., K. Gustafson & E. Strangert, 2006. Modelling Hesitation for Synthesis of
Spontaneous Speech. Proc. Speech Prosody 2006, Dresden.
Eklund, R., 2004. Disfluency in Swedish human-human and human-machine travel booking
dialogues. Dissertation 882, Linköping Studies in Science and Technology.
Horne, M., J. Frid, B. Lastow, G. Bruce & A. Svensson, 2003. Hesitation disfluencies in
Swedish: Prosodic and segmental correlates. Proc. 15th ICPhS, Barcelona, 2429-2432.
Klatt, D. & L. Klatt, 1990. Analysis, synthesis and perception of voice quality variations
among female and male talkers. JASA 87, 820-857.
Lövgren, T. & J. van Doorn, 2005. Influence of manipulation of short silent pause duration on
speech fluency. Proc. DISS2005, 123-126.
Strangert, E., 2004. Speech chunks in conversation: Syntactic and prosodic aspects. Proc.
Speech Prosody 2004, Nara, 305-308.
Strangert, E. & R. Carlson, 2006. On modelling and synthesis of conversational speech. Proc.
Nordic Prosody IX, 2004, Lund, 255-264.
Sundaram, S. & S. Narayanan, 2003. An empirical text transformation method for
spontaneous speech synthesizers. Proc. Interspeech 2003, Geneva.
F-pattern Analysis of Professional Imitations

of “hallå” in three Swedish Dialects
Frantz Clermont and Elisabeth Zetterholm
{frantz.clermont|elisabeth.zetterholm}@ling.lu.se
Abstract
We describe preliminary results of an acoustic-phonetic study of voice imitations, which is
ultimately aimed towards developing an explanatory approach to similar-sounding voices.
Such voices are readily obtained by way of imitations, which were elicited by asking an adult-
male, professional imitator to utter two tokens of the Swedish word “hallå” in a telephone-
answering situation and three Swedish dialects (Gothenburg, Stockholm, Skania). Formant-
frequency (F1, F2, F3, F4) patterns were measured at several landmarks of the main phonetic
segments (‘a’, ‘l’, ‘å’), and cross-examined using the imitator’s token-averaged F-pattern and
those obtained by imitation. The final ‘å’-segment seems to carry the bulk of differences
across imitations, and between the imitator’s patterns and those of his imitations. There is
however a notable constancy in F1 and F2 from the ‘a’-segment nearly to the end of the ‘l’-
segment, where the imitator seems to have had fewer degrees of articulatory freedom.
1 Introduction
It is an interesting fact but all the same a challenging one in forensic voice identification, that
certain voices should sound similar (Rose & Duncan, 1995), even though they originate from
different persons with differing vocal-tract structures and speaking habits. It is also a familiar
observation (Zetterholm, 2003) that human listeners can associate an imitated voice with the
imitated person. However, there are no definite explanations for similar-sounding voices, and
thus there is still no definite approach for understanding their confusability. Nor are there any
systematic insights into the degree of success that is achievable in trying to identify an
imitator’s voice from his/her imitations. Some valiant attempts have been made in the past to
characterise the effects of disguise on voice identification by human listeners. More recently,
there have been some useful efforts to evaluate the robustness of speaker identification
systems (Zetterholm et al., 2005). The results are however consistent in that “it is possible to
trick both human listeners and a speaker verification system” (Zetterholm et al., 2005: p. 254),
and that there are still no clear explanations.
Overall, the knowledge landscape around the issue of similarity of voices appears to be
quite sparse, yet this issue is at the core of the problem of voice identification, which has
grown pressing in dealing with forensic-phonetic evaluation of legal and security cases. Our
ultimate objective, therefore, is to use acoustic, articulatory and perceptual manifestations of
imitated voices as pathways for developing a more explanatory approach to similar-sounding
voices than available to date.
The present study describes a preliminary step in the acoustic-phonetic analysis of
imitations of the word “hallå” in three dialects of Swedish. The formant-frequency patterns
obtained are enlightening from a phenomenological and a methodological point of view.
26 FRANTZ CLERMONT & ELISABETH ZETTERHOLM
2 Imitations of the Swedish word “hallå” – the speech material

The material gathered thus far consists of auditorily-validated imitations of the Swedish word
“hallå”. An adult-male, professional imitator was asked to first produce the word in his own
usual way. The imitator is a long-term resident of an area close to Gothenburg and, therefore,
his speaking habits are presumed to carry some characteristics of the Gothenburg dialect. He
was asked to also produce imitations of “hallå” in situations such as: (i) answering the
telephone, (ii) signalling arrival at home, and (iii) greeting a long-lost friend, all in 5 Swedish
dialects (Gothenburg, Stockholm, Skania, Småland, Norrland). The 2 tokens obtained for the
first 3 dialects in situation (i) were retained for this preliminary study. The recordings took
place in the anechoic chamber recently built at Lund University. The analogue signals were
sampled at 44 kHz, and then down-sampled by a factor of 4 for formant-frequency analyses.
3 Formant-frequency parameterisation
3.1 Formant-tracking procedure
The voiced region of every waveform was isolated using a spectrographic representation,
concurrently with auditory validation. Formants were estimated using Linear-Prediction (LP)
analyses through Hanning-windowed frames of 30-msec duration, by steps of 10 msecs, and a
pre-emphasis of 0.98. For 25% of the data used for this study, the LP-order had to be
increased to 18 from a default value of 14. For each voiced interval, the LP-analyses yielded a
set of frame-by-frame poles, among which F1, F2, F3 and F4 were estimated using a method
(Clermont, 1992) based on cepstral analysis-by-synthesis and dynamic programming.
3.2 Landmark selection along the time axis

The expectedly-varying durations amongst the “hallå” tokens raise the non-trivial problem of
mapping their F-patterns onto a common time base. We sought a solution to this problem by
looking at the relative durations of the main phonetic segments (‘a’, ‘l’, ‘å’), which were
demarcated manually. The token-averaged durations for imitated and imitator’s segments are
superimposed in Fig. 1, together with the overall mean per segment.
Figure 1. Segmental durations: Mean ratio of ~3 to 1 for ‘a’, ~5 to 1 for ‘å’, relative to ‘l’.
Interestingly, the durations for the imitator’s ‘a’- and ‘å’-segments are closer to those
measured for his Gothenburg imitations, and smaller than those measured for his Skanian and
Stockholm imitations. Fig. 1 also indicates that the medial ‘l’-segment has a duration that is
tightly clustered around 50 msecs and, therefore, it is a suitable reference to which the other
segments can be related. On the average, the duration ratio relative to the ‘l’-segment is about
3 to 1 for ‘a’, and 5 to 1 for ‘å’. A total of 45 landmarks were thus selected such that, if 5 are
arbitrarily allocated for the ‘l’-segment, there are 3 times as many for the ‘a’-segment and 5
times as many for the ‘å’-segment. The method of cubic-spline interpolation was employed to
generate the 45-landmark, F-patterns that are displayed in Fig. 2 and subsequently examined.
F-PATTERN ANALYSIS OF PROFESSIONAL IMITATIONS OF “HALLÅ” 27
4 F-pattern analysis
4.1 Inter-token consistency
It is known that F-patterns exhibit some variability because of the measurement method used,
and of one’s inability to replicate sounds in exactly the same way. Consequently, the spread
magnitude about a token-averaged F-pattern should be useful for gauging measurement
consistency, and intrinsic variability to some degree. Table 1 lists spread values that mostly lie
within difference-limens for human perception, and are therefore deemed to be tolerable. The
spread in F3 for the imitator’s “hallå” is relatively large, especially by comparison with his
other formants. However, the top left-hand panel of Fig. 2 does show that there is simply
greater variability in the F3 of his initial ‘a’-segment. Overall, there appear to be no gross
measurement errors that prevent a deeper examination of our F-patterns.
Table 1. Inter-token spreads (=standard deviations in Hz) averaged across all 45 landmarks.
F1 F2 F3 F4
IMITATOR (SELF) 33 68 136 72
STOCKHOLM (STK) 42 68 28 79
GOTHENBURG (GTB) 23 55 71 75
SKANIA (SKN) 34 58 36 50
Mean (spread) with IMITATOR: 32 (8) 62 (7) 68 (49) 69 (13)
Mean (spread) without IMITATOR: 33 (10) 60 (7) 45 (23) 68 (16)
4.2 Overview of F-pattern behaviours

For both the imitator’s “hallå” and his imitations, there is less curvilinearity in the formant
trajectories for the ‘a’- and ‘l’-segments than in those for the final ‘å’-segment, which behaves
consistently like a diphthong. The concavity of the F2-trajectory for the Skanian-like ‘å’-
segment seems to set this dialect apart from the other dialects. Quite noticeably for the ‘a’-
and ‘l’-segments, F1- and F2-trajectories are relatively flatter, and numerically closer to one
another than the higher formants. Interestingly again, the F-patterns for the Gothenburg-like
“hallå” seem to be more aligned with those corresponding to the imitator’s own “hallå”.
Figure 2. Landmark-normalised F-patterns: Imitator & his imitations of 3 Swedish dialects.

28 FRANTZ CLERMONT & ELISABETH ZETTERHOLM
4.3 Imitator versus imitations – a quantitative comparison

The ‘a’- and ‘l’-segments examined above seem to retain the strongest signature of the
imitator’s F1- and F2-patterns. To obtain a quantitative verification of this behaviour, we
calculated landmark-by-landmark spreads (Fig. 3) of the F-patterns with all data pooled
together (left panel), and without the Skania-like data (right panel). The left-panel data
highlight a large increase of the spread in F1 and F2 for the final ‘å’-segment, thus confirming
a major contrast with the other dialectal imitations. The persistently smaller spread in F1 and
F2 for the two initial segments raises the hope of being able to detect some invariance in
professional imitations of “hallå”. The relatively larger spreads in F3 and F4 cast some doubt
on these formants’ potency for de-coupling our imitator’s “hallå” from his imitations.
Figure 3. Landmark-by-landmark spreads: (left) all data pooled; (right) Skania-like excluded.
5 Summary and ways ahead

The results of this study are prima facie encouraging, at least for the imitations obtained from
our professional imitator. It is not yet known whether the near-constancy observed through F1
and F2 of the initial segments of “hallå” will be manifest in other situational tokens, and
whether a similar behaviour should be expected with different imitators and phonetic
contexts. We have looked at formant-frequencies one at a time but, as shown by Clermont
(2004) for Australian English “hello”, there are deeper insights to be gained by re-examining
these frequencies systemically. The ways ahead will involve exploring all these possibilities.
Acknowledgements
We express our appreciation to Prof. G. Bruce for his auditory evaluation of the imitations.
We thank Prof. Bruce and Dr D.J. Broad for their support, and the imitator for his efforts.
References
Clermont, F., 1992. Formant-contour parameterisation of vocalic sounds by temporally-
constrained spectral matching. Proc. 4th Australian Int. Conf. Speech Sci. & Tech., 48-53.
Clermont, F., 2004. Inter-speaker scaling of poly-segmental ensembles. Proc. 10th Australian
Int. Conf. Speech Sci. & Tech., 522-527.
Rose, P. & S. Duncan, 1995. Naïve auditory identification and discrimination of similar
sounding voices by familiar listeners. Forensic Linguistics 2, 1-17.
Zetterholm, E., 2003. Voice imitation: A phonetic study of perceptual illusions and acoustic
successes. Dissertation, Lund University.
Zetterholm, E., D. Elenius & M. Blomberg, 2005. A comparison between human perception
and a speaker verification system score of a voice imitation. Proc. 10th Australian Int. Conf.
Speech Sci. & Tech., 393-397.
Describing Swedish-accented English

Una Cunningham
Department of Arts and Languages, Högskolan Dalarna
uca@du.se
Abstract
This paper is a presentation of the project Swedish accents of English which is in its initial
stages. The project attempts to make a phonetic and phonological description of some
varieties of Swedish English, or English spoken in Sweden, depending on the status attributed
to English in Sweden. Here I show some curious results from a study of acoustic correlates of
vowel quality in the English and Swedish of young L1 Swedish speakers.
1 Introduction
1.1 Background
The aim of the proposed project is to document the phonetic features of an emerging variety
of English, i.e. the English spoken by young L1 speakers of Swedish. At a time when the
relative positions of Swedish and English in Sweden are the stuff of Government bills
(Regeringen, 2005), the developing awareness of the role English has as an international
language in Sweden is leading to a rejection of native speaker targets for Swedish speakers of
English. Throughout what Kachru (1992) called the expanding circle, learners of English are
no longer primarily preparing for communication with native speakers of English but with
other non-native speakers. In a recent article, Seidlhofer (2005) called for the systematic study
of the features of English as a lingua franca (ELF), that is communication that does not
involve any native speakers, in order to free ELF from native-speaker norms imposed upon it.
She would prefer to see ELF alongside native speaker varieties rather than constantly being
monitored and compared to them. The point is that there are features of the pronunciation of
native speaker varieties which impede communication, and features of non-native
pronunciation which do not disturb communication, and rather than teaching learners to be as
native-like as possible, communication would be optimised by instead concentrating on the
non-native listener rather than the native listener.
Some young people are British-oriented in their pronunciation, either from RP/BBC
English or another accent, others have general American as a clear influence, while another
group is not clearly influenced by any native speaker norm. A full phonetic description of
these accents of English does not as yet exist, and is of interest as a documentation of an
emerging variety of English, at a time when previously upheld targets for the pronunciation of
English by Swedish learners have been abandoned and English is growing in importance
(Phillipson, 1992; Skolverket, 2006).
1.2 Previous studies

The distinction between English as a Foreign Language (EFL) and English as an International
Language (EIL) or English as a Lingua Franca (ELF) is important here. The number of non-
native speakers of English increasingly exceeds the number of native speakers, and the native
speaker norm as the “given and standard measure” (Jenkins, 2004) for English learners must
be questioned. There is a clear distinction between those learners who aspire to sound as
30 UNA CUNNINGHAM
native-like as possible and those who wish to be as widely understood as possible. McArthur
(2003) makes a distinction between English in its own right and English in its global role and
argues that the distinction between English as a second language and English as a foreign
language is becoming less useful, as people in a range of countries, including those in
Scandinavia routinely use the language. Seidlhofer (2005) called for the description of English
as a Lingua Franca (a term rejected by some writers because of the associations of lingua
franca with pidgins and mixed forms of language). Those who use this term usually want to
indicate the same as those who use English as an international language, i.e. a “core” of
English stripped of the less useful features of native speaker varieties, such as weak forms of
function words, typologically unusual sounds such as the interdental fricatives etc. (e.g.
Jenkins, 2004). As corpora of non-native English (both non-native to native and non-native to
non native) are being developed, such as the English as a lingua franca in academic settings
(ELFA) corpus (Mauranen, 2003) and the general Vienna-Oxford International Corpus of
English (VOICE) (Seidlhofer, 2005), this is now possible although few studies have been
made of pronunciation.
2 Methodological thoughts
The project aim is to make a thorough study of some phonetic and phonological features of
Swedish accents of English in two groups of informants. The first group is young adults
(those who are currently at upper secondary or have left upper secondary education in the past
5 years, and are thus 16-24 years of age. These speakers have not usually received any
pronunciation teaching. The second group of Swedish speakers of English is university
teachers who are over 40 years of age and who have not spent long periods in native English-
speaking environments or studied English at university level, but who do use English
regularly. Although there is a difference in the stability of a learner variety compared to an
established variety of a language (Tarone, 1983), there is certainly a set of features
characteristic of a Swedish accent of English. It should be possible to make interesting
generalisations.
A first step will be to establish the phoneme inventories of the English of each informant.
The acoustic quality of vowels produced in elicited careful speech (reading words in citation
form and texts) as well as in spontaneous speech (dialogues between non-native informants)
will be investigated. Within-speaker variation is of interest here to capture variable
production, as well as between-speaker variation. The realisations of the vowel phonemes of
Swedish-English will be charted and examined. There are hypothesised to be some consonant
phonemes of native varieties of English that are missing from Swedish English (as any ELF
variety c.f. Seidlhofer, 2005) – voiced alveolar and palatoalveolar fricatives and affricates are
candidates. The realisation of the English alveolar consonants will also be closely studied as
will various kinds of allophonic variation such as dark and light /l/, rhoticism, phonotactic
effects, assimilation, vowel reduction, rapid speech phenomena etc.
Flege, Schirru & MacKay (2003) established that the two phonetic systems of Italian-
speaking learners of English interact. Their study showed that L2 vowels are either
assimilated to L1 vowels or dissimilated from them (i.e. are made more different than the
corresponding vowels produced by monolinguals speaking English or Italian), depending on
the usage patterns of the individual learners. A similar phenomenon was seen in bilingual
speakers of Swedish and English (Cunningham, 2004) where there was a dissimilation in
timing. An attempt will be made to detect any similar patterns (i.e. instances of the Swedish
accent having greater dissimilation between categories than native varieties in the speech of
the Swedish-English speakers being studied).
The differences between Swedish and English timing and the way bilingual individuals do
not usually maintain two separate systems for organising the temporal relationship of vowels
DESCRIBING SWEDISH-ACCENTED ENGLISH 31
and consonants have been the subject of earlier research (Cunningham, 2003). The timing of
Swedish-accented English will be studied in the data collected in this project. The way the
learners deal with post-vocalic voicing and the relationship between vowel quality and vowel
and consonant quantity are particularly interesting as regards their consequences for
comprehensibility, as the perceptive weight of quantity appears to be different in Swedish and
English (c.f. e.g. McAllister, Flege & Piske, 2002). The consequences of the timing solutions
adopted by Swedish speakers of English for their comprehensibility to native speakers of
English, Swedish speakers of English and other non-native speakers of English could be
investigated at a later stage.
3 Early results
Recordings of sixteen young Swedish speakers (12 female, 4 male), with Swedish as their
only home and heritage language) have been made. They were all in their first year of upper
secondary education at the time of recording (around 16 years old). Figures 1 and 2 show the
relationships between the first and second formant frequencies for high front vowels in
elicited citation form words for one of these speakers (known as Sara for the purpose of this
study). Sara’s English high front vowels appear to be qualitatively dissimilated while her
Swedish high front vowels are not clearly qualitatively distinguished using the first two
formants. Sara’s English high vowels are apparently generally higher than these Swedish high
vowels. Notice the fronting found for Sara’s English in two instances of /u:/ in the word
choose. This particular word has been pronounced with fronting for other speakers too. Might
this be a case of a feature of Estuary English making its way into the English spoken in
Sweden?
Sara English
200
300
400
Ref Eng u
500 Ref Eng i
i:
F1 Hz
600
I
700 u:
U
800
900
1000
2900 2400 1900 1400 900
F2 Hz
Figure 1. A 16 year-old female Swedish speaker’s high vowels from a read list of English
words. Reference values from Ladefoged’s material http://hctv.humnet.ucla.edu/departments/
linguistics/VowelsandConsonants/vowels/chapter3/english.aiff .
32 UNA CUNNINGHAM
Sara Swedish
200
300
400
i:
500 Ǻ
ș
F1 Hz
600 ȅ
700 Ref Sw u
Ref Sw i
800
900
1000
2900 2400 1900 1400 900
F2 Hz
Figure 2. Some of the same speaker’s high vowels from a read list of Swedish words.
Reference values from Eklund & Traunmüller (1997).
References
Cunningham, U., 2003. Temporal indicators of language dominance in bilingual children.
Proceedings from Fonetik 2003, Phonum 9, Umeå University, 77-80.
Cunningham, U., 2004. Language Dominance in Early and Late Bilinguals. ASLA, Södertörn.
Eklund, I. & H. Traunmüller, 1997. Comparative study of male and female whispered and
phonated versions of the long vowels of Swedish. Phonetica 54, 1-21.
Flege, J.E., C. Schirru & I.R.A. MacKay, 2003. Interaction between the native and second
language phonetic subsystems. Speech Communication 40, 467-491.
Jenkins, J., 2004. Research in teaching pronunciation and intonation. Annual Review of
Applied Linguistics 24, 109-125.
Kachru, B. (ed.), 1992. The Other Tongue (2nd edition). Urbana and Chicago: University of
Illinois Press.
Mauranen, A., 2003. Academic English as lingua franca—a corpus approach. TESOL Q. 37,
513-27.
McAllister, R., J.E. Flege & T. Piske, 2002. The influence of L1 on the acquisition of Swedish
quantity by native speakers of Spanish, English and Estonian. Journal of Phonetics 30(2),
229-258.
McArthur, T., 2003. World English, Euro-English, Nordic English? English Today 73(19),
54-58.
Phillipson, R., 1992. Linguistic Imperialism. Oxford: Oxford Univ. Press.
Regeringen, 2005. Bästa språket – en samlad svensk språkpolitik. Prop. 2005/06:2.
Seidlhofer, B., 2005. English as a lingua franca. ELT Journal 59(4), 339-341.
Skolverket, 2006. Förslag till kursplan.
Tarone, E., 1983. On the variability of interlanguage systems. Applied Linguistics 4, 142-163.
Quantification of Speech Rhythm in

Norwegian as a Second Language
Wim A. van Dommelen
Department of Language and Communication Studies, NTNU
wim.van.dommelen@hf.ntnu.no
Abstract
This paper looks into the question of how to quantify rhythm in Norwegian spoken as a
second language by speakers from different language backgrounds. The speech material for
this study was taken from existing recordings from the Language Encounters project and
consisted of sentences read by natives and speakers from six different L1s. Measurements of
syllable durations and speech rate were made. Seven different metrics were calculated and
used in a discriminant analysis. For the five utterances investigated, statistical classification
was to a large degree in congruence with L1 group membership. The results therefore suggest
that L2 productions differed rhythmically from Norwegian spoken as L1.
1 Introduction
During the last few years a number of attempts have been made to classify languages
according to rhythmical categories using various metrics. To investigate rhythm characteris-
tics of eight languages, Ramus, Nespor & Mehler (1999) calculated the average proportion of
vocalic intervals and standard deviation of vocalic and consonantal intervals over sentences.
Though their metrics appeared to reflect aspects of rhythmic structure, also considerable
overlap was found. Grabe's Pairwise Variability Index (PVI; see section 2.2) is a measure of
differences in vowel duration between successive syllables and has been used by, e.g., Grabe
& Low (2002), Ramus (2002) and Stockmal, Markus & Bond (2005). In order to achieve
more reliable results Barry, Andreeva, Russo, Dimitrova & Kostadinova (2003) proposed to
extend existing PVI measures by taking consonant and vowel intervals together. The present
paper takes an exploratory look into the question of how to quantify speech rhythm in
Norwegian spoken by second language users. Seven metrics will be used, five of which being
based on syllable durations. Two metrics are related to speech rate, and the last one is Grabe's
normalized Pairwise Variability Index with syllable duration as a measure.
2 Method
2.1 Speech material
The speech material used for this study was chosen from existing recordings made for the
Language Encounters project. These recordings were made in the department's sound-
insulated studio and stored with a sampling frequency of 44.1 kHz. Five different sentences
were selected consisting of 8, 10, 11, 11, and 15 syllables, respectively. There were six second
language speaker groups with the following L1s (number of speakers in parentheses): Chinese
(7), English (4), French (6), German (4), Persian (6) and Russian (4). Six native speakers of
Norwegian served as a control group. The total number of sentences investigated was thus 37
x 5= 185.
34 WIM A. VAN DOMMELEN
2.2 Segmentation and definition of metrics

The 185 utterances were segmented into syllables and labeled using Praat (Boersma &
Weenink, 2006). Syllabification of an acoustic signal is not a trivial task. It was guided
primarily by the consideration to achieve consistent results across speakers and utterances. In
words containing a sequence of a long vowel and a short consonant in a context like V:CV
(e.g., fine [nice]) the boundary was placed before the consonant (achieving fi-ne), after a short
vowel plus long consonant as in minne (memory) after the consonant (minn-e). Only when the
intervocalic consonant was a voiceless plosive, the boundary was always placed after the
consonant (e.g. in mat-et [fed]).
To compare temporal structure of the L2 with the L1 utterances, seven different types of
metrics were defined. In all cases calculations were related to each of the seven groups of
speakers as a whole. The first metric was syllable duration averaged over all syllables of each
utterance, yielding one mean syllable duration for each sentence and each speaker group.
Second, the standard deviation for the syllable durations pooled over the speakers of each
group was calculated for each of the single utterances' syllables. The mean standard deviation
was then taken as the second metric, thus expressing mean variation of syllable durations
across each utterance.
For the definition of the third and fourth metric let us look at Figure 1. In this figure, closed
symbols depict mean syllable durations in the sentence To barn matet de tamme dyrene (Two
children fed the tame animals) produced by six native speakers. The syllables are ranked
according to their increasing durations. Similarly, the open symbols give the durations for the
same syllables produced by the group of seven Chinese speakers. Note that the order of the
syllables is the same as for the Norwegian natives. Indicated are regression lines fitted to the
two groups of data points. The correlation coefficient for the relation between syllable
duration and the rank number of the syllables as defined by the Norwegian reference is the
third metric in this study (for the Chinese speaker group presented in the figure r= 0.541).
Further, the slope of the regression line was taken as the fourth metric (here: 18.7). The
vertical bars in Figure 1 indicate ± 1 standard deviation. The mean of the ten standard
deviation values represents the second metric defined above (for Norwegian 27 ms; for
Chinese 63 ms).
600
500
Syllable duration [ms]
400
300
200
100
e re de et ne dy to tamm mat barn

0
0 2 4 6 8 10 12
Syllable rank
Figure 1. Mean duration of syllables in a Norwegian utterance ranked according to increasing

duration for six native speakers (closed symbols with regression line). Open symbols indicate
mean durations for a group of seven Chinese subjects with syllable rank as for the L1
speakers. Vertical bars indicate ± 1 standard deviation.
QUANTIFICATION OF SPEECH RHYTHM IN NORWEGIAN AS A SECOND LANGUAGE 35
As metric number five speech rate was chosen, calculated as the number of (actually
produced) phonemes per second. The standard deviation belonging to the mean number of
phonemes served as the sixth metric. In both cases, there was one single value per utterance
and speaker group. Finally, the seventh metric was the normalized Pairwise Variability Index
(nPVI) as used by Grabe & Low (2002):
m−1 dk −dk +1 
(1) nPVI = 100 ×  ∑ /(m − 1)
(d +d ) / 2
 k =1 k k +1 
In this calculation the difference of the durations (d) of two successive syllables is divided by
the mean duration of the two syllables. This is done for all (m-1) successive syllable pairs in
an utterance (m= the number of syllables). Finally, by dividing the sum of the (m-1) amounts
by (m-1) a mean normalized difference is calculated and expressed as percent.
3 Results
3.1 Mean syllable duration
Since the main temporal unit under scrutiny is the syllable, let us first see whether and to what
extent the various speaker groups produced different syllable durations. As can be seen from
Table 1, mean syllable durations vary substantially. Shortest durations were found for the
natives (178 ms), while the subjects with a Chinese L1 produced the longest syllables
(285 ms). The other groups have values that are more native-like, in particular the German
speakers with a mean of 200 ms. For all speaker groups the standard deviations are quite
large, which is due to both inter-speaker variation and the inclusion of all the different types
of syllables. (Note that the standard deviation described here is different from the second
metric; see 2.2.) According to a one-way analysis of variance, the overall effect of speaker
group on syllable duration is statistically significant (F(6, 2029)= 40.322; p< .0001).
Calculation of a Games-Howell post-hoc analysis resulted in the following homogeneous
subsets (level of significance p= 0.05): (Chinese); (English, French, German, Russian);
(French, English, Persian, Russian); (German, Norwegian, English, Russian); (Persian,
French); (Russian, English, French, German); (Norwegian, German). It is thus obvious that
syllable durations overlap considerably and do not really distinguish the speaker groups.
Table 1. Mean syllable durations and standard deviations in ms for six groups of L2 speakers
and a Norwegian control group. Means are across five utterances and all speakers in the
respective speaker groups (see 2.2).
Chinese English French German Persian Russian Norwegian
mean 285 227 238 200 255 224 178
sd 115 107 98 91 102 111 84
n 387 220 330 220 329 220 330
3.2 Discriminant analysis

In order to investigate whether rhythmical differences between utterances from the different
speaker groups can be captured by the seven metrics defined above, a discriminant analysis
was performed. It appears from the results that in the majority of cases the L2-produced
utterances were correctly classified (Table 2). The overall correct classification rate amounts
to 94.3%. Only one utterance produced by the Chinese speaker group was classified as Persian
and one utterance from the French group was confused with the category Russian.
36 WIM A. VAN DOMMELEN
Table 2. Predicted L1 group membership (percent correct) of five utterances according to a

discriminant analysis using seven metrics (see section 2.2).
Predicted L1 group membership
L1 group Chinese English French German Persian Russian Norwegian
Chinese 80 0 0 0 20 0 0
English 0 100 0 0 0 0 0
French 0 0 80 0 0 20 0
German 0 0 0 100 0 0 0
Persian 0 0 0 0 100 0 0
Russian 0 0 0 0 0 100 0
Norwegian 0 0 0 0 0 0 100
The results of the discriminant analysis further showed that three of the six discriminant
functions reached statistical significance, cumulatively explaining 96.4% of the variance. For
the first function, the metrics with most discriminatory power were slope (metric 4), speech
rate (metric 5) and mean syllable duration (metric 1). The second discriminant function had
also slope and speech rate, but additionally standard deviations for speech rate (metric 6) and
for syllable duration (metric 2), and nPVI (metric 7) as important variables. Finally, of highest
importance for the third function were metrics 5, 3 (correlation coefficient), 4, and 7, in that
order.
4 Conclusion
The present results suggest that the utterances spoken by the second language users differed in
rhythmical structure from those produced by the native speakers. It was shown that it is
possible to quantify rhythm using direct and indirect measures. Though the statistical analysis
yielded promising results, it should be kept in mind that the number of utterances investigated
was relatively small. Therefore, more research will be needed to confirm the preliminary
results and to refine the present approach.
Acknowledgements
This research is supported by the Research Council of Norway (NFR) through grant
158458/530 to the project Språkmøter. I would like to thank Rein Ove Sikveland for the
segmentation of the speech material.
References
Barry, W.J., B. Andreeva, M. Russo, S. Dimitrova & T. Kostadinova, 2003. Do rhythm
measures tell us anything about language type? Proceedings 15th ICPhS, Barcelona, 2693-
2696.
[Computer program]. Retrieved February 23, 2006, from http://www.praat.org/.
Grabe, E. & E.L. Low, 2002. Durational variability in speech and the rhythm class hypothesis.
In C. Gussenhoven & N. Warner (eds.), Laboratory Phonology 7. Berlin: Mouton, 515-546.
Ramus, F., 2002. Acoustic correlates of linguistic rhythm: Perspectives. Proceedings Speech
Prosody 2002, Aix-en-Provence, 115-120.
Ramus, F., M. Nespor & J. Mehler, 1999. Correlates of linguistic rhythm in the speech signal.
Cognition 73, 265-292.
Stockmal, V., D. Markus & D. Bond, 2005. Measures of native and non-native rhythm in a
quantity language. Language and Speech 48, 55-63.
/nailon/ – Online Analysis of Prosody

Jens Edlund and Mattias Heldner
Department of Speech, Music and Hearing, KTH, Stockholm
{edlund|mattias}@speech.kth.se
Abstract
This paper presents /nailon/ – a software package for online real-time prosodic analysis
that captures a number of prosodic features relevant for interaction control in spoken
dialogue systems. The current implementation captures silence durations; voicing, intensity,
and pitch; pseudo-syllable durations; and intonation patterns. The paper provides detailed
information on how this is achieved.
1 Introduction
All spoken dialogue systems, no matter what flavour they come in, need some kind of
interaction control capabilities in order to identify places where it is legitimate to begin to talk
to a human interlocutor, as well as to avoid interrupting the user. Most current systems rely
exclusively on silence duration thresholds for making such interaction control decisions, with
thresholds typically ranging from 500 to 2000 ms (e.g. Ferrer, Shriberg & Stolcke, 2002).
Such an approach has obvious drawbacks. Users generally have to wait longer for responses
than in human-human interactions, but at the same time they run the risk of being interrupted
by the system. This is where /nailon/ – our software for online analysis of prosody and the
main focus of this paper – enters the picture.
2 Design criteria for practical applications

In order to use prosody in practical applications, the information needs to be available to the
system, which places special requirements on the analyses. First of all, in order to be useful in
live situations, all processing must be performed automatically, in real-time and deliver its
results with minimal latency (cf. Shriberg & Stolcke, 2004). Furthermore, the analyses must
be online in the sense of relying on past and present information only, and cannot depend on
any right context or look-ahead. There are other technical requirements: the analyses should
be sufficiently general to work for many speakers and many domains, and should be
predictable and constant in terms of memory use, processor use, and latency. Finally,
although not a strict theoretical nor a technical requirement, it is highly desirable to use
concepts that are relevant to humans. In the case of prosody, measurements should be made
on psychoacoustic or perceptually relevant scales.
3 /nailon/
The prosodic analysis software /nailon/ was built to meet the requirements and to capture
silence durations; voicing, intensity, and pitch; pseudo-syllable durations; and intonation
patterns. It implements high-level methods accessible through in Tcl/Tk and the low-level
audio processing is handled by the Snack sound toolkit, with pitch-tracking based on the
ESPS tool get_f0. /nailon/ differs from Snack in that its analyses are incremental with
relatively small footprints and can be used for online analyses. The implementation is real-
38 JENS EDLUND & MATTIAS HELDNER
time in the sense that it performs in real time, with small and constant latency, on a standard
PC. It is a key feature that the processing is online – in fact, /nailon/ is a phonetic anagram
of online. On the acoustic level, this goes well with human circumstances as humans rarely
need acoustic right context to make decisions about segmentation. The requirements on
memory and processor usage are met by using incremental algorithms, resulting in a system
with a small and constant footprint and flexible processor usage. The generality requirements
are met by using online normalisation and by avoiding algorithms relying on ASR. The
analysis is in some ways similar to that used by Ward & Tsukahara (2000), and is performed
in several consecutive steps. Each step is described in detail below.
3.1 Audio acquisition

The audio signal is acquired through standard Snack object methods from any audio device.
Each new frame is pushed onto a fixed length buffer of predetermined size, henceforth the
current buffer. The buffer size is a factor of the processing unit size. Note that processing unit
size is not the inverse of the sampling frequency, which defaults to 1/16 kHz. Rather, it
should be larger by an order of magnitude to ensure smooth processing. The default
processing unit size is 10 ms, and the default current buffer size is 40 such units, or 400 ms.
The current buffer, then, is in effect a moving window with a length of less than half a
second. As far as the processing goes, sound that is pushed out on the left side of the buffer is
lost, as the Snack object used for acquisition is continuously truncated. The current buffer is
updated with every time a sufficient sound to fill another processing unit has been acquired –
100 times per second given the default settings.
3.2 Preprocessing
In many cases, the online requirement makes it impractical or impossible to use filters directly
on the Snack sound object used for acquisition. Instead, /nailon/ provides access to the raw
current audio buffer, so that filters can be applied to it before any other processing takes
place. Filters are applied immediately before each get_f0 extraction (see the next section).
Using filters in this manner causes /nailon/ to duplicate the current audio buffer in order to
have a raw, unfiltered copy of the buffer available at all times.
3.3 Voicing, pitch, and intensity extraction

Voicing, pitch, and intensity are extracted from the current buffer using the Snack/ESPS
get_f0 function. This process is made incremental by repeating it over the current buffer as
the buffer is updated. The rate at which extraction takes place is managed externally, which
facilitates robust handling of varying processor load caused by other processes. In an ideal
situation, the update takes place every time a new processing unit has been pushed onto the
current buffer, in which case only the get_f0 results for the very last processing unit of the
buffer are used. If this is not possible due to processor load, then a variable number N
processing units will have been added to the buffer since the last F0 extraction took place, and
the last N results from get_f0 will be used, where N is a number smaller then the length of the
current buffer processing units. In this case, we introduce a latency of N processing units to
the processing at this stage. /nailon/ configuration permits that a maximum update rate is
given in order to put a cap on the process requirements of the analysis. The default setting is
to process every time a single processing unit has been added, which provides smooth
processing on a regular PC at a negligible latency. Each time a get_f0 extraction is performed,
/nailon/ raises an event for each of the new get_f0 results produced by the extraction, in
sequence. These events, called ticks, trigger each of the following processing steps.
/nailon/ – ONLINE ANALYSIS OF PROSODY 39
3.4 Filtering
Each tick triggers a series of event driven processing steps. These steps are generally optional
and can be disabled to save processing time. The steps described here are the ones used by
default. The first step is a filter containing a number of reality checks. Pitch and intensity are
checked against preset minimum and maximum thresholds, and set to an undefined value if
they fail to meet these. Similarly, if voicing was detected, this is removed if the pitch is out of
bounds. Correction for octave errors is planned to go here as well, but not currently
implemented. Note that removing values at this stage does not put an end to further
processing – consecutive processes may continue by extrapolation or other means. Median
filtering can be applied to the pitch and intensity data. If this is done, a delay of half the total
time of the number of processing units used in the filtering is introduced at this point. By
default, a median filter of seven processing units is used. In effect, this causes all consecutive
processes to focus on events that took place 3 processing units back, causing a delay of less
than 40 ms. On the other hand, the filter makes the analysis more robust. Finally, the resulting
pitch and intensity values are transformed into semitones and dB, respectively.
3.5 Range normalisation of F0 and intensity

/nailon/ implements algorithms for calculation of incremental means and standard
deviations. Each new processing unit causes an update in mean and standard deviation of both
pitch and intensity, provided that it was judged to contain voiced speech by the previous
processing stage. The dynamic mean and standard deviation values are used as a model for
normalising and categorising new values. The stability of the model is tracked by determining
whether the standard deviation is generally decreasing. Informal studies show that the model
stabilises after less than 20 seconds of speech has been processed, given a single speaker in a
stable sound environment. Currently, /nailon/ may either cease updating means and
standard deviation when stability is reached, or continue updating them indefinitely, with
ever-decreasing likelihood that they will change. A possibility to reset the model is also
available. A decaying algorithm which will permit us to fine-tune how stable or dynamic the
normalisation should be has been designed, but has yet to be implemented. The mean and
standard deviation are used to normalise the values for pitch and intensity with regard to the
preceding speech by expressing them as the distance from the mean expressed in standard
deviations.
3.6 Silence detection

Many of the analyses we have used /nailon/ for to date are refinements or additions of
speech/silence decisions. For this reason a simplistic speech activity detection (SAD) is
implemented. Note, however, that /nailon/ would work equally well or better together with
an external SAD. /nailon/ uses a simple intensity threshold which is recalculated
continuously and is defined as the valley following the first peak in an intensity histogram.
/nailon/ signals a change from silence to speech whenever the threshold is exceeded for a
configurable number of consecutive processing units, and vice versa. The default number is
30, resulting in a latency of 300 ms for speech/silence decisions. Informal tests show no
decrease in performance if this number is lowered to 20, but it should be noted that the system
has only been used on sound with a very good signal-to-noise ratio.
3.7 Psyllabification
/nailon/ keeps a copy of pitch, intensity and voicing information for the last seen
consecutive stretch of speech at all times. Whenever silence is encountered, this intensity
values of the record are searched backwards (last processing unit first) for a convex hull
40 JENS EDLUND & MATTIAS HELDNER
(loosely based on Mermelstein, 1975) contained in it. A hull in the intensity values of speech
is assumed to correspond roughly to a syllable, thus providing a pseudo-syllabification, or
psyllabification. By searching backwards, the hull that occurred last is found first. Currently,
processing ceases at this point, since only the hulls directly preceding silence has been of
interest to us so far. A convex hull in /nailon/ is defined as a stretch of consecutive value
triplets ordered chronologically, where the centre value is always above or on a line drawn
between the first and the last value. As this definition is very sensitive to noisy data, it is
relaxed by allowing a limited number of values to drop below the line between first and last
value as long as the area between that line and the actual values is less than a preset threshold.
3.8 Classification
The normalised pitch, intensity, and voicing data extracted by /nailon/ over a psyllable are
intended for classification of intonation patterns. Each silence-preceding hull is classified into
HIGH, MID, or LOW depending on whether the pitch value is in the upper, mid or lower third of
the speaker’s F0 range described by mean and standard deviation, and into RISE, FALL, and or
LEVEL depending on the shape of the intonation pattern. Previous work have shown that the
prosodic information provided by /nailon/ can be used to improve the interaction control in
spoken human-computer dialogue compared to systems relying exclusively on silence
duration thresholds (Edlund & Heldner, 2005).
4 Discussion
In this paper, we have presented /nailon/, an online, real-time software package for prosodic
analysis capturing a number of prosodic features liable to be relevant for interaction control.
Future work will include further development of /nailon/ in terms of improving existing
algorithms – in particular the intonation pattern classification – as well as adding new
prosodic features. For example, we are considering evaluating the duration of psyllables as an
estimate of final lengthening or speaking rate effects, and to use intensity measures to capture
the different qualities of silent pauses resulting from different vocal tract configurations
(Local & Kelly, 1986).
Acknowledgements
This work was carried out within the CHIL project. CHIL is an Integrated Project under the
European Commission’s sixth Framework Program (IP-506909).
References
Edlund, J. & M. Heldner, 2005. Exploring Prosody in Interaction Control. Phonetica 62, 215-
226.
Ferrer, L., E. Shriberg & A. Stolcke, 2002. Is the speaker done yet? Faster and more accurate
end-of-utterance detection using prosody in human-computer dialog. Proceedings ICSLP
2002, Denver, 2061-2064.
Local, J.K. & J. Kelly, 1986. Projection and ‘silences’: Notes on phonetic and conversational
structure. Human Studies 9, 185-204.
Mermelstein, P., 1975. Automatic segmentation of speech into syllabic units. Journal of the
Acoustical Society of America 58, 880-883.
Shriberg, E. & A. Stolcke, 2004. Direct Modeling of Prosody: An Overview of Applications
in Automatic Speech Processing. Proceedings Speech Prosody 2004, Nara, 575-582.
Ward, N. & W. Tsukahara, 2000. Prosodic features which cue back-channel responses in
English and Japanese. Journal of Pragmatics 32, 1177-1207.
Feedback from Real & Virtual Language Teachers

Olov Engwall
Centre for Speech Technology, KTH
engwall@kth.se
Abstract
Virtual tutors, animated talking heads giving the student computerized training of a foreign
language, may be a very important tool in language learning, provided that the feedback
given to the student is pedagogically sound and effective. In order to set up criteria for good
feedback from a virtual tutor, human language teacher feedback has been explored through
interviews with teachers and students, and classroom observations. The criteria are presented
together with an implementation of some of them in the articulation tutor ARTUR.
1 Introduction
Computer assisted pronunciation training (CAPT) may contribute significantly to second
language learning, as it gives the students access to private training sessions, without time
constraints or the embarrassment of making errors in front of others. The success of CAPT is
nevertheless still limited. One reason is that the detection of mispronunciations is error-prone
and that this leads to confusing feedback, but Neri et al. (2002) argue that successful CAPT is
already possible, as the main flaw lies in the lack of pedagogy in existing CAPT software
rather than in technological shortcomings. They conclude that if only the learners’ needs,
rather than technological possibilities, are put into focus during system development,
pedagogically sound CAPT could be created with available technology.
One attempt to answer this pedagogical need is to create virtual tutors, computer programs
where talking head models interact as human language teachers. An example of this is
ARTUR – the ARticulation TUtoR (Bälter et al., 2005), who gives detailed audiovisual
instructions and articulatory feedback. Refer to www.speech.kth.se/multimodal/ARTUR for a
video presenting the project. In such a virtual tutor system it becomes important not only to
improve the pedagogy of the given feedback, but to do it in such a way that it resembles
human feedback, in order to take benefit of the social process of learning
To test the usability of the system at an early stage, we are conducting Wizard of Oz
studies, in which a human judge detects the mispronunciations, diagnoses the cause and
chooses what feedback ARTUR should give from a set of pre-generated audiovisual
instructions (Bälter et al., 2005). The children practicing with ARTUR did indeed like it, but
the feedback was sometimes inadequate, e.g. when the child repeated the same error several
times; when the error was of the same type as before, but the pronunciation had been
improved, or when the student started to loose motivation, because the virtual tutor's feedback
was too detailed. One conclusion was hence that a more varied feedback was needed in order
to be efficient. The aim of this study is to investigate how the feedback of the virtual tutor
could be improved by studying feedback strategies of human language teachers in
pronunciation training and assess which of them could be used in ARTUR. Interviews with
language teachers and students, and classroom observations were used to explore when
feedback should be given, how to indicate an error, which errors should be corrected, and how
to promote student motivation.
42 OLOV ENGWALL
2 Feedback in pronunciation training

Lyster & Ranta (1997) classified feedback given by language teachers as
1. Explicit correction: the teacher clearly states that what the student said was incorrect and
gives the correct form, e.g. as “You should say: ...”
2. Recasts: the teacher reformulates the student's utterance, removing the error.
3. Repetition: the teacher repeats the student utterance with the error using the intonation to
indicate the error. Repetitions may also be used as positive feedback on a correct utterance.
4. Clarification requests: urging the student to reformulate the utterance.
5. Metalinguistic feedback: information or questions about an error used to make the students
reflect upon and find the error themselves using the provided information.
6. Elicitation: encourage students to provide the correct pronunciation, by open-ended
questions or fill-in-the-gap utterances.
Recasts was by far the most common type, but learners often perceive recasts as another way
to say the same thing, rather than a correction (Mackey & Philip, 1998). Carroll & Swain
(1993) found that all groups receiving feedback, explicit or implicit, improved significantly
more than the control group, but the group given explicit feedback outperformed the others.
As explicit feedback may be intrusive and affect student self-confidence if given too
frequently, it is however not evident that it should always be used.
3 Data collection
Six language teachers participated in the study, four in a focus group and two in individual
interviews using a semi-structured protocol (Rubin, 1994) with open-ended questions. Five
students were interviewed, three of them in a focus group and two individually. The teacher
and student groups were intentionally heterogeneous with respect to target language and
student level, in order to capture general pedagogical strategies. Classroom observations were
made in three beginner level courses, where the languages taught were close to, moderately
different from and very different from Swedish, respectively.
4 Results
4.1 When should errors be corrected?
There was a large consensus among teachers and students about the importance of never
interrupting the students' utterances, reading or discussions with feedback, even if it means
that errors are left uncorrected. This strategy was also observed in the classrooms.
4.2 How should errors be corrected?

This section summarizes how the teachers (T) or students (S) described how feedback should
be given and feedback observed during classes (O).
1. Recasts were the most common feedback in the classroom and were also advocated by the
students, as they considered that it was often enough to hear the correct pronunciation.
Contrary to the finding by Mackey & Philip (1998) that recasts were not perceived as
corrections, the students tried to repair after recasts (T, S, O).
2. Implicit (e.g. “Sorry?”) and explicit (e.g. “Could you repeat that?”) elicitation for the
student to self-correct was used frequently (O).
3. Increasing feedback. One teacher described a strategy going from minimal implicit
feedback towards more explicit, when required. In the most minimal form, the teacher
indicates that an error was produced by a questioning look or an attention-catching sound,
giving the students the opportunity to identify and self-correct the error. If the student is
unable to repair, a recast would be used. If needed, the recast would be repeated again
FEEDBACK FROM REAL & VIRTUAL LANGUAGE TEACHERS 43
(turning it into an explicit correction). The last step would be an explicit explanation of the
difference between the correct and erroneous pronunciation (T).
4. Articulatory instructions. Several teachers thought that formal descriptions and sketches
on place of articulation are of little use, since the students are unaccustomed to thinking
about how to produce different sounds. Some teachers did, however, use articulatory
instructions and one student specifically requested this type of feedback (T, S, O).
5. Sensory feedback, e.g. letting the students place their hands on their neck to feel the
vibration of voiced sounds or in front of the mouth to feel aspiration (T, O).
6. Comparisons to Swedish phonemes, as an approximation or reminder (T, S, O).
7. Metalinguistic explanations used to enforce the feedback or to motivate why it is important
to get a particular pronunciation right (T).
8. General recommendations rather than feedback on particular errors, e.g., “You should try
reading aloud by yourself at home”, to encourage additional training (T, O).
9. Contrasting repeat-recast, to illustrate the difference between the student utterance and the
correct or between minimal pairs (T, S).
4.3 Which errors should be corrected?

The teachers ventured several criteria for which errors should be corrected:
1. Comprehensibility: if the utterance could not be correctly understood.
2. Intelligibility: if the utterance could not be understood without effort.
3. Frequency: if the student repeats the same (type of) error several times.
4. Social impact: if the listener gets a negative impression of a speaker making the error.
5. Proficiency: a student with a better overall pronunciation may get corrective feedback on
an error for which a student with a less good pronunciation does not get one.
6. Generality: if the error is one that is often made in the L2 by foreign speakers.
7. Personality: a student who appreciates corrections receives more than one who does not.
8. Commonality: an error that is common among native speakers of the L2 language is
regarded as less grave than such errors that a native speaker would never make.
9. Exercise focus: feedback is primarily given on the feature targeted by the exercise.
None of the students thought that all errors should be corrected, only the “worst”. When
probed further, the general opinion was that this signified mispronunciations that lead to
misunderstandings or deteriorated communication. Other criteria stated were if the error
affected the listener’s view of the speaker negatively, or if it was a repeated error. Apart from
this, the students thought that it should depend on the student’s ambition. These opinions
hence correspond to the first five criteria given by the teachers.
In the classes, the amount and type of feedback given depended on the type of exercise
(practicing one word, reading texts, speaking freely), the L2 language (for the L2 language
that was most different from Swedish, significantly more detailed feedback was given),
generality (errors that several students made were given more emphasis) and proficiency.
4.4 Motivation
To avoid negative feelings about feedback, the teachers or students suggested:
1. Adapt the feedback to the students’ self-confidence (criteria 5 & 7 in section 3.3).
2. Make explicit corrections impersonal, by expanding to a general error and using “When one
says...” rather than “When you say...”
3. Insert non-problematic pronunciations among the more difficult ones.
4. Acknowledge difficulties (e.g. “Yes, this is a tricky pronunciation”).
5. Never getting stuck on the same pronunciation too long.
44 OLOV ENGWALL
6. Promote the students’ willingness to speak, by making the student feel that the teacher is
interested in what the student has to say and not only by how it is said.
7. Provide positive feedback when the student has made an effort or when a progress is made.
8. Adapt to the exercise. Use explicit feedback sparingly if implicit feedback is enough.
9. Give feedback only on the focus of the session. If other pronunciation problems are
discovered, these should be left uncorrected, but noted and addressed in another session.
5 Feedback management in ARTUR

Some aspects of the feedback strategies proposed above have been implemented in a Wizard-
of-Oz version of ARTUR that will be demonstrated at the conference. The focus of the
exercise is to teach speakers of English the pronunciation of the Swedish sound “sj”, using the
tongue twister “Sju själviska sjuksköterskor stjäl schyst champagne”.
The instructions and feedback consisted of instructions and animations on how to position
the tongue, showing and explaining the difference between the user’s pronunciation and the
correct. The user could further listen to his/her previous attempt to compare it with the target.
One new feature is that each user can control individually the amount of feedback given.
The first reason for this is the affective, that students should be able to choose a level that they
are comfortable with. The second is that this does put the responsibility and initiative with the
student, who can decide how much advice he or she requires from the tutor.
Secondly, several feedback categories have been added to the standard positive (for a
correct pronunciation) and corrective (incorrect): minimal (correct pronunciation, only
implicit positive feedback given, in order not to interrupt the flow of the training), satisfactory
(the pronunciation is not entirely correct, but it is pedagogically sounder to accept it and move
ahead), augmented (for a repeated error, more detailed feedback given), vague (a general hint
is given, rather than explicit feedback) and encouragement (encouraging the student and
asking for a new try). The two latter categories may be used either when the system is
uncertain of the error, when it does not fit the predefined mispronunciation categories or when
more explicit feedback is pedagogically unsound.
Acknowledgements
This research is carried out within the ARTUR project, funded by the Swedish research
council. The Centre for Speech Technology is supported by VINNOVA (The Swedish Agency
for Innovation Systems), KTH and participating Swedish companies and organizations. The
author would like to thank the participating teachers and students.
References
Bälter, O., O. Engwall, A-M. Öster & H. Kjellström, 2005. Wizard-of-oz test of ARTUR – a
computerbased speech training system with articulation correction. Proceedings of the 7th
International ACM SIGACCESS Conference on Computers and Accessibility, 36–43.
Carroll, S. & M. Swain, 1993. Explicit and implicit negative feedback: An empirical study of
the learning of linguistic generalizations. Studies in Second Lang. Acquisition 15, 357–386.
Lyster, R. & L. Ranta, 1997. Corrective feedback and learner uptake. Studies in Second Lang.
Acquisition 20, 37–66.
Mackey, A. & J. Philip, 1998. Conversational interaction and second language development:
Recasts, responses, and red herrings? Modern Language Journal 82, 338–356.
Neri, A., C. Cucchiarini & H. Strik, 2002. Feedback in computer assisted pronunciation
training: When technology meets pedagogy. Proceedings of CALL professionals and the
future of CALL research, 179–188.
Rubin, J. (ed.), 1994. Handbook of Usability Testing. New York: John Wiley & Sons Inc.
Directional Hearing in a Humanoid Robot

Evaluation of Microphones Regarding HRTF and Azimuthal Dependence
Lisa Gustavsson, Ellen Marklund, Eeva Klintfors, and Francisco Lacerda

Department of Linguistics/Phonetics, Stockholm University
{lisag|ellen|eevak|frasse}@ling.su.se
Abstract
As a first step of implementing directional hearing in a humanoid robot two types of
microphones were evaluated regarding HRTF (head related transfer function) and azimuthal
dependence. The sound level difference between a signal from the right ear and the left ear is
one of the cues humans use to localize a sound source. In the same way this process could be
applied in robotics where the sound level difference between a signal from the right
microphone and the left microphone is calculated for orienting towards a sound source. The
microphones were attached as ears on the robot-head and tested regarding frequency
response with logarithmic sweep-tones at azimuth angles in 45º increments around the head.
The directional type of microphone was more sensitive to azimuth and head shadow and
probably more suitable for directional hearing in the robot.
1 Introduction
As part of the CONTACT project1 a microphone evaluation regarding head related transfer
function (HRTF), and azimuthal2 dependence was carried out as a first step in implementing
directional hearing in a humanoid robot (see Figure 1). Sound pressure level by the robot ears
(microphones) as a function of frequency and azimuth in the horizontal plane was studied.
The hearing system in humans has many features that together enable fairly good spatial
perception of sound, such as timing differences between left and right ear in the arrival of a
signal (interaural time difference), the cavities of the pinnae that enhance certain frequencies
depending on direction and the neural processing of these two perceived signals (Pickles,
1988). The shape of the outer ears is indeed of great importance in localization of a sound
source, but as a first step of implementing directional hearing in a robot, we want to start up
by investigating the effect of a spherical head shape between the two microphones and the
angle in relation to the sound source. So this study was done with reference to the interaural
level difference (ILD)3 between two ears (microphones, no outer ears) in the sound signal that
is caused by the distance between the ears and HRTF or head shadowing effects (Gelfand,
1998). This means that the ear furthest away from the sound source will to some extent be
blocked by the head in such a way that the shorter wavelengths (higher frequencies) are
reflected by the head (Feddersen et al., 1957). Such frequency-dependent differences in
intensity associated with different sound source locations will be used as an indication to the
robot to turn his head in the horizontal plane. The principle here is to make the robot look in
the direction that minimizes the ILD4. Two types of microphones, mounted on the robot head,
1 "Learning and development of Contextual Action" European Union NEST project 5010
2 Azimuth = angles around the head
3 The abbreviation IID can also be found in the literature and stands for Interaural Intensity Difference.
4 This is done using a perturbation technique. The robot’s head orientation is incrementally changed in order to detect the direction
associated with a minimum of ILD.
46 LISA GUSTAVSSON ET AL.
were tested regarding frequency response at azimuth angles in 45º increments from the sound
source (Shaw & Vaillancourt, 1985; Shaw, 1974).
The study reported in this paper was carried out by the CONTACT vision group (Computer
Vision and Robotics Lab, IST Lisbon) and the CONTACT speech group (Phonetics Lab,
Stockholm University) assisted by Hassan Djamshidpey and Peter Branderud. The tests were
performed in the anechoic chamber at Stockholm University in December 2005.
2 Method
The microphones evaluated in this study were wired Lavalier
microphones of the Microflex MX100 model by Shure. These
microphones were chosen because they are small electret condenser
microphones designed for speech and vocal pickup. The two types tested
were omni-directional (360º) and directional (cardoid 130º). The
frequency response is 50 to 17000 Hz and its max SPL is 116 dB (omni-
directional), 124 dB (directional) with a s/n ratio of 73 dB (omni-
directional), 66 dB (directional). The robotic head was developed at
Computer Vision and Robotics Lab, IST Lisbon (Beira et al., 2006).
Figure 1. Robot head.
2.1 Setup
The experimental setup is illustrated in Figure 2a. The robot-head is attached to a couple of
ball bearing arms (imagined to correspond to a neck) on a box (containing the motor for
driving head and neck movements). The microphones were attached and tilted by about 30
degrees towards the front, with play-dough in the holes made in the skull for the future ears of
the robot. The wires run through the head and out to the external amplifier. The sound source
was a Brüel&Kjær 4215, Artificial Voice Loud Speaker, located 90 cm away from the head in
the horizontal plane (Figures 2a and 2b). A reference microphone was connected to the
loudspeaker for audio level compression (300 dB/sec).
Figure 2a and 2b. a) Wiring diagram of experimental setup (left). b) Azimuth angles in
relation to robot head and loudspeaker (right).
2.2 Test
Sweep-tones were presented through the loud-speaker at azimuth angles in 45º increments
obtained by turning the robot-head (Figure 2b). The frequency range of the tone was between
20 Hz5 and 20 kHz with a logarithmic sweep control and writing speed of 160mm/sec
(approximate averaging time 0.03 sec). The signal response of the microphones was registered
and printed in dB/Hz diagrams (using Brüel&Kjær 2307, Printer/Level Recorder) and a back-
up recording was made on a hard-drive. The dB values as a function of frequency were also
plotted in Excel diagrams for a better overview of superimposed curves of different azimuths
(and for presentation in this paper).
5 Because the compression was unstable up until about 200 Hz, the data below 200 Hz will not be reported here. Furthermore, the lower
frequencies are not affected that much in terms of ILD.
DIRECTIONAL HEARING IN A HUMANOID ROBOT 47
3 Results
The best overall frequency response of both microphones was at angles 0º, -45º and -90º that
is the (right) microphone is to some extent directed towards the sound source. The sound level
decreases as the microphone is turned away from the direction of the sound source. The omni-
directional microphones have an overall more robust frequency response than the directional
microphones. As expected, the difference in sound level between the azimuth angles are most
significant in higher frequencies since the head as a blockage will have a larger impact on
shorter wavelengths than on longer wavelengths. An example of ILD for the directional
microphones is shown in Figure 3, where sound level as a function of frequency is plotted for
the ear near the sound source and ear furthest away from the sound source at azimuth 45º, 90º
and 135º. While the difference ideally should be zero6 at azimuth 0 it is well above 15 dB at
many higher frequencies in azimuth 45º, 90º and 135º.
45
40
35
30 near ear 45
near ear 90
25
dB
near ear 135
20 far ear 45
far ear 90
15 far ear 135
10
0
0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,5 2 3 4 5 6 7 8 9 10 15 20
Frequency (kHz)
Figure 3. Signal response of the directional microphones. Sound level as a function of

frequency and azimuth 45º, 90º and 135º for the ear near the sound source and the ear furthest
away from the sound source.
4 Discussion
In line with the technical description of the microphones our results show that the directional
microphones are more sensitive to azimuth than the omnidirectional microphones and will
probably make the implementation of sound source localization easier. Also disturbing sound
of motors and fans inside the robot’s head might be picked up easier by an omnidirectional
microphone. A directional type of microphone would therefore be our choice of ears for the
robot. However, decisions like this are not made without some hesitation since we do not
want to manipulate the signal response in the robot hearing mechanism beyond what we find
motivated in terms of the human physiology of hearing. Deciding upon what kind of pickup
angle the microphones should have forces us to consider what implications a narrow versus a
wide pickup angle will have in further implementations of the robotic hearing. At this moment
we see no problems with a narrow angle but if problems arise we can of course switch to wide
angle cartridges.
The reasoning in this study holds for locating a sound source only to a certain extent. By
calculating the ILD the robot will be able to orient towards a sound source in the frontal
horizontal plane. But if the sound source is located straight behind the robot the ILD would
also equal zero and according to the robot’s calculations he is then facing the sound source.
Such front-back errors are in fact seen also in humans since there are no physiological
6 A zero difference in sound level at all frequencies between the two ears requires that the physical surroundings at both sides of the head
are absolutely equal.
48 LISA GUSTAVSSON ET AL.
attributes of the ear that in a straightforward manner differentiate signals from the front and
rear. Many animals have the ability to localize a sound source by wiggling their ears, humans
can instead move themselves or the head to explore the sound source direction (Wightman &
Kistler, 1999). As mentioned earlier the outer ear is however of great importance for locating
a sound source, the shape of the pinnae does enhance sound from the front in certain ways but
it takes practice to make use of such cues. In the same way the shape of the pinnae can be of
importance for locating sound sources in the medial plane (Gardner & Gardner, 1973;
Musicant & Butler, 1984). Subtle movements of the head, experience of sound reflections in
different acoustic settings and learning how to use pinnae related cues are some solutions to
the front-back-up-down ambiguity that could be adopted also by the robot. We should not
forget though, that humans always use multiple sources of information for on-line problem
solving and this is most probably the case also when locating sound sources. When we hear a
sound there is usually an event or an object that caused that sound, a sound source that we
easily could spot with our eyes. So the next question we need to ask is how important vision is
in localizing sound sources or in the process of learning how to trace sound sources with our
ears and how vision can be used in the implementation of directional hearing of the robot.
5 Concluding remarks
Directional hearing is only one of the many aspects of human information processing that we
have to consider when mimicking human behaviour in an embodied robot system. In this
paper we have discussed how the head has an impact on the intensity of signals at different
frequencies and how this principle can be used also for sound source localization in robotics.
The signal responses of two types of microphones were tested regarding HRTF at different
azimuths as a first step of implementing directional hearing in a humanoid robot. The next
steps are designing outer ears and formalizing the processes of directional hearing for
implementation and on-line evaluations (Hörnstein et al., 2006).
References
Beira, R., M. Lopes, C. Miguel, J. Santos-Victor, A. Bernardino, G. Metta et al., in press.
Design of the robot-cub (icub) head. IEEE ICRA.
Feddersen, W.E., T.T. Sandel, D.C. Teas & L.A. Jeffress, 1957. Localization of High-
Frequency Tones. Journal of the Acoustical Society of America 29, 988-991.
Gardner, M.B. & R.S. Gardner, 1973. Problem of localization in the median plane: effect of
pinnae cavity occlusion. Journal of the Acoustical Society of America 53, 400-408.
Gelfand, S., 1998. An introduction to psychological and physiological acoustics. New York:
Marcel Dekker, Inc.
Hörnstein, J., M. Lopes & J. Santos-Victor, 2006. Sound localization for humanoid robots –
building audio-motor maps based on the HRTF. CONTACT project report.
Musicant, A.D. & R.A. Butler, 1984. The influence of pinnae-based spectral cues on sound
localization. Journal of the Acoustical Society of America 75, 1195-1200.
Pickles, J.O., 1988. An Introduction to the Physiology of Hearing. (Second ed.) London:
Academic Press.
Shaw, E.A.G., 1974. Transformation of sound pressure level from the free field to the
eardrum in the horizontal plane. Journal of the Acoustical Society of America 56.
Shaw, E.A.G. & M.M. Vaillancourt, 1985. Transformation of sound-pressure level from the
free field to the eardrum presented in numerical form. Journal of the Acoustical Society of
America 78, 1120-1123.
Wightman, F.L. & D.J. Kistler, 1999. Resolution of front—back ambiguity in spatial hearing
by listener and source movement. Journal of the Acoustical Society of America 105, 2841-
2853.
Microphones and Measurements

Gert Foget Hansen1 and Nicolai Pharao2
1
Department of Dialectology, University of Copenhagen
gertfh@hum.ku.dk
2
Centre for Language Change in Real Time, University of Copenhagen
nicolajp@hum.ku.dk
Abstract
This paper presents the current status of an ongoing investigation of differences in formant
estimates of vowels that may come about solely due to the circumstances of the recording of
the speech material. The impact of the interplay between type and placement of microphone
and room acoustics are to be examined for adult males and females across a number of vowel
qualities. Furthermore, two estimation methods will be compared (LPC vs. manual). We
present the pilot experiment that initiated the project along with a brief discussion of some
relevant articles. The pilot experiment as well as the available results from other related
experiments seem to indicate that different recording circumstances could induce apparent
formant differences of a magnitude comparable to differences reported in some investigations
of sound change.
1 Introduction
1.1 Purpose
The study reported here arose from a request to evaluate different types of recording
equipment for the LANCHART Project, a longitudinal study of language change with Danish
as an example. One aim of the assignment was to ensure that the LANCHART corpus would
be suitable for certain acoustic phonetic investigations.
1.2 Pilot experiments – choosing suitable microphones for on-location recordings

Head mounted microphones were compared to the performance of a lapel-worn microphone
and a full-size directional microphone placed in a microphone stand in front of the speaker
(hereafter referred to as a studio microphone). The following four factors were considered in
the evaluation of the suitability of the recordings provided by the microphones: 1) ease of
transcription and 2) segmentation of the recordings as well as estimation of 3) fundamental
frequency and 4) formants using LPC analysis.
Simultaneous recordings of one speaker using all three types of microphones formed the
basis for a pilot investigation. Primarily, the results indicated that the lapel-worn microphone
was clearly inferior to the other two types with regard to the first 3 criteria, since it is more
prone to pick up background noise. The head mounted and studio microphones also showed
some differences with regard to these 3 criteria; in particular the recordings made with the
head mounted microphone provided clearer spectrograms. Furthermore, apparent differences
emerged in the LPC analysis of the vowels in the three recordings.
To explore this further we recorded 6 different pairs of microphone and distance
combinations using a two channel hard disk recorder. Microphones compared were
Sennheiser ME64, Sennheiser MKE2 lavallier and VT600 headset microphone, positioned
either as indicated by type, or as typical for ME64 (i.e. at a distance of about 30 cm).
50 GERT FOGET HANSEN & NICOLAI PHARAO
One speaker producing various sustained vowels was recorded, and subsequently we
measured the formant values at the same 3 randomly chosen points in each vowel in the two
channels and compared the values. Of course we expected some random variation, but our
naïve intuition was that if a formant value would for some reason bounce upwards in one
recording it should also do so in a synchronous recording made with a different microphone
set up. We were wrong. In fact, vowels of all heights and tongue positions seemed to exhibit
quite dramatic differences, but the differences appeared to be more prominent for some
vowels. Some of these differences are likely to be the result of mistracings of formant values
in one or both channels, but some of the large differences were found for high front non-
rounded vowels like [i] and [e] where first and second formants are not often confused.
Furthermore, when we compared average values of the three points in each vowel, 37 out of
252 values differed between 5 and 10%, while 31 differed more than 10%. Closer inspection
of the two channels revealed that a substantial number of the differences could not simply be
attributed to spurious values, but was indeed a result of the LPC algorithm producing
consistently different results, although the average differences were of a smaller magnitude.
Since all other factors were held constant in these pairwise comparisons the apparent
differences could only be an effect of the type and placement of the microphone. The question
remains which recording to trust.
2 Previous investigations
In previous investigations of the usability of portable recording equipment for phonetic
investigations and the reliability of LPC-based formant measurements made on such recor-
dings the main focus seems to have been on the recording devices, notably the consequences
of using digital recorders that employ some sort of psychoacoustic encoding such as MiniDisc
and mp3 recorders, rather than on the role of the microphone used. Below is a brief summary
of the articles we have found which deal with the influence the microphone exerts.
Though the goal for van Son (2002) is also to investigate the consequences of using audio-
compression, interestingly, van Son uses the difference in estimation values that results from
switching from one particular microphone to another as a yardstick against which the errors
introduced by the compression algorithms are compared. Comparing a Sennheiser MKH105
condenser microphone against a Shure SM10A dynamic headset microphone he finds
differences between the two recordings larger than 9 semitones (considered “jumps”) in
slightly less than 4% of the estimates of F1 and about 2% for F2. When these jumps are
excluded the remaining measurements show an increased RMS error of about 1.2 to 1.7
semitones as a result of switching microphones. Unfortunately it is not possible to see the
values for the individual vowel qualities.
Plichta (2004) also examines formant estimates of vowels from three simultaneous
recordings. Comparing three combinations of microphone and recording equipment (thereby
not separating characteristics of the microphones and the recording equipment), he shows
significant differences in F1 values and bandwidths between all three recording conditions.
His material is limited to non-high non-rounded front vowels, plus the diphthong [ai].
Thus there is evidence that recordings made with different microphones (and/or recording
equipment differing in other respects) can lead to significantly different formant estimates.
Apart from these investigations there are two studies of the spectral consequences of
differences in microphone placement by the acoustician Eddy Bøgh Brixen (1996; 1998)
which are of particular relevance to our investigation. He provides evidence that the
placement of the microphone relative to the speaker in and of itself can lead to substantial
differences in the recorded power spectrum, notably when microphones are placed very close
to (or on) the body or head of the speaker as is the case with lavallier and headband micro-
phones.
MICROPHONES AND MEASUREMENTS 51
3 The experiment
3.1 Research questions
As we have seen recordings will be affected by a number of factors which interact in complex
ways making for a source of error of unknown impact on formant estimates. Now the
interesting question is: how big is the problem? Is it large enough to have practical
consequences for the use of LPC-based formant estimation as an analysis tool? This overall
question led us to these research questions:
a) How accurate can LPC-based formant estimates be expected to be?
b) How much does the microphone and its placement contribute to the inaccuracy?
c) How much does the room contribute to the inaccuracy?
d) Is this only a concern for LPC-based formant estimates, or are estimates made by hand also
affected?
3.2 Experimental design

As an attempt to answer these questions, a more comprehensive experiment was designed. It
seems to us that what we need is some sort of neutral reference recording and knowledge
about the consequences for formant estimation as we deviate from this ideal. Thus we planned
to compare formant estimates of recordings made in four locations with very different
acoustic characteristics using four different microphones simultaneously. In total the recorded
material covers: 4 microphones (see table below), 4 locations: Anechoic chamber, recording
studio, two private rooms, 2 male and 2 female adult speakers.
The subjects read short sentences producing 6-18 renditions of 8 vowel qualities at each
location. In addition 6 repetitions of sustained vowels with f0-sweeps of 6 vowel qualities
have been recorded in the recording studio and in the anechoic chamber by four speakers.
These were meant to facilitate a more accurate manual estimation of the formant values. All
material was recorded using synchronized Sound Device SD722 hard disk recorders at 24
Bit/48KHz.
Table 1. Microphones compared and their position relative to the subjects

Microphone Position relative to speaker Directional
sensitivity
Brüel & Kjær 4179 80 cm directly in front of speaker’s mouth omnidirectional
Sennheiser MKH40 40 cm at a 45 degree angle cardioid
DPA 4066 2 cm from corner of mouth, head worn omnidirectional
VT 700 2 cm from corner of mouth, head worn omnidirectional
We would suggest using the B&K 4179 with a (certified) flat frequency response in the
anechoic chamber at a distance of 80 cm as the reference. The distance is perhaps somewhat
arbitrary, but it appears from Brixen (1998) that the effect of changing the distance decreases
rapidly as the distance increases. On-axis, the spectrum at 80 cm deviates less than +/- 2dB
from the spectrum at 1 m.
4 Current status and preliminary results

All planned recordings have been made, and the analysis phase has commenced. We have
started with the sustained vowels as they should be the simplest to analyse (since there are no
transitions to be aware of) and as they are also the most suitable for manual inspection. Two
PRAAT scripts have been produced for the analysis. One is a formant analysis tool that
enables simultaneous analysis of the four recordings to ensure that measurements are made at
52 GERT FOGET HANSEN & NICOLAI PHARAO
points that – as far as possible – provide trustworthy formant values for all recordings. The
other is a script which by tracing the intensity variation in each partial as the f0 changes, can
be used to determine when a given partial crosses a formant. By measuring f0 at this point and
counting the number of the partial we are able to estimate the formant frequency. We assume
that this approach will be more accurate than judging the formant frequencies by visual
inspection alone.
It is obvious that the “f0-sweep” approach we use to determine formant values manually is
not without flaws as we are relying heavily on a number of assumptions: First we expect our
speakers to be able to produce the same vowel quality independent of pitch. As vowel quality
and pitch are known to be interrelated in real speech it may both be difficult for our speakers
to live up to this expectation, and difficult for us to verify auditorily whether they do. Even if
the speakers may succeed in ‘freezing’ the oral cavities during the sweep, differences may
arise due to movement of the larynx as the pitch is changed, as well as due to changes in voice
quality associated with the pitch. Notably the voice often seemed to get more breathy and
hypofunctional towards the lower end of the pitch range. The method of determining the time
of the maximum energy for the partial may also be affected by overall changes in intensity
that have nothing to do with the interaction between the partial and the formant. This would
mostly affect estimates of F1 as the transition of partials through higher formants happens
much faster, and since there are often more partials crossing through the formant thus giving
more estimates. Finally the accuracy of course depends on the accuracy of the f0 tracing, and
more so the higher the partial. Despite the potential shortcomings of the method it does seem
to provide reliable results, and is particularly helpful in determining the formant frequencies
in the lower region of the spectrum.
Our ongoing analyses of the data have so far only confirmed the usefulness of carrying out
the larger investigation. We hope to be able to ensure that our colleagues at the LANCHART
Project need not end up reporting as sound changes what might merely be the results of
microphone changes...
References
Brixen, E.B., 1996. Spectral Degradation of Speech Captured by Miniature Microphones
Caused by the Microphone Placing on Persons’ Head and Chest. Proceedings AES 100th
Convention.
Brixen, E.B., 1998. Near Field Registration of the Human Voice: Spectral Changes due to
Positions. Proceedings AES 104th Convention.
Plichta, B., 2004. Data acquisition problems. In B. Plichta, Signal acquisition and acoustic
analysis of speech. Available at: http://bartus.org/akustyk/signal_aquisition.pdf.
van Son, R.J.J., 2002. Can standard analysis tools be used on decompressed speech?
Available at: http://www.fon.hum.uva.nl/Service/IFAcorpus/SLcorpus/
AdditionalDocuments/CoCOSDA2002.pdf.
Prosodic Cues for Interaction Control in

Spoken Dialogue Systems
Mattias Heldner and Jens Edlund
{mattias|edlund}@speech.kth.se
Abstract
This paper discusses the feasibility of using prosodic features for interaction control in
spoken dialogue systems, and points to experimental evidence that automatically extracted
prosodic features can be used to improve the efficiency of identifying relevant places at which
a machine can legitimately begin to talk to a human interlocutor, as well as to shorten system
response times.
1 Introduction
All spoken dialogue systems, no matter what flavour they come in, need some kind of
interaction control capabilities in order to identify places where it is legitimate to begin to talk
to a human interlocutor, as well as to avoid interrupting the user. Most current systems rely
exclusively on silence duration thresholds for making such interaction control decisions, with
thresholds typically ranging from 500 to 2000 ms (Ferrer, Shriberg & Stolcke, 2002; Shriberg
& Stolcke, 2004). Such an approach has several drawbacks, both from the point of view of the
user and that of the system. Users generally have to wait longer for responses than in human-
human interactions; at the same time they run the risk of being interrupted by the system,
since people frequently pause mid-speech, for example when hesitating or before semantically
heavy words (Edlund & Heldner, 2005; Shriberg & Stolcke, 2004); and using silent pauses as
the sole information for segmentation of user input is likely to impair the system’s speech
understanding, as unfinished or badly segmented utterances often are more difficult to
interpret (Bell, Boye & Gustafson, 2001).
Humans are very good at discriminating the places where their conversational partners have
finished talking from those where they have not – accidental interruptions are rare in
conversations. Apparently, we use a variety of information to do so, including numerous
prosodic and gestural features, as well as higher levels of understanding, for example related
to (in)completeness on a structural level (e.g. Duncan, 1972; Ford & Thompson, 1996; Local,
Kelly & Wells, 1986).
In light of this, the interaction control capabilities of spoken dialogue systems would likely
benefit from access to more of this variety of information – more than just the duration of
silent pauses. Ultimately, spoken dialogue systems should of course be able to combine all
relevant and available sources of information for making interaction control decisions.
Attempts have been made at using semantic information (Bell, Boye & Gustafson, 2001;
Skantze & Edlund, 2004), prosodic information and in particular intonation patterns (Edlund
& Heldner, 2005; Ferrer, Shriberg & Stolcke, 2002; Thórisson, 2002), and visual information
(Thórisson, 2002) to deal with (among other things) the problems that occur as a result of
interaction control decisions based on silence only.
54 MATTIAS HELDNER & JENS EDLUND
2 Prosodic cues for interaction control

Previous work suggests that a number of prosodic or phonetic cues are liable to be relevant for
interaction control in human-human dialogue. Ultimately, software for improving interaction
control in practical applications should capture all relevant cues.
The phenomena associated with turn-yielding include silent pauses, falling and rising
intonation patterns, and certain vocal tract configurations such as exhalations (e.g. Duncan,
1972; Ford & Thompson, 1996; Local, Kelly & Wells, 1986). Turn-yielding cues are typically
located somewhere towards the end of the contribution, although not necessarily on the final
syllable. Granted that human turn-taking involves decisions above a reflex level, evidence
suggests that turn-yielding cues must occur at least 200-300 ms before the onset of the next
contribution (Ward, 2006; Wesseling & van Son, 2005).
The phenomena associated with turn-keeping include level intonation patterns, vocal tract
configurations such as glottal or vocal tract stops without audible release, as well as a
different quality of silent pauses as a result of these vocal tract closures (e.g. Caspers, 2003;
Duncan, 1972; Local & Kelly, 1986). Turn-keeping cues are also located near the end of the
contribution. As these cues are not intended to trigger a response, but rather to inhibit one,
they may conceivably occur later than turn-yielding cues.
There are also a number of cues (in addition to the silent pauses mentioned above) that have
been observed to occur with turn-yielding as well as with turn-keeping. Examples of such
cues include decreasing speaking rate and other lengthening patterns towards the end of
contributions. The mere presence (or absence) of such cues cannot be used for making a turn-
yielding vs. turn-keeping distinction, although the amount of final lengthening, for example,
might provide valuable guidance for such a task (cf. Heldner & Megyesi, 2003).
3 Prosodic cues applied to interaction control

In previous work (Edlund & Heldner, 2005), we explored to what extent the prosodic features
extracted with /nailon/ (Edlund & Heldner, forthcoming) could be used to mimic the
interaction control behaviour in conversations among humans. Specifically, we analysed one
of the interlocutors in order to predict the interaction control decisions made by the other
person taking part in the conversation. These predictions were evaluated with respect to
whether there was a speaker change or not at that point in the conversation, that is, with
respect to what the interlocutors actually did.
Each unit ending in a silent pause in the speech of the interlocutor being analysed was
classified into one out of three categories: turn-keeping, turn-yielding, and don’t know. Units
with low patterns were classified as suitable places for turn-taking (i.e. turn-yielding); mid and
level patterns were classified as unsuitable places (i.e. turn-keeping); all other patterns,
including high or rising, ended up in the garbage category don’t know. This tentative
classification scheme was based on observations reported in the literature (e.g. Caspers, 2003;
Thórisson, 2002; Ward & Tsukahara, 2000), but it was in no way optimised or adapted to suit
the speech material used.
This experiment showed that interaction control based on extracted features avoided 84%
of the places where a system using silence duration thresholds only would have interrupted its
users, while still recognizing 40% of the places where it was suitable to say something (cf.
Edlund & Heldner, 2005). Interaction control decisions using prosodic information can
furthermore be made considerably faster than in silence only systems. The decisions reported
here were made after a 300-ms silence to be compared with silences ranging from 500 to 2000
ms in typical silence only systems (Ferrer, Shriberg & Stolcke, 2002).
PROSODIC CUES FOR INTERACTION CONTROL IN SPOKEN DIALOGUE SYSTEMS 55
4 Discussion
In this paper, we have discussed a number of prosodic features liable to be relevant for
interaction control. We have shown that automatically extracted prosodic information can be
used to improve the interaction control in spoken human-computer dialogue compared to
systems relying exclusively on silence duration thresholds.
Future work will include further development of the automatic extraction in terms of
improving existing algorithms as well as adding new prosodic features. In a long-term
perspective, we would want to combine prosodic information with other sources of
information, such as semantic completeness and visual interaction control cues, as well as to
relate interaction control to other conversation phenomena such as grounding, error handling,
and initiative.
Acknowledgements
This work was carried out within the CHIL project. CHIL (Computers in the Human
Interaction Loop) is an Integrated Project under the European Commission’s Sixth Framework
Program (IP-506909).
References
Bell, L., J. Boye & J. Gustafson, 2001. Real-time handling of fragmented utterances.
Proceedings NAACL Workshop on Adaptation in Dialogue Systems, Carnegie Mellon
University, Pittsburgh, PA, 2-8.
Caspers, J., 2003. Local speech melody as a limiting factor in the turn-taking system in Dutch.
Journal of Phonetics 31, 251-276.
Duncan, S., Jr., 1972. Some signals and rules for taking speaking turns in conversations.
Journal of Personality and Social Psychology 23(2), 283-292.
Edlund, J. & M. Heldner, 2005. Exploring Prosody in Interaction Control. Phonetica 62(2-4),
215-226.
Edlund, J. & M. Heldner, forthcoming. /nailon/ – a tool for online analysis of prosody.
Proceedings 9th International Conference on Spoken Language Processing (Interspeech
2006), Pittsburgh, PA.
Ferrer, L., E. Shriberg & A. Stolcke, 2002. Is the speaker done yet? Faster and more accurate
end-of-utterance detection using prosody in human-computer dialog. Proceedings ICSLP
2002, Denver, Vol. 3, 2061-2064.
Ford, C.E. & S.A. Thompson, 1996. Interactional units in conversation: syntactic,
intonational, and pragmatic resources for the management of turns. In E. Ochs, E.A.
Schegloff & S.A. Thompson (eds.), Interaction and grammar. Cambridge: Cambridge
University Press, 134-184.
Heldner, M. & B. Megyesi, 2003. Exploring the prosody-syntax interface in conversations.
Proceedings ICPhS 2003, Barcelona, 2501-2504.
Local, J.K. & J. Kelly, 1986. Projection and 'silences': Notes on phonetic and conversational
structure. Human Studies 9, 185-204.
Local, J.K., Kelly, J. & W.H.G. Wells, 1986. Towards a phonology of conversation: turn-
taking in Tyneside English. Journal of Linguistics 22(2), 411-437.
Shriberg, E. & A. Stolcke, 2004. Direct Modeling of Prosody: An Overview of Applications
in Automatic Speech Processing. Proceedings Speech Prosody 2004, Nara, 575-582.
Skantze, G. & J. Edlund, 2004. Robust interpretation in the Higgins spoken dialogue system.
COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in
Conversational Interaction, Norwich.
Thórisson, K.R., 2002. Natural turn-taking needs no manual: Computational theory and
model, from perception to action. In B. Granström, D. House & I. Karlsson (eds.),
56 MATTIAS HELDNER & JENS EDLUND
Multimodality in language and speech systems. Dordrecht: Kluwer Academic Publishers,

173-207.
Ward, N., 2006. Methods for discovering prosodic cues to turn-taking. Proceedings Speech
Prosody 2006, Dresden.
Ward, N. & W. Tsukahara, 2000. Prosodic features which cue back-channel responses in
English and Japanese. Journal of Pragmatics 32, 1177-1207.
Wesseling, W. & R.J.J.H. van Son, 2005. Early preparation of experimentally elicited
minimal responses. Proceedings Sixth SIGdial Workshop on Discourse and Dialogue,
ISCA, Lisbon, 11-18.
SMTC – A Swedish Map Task Corpus

Pétur Helgason
Department of Linguistics and Philology, Uppsala University
petur.helgason@lingfil.uu.se
Abstract
A small database of high quality recordings of 4 speakers of Central Standard Swedish is
being made available to the speech research community under the heading Swedish Map
Task Corpus (SMTC). The speech is unscripted and consists mostly of conversations elicited
through map tasks. In total, the database contains approximately 50 minutes of word-labelled
conversations, comprising nearly 8000 words. The material was recorded at the Stockholm
University Phonetics Lab. This paper describes the recording method, the data elicitation
procedures and the speakers recruited for the recordings. The data will be made available on-
line to researchers who put in a request with the author (cf. section 7 below).
1 Introduction
The data being made available under the heading Swedish Map Task Corpus (SMTC) were
originally recorded as part of the author’s doctoral dissertation project (Helgason, 2002). The
data have already proved useful for several other research projects, e.g. Megyesi (2002),
Megyesi & Gustafson-Čapková (2002) and Edlund & Heldner (2005). As it seems likely that
future projects shall want to make use of the data, and the data are not described in much
detail elsewhere, an account of the recording procedure and elicitation method are called for.
At the same time, the data shall be made available for download for researchers.
2 Recording set-up
The data were recorded in the anechoic room at the Stockholm University Phonetics Lab. The
subjects were placed facing away from one
another at opposite corners of the room (see
Figure 1). The “head-to-head” distance be-
tween the subjects was approximately two
meters. The reason for this placement of the
subjects was partly to minimize cross-
channel interference, and partly to prevent
them from consulting one another’s maps
(see the following section). The recording
set-up was therefore in accordance with the
nature of the data elicitation method.
The data were recorded using a Technics
SV 260 A DAT recorder and two Sennheiser
MKE2 microphones. Each microphone was
mounted on a headset and placed in such a
way that it extended approximately 2.5 cm Figure 1. The placement of subjects and
out and to the side of the corner of the experimenter during the recording.
58 PÉTUR HELGASON
subject’s mouth. The recording device and an experimenter were placed in between the sub-
jects, within the anechoic room.
The subjects were recorded on separate channels. This was done in order to avoid an over-
lap between the subjects when they were speaking simultaneously. The absorption of sound
energy in the anechoic room proved to be quite effective. The difference in average RMS be-
tween speakers on a channel was approximately 40 dB. Thus, for example, the average RMS
for the intended right channel speaker was, on average, 40 dB higher than for the interfering
(left-channel) speaker. This means that at normal listening levels (and provided the intended
speaker is silent), the interfering speaker can be detected only as a faint background murmur.
3 Data elicitation – the map tasks

Most of the data were elicited by having the subjects perform map tasks. Map tasks have
previously been used successfully for eliciting unscripted spoken data, perhaps most notably
in the HCRC Map Task Corpus (Anderson et al., 1991). A map task involves two participants,
an instruction giver and an instruction follower (henceforth Giver and Follower). For each
map task, the experimenter prepares two maps with a set of landmarks (symbols or drawings),
and to a large extent the landmarks on the two maps are the same. However, some differences
in landmarks are incorporated by design, so that the maps are not quite identical. The Giver’s
map has a predetermined route drawn on it, the Follower’s map does not. Their task is to
cooperate through dialogue, so that the route on the giver’s map is reproduced on the
follower’s map. The Giver and Follower are not allowed to consult one another’s maps. In the
SMTC recordings, the subjects were told at the beginning of the task that the maps differed,
but it was left up to them to discover the ways in which they differed.
Figure 2. One of the “treasure hunt” map pairs used for data elicitation. On the left is a
Giver’s map, and on the right is a Follower’s map. The path on the Giver’s map is reproduced
in grey here, but when the subjects performed the tasks it was marked in green.
SMTC – A SWEDISH MAP TASK CORPUS 59
For the SMTC recordings, four map tasks

were prepared, each consisting of a Giver
and Follower map pair. An example of a
map task is given in Figure 2. All the maps
had a set of basic common features. They all
depicted the same basic island shape, the
contour of which had several easily recog-
nizable bays and peninsulas. The island also
had mountains and hills, as well as a lake
and a river. Finally, each map had a simple
compass rose in the bottom left corner.
Two of the map tasks had a “treasure
hunt” theme. The landmarks on these maps
included an anchor (the starting point), a
key (an intermediate goal), and a treasure
chest (the final goal). However, most of the
symbols depicted animals (some of which
were prehistoric) and vegetation.
The remaining two map tasks had a
“tourist” theme. On these maps the land-
marks consisted entirely of various symbols
typically used in tourist brochures and maps.
In order to familiarize the subjects with
these symbols, they were asked to go
through a list of such symbols with a view Figure 3. The symbol list used to familiarize
to deciding how to refer to them if they the subjects with the symbols on the “tourist
occurred in a map task. This interaction was theme” maps.
recorded, and is included in the SMTC
database under the heading “Symbol task”. This symbol list is reproduced here in Figure 3.
The subjects’ goal in the tourist maps was to trace a predetermined route around the island
from an airport and back to the same airport.
4 The subjects
The subjects, one male and three females, were recruited from the staff at the Stockholm
University Linguistics Department. They are referred to as FK, FT, FS (all female) and MP
(male). FK and MP were in their thirties and FT and FS in their forties. All speakers were of
normal hearing. As regards dialect, all speakers identified themselves as speakers of Central
Standard Swedish and had lived for most or all of their lives in or around Stockholm. They
were paid a moderate fee for their participation.
The subjects were arranged in pairs of two, FS and MP as one pair and FK and FT as
another. Each pair began the session by navigating through a “treasure hunt” map task. This
was followed by a discussion of the symbol list. The pair then continued with a “tourist map”
followed by another “treasure map”, and finally, if time allowed, one more “tourist map”.
5 The extent of the database

The data from both subject pairs comprise a total of approximately 50 minutes of conversa-
tion. (There exist additional map-task recordings of these as well as other subjects which
await word-labelling, but these are not included in the present database). This represents a
total of 35 minutes of uninterrupted speech from the four subjects. (What is referred to here as
60 PÉTUR HELGASON
uninterrupted speech is the total speaking time for a subject, excluding any and all pauses.)
For FT, approximately 4.3 minutes of uninterrupted speech are available, comprising a total of
870 words; for FK 9.5 minutes comprising 2045 words; for MP 10.8 minutes comprising
2554 words; and for FS 10.3 minutes comprising 2401 words.
6 Some remarks on the transliteration provided with the recordings

The data are provided with a word-level transliteration (word labelling). The transliteration
was performed by the author, a non-native (albeit competent) speaker of Swedish. Researchers
that wish to make use of the data may make use of this transliteration, possibly using it as the
basis for searches or subject it to automatic text processing. Therefore, the rationale behind
the transliteration conventions will be outlined here.
The aim of the transliteration was to facilitate lexical look-ups rather than to indicate or
reflect the segmental content. For instance, the function word det is always indicated simply
as “det” in the transliteration, without regard for any variability in its production (e.g. [deːt],
[deː], [dɛ̝], [ɾeː] or [d̥e]). This approach was also applied in the labelling of minimal responses
and lexical fillers. For example, lexical fillers of the “eh” or “er” type are indicated with a
semicolon ; in the transliteration, irrespective of their segmental content (schwa-like, [e]-like,
[œ]-like, creaky, nasalized, etc.)
A prominent feature of the transliteration is that contiguous pieces of speech (i.e. stretches
of speech which contain no silence pauses) are demarcated at the onset and offset with a
period (full stop) symbol. Thus the transliteration does not attempt to reflect the syntactic
structure of an utterance, but instead only the presence of silence pauses. Note, also, that the
transliteration provides no evaluation of or amendment to the grammaticality of an utterance.
7 Format and availability

The sound files are 16-bit stereo (with one speaker on each channel) and have a sampling rate
of 16 kHz. The files are provided in the uncompressed Wave PCM format (i.e. *.wav). The
word label files are provided as text files in WaveSurfer format. The data are made available
as is, with no guarantee of groundbreaking research results. To obtain the data, please e-mail a
request to the author to obtain a web address from which to download the data.
Acknowledgements
The author would like to thank Sven Björsten, Peter Branderud and Hassan Djamshidpey for
their assistance with the recording of the data.
References
Anderson, A.H., M. Bader, E.G. Bard, E. Boyle, G. Doherty, S. Garrod, S. Isard, J. Kowtko, J.
McAllister, J. Miller, C. Sotillo, H. Thompson & R. Weinert, 1991. The HCRH Map Task
Corpus. Language and Speech 34(4), 351–366.
Edlund, J. & M. Heldner, 2005. Exploring Prosody in Interaction Control. Phonetica 62(2–4),
215–226.
Helgason, P., 2002. Preaspiration in the Nordic Languages: Synchronic and Diachronic
Aspects. Ph.D. Thesis, Stockholm University.
Megyesi, B., 2002. Data-Driven Syntactic Analysis – Methods and Applications for Swedish.
Ph.D. Thesis, Department of Speech, Music and Hearing, KTH, Stockholm.
Megyesi, B. & S. Gustafson-Čapková, 2002. Production and Perception of Pauses and their
Linguistic Context in Read and Spontaneous Speech in Swedish. Proceedings of the 7th
ICSLP, Denver.
The Relative Contributions of Intonation and

Duration to Degree of Foreign Accent in
Norwegian as a Second Language
Snefrid Holm
Department of Language and Communication Studies,
Norwegian University of Science and Technology (NTNU)
snefrid.holm@hf.ntnu.no
Abstract
This study investigates the relative contributions of global intonation and global segment
durations to degree of foreign accent in Norwegian. Speakers of Norwegian as a second
language (N2) from different native languages (L1s) plus one native Norwegian (N1) speaker
are recorded reading the same sentence. The N2 utterances’ global intonation and global
segment durations are manipulated to match the N1 pronunciation. In this way every N2
speaker provides four utterance versions: the original, a duration corrected version, an
intonation corrected version and a version with both features corrected. N1 listeners judge
the degree of foreign accent between each speaker’s four utterance versions. The results show
that a) the combined correction of both features reduces the degree of foreign accent for all
speakers, b) each correction by itself reduces the degree of foreign accent for all but two of
the investigated L1 groups and c) some L1 groups benefit more from intonation correction
whereas others benefit more from duration correction.
1 Introduction
When learning a second language after early childhood the resulting speech will normally be
foreign accented (e.g. Flege, Munro & Mackay, 1995). The phenomenon of foreign accent is
complex and comprises issues regarding the nature of the foreign accent itself as well as the
foreign accent’s various effects on listeners, for instance regarding social acceptance or the
ability to make oneself understood.
A foreign accent may not in itself hinder communication. Although degree of foreign accent
is often confounded with degree of intelligibility, a growing body of evidence supports the
view that even heavily accented speech may sometimes be perfectly intelligible (Derwing &
Munro, 1997; Munro & Derwing, 1995). The relationship between a deviating pronunciation
on the one hand and its effect on listener dimensions like intelligibility or perceived degree of
foreign accent on the other hand is not clear. There is however a general belief that prosodic
deviations are more important than segmental ones, at least for intelligibility, although there
are rather few studies to support this view (Munro & Derwing, 2005).
This study aims to establish which of the two pronunciation features global intonation and
global segment durations contributes most to perceived degree of foreign accent in Norwegian
as spoken by second language learners.
The present paper reports on a study which is part of a larger work where the next step will
be to investigate the effect of the same two pronunciation features upon intelligibility. In this
62 SNEFRID HOLM
way I hope to shed some light upon the relationship between non-native pronunciation, degree
of accent, and intelligibility.
2 Experimental procedure
2.1 Recordings
The speakers were 14 adult learners of Norwegian as a second language with British English,
French, Russian, Chinese, Tamil, Persian and German as their L1s. There were two speakers
from each of the L1s. The speakers were of both sexes. In addition one male native
Norwegian speaker from the South East region was recorded in order to provide an N1
template.
The speech material was recorded in a sound-treated studio using a Milab LSR 1000
microphone and a Fostex D-10 digital recorder. Files were digitized with a sampling rate of
44.1 kHz and later high-pass filtered over 75 Hz. Speech analyses and manipulations were
carried out with the Praat program (Boersma & Weenink, 2006).
The N2 speakers and the N1 speaker all read the same Norwegian sentence: Bilen kjørte
forbi huset vårt (=The car drove past our house).
2.2 Stimuli
Global intonation and global segment durations in the N2 utterances were computer
manipulated to match the N1 production of the same sentence. The intonation was
manipulated by replacing the intonation contour of each N2 utterance with the stylized
intonation contour of the N1 utterance. Because of durational differences between the N1
utterance and the various N2 utterances, the intonation contour had to be manually corrected
in the time domain. Because of pitch level differences between the speakers, especially
between the sexes, the intonation contour also had to be manually shifted in frequency so as to
suit the individual N2 speaker’s voice. Manipulation of segment durations required a
phonemic segmentation of both the N1 and the N2 utterances. All segment durations were
measured and the N2 phonemes were lengthened or shortened to match the segment durations
of the N1 utterance.
Three manipulated versions of each speaker’s original utterance were generated: one
intonation corrected utterance version, one duration corrected utterance version and one
utterance version with both features corrected.
The stimuli thus consisted of four utterance versions for each speaker: the original utterance
and three manipulated versions. These four versions were put together as pairs. Each pair was
put in a separate sound file with a two-second pause in between. These stimulus pairs enabled
the direct comparison of each speaker’s four utterance versions. Note that each stimulus pair
consists of two utterance versions from the same speaker so that utterance versions are always
compared within speaker.
2.3 Experiment
13 native Norwegian listeners evaluated the stimulus pairs. None reported experience with N2
speech out of the ordinary and none reported poor hearing. There were 8 listeners from low-
tone dialects and 5 listeners from high-tone dialects. The listeners were paid for their
participation.
The listener was seated in a sound-treated studio and the sound was presented through
loudspeakers. The listener’s task was to judge which of the two utterance versions in each
stimulus pair sounded less foreign accented than the other. They also had the option to rate the
two utterances as equally foreign accented. All stimulus pairs were presented in random order,
and they were presented 10 times each.
CONTRIBUTIONS OF INTONATION & DURATION TO FOREIGN ACCENT IN L2 NORWEGIAN 63
The listeners were not told that some of the utterances they would hear were altered through
computer manipulation. The participants seemed to find the test design comprehensible.
3 Results
The listeners’ responses were subjected to statistical testing. However, no statistics will be
presented in this paper. The main findings will be presented and discussed in the following.
3.1 Intonation vs. duration

The results show that when both global intonation and global segment durations are
manipulated, this correction reduces the amount of perceived foreign accent in the N2 speech.
This effect is statistically significant for all N2 speakers across the different L1s.
When the listeners are exposed to speech where only one pronunciation feature is corrected,
it is shown that each correction separately contributes to the reduction of foreign accent. This
effect is statistically significant for all L1 groups with two exceptions. For the N2 speakers
with British English as their L1, the native listeners judge the degree of foreign accent as
unaltered despite global intonation correction. For the N2 speakers with German as their L1,
the correction of global segment durations does not affect the perceived degree of foreign
accent.
It is thus clear that, in general, both global intonation and segment durations are significant
contributors to percieved degree of foreign accent. The interesting question is which of these
two corrections reduces the degree of foreign accent most effectively. This is shown to vary
between the L1s as presented in Table 1 below. The table also shows the relative size of the
accent reduction brought about by the corrections.
Table 1. The middle column shows the correction that contributes the most to degree of
foreign accent for each of the L1s. The rightmost column shows the relative size of the accent
reductions. T= Tamil, C= Chinese, E= English, F= French, G= German. >= larger effect.
L1 Main contribution Effect size
Tamil Segment durations T>C>E
Chinese
English
French Global intonation F>G
German
Russian Equal effect -
Persian
The table shows that the N2 speech produced by speakers with the native languages Tamil,
Chinese and English is affected more by the correction of global segment durations than by
the correction of global intonation correction for the purpose of foreign accent reduction.
Conversely, the N2 produced by speakers with the native languages French and German is
perceived as having less foreign accent when the global intonation is corrected than when the
global duration is corrected. For the Russian and Persian participants there was no difference
between the two pronunciation features. This means that correcting global intonation reduces
the amount of perceived foreign accent to the same degree as correcting global segment
durations.
The L1s can thus be categorized according to which of the two investigated pronunciation
features reduces the foreign accent more than the other. There are however differences within
each of these two categories as the degree to which the foreign accent is reduced by a
correction differs between the L1s. Native speakers of Tamil, Chinese and English all benefit
64 SNEFRID HOLM
most from intonation correction, but the effect of the correction has greater impact on the
foreign accent for some L1 groups than for others. The Tamil speakers’ N2 is more foreign
accent reduced by the correction than the Chinese speakers’ N2 and the Chinese N2 speech is
more foreign accent reduced than the English speakers’ N2. Likewise, speakers with French
and German as their native languages benefit most from duration correction, but the foreign
accent reduction effect is larger for the French L1 group than for the German L1 group.
The native Norwegian listeners that participated in the experiment represented both low-
tone and high-tone dialects. No correlation was found between listener dialect and responses
in the perception experiment.
References
[Computer program]. Retrieved April 19, 2006, from http://www.praat.org/.
Derwing, T. & M.J. Munro, 1997. Accent, intelligibility and comprehensibility: Evidence
from four L1s. Studies in second language acquisition 19, 1-16.
Flege, J.E., M.J. Munro & I.R.A. MacKay, 1995. Factors affecting strength of perceived
foreign accent in a second language. Journal of the Acoustical Society of America 97, 3125-
3134.
Munro, M.J. & T. Derwing, 1995. Processing time, accent and comprehensibility in the
perception of native and foreign accented-speech. Language and Speech 38, 289-306.
Munro, M.J. & T. Derwing, 2005. Second language accent and pronunciation teaching: A
research based approach. TESOL Quarterly 39(3), 379-397.
The Filler EH in Swedish

Merle Horne
merle.horne@ling.lu.se
Abstract
Findings from a pilot study on the distribution, function and phonetic realization of the filler
EH in interviews from SweDia2000 interviews are presented. The results show that EH
occurs almost exclusively after function words at the beginning of constituents. The phonetic
realization of EH was seen to be of three basic forms: a middle-high vowel (e.g. [ ], [e], [ ]),
a vowel+nasal (e.g. [ m], [ m], [ n]), and a vowel with creaky phonation (e.g.[ [ ). The
vowel+nasal realization occurs as has been shown for English before other delays and is
associated with planning of complex utterances. Since creaky phonation is associated with
terminality, the creaky vowel realization of EH could be interpreted as signalling the juncture
between the filler and an upcoming disfluency.
1 Introduction
The filler, or ‘filled pause’ EH has often been termed a ‘disfluency’ (e.g. Shriberg, 2001),
since it constitutes a delay in the flow of speech associated with referential meaning.
However, since it can often be assigned pragmatic functions, such as signalling an upcoming
focussed word (Bruce, 1998), or need on the part of the speaker to plan or code his/her speech
and thus a desire to hold the floor, EH can also be considered to be an integral part of the
linguistic system (see e.g. Allwood (1994), and Clark & Fox Tree (2002) who refer to it as a
‘word’). In a study on English, Clark & Fox Tree (2002) found that its realization as Uh
signals a minor delay in speaking, whereas Um announces a major delay in speaking.
A number of studies on Swedish have reported some characteristics of EH in different
speaking styles, but none have focussed on the variation in the phonetic realization of EH as
far as I know. Hansson (1998), in a study on the relationship between pausing and syntactic
structure in a spontaneous narrative, found that the filled pauses (n=22) in her material
occurred at clause boundaries after conjunctions and discourse markers and before focussed
words. Lundholm (2000) in a study on pause duration in human-human dialogues found that
the filler EH (n=55) in authentic travel-bureau dialogues occurred in turn non-initial position
and had a duration similar to silent planning pauses (mean = 340 ms). Eklund (2004) in a
number of studies on simulated human-human and human-machine dialogues found that the
most common position of EH (n=2601) was utterance-initial before another disfluency and
that it was most often followed by jag ‘I’, and det/den ‘it’. The filled pauses were found to
have a mean duration of about 500 ms, thus considerably longer than those found by
Lundholm (2000) in authentic task dialogues.
2 Current study
The present study has been carried out to pursue the investigation of EH in spontaneous data
to get some better idea as to its distribution, function, and phonetic realization in authentic
interviews where the speech is basically of a monologue style. Spontaneous speech from the
66 MERLE HORNE
SweDia 2000 interview material was used for the study (<http://www.swedia.nu/>). The
speech of two female speakers from Götaland (a young woman from Orust and an older
woman from Torsö) was transcribed and all EH fillers were labeled.
3 Results
3.1 Distribution of EH
The spontaneous SweDia data showed that EH occurs principally in non-utterance-initial
position. There were only two cases of EH in utterance-initial position in the data studied and
their mean duration was 899 ms.
EH occurs almost exclusively as a clitic to a preceding function word: 127 of the 137
instances of EH were cliticized to a preceding function word. The most frequent function
words preceding EH were the coordinate conjunctions och ‘and’ and men ‘but’ which often
have discourse functions, e.g. introducing topic continuations, new topics, etc. 52 cases of EH
occurred after these two function words. Och EH ‘and UH’ was the most common function
word+filler construction and was often (in 30 of 38 cases) preceded or followed by an
inhalation break, a clear indicator of a speech chunk boundary (see Horne et al., 2006). The
second most frequent function word category preceding EH was the subordinate conjunction
att ‘that’ which also sometimes functions as a discourse marker introducing a non-subordinate
clause. 24 instances EH occurred after att. The other instances of EH were found after the
following function words: preposition (n=13), articles (n=9), pronouns (Subject) (n=9), basic
verbs or auxiliary verbs (n=9), demonstrative article (n=5), indefinite adjective (n=3),
subordinate conjunction (other than att ‘that’) (n=2), negation (n=1). Content words preceded
EH in only 7 cases. Finally, there was 1 case where EH was a repetition.
3.2 Phonetic realization of EH in non-initial position

Three basic realizations of the filler EH have been observed: (1) a middle-high front or central
vowel: e.g. [ ], [e], [ ] (see Fig. 1), (2) a nasalized vowel or vowel+nasal consonant: e.g. [ n],
[ m], [ n] (see Fig. 2), (3) a glottalized or creaky vowel: e.g. [ [ (see Fig. 3).
Figure 1. Example of the realization of EH as the middle high vowel [ ].
The vowel realizations of EH were the most frequent (n=61) and had a mean duration of 268
ms and a SD of 136 ms. The nasalized or vowel+nasal realizations were second in frequency
(n=43), and had a mean duration of 436 ms and a SD of 185 ms. These showed a distribution
like the vowel+nasal fillers in English that Clark & Fox Tree (2002) analysed, i.e. they were
THE FILLER EH IN SWEDISH 67
always followed by other kinds of ‘delays’, sometimes several in sequence as in Figure 2 with
SWALLOW, SMACK, INHALE following EH.
Figure 2. Example of the realization of EH as a vowel+nasal [ m]. Notice the other delays
(SWALLOW, SMACK, INHALE) following [ m].
Figure 3. Example of the realization of EH as the creaky vowel [ .
The creaky vowel realizations of EH were the fewest (n=31) and had a mean duration of 310
ms and a SD of 150 ms. Their duration thus overlaps with the durations of the vowel and
vowel+nasal realizations. Unlike the vowel+nasal realizations, the only other delay that was
observed to follow the creaky filler was a silent pause. Creaky fillers are in some sense
unexpected, since EH is often assumed to be a signal that the speaker wants to hold the floor,
whereas creak, on the other hand, is assumed to be a signal of finality (Ladefoged, 1982).
Nakatani & Hirschberg (1994), however, have observed that glottalization is not uncommon
before a speech repair, and thus the creaky EH could, therefore, be interpreted as a juncture
signal for an upcoming disfluency. Observation of the SweDia data shows in fact that the
creaky realizations have a tendency to occur before disfluencies, as in the following examples:
a) men den såg ju inte ut [ det var någon ‘but it did not look [ it was somebody’, b) å då
var det en [ en kar som heter Hans Nilsson som blev ordförande ‘and then there was a [ a
guy named Hans Nilsson who became chairman’. Creaky fillers also occur in environments
where the speaker seems to be uncertain or have problems in formulating an utterance, e.g. för
68 MERLE HORNE
att då blev det ju så att PAUSE [ PAUSE Johannesberg det skulle ju läggas ner ‘since it
happened that PAUSE [ PAUSE Johannesberg it was going to be shut down’.
4 Summary and conclusion

This study on the distribution, function and phonetic realization of the filler EH has shown
that the occurrence of EH in the SweDia spontaneous speech studied here is mostly restricted
to a position following a function word at the beginning of an utterance. This supports and
generalizes the findings of Hansson (1998) and Lundholm (2000) who found the filler EH
most often in utterance internal position after conjunctions/discourse markers in spontaneous
speech, both of a monologue and dialogue type. This differs from the findings for simulated
task-related dialogues in Eklund (2004), where the filler EH occurred almost exclusively in
utterance-initial position. This difference is most likely due to the simulated nature of the
speech situation where the planning and coding of speech is more cognitively demanding.
As regards the phonetic realization of the filler EH, the patterning in Swedish is seen to be
partially similar to the findings of Clark & Fox Tree (2002) for English: A vocalic realization
of EH occurs before shorter delays in speech whereas a vowel+nasal realization correlated
with relatively longer delays in speech. The duration of the vocalic realizations in the present
data (mean = 268 ms) corresponds rather well with the median duration for EH found by
Lundholm (240 ms) in spontaneous dialogues; thus, we would expect that the fillers in her
data were realized mainly as a vowel such as ([ ], [e], [ ]). A third realization, with creaky
phonation, whose distribution overlaps with the other two realizations would appear to be
associated with relatively more difficulty in speech coding; the creaky phonation, associated
with termination, perhaps signals that the speaker has problems in formulating or coding
his/her speech, and was observed to sometimes occur before repairs and repetitions. More data
is needed, however, in order to draw more conclusive results.
Acknowledgements
This research was supported by a grant from the Swedish Research Council (VR).
References
Allwood, J., 1994. Om dialogreglering. In N. Jörgenson, C. Platzack & J. Svensson (eds.),
Språkbruk, grammatik och språkförändring. Dept. of Nordic Lang., Lund U., 3-13.
Bruce, G., 1998. Allmän och svensk prosodi. Dept. of Linguistics & Phonetics, Lund U.
Clark, H. & J. Fox Tree, 2002. Using uh and um in spontaneous speech. Cognition 84, 73-
111.
Eklund, R., 2004. Disfluency in Swedish: Human-human and human-machine travel booking
dialogues. Ph.D. Dissertation, Linköping University.
Hansson, P., 1998. Pausering i spontantal. B.A. essay, Dept. of Ling. & Phonetics, Lund U.
Horne, M., J. Frid & M. Roll, 2006. Timing restrictions on prosodic phrasing. Proceedings
Nordic Prosody IX, Frankfurt am Main: P. Lang, 117-126.
Ladefoged, P., 1982. The linguistic use of different phonation types. University of California
Working Papers in Phonetics 54, 28-39.
Lundholm, K., 2000. Pausering i spontana dialoger: En undersökning av olika paustypers
längd. B.A. essay, Dept. of Ling. & Phonetics, Lund U.
Nakatani, C. & J. Hirschberg, 1994. A corpus-based study of repair cues in spontaneous
speech. Journal of the Acoustical Society of America 95, 1603-1616.
Shriberg, E., 2001. To ‘errrr’ is human: ecology and acoustics of speech disfluencies. Journal
of the International Phonetic Association 31, 153-169.
SweDia 2000 Database: http://www.swedia.nu/.
Modelling Pronunciation in Discourse Context

Per-Anders Jande
Dept. of Speech, Music and Hearing/CTT, KTH
jande@speech.kth.se
Abstract
This paper describes a method for modelling phone-level pronunciation in discourse context.
Spoken language is annotated with linguistic and related information in several layers. The
annotation serves as a description of the discourse context and is used as training data for
decision tree model induction. In a cross validation experiment, the decision tree pronuncia-
tion models are shown to produce a phone error rate of 8.1% when trained on all available
data. This is an improvement by 60.2% compared to using a phoneme string compiled from
lexicon transcriptions for estimating phone-level pronunciation and an improvement by
42.6% compared to using decision tree models trained on phoneme layer attributes only.
1 Introduction and background

The pronunciation of a word is dependent on the discourse context in which the word is ut-
tered. The dimension of pronunciation variation under study in this paper is the phone dimen-
sion and only variation such as the presence or absence of phones and differences in phone
identity are considered. The focus is on variation that can be seen as a property of the lan-
guage variety rather than as individual variation or variation due to chance.
Creating models of phone-level pronunciation in discourse context requires a detailed de-
scription of the context of a phoneme. Since the discourse context is the entire linguistic and
pragmatic context in which the word occurs, the description must include everything from
high-level variables such as speaking style and over-all speech rate to low-level variables such
as articulatory feature context.
Work on pronunciation variation in Swedish has been reported by several authors, e.g.
Gårding (1974), Bruce (1986), Bannert & Czigler (1999), Jande (2003; 2005). There is an ex-
tensive corpus of research on the influence of various context variables on the pronunciation
of words. Variables that have been found to influence the segmental realisation of words in
context are foremost speech rate, word predictability (or word frequency) and speaking style,
cf. e.g. Fosler-Lussier & Morgan (1999), Finke & Waibel (1997), Jurafsky et al. (2001) and
Van Bael et al. (2004).
2 Method
In addition to the variables mentioned above, the influence of various other variables on the
pronunciation of words has been studied, but these have mostly been studied in isolation or
together with a small number of other variables. A general discourse context description for
recorded speech data, including a large variety of linguistic and related variables, will enable
data-driven studies of the interplay between various information sources on e.g. phone-level
pronunciation. Machine learning methods can be used for such studies. A model of pronuncia-
tion variation created through machine learning can be useful in speech technology applica-
tions, e.g. for creating more dynamic and natural-sounding speech synthesis. It is possible to
70 PER-ANDERS JANDE
create models which can predict the pronunciation of words in context and which are simulta-
neously descriptive and to some degree explain the interplay between different types of vari-
ables involved in the predictions. The decision tree induction paradigm is a machine learning
method that is suitable for training on variables of diverse types, as those that may be included
in a general description of discourse context. The paradigm also creates transparent models.
This paper describes the creation of pronunciation models using the decision tree paradigm.
2.1 Discourse context description

The speech databases annotated comprise ~170 minutes of elicited and scripted speech. Ca-
nonical phonemic word representations are collected from a pronunciation lexicon and the
phoneme is used as the central unit in the pronunciation models. The annotation is aimed at
giving a general description of the discourse context of a phoneme and is organised in six lay-
ers: 1) a discourse layer, 2) an utterance layer, 3) a phrase layer, 4) a word layer, 5) a syllable
layer and 6) a phoneme layer. Each layer is segmented into a linguistically meaningful type of
unit which can be aligned to the speech signal and the information included in the annotation
is associated with a particular unit in a particular layer. For example, in the word layer, infor-
mation about part of speech, word frequency, word length etc. is included. The information
associated with the units in the phoneme layer is instead phoneme identity, articulatory fea-
tures etc. For a more detailed description of the annotation, cf. Jande (2006).
2.2 Training data

Decision trees are induced from a set of training instances compiled from the annotation. The
training instances are phoneme-sized and can be seen as a set of context sensitive phonemes.
Each training instance includes a set of 516 attribute values and a phone realisation, which is
used as the classification key. The features of the current unit at each layer of annotation are
included as attributes in the training examples. Where applicable, information from the
neighbouring units at each annotation layer is also included in the attribute sets.
The key phone realisations are generated by a hybrid automatic transcription system using
statistical decoding and a posteriori correction rules. This means that there is a certain degree
of error in the keys. When compared to a small gold standard transcription, the automatic tran-
scription system was shown to produce a phone error rate (PER) of 15.5%. Classification is
not always obvious at manual transcription, e.g. many cases of choosing between a full vowel
symbol and a schwa. Defaulting to the system decision whenever a human transcriber is
forced to make ad hoc decisions would increase the speed of manual checking and correction
of automatically generated phonetic transcripts without lowering the transcript quality. If this
strategy had been used at gold standard compilation, the estimation of the system accuracy
would have been somewhat higher. The 15.5% PER is thus a pessimistic estimate of the tran-
scription system performance.
2.3 Decision tree model induction

Decision tree induction is non-iterative and trees are built level by level, which makes the
learning procedure fast. However, the optimal tree is not guaranteed. At each new level cre-
ated during the tree induction procedure, the set of training instances is split into subsets ac-
cording to the values of one of the attributes. The attribute selected is the attribute that best
meets a given criterion, generally based on entropy minimisation. Since training data mostly
contain some degree of noise, a decision tree may be biased toward the noise in the training
data (over-trained). However, a tree can be pruned to make it more generally applicable. The
idea behind pruning is that the most common patterns are kept in the model, while less com-
mon patterns, with high probability of being due to noise in the training data, are deleted.
MODELLING PRONUNCIATION IN DISCOURSE CONTEXT 71
3 Model performance
A tenfold cross validation procedure was used for model evaluation. Under this procedure, the
data is divided into ten equally sized partitions using random sampling. Ten different decision
trees are induced, each with one of the partitions left out during training. The partition not
used for training is then used for evaluation. A pruned and an unpruned version of each tree
were created and the version with the highest prediction accuracy on the evaluation data was
used for calculating the average prediction accuracy. The annotation contains some prosodic
information (variables based on pitch and duration measures calculated from the signal),
which cannot be fully exploited in e.g. a speech synthesis context. Thus, it was interesting to
investigate the influence of the prosodic information on model performance. For this purpose,
a tenfold cross validation experiment where the decision tree inducer did not have access to
the prosodic information was performed. As a baseline, an experiment with trees induced
from phoneme layer information only was also performed.
3.1 Results
Attributes from all layers of annotation were used in the models with the highest prediction
accuracy. The topmost node of all trees was phoneme identity and other high ranking attrib-
utes were phoneme context, mean phoneme duration measured over the word and over the
phrase, and function word, a variable separating between a generic content word representa-
tion and the closed set of function words. The trees produced an average phone error rate of
8.1%, which is an improvement by 60.2% compared to using a phoneme string compiled from
a pronunciation lexicon for estimating the phone-level realisation.
The average PER of models trained on phoneme layer attributes only was 14.2%, which
means that the prediction accuracy was improved by 42.6% by adding attributes for units
above the phoneme layer to the training instances. A comparison between the models trained
on all attributes and the models trained without access to prosodic information showed that
the prosodic information gave a decrease in PER from 13.1 to 8.1% and thus increased model
performance by 37.8%.
The phonetic transcript generated by the models trained on all attributes was also evaluated
against actual target transcripts, i.e., the gold standard used to evaluate the automatic tran-
scription system. In this evaluation, the models produced a PER of 16.9%, which means that
the deterioration in performance when using an average decision tree model instead of the
automatic transcription system is only 8.5% and that the improvement using the model instead
of a phoneme string is 34.9%.
4 Model transparency
Figure 1 shows a pruned decision tree trained on all available data. The tree uses 58 of the 516
available attributes in 423 nodes on 12 levels. The transparency of the decision tree represen-
tation becomes apparent from the magnification of the leftmost sub-tree under the top node,
shown in the lower part of Figure 1.
The top node of the tree is phoneme identity and the magnified branch is the branch repre-
senting phoneme identity /v/. It can be seen that there are two possible realisations of the pho-
neme /v/, [v] and null (no realisation) and it is easy to understand the conditions under
which the respective realisations are used. If the mean phoneme duration over the word is less
than 35.1 ms, the /v/ is never realised. If the mean phoneme duration is between 31.5 and 38.2
ms, the current word is decisive. If the word is one of the function words vad, vi, vara, vid, or
av, the /v/ is not realised. If the word is any content word or the function word blev, the /v/ is
realised as [v]. Finally, if the mean phoneme duration over the word is more than 38.2 ms, the
/v/ is realised (as [v]) unless the phoneme to the right is also a /v/.
72 PER-ANDERS JANDE
phoneme_identity
V A: D J G O: RD E0,E:,E I: S Å M R A K B Ö4,Ö3 P Å: T I N RN SJ L Ä: Ä3,Ä4 O U: F U Ö: Y Y: H Ä TJ NG RT Ö RS RL
word_duration_phonemes_absolute word_duration_vowels_absolute D phoneme_identity+1 word_type_with_function word_duration_vowels_absolute RD phoneme_identity word_duration_phonemes_absolute phoneme_identity+1 word_type_with_function word_duration_phonemes_absolute phoneme_feature_py+1 word_duration_vowels_normalised word_type_with_function word_duration_phonemes_absolute syll_accent word_duration_phonemes_absolute word_duration_vowels_absolute phoneme_identity+1 I phoneme_identity+1 word_duration_phonemes_log_normalised SJ phoneme_identity+1 word_duration_vowels_absolute word_duration_phonemes_absolute phoneme_identity−1 word_duration_vowels_absolute word_duration_phonemes_absolute word_duration_vowels_absolute phrase_duration_vowels_absolute Y phoneme_identity+4 syll_position_in_word syll_stress TJ NG RT syll_stress RS utterance_semitone_pitch_dynamic_extremes_median−1
<0.0351488 >0.0351488 <0.0450865 >0.0450865 A:,J,G,I:,A,Å:,SJ,F,H D,O:,E0,S,Å,M,R,K,B,E:,sil,V,P,E,T,junk,N,L,Ä:,Ä3,O,Ö3,U,Ö:,U:,Y,I,Ä,Ä4,Ö,− jag,några,något,enligt,angående,vågar content,någon,herregud,iväg,gå,ihåg,tillväga,tog,igång,lägga <0.054381 >0.054381 E0 E: E <0.0453408 >0.0453408 A:,D,J,G,O:,E0,I:,Å,M,R,A,K,B,Ö4,E:,sil,V,P,Å:,E,T,junk,N,SJ,L,Ä:,Ä3,O,F,Ö3,U,Ö:,U:,Y,Y:,I,H,Ä,Ä4,TJ,Ö,− S content,dem,de,måste,oss,om,sånt,nån,nånting,igenom,trots,någonting,dom,komma,omkring,honom,bort,utom,bortom,genom,okej,loss,många,oj,kommer,förutom,bakom,vårt,hålla,igång,såsom,kontra,inom,hoppas som,och,liksom eftersom,någon,något <0.0351371 >0.0351371 open,close,mid,open−mid,close−mid,−,glottal dental,palatal,velar,retroflex,bilabial,labiodental,alveolar <−0.109945 >−0.109945 content,skulle,fick,vilket,mycket,ska,kan,kunde,tillbaka,kunna,kvar,skall,komma,omkring,tänkte,country,sjunka,liksom,vilka,verkade,försöka,ikapp,brukar,försökte,okej,kommer,vilken,tack,bakom,inklusive,kontra,diktaturländer,försöker,lyckas,tänker,frakta,kring och <0.0345119 >0.0345119 prim1,prim2,secondComp second2,no <0.0527423 >0.0527423 <0.054784 >0.054784 A:,D,J,G,O:,E0,I:,Å,M,R,A,K,Ö4,E:,sil,V,Å:,E,junk,N,SJ,L,Ä:,Ä3,O,F,Ö3,U,Ö:,U:,Y,Y:,I,H,Ä,Ä4,TJ,Ö,RS,− S,B,P,T A:,D,J,G,O:,E0,I:,S,Å,M,R,A,Ö4,E:,sil,V,Å:,E,T,junk,SJ,L,Ä:,Ä3,O,F,Ö3,U,Ö:,U:,Y,Y:,I,H,Ä,Ä4,TJ,Ö,RS,− K B,P N <−0.211851 >−0.211851 A:,D,J,G,O:,E0,I:,Å,M,R,A,K,B,Ö4,E:,sil,V,P,Å:,E,T,junk,N,SJ,Ä:,Ä3,O,F,Ö3,U,Ö:,U:,Y,Y:,I,H,Ä,Ä4,Ö,− S L <0.0688825 >0.0688825 <0.035647 >0.035647 sil,V,D,J,RD,M,R,A,B,E:,P,N,T,F,H,TJ,RS G,I:,S,K,E,SJ,L I <0.0259981 >0.0259981 <0.0366548 >0.0366548 <0.01375 >0.01375 <0.0354063 >0.0354063 A:,G,O:,RD,E0,I:,S,Å,M,R,A,K,B,E:,D,sil,P,E,T,I,N,L,Ä3,V,U,Ö:,Y,NG J,Å:,U:,H i f,m y n y n <2.04762 >2.04762
null word_duration_phonemes_absolute phoneme_identity−1 A: phrase_duration_phonemes_absolute J null phrase_duration_phonemes_log_absolute phoneme_identity−2 word_type_with_function−1 word_duration_phonemes_absolute word_duration_vowels_absolute word_type_with_function word_duration_vowels_absolute word_duration_vowels_absolute word_duration_phonemes_absolute null Å word_duration_vowels_absolute null null phoneme_identity+1 word_duration_phonemes_absolute phoneme_identity+1 syll_accent A K word_duration_phonemes_absolute null phrase_duration_phonemes_absolute word_duration_vowels_absolute Ö4 phoneme_identity+1 phoneme_identity+1 word_duration_vowels_absolute Å: phoneme_identity−3 null word_duration_phonemes_absolute word_part_of_speech+4 M null null RN word_type_with_function phoneme_identity+2 null word_duration_phonemes_absolute syll_stress word_type_with_function word_duration_vowels_log_absolute phoneme_identity+1 E0 word_probability word_part_of_speech U: word_part_of_speech+1 F E0 U phoneme_identity−1 phoneme_identity−1 phrase_duration_vowels_normalised Y H phrase_duration_phonemes_log_absolute phrase_duration_phonemes_absolute E0 Ö null RL null
<0.038165 >0.038165 V,J D,G,E0,S,M,R,A,K,B,P,Å:,N,T,L,Ö:,H,NG <0.0351698 >0.0351698 <−3.07709 >−3.07709 junk,sil,G,E0,S,M,R,A,T,N,L I:,Å: K,U:,Ä4 E −,jag,content,i,på,med,det,är,som,en,vi,skulle,de,från,och,varit,den,över,upp,man,att,några,om,för,till,men,jo,hade,ja,ut,sånt,alla,då,vilket,han,vara,av,sina,ett,nån,efter,fram,mellan,hur,har,något,våra,både,sin,varför,du,får,varje,bli,iväg,våran,sätta,inget,enligt,all,usch,vilken,dom,vår,gentemot,per mig,än,någon,mina <0.0500093 >0.0500093 <0.0443735 >0.0443735 content,ett,den,men,eller,sen,sig,varenda,nej,bredvid,mig,eftersom,efter,mellan,emellan,hennes,henne,herregud,oavsett,hem,detta,okej,emot,tv−fem,dessa,dig,vem,överens,dess,denna,_re,igen,gentemot,sej en,måste,ville,kunde,ens,_berä,behövde,angående,behöver,densamma,behöva <0.0580797 >0.0580797 <0.0949705 >0.0949705 <0.0363721 >0.0363721 <0.0797275 >0.0797275 A:,D,J,G,O:,E0,I:,S,Å,R,A,K,B,Ö4,E:,sil,V,P,Å:,E,T,junk,N,SJ,L,Ä:,Ä3,O,F,Ö3,U,Ö:,U:,Y,I,H,Ä,Ä4,TJ,Ö,− M <0.0305612 >0.0305612 D,RD,M,R,N,SJ,L,TJ J,G,S,K,B,V,P,T,F,−,RL prim1,second2,no prim2,secondComp <0.135855 >0.135855 <0.031577 >0.031577 <0.0434868 >0.0434868 A:,D,sil,Å:,Ä3 R,A,T A:,D,J,O:,E0,I:,S,Å,M,R,A,K,B,E:,sil,V,Å:,E,T,junk,N,SJ,L,Ä3,O,F,U,Ö:,U:,I,H,Ä,Ä4 P <0.0138383 >0.0138383 sil,junk,V,A:,D,J,G,O:,RD,E0,I:,S,Å,M,R,A,K,B,Ö4,E:,P,E,T,I,N,RN,SJ,L,Ä:,Ä3,Å:,O,U:,Ö3,U,Ö:,Y:,F,H,Ä,Ä4,TJ,NG,RT,Ö,RS,− Y <0.0362163 >0.0362163 NN,VB,PN,PP,−,JJ,KN,HP,PM,RG,DT,AB,HA,SN,IE PS,PLQS content,skulle,allt,till,vill,eller,alla,allting,bland,blev,ville,tillbaka,mellan,emellan,vilja,slutar,bli,skall,allihopa,blivit,lät,liksom,tillväga,enligt,all,slut,loss,allihop,lärde,fel,tala,hålla,inklusive,lägga,lyckas,lämna vilket,vilka,vilken D,E0,I:,A,K,B,E:,L,Y:,Ä J,S,Å,E,Ä:,Ä3,O,TJ M,R,sil,V,P,Å:,T,I,N,F,Y,H <0.0353142 >0.0353142 y n content,där är,när <−3.02067 >−3.02067 A:,D,G,RD,I:,S,M,R,A,K,B,T,N,RN,SJ,L,H,Ä,RT J,sil,P <2.0029e−05 >2.0029e−05 PN,PP,AB,PLQS NN,JJ HA PN,VB,NN,−,DT,JJ,AB,PLQS,HA,SN,IE PM D,S,P,SJ,F,TJ,RS I:,R,A,L T H sil,D,J,E0,I:,S,M,R,A,K,B,E:,P,Å:,junk,I,N,T,SJ,L,U:,U,H,TJ,RT,RS G NG <−0.894746 >−0.894746 <−2.54407 >−2.54407 <0.027235 >0.027235
word_type_with_function phoneme_identity+1 word_type_with_function word_duration_phonemes_absolute null phoneme_identity−1 null G O E0 O: null O: O null phrase_duration_phonemes_absolute word_type_with_function word_duration_phonemes_absolute syll_accent E0 word_duration_phonemes_log_normalised I: word_type_with_function I: phoneme_identity+1 S word_duration_phonemes_log_absolute Å M null phoneme_identity−2 R null word_duration_phonemes_absolute word_duration_vowels_log_absolute word_duration_phonemes_absolute null phrase_duration_vowels_normalised null B phoneme_identity−1 phoneme_feature_len+1 null P P null word_duration_phonemes_absolute Å T phoneme_identity−4 word_type_with_function N NG null L null phoneme_identity+3 L L E0 phoneme_identity+4 syll_duration_vowels_absolute word_type_with_function−1 word_duration_phonemes_absolute word_morphology_tense_aspect+2 phoneme_identity+2 phoneme_feature_len O E0 phoneme_identity+3 null phoneme_identity−4 U: word_duration_phonemes_absolute null F Ö Ö: E0 null syll_duration_vowels_absolute E0 E0 Y syll_duration_vowels_absolute phoneme_identity−2 H E0 Ä
vad,vi,vara,vid,av content blev A:,D,J,G,O:,E0,I:,S,Å,M,R,A,K,B,E:,sil,P,Å:,E,T,junk,N,SJ,L,Ä:,Ä3,O,F,U,U:,I,H,Ä,Ä4,− V vad,jag,content,vara,var varit,ja,kvar <0.0255504 >0.0255504 sil,G,R,E,junk,Ä A:,D,J,O:,E0,S,Å,M,A,K,Å:,I,N,T,L,U:,NG,RT,Ö,RS <0.0182007 >0.0182007 content,blev,sedan,genom med,det,enligt,ned ner <0.0350682 >0.0350682 prim1,prim2,secondComp second2,no <−0.386104 >−0.386104 content,vi,dit,vid,bli,hin i,sina,mina,blivit,ni Å,E:,Å:,E,T,L M,A,K,I <−3.12493 >−3.12493 sil,V,D,G,E0,M,R,A,K,Å:,T,N,U: J,F,H <0.0366438 >0.0366438 <−2.99887 >−2.99887 <0.0653373 >0.0653373 <1.63464 >1.63464 J,G B,T,F H l s <0.0806865 >0.0806865 S,R,B,T,N,L,H M content,en,den,man,men,bland,nån från,när,hon,han,nej,hans några,än,sånt,under,nåt,någon,något,ens,ni ned A:,G,V O:,S,Å,M,R,A,K,E:,D,sil,E,T,N,L,U:,F A:,E0,S,M,R,A,B,E:,D,sil,Å:,T,junk,I,N,L,F J I:,Ö3 <1.1824e−05 >1.1824e−05 −,content,med,från,om,för,till,ur i,mot,dom <0.0251743 >0.0251743 PRT,NO,PRS,INF,− SUP PRF D,J,I:,S,Å,M,K,B,V,T,L,H,− A:,G,sil,junk,N,Ä:,Ä3,U:,Y,Ä,RT O:,P,O RD,E0,R,RS A,E,I,SJ,F,U,Ö: s l G,O:,Å,M,E:,sil,Å:,E,I,L,O,Ä,− R,A,K,D,N sil,V,D,J,S,M,R,A,K,B,E,T,I,N,L,Ä3,O,H,Ä E0 E: U <0.0254167 >0.0254167 <7.64504e−06 >7.64504e−06 <1.05608e−05 >1.05608e−05 V,D,G,O:,E0,S,Å,R,A,K,B,Ö4,Å:,T,I,E,L,U:,Y,Ä4,NG A:,J,P,N,SJ,E:,U,Ä
null V V V null word_part_of_speech phrase_duration_vowels_log_normalised E0 word_duration_vowels_absolute phoneme_identity+2 J null E0 E E0 E: E0 E: word_duration_phonemes_absolute E0 word_part_of_speech phrase_duration_phonemes_normalised word_duration_vowels_absolute I null S E0 phrase_duration_vowels_absolute syll_length_phonemes phoneme_identity+1 phoneme_identity−2 word_type_with_function+1 word_type_with_function A E0 phrase_duration_phonemes_log_normalised K null Ö3 word_morphology_tense_aspect Ö Ö4 Ö3 E0 Å T null word_duration_phonemes_normalised phrase_duration_phonemes_log_normalised null RN L null Ä Ä: E0 Ä word_type_with_function−1 E0 Ä: E0 word_duration_vowels_log_normalised E0 Ä3 null Ä4 word_duration_vowels_normalised phoneme_identity+1 word_duration_vowels_log_absolute−1 phoneme_identity+3 Ä Ä3 phrase_pitch_range+1 O phoneme_identity−3 U null E0 E0 U: Ö Ö: E0 utterance_pitch_range null H
HP,PN,VB,DT,HA NN,AB <−1.33998 >−1.33998 <0.0140655 >0.0140655 D,J,A:,O:,I:,S,Å,M,A,K,B,E:,sil,V,P,Å:,E,T,junk,I,L,U:,F,Ö3,U,Ö:,Y,H,Ä,− G,R,N,Ä3,O <0.035194 >0.035194 PN,PP VB NN <−0.334969 >−0.334969 <0.0449335 >0.0449335 <0.0302093 >0.0302093 <3.5 >3.5 I:,sil,junk,Ä3,U Å,A,Å:,E,U:,I,H sil,V,G,E0,I:,S,M,K,B,P,Å:,T,N,U:,NG D,J,R,L,F,H jag,med,är,sin,för,vad,så,sig,eller,har content,i,−,på,det,som,en,vi,från,och,över,man,att,om,till,då,av,sina,ska,bland,varför,bakom skulle,den,de,vara,ner,sen,vilket,deras,blev,vid,ja,fram,kvar,framför,vilka,försöka,genom,all,vart,_re,försöker,sej,per varit när,men,än,ut,hon,han,mot,tillbaka,mellan,ihop,våra,både,samma,under,samman,honom,blivit,fast,vilken,bör,vårt,fel,vågar content,utanför,allt,man,några,utan,började,inga,bland,varenda,deras,innan,sina,slutar,kunna,varandra,samma,allihopa,detta,försöka,vågar,varandras ha,att,värma,tillbaka,medans,våra,vilja,mina,sedan,komma,era,börja,sjunka,brukar,skära,många,media,dessa,tala,huruvida,behöva,lägga,via,diktaturländer,frakta,säga vara,börjar fast,allting,han,andra,kan,fram,mellan,nära,varför,skall,våran,samman,united,sätta,vilka,verkade,allihop,vart,tvingas,hoppas alla,annat <−1.20726 >−1.20726 NO,PRT,INF PRS,SUP <−0.553373 >−0.553373 <−1.17269 >−1.17269 −,vad,jag,content,i,på,med,det,är,som,en,vi,skulle,och,varit,den,man,att,så,om,började,för,till,fast,ja,eller,min,hon,han,sig,vara,vid,av,ett,ska,hur,nåt,har,vilja,du,får,varje,kunna,samma,komma,era,inget,enligt hade ville,någon,börjar,lärde <−0.156165 >−0.156165 <−0.160485 >−0.160485 RD,RN R,RS <−6.55093 >−6.55093 J,M,A,Ö4,sil,E,I,H,TJ,NG A: G,S,K,D,P,Å:,T,N,L,F,Ö3,U Å,R,V,Ö Ä3,Ö: <9.62448 >9.62448 sil,junk,A:,D,E0,Å,R,A,K,E:,E,T,I,N,L,Ä3,Å:,U: S,Ö3 <98.4923 >98.4923
E0 A: E0 A: E0 A: null J E0 E word_duration_phonemes_absolute phrase_duration_phonemes_absolute I I: I phrase_duration_vowels_log_absolute I: phoneme_identity−1 Å null R null R null word_duration_phonemes_absolute+1 R syll_nucleus R null null discourse_duration_vowels_absolute E0 null A word_duration_phonemes_absolute E0 A Ö4 E0 null N null N word_part_of_speech+3 Ä: Ä phoneme_identity+2 Ä3 word_type_with_function−1 Ä3 Ä3 phrase_mel_pitch_range Ä4 phrase_duration_phonemes_absolute phoneme_identity−1 Ä3 Ä4 Ä4 E0 O E0 phoneme_identity+2 U Y Y:
<0.0158196 >0.0158196 <0.0343429 >0.0343429 <−4.26926 >−4.26926 sil,E0,I:,Å,M,R,junk,NG,RT J,S,P,N <0.0415884 >0.0415884 A:,E O:,E0,I:,Å,A,Ö4,E:,Å:,I,Ä3,O,U:,Ö3,U,Y,Ä4 − <0.0694128 >0.0694128 <0.0864187 >0.0864187 PP,NN,VB,PN,−,DT,JJ,PM,KN,HP,AB,PS,PLQS,SN,IE,IN HA,UO J,S,M,T,L,Ä3 A:,G,R,B,sil,P,N,F K −,jag,content,det,som,de,den,för,sånt,många är,han <56.3845 >56.3845 <0.0536095 >0.0536095 sil D,J,S,B,T,SJ,L,F K P,H A:,G,I:,S,Å,M,R,A,K,Ö4,E:,sil,V,Å:,E,T,N,O,F,Y,H,Ä,TJ,NG I,Ä3
null E0 E0 word_lexeme_repetitions E0 I Å E0 phrase_length_phonemes null phrase_type−1 R R syll_position_in_word word_duration_vowels_normalised null discourse_duration_vowels_absolute Ä: Ä phrase_duration_vowels_normalised+1 Ä4 Ä3 Ä4 E0 Ä4 E0 Ä4 Ä3 Ä phrase_duration_vowels_absolute E0 Ä4 E0 U
<7.5 >7.5 <2.5 >2.5 NP,PP,NOP,ADVP VC APMIN i,m f <−0.665706 >−0.665706 <0.0818528 >0.0818528 <−0.851477 >−0.851477 <0.02875 >0.02875
I: I null R null R null word_type_with_function E0 syll_position_in_word phrase_length_phonemes−1 word_duration_phonemes_log_absolute+1 A Ä4 Ä Ä word_part_of_speech+3
content,utanför,allt,varenda,varandra,allihopa,varandras man,började,bland i,m f <25.5 >25.5 <−2.36703 >−2.36703 PP,NN,VB,PN,−,DT,JJ,KN,HP,AB HA,IN
A E0 A word_type_with_function−1 A E0 E0 A Ä3 Ä4
−,jag,content,i,på,med,det,är,vi,skulle,och,den,upp,att,några,för,till,eller,alla,ville,sina,ska,kan,in,kunde,vilja,du,dom,törs som,en,man,av,mellan,har
phoneme_identity−1 A
V,D,J,G,S,R,K,N,T,RN,L,RT M,I,NG B,P,H
phoneme_identity+4 A A
A:,J,O:,E0,S,Å,M,R,A,K,E:,D,sil,Å:,E,T,I,N,SJ,L,Ä:,Ä3,V,O,F,Ö3,U,Y,H,− G,I:,NG
E0 A
Figure 1. The upper part of the figure shows the pruned version of a decision tree and the
lower part of the figure shows a magnification of a part of the tree.
Acknowledgements
The research reported in this paper is carried out at the Centre for Speech Technology, a com-
petence centre at KTH, supported by VINNOVA (the Swedish Agency for Innovation Sys-
tems), KTH and participating companies and organisations. The work was supported by the
Swedish Graduate School of Language Technology, GSLT.
References
Bannert, R. & P.E. Czigler, 1999. Variations in consonant clusters in standard Swedish.
Phonum 7. Umeå: Umeå University.
Bruce, G., 1986. Elliptical phonology. Papers from the Scandinavian Conference on Linguis-
tics, 86-95.
Finke, M. & A. Waibel, 1997. Speaking mode dependent pronunciation modeling in large vo-
cabulary conversational speech recognition. Proceedings of Eurospeech, 2379-2382.
Fosler-Lussier, E. & N. Morgan, 1999. Effects of speaking rate and word frequency on pro-
nunciations in conversational speech. Speech Communication 29(2-4), 137-158.
Gårding, E., 1974. Sandhiregler för svenska konsonanter (Sandhi rules for Swedish conso-
nants). Svenskans beskrivning 8, 97-106.
Jande, P.A., 2003. Phonological reduction in Swedish. Proceedings of ICPhS, 2557-2560.
Jande, P.A., 2005. Inducing decision tree pronunciation variation models from annotated
speech data. Proceedings of Interspeech, 1945-1948.
Jande, P.A., 2006. Integrating linguistic information from multiple sources in lexicon devel-
opment and spoken language annotation. Proceedings of LREC workshop on merging and
layering linguistic information (accepted).
Jurafsky, D., A. Bell, M. Gregory & W. Raymond, 2001. Probabilistic relations between
words: Evidence from reduction in lexical production. In J. Bybee & P. Hopper (eds.), Fre-
quency and the emergence of linguistic structure. Amsterdam: John Benjamins, 229-254.
Van Bael, C.P.J., H. van den Heuvel & H. Strik, 2004. Investigating speech style specific pro-
nunciation variation in large spoken language corpora. Proceedings of ICSLP, 586-589.
Are Verbs Less Prominent?

Christian Jensen
Department of English, Copenhagen Business School
cje.eng@cbs.dk
Abstract
The perceived prominence of three parts of speech (POS), nouns, verbs and adjectives, in
three utterance positions, initial, intermediate and final, were examined in a perceptual
experiment to see whether previously observed reductions in prominence of intermediate
items were the result of positional effects or because words in this position belonged to the
same POS, namely verbs. It was found that the perceived prominence of all three POS was
reduced in intermediate position, and that the effect of POS membership was marginal,
although adjectives tended to be slightly more prominent than nouns and verbs.
1 Introduction
In a previous study of the perceived prominence of accented words in Standard Southern
British English (SSBE) (Jensen, 2003; 2004) it was found that, in short sentences, accented
words in utterance initial and utterance final position are generally perceived as more
prominent than accented words in an intermediate position. This is in accordance with
traditional descriptions of intonation in SSBE and has also been observed in German (Widera,
Portele & Wolters, 1997) and, at least with regard to utterance initial position, Dutch
(Streefkerk, 2001). In utterances with multiple intermediate accented lexical items these
seemed to form an alternating strong – weak pattern, and the complete pattern of the entire
utterance was explained (in part) as reflecting the intermediate accent rule, which states that
“any accented syllables between onset and nucleus are liable to lose their accent” (Knowles,
1987: 124).
However, it was suggested to me that the observed pattern might not be a general property
of the prosodic structure of utterances (or phrases), but rather a reflection of lexical/semantic
properties of the sentences employed in the study. Most of these were of the type Bill struck
Ann and Sheila examined the patient carefully, i.e. SVO structure with a verb as the second
lexical item. Some studies have noted a tendency for verbs to be perceived as less prominent
than other lexical items in various languages: Danish (Jensen & Tøndering, 2005), Dutch
(Streefkerk, 2001) and German (Widera, Portele & Wolters, 1997), so the reduction in
perceived prominence, which was particularly noticeable immediately following the first
accent of the utterance, could be the result of an inherent property of verbs.
The present study examines whether the tendency towards intermediate accent reduction
can be reproduced in utterances with varying lexico-syntactic structure and addresses the
following question: does the perceived prominence of a lexical item vary as a function of its
part of speech (POS) membership independently of the position of this item in an utterance?
Specifically, are verbs, in their function as main verbs in a clause, inherently less prominent
than (some) other parts of speech?
74 CHRISTIAN JENSEN
2 Method
Since the perceived prominence of words in utterances depends on factors other than the ones
studied here, most importantly information structure, it is necessary to find an experimental
design which limits the influence of these factors to the smallest possible minimum. This
effectively rules out studies of spontaneous speech, since the influences of information
structure and the lexical content of the accented words are likely to mask the effects of
location within an utterance. A relatively large corpus of spontaneous speech would be
required to bring out these effects, which is not practical when measurements of perceived
prominence are elicited through the ratings of multiple listeners (see below). Instead, the
research question outlined above is addressed through a simple design involving read speech.
Verbs are compared with two other POS categories, namely nouns and adjectives. While
verbs are often found to be less prominent than other lexical items, nouns and adjectives are
consistently found to be among the most prominent words. The inclusion of these word
classes should therefore maximise any potential difference between verbs and “other lexical
items”. A number of sentences were constructed, each of which contained three lexical items,
one verb, one noun and one adjective, which were all expected to be accented. All six possible
combinations were used, with two examples of each type, giving a total of 12 different
sentences. Some examples of sentences from the material: The children claimed they were
innocent (noun – verb – adj); The little girl was crying (adj – noun – verb); He admitted she
was a beautiful woman (verb – adj – noun).
The decision to include all logical possibilities means that some of the sentence types are
more common, or “natural”, than others and also poses certain restrictions on verb forms, for
example when they occur in final position. However, this should not have any negative
influence on the research question as it is formulated above. Using this design, each POS
occurs four times in each of the three positions in the sentence.
The 12 sentences were recorded onto a computer by three speakers of Southern British
English, giving a total of 36 utterances, which were presented to the raters via a web page, one
utterance per page. The raters could hear the utterance as many times as they wanted by
pressing a “play” button, and indicated their judgment by selecting the appropriate scale point
for each lexical item and then clicking a “submit” button. A four-level scale was used, from 1
to 4, with 1 representing “low degree of emphasis” and 4 representing “high degree of
emphasis”. A four-level scale has been demonstrated to be preferable to commonly used
alternatives such as a binary scale or a 31-level scale (Jensen & Tøndering, 2005). The lower
end of the scale was represented by 1 rather than 0 here to signal that all words were expected
to have some degree of emphasis, since function words were excluded. Note also that the
word emphasis was used in the written instructions to the untrained, linguistically relatively
naive listeners, but refers to the phenomenon which elsewhere I call perceptual prominence
and not (just) to higher levels of prominence, such as contrastive emphasis. The notion of
“emphasis” (i.e. perceptual prominence) was both explained and exemplified in the online
instructions.
23 raters participated in the experiment, all students of English at the Copenhagen Business
School.
3 Results
The reliability of the data as a whole is good, with a Cronbach's coefficient of 0.922.
However, reliability coefficients for any group of three or five raters were relatively low,
which indicates some uncertainty on the part of individual raters.
The overall ratings averaged over POS membership and position in the utterance are
displayed in Figure 1.
ARE VERBS LESS PROMINENT? 75
Figure 1. Prominence ratings based on 36 utterances (12 sentences × three speakers) averaged
over POS and position in utterance. Each bar represents average scores of 12 utterances as
perceived by all 23 raters on a scale from 1 to 4.
Average ratings for the three utterance positions and three parts of speech are all between 2.42
and 2.97 on the scale from 1 to 4. With regard to the effect of POS membership verbs were
not found to be less prominent than nouns, but they were rated slightly lower (by 0.16 on the
scale from 1 to 4) than adjectives. The difference is significant (one-tailed t-test, p < 0.05).
Adjectives were also in general found to be significantly more prominent than nouns (two-
tailed t-test, p < 0.001), which was not predicted (hence the use of two-tailed probability).
As expected, words in second position are perceived as less prominent than words in initial
position by approximately 0.5 on the scale from 1 to 4. The difference is significant (one-
tailed t-test, p < 0.001). Somewhat surprisingly, words in final position are only slightly more
prominent (by 0.08) than words in second position, and the difference is only just significant
(one-tailed t-test, p < 0.05). The difference between initial and final position is highly
significant (two-tailed t-test, p < 0.001). This pattern, and in particular the low prominence
ratings of words in final position was not expected, but it is partly caused by the fact that so
far only utterance position has been taken into consideration. In some cases the speaker
(particularly one) divided these short utterances into two phrases, which may obviously have
an effect on the expected prominence relations (as produced by the speakers and perceived by
the listeners). Therefore, phrase boundaries were evaluated by three trained phoneticians
(including the author) and assigned to the material in those cases where at least two out of
three had perceived a boundary. This process divides all accented words up into three
categories in accordance with traditional British descriptions of English intonation: nucleus,
which is the last accent word of a phrase; onset, which is the first accented word in a phrase
with more than one accent; and intermediate (my terminology) which is any accented word
between onset and nucleus. Figure 2 displays prominence ratings for these three positions
both across all three parts of speech and for each POS separately.
The overall pattern of prominence ratings according to phrase position is similar to the
patterning according to utterance position in Figure 1 (24 out of 36 cases are identical), but
words in intermediate position are more clearly less prominent than words in phrase final
(nucleus) position. All differences between onset, nucleus and intermediate position are highly
significant (p < 0.001). If we examine the results for the three parts of speech separately we
can see that verbs and adjectives behave similarly: onset and nucleus position are (roughly)
equally prominent (p > 0.1) but intermediate accents are less prominent (p < 0.01). The
difference is larger for verbs than for adjectives. For nouns, however, the onset position is
significantly more prominent than both intermediate and nucleus accents (p < 0.001) while the
latter are equally prominent (p > 0.1).
76 CHRISTIAN JENSEN
Figure 2. Prominence ratings for the three parts of speech in three different positions in the
intonation phrase.
4 Conclusion
There is a clear effect of phrase position on the perceived prominence of lexical items for all
three POS, nouns, verbs and adjectives, such that words in intermediate position are less
prominent than words in onset (initial) or nucleus (final) position (nouns in nucleus position
excepted). The effect noted in Jensen (2004) – reduction of perceived prominence of
intermediate accents – is therefore replicated here and is not likely to have been the result of a
certain syntactic structure with verbs in intermediate position.
With regard to the effect of POS membership it seems that adjectives are generally slightly
more prominent than verbs or nouns. This may be the result of a certain affective content of
(some or all of) the adjectives. Although care had been taken to avoid overly affective
adjectives, it is difficult to control for minor variations of this parameter.
The interpretation of the results is complicated by the fact that nouns were rated as very
prominent in onset position but markedly less so in nucleus position. Such a difference was
not found in similar sentences in Jensen (2004), and I have no immediate explanation for this
observation.
The question raised in the title and introduction of this paper must therefore be answered
somewhat tentatively: while verbs were found to be slightly less prominent than adjectives,
the difference was rather small. And while verbs were found to be as prominent as nouns
overall, they were less prominent in onset position but more prominent in nucleus position.
The implications of this surprising result awaits further investigation.
References
Jensen, C., 2003. Perception of prominence in standard British English. Proceedings of the
15th ICPhS 2003, Barcelona, 1815-1818.
Jensen, C., 2004. Stress and accent. Prominence relations in Southern Standard British
English. PhD thesis, University of Copenhagen.
Jensen, C. & J. Tøndering, 2005. Choosing a scale for measuring perceived prominence.
Proceedings of Interspeech 2005, Lisbon, 2385-2388.
Knowles, G., 1987. Patterns of spoken English. London and New York: Longman.
Streefkerk, B., 2001. Acoustical and lexical/syntactic features to predict prominence.
Proceedings 24, Institute of Phonetic Sciences, University of Amsterdam, 155-166.
Widera, C., T. Portele & M. Wolters, 1997. Prediction of word prominence. Proceedings of
Eurospeech '97, Rhodes, 999-1002.
Variation and Finnish Influence in Finland

Swedish Dialect Intonation
Yuni Kim
Department of Linguistics, University of California at Berkeley
yuni@berkeley.edu
Abstract
Standard Finland Swedish is often described as having Finnish-like intonation, with
characteristic falling pitch accents. In this study, it is found that the falling pitch accent
occurs with varying degrees of frequency in different Finland Swedish dialects, being most
frequent in the dialects that have had the greatest amount of contact with Finnish, and less
frequent (though in many cases still part of the intonational system) elsewhere.
1 Introduction
It is generally known that the Swedish dialects of Finland, with the exception of western
Nyland (Selenius, 1972; Berg, 2002), have lost the historical word accent contrast between
Accent 1 and Accent 2. What is less clear is what kinds of intonational systems the dialects
have developed, and how these relate to the previous word-accent system on the one hand,
and contact with Finnish (often via Finnish-influenced prestige Swedish varieties) on the
other. In their prosodic typology of Swedish dialects, Bruce & Gårding (1978) classified
Helsinki Swedish as type 0 (Far East), with falling pitch throughout the word, and western
Nyland as type 2A (Central). As for other rural Finland Swedish dialects, subsequent research
(Selenius, 1978; Svärd, 2001; Bruce, 2005; Aho, ms.) has suggested that many fit neither
category straightforwardly.
The purpose of the present study is to gauge how widespread the falling pitch accent is in
Finland Swedish. It may be taken as a sign of Finnish influence, since it is the basic pitch
accent in Finnish (see e.g. Mixdorff et al., 2002) but generally not attested in Sweden. Since
the investigated dialects appeared to have intonational inventories with multiple pitch accents,
unlike the lexical-accent dialects of Sweden, a quantitative component was undertaken to
assess the frequency of falling pitch accents intradialectally. The results should be seen as
preliminary due to the limited size of the corpus, but they point to some interesting questions
for future research.
2 Materials and methods

The materials used here were archaic dialect recordings, consisting of interviews and
spontaneous narratives, from the CD accompanying Harling-Kranck (1998). The southern
dialects included in the study were, from east to west, Lappträsk (eastern Nyland; fi.
Lapinjärvi), Esbo (central Nyland; fi. Espoo), Kimito and Pargas (eastern Åboland; fi. Kemiö
and Parainen, respectively). The northern dialects, south to north, were Lappfjärd (southern
Österbotten; fi. Lapväärtti), Vörå (central Österbotten; fi. Vöyri), and Nedervetil (northern
Österbotten; fi. Alaveteli). There was one speaker per dialect. The speakers, all female, were
born between 1880 and 1914 and were elderly at the time of recording (1960s–1980s).
78 YUNI KIM
Between 1 and 3 minutes of speech from each dialect was analyzed using PRAAT. Accented
words of two or more syllables were identified and given descriptive labels according to the
shape of F0 in the tonic and post-tonic syllables. Monosyllables were not counted due to
difficulties in determining which pitch accents they had. Non-finally, falling pitch accents
were defined as those where the tonic syllable had a level or rising-falling F0, followed by a
lower F0 in the post-tonic syllable. In phrase-final position, only words with falling F0
throughout were counted as having a falling pitch accent (as opposed to words with rising F0
through the tonic syllable followed by a boundary L).
3 Results
3.1 Eastern and central Nyland
Lappträsk (eastern Nyland) and Esbo (central Nyland) overwhelmingly used falling pitch
accents: 27 out of 30 total pitch accents and 33 out of 34, respectively. This result agrees with
Aho’s (forthcoming) study of the Liljendal dialect of eastern Nyland. Just over a minute of
speech was analyzed for each of these two speakers, which attests to the high density of
accented words, especially given that monosyllabic accented words were not counted. This
density is also characteristic of Finnish, which accents nearly every content word (Mixdorff et
al., 2002; see also Kuronen & Leinonen, 2001).
400
300
200
100
ti:e hö:lass ti sta:n men vå:ga int gå opp å å:ka
0 3.02231
Time (s)
Figure 1. Falling pitch accents in the Esbo dialect: (ja tjö:rd) ti:e hö:lass ti sta:n men vå:ga
int gå opp å å:ka ‘(I drove) ten hay bales to town but didn’t dare go up and ride’.
For the Lappträsk speaker, the non-falling pitch accents consisted mainly of a low pitch on the
tonic syllable followed by a high post-tonic (hereafter Low-High), which was used on 2
tokens and additionally on 5 or 6 monosyllables that were not included in the main count. The
Esbo sample did not contain the Low-High accent, though a longer sample might have
revealed some tokens.
Interestingly, the Lappträsk and Esbo speech samples each contained two examples of an
emphasis intonation that is reminiscent of Central or Western Swedish Accent 2. This pitch
accent involves a sharp fall in the stressed syllable, followed by a rise culminating in a peak
that may lie several syllables after the tonic (cf. Bruce, 2003).
3.2 Eastern Åboland

Eastern Åboland presents a different picture, where the falling pitch accent is infrequent. In
the Kimito data (about 1 minute 40 seconds), 6 out of 40 pitch accents were counted as
falling, of which five were the last accented word in the phrase.
In the Pargas data (about two and a half minutes), none of the 35 pitch accents were
classified as falling. Seven of the 8 phrase-final tokens had an F0 rise in the tonic syllable
with a L% boundary tone, however, making at least the disyllables auditorily somewhat
similar to falling pitch accents.
VARIATION & FINNISH INFLUENCE IN FINLAND SWEDISH DIALECT INTONATION 79
In both Kimito and Pargas, the majority of the pitch accents were either Low-High, or had a
rise in the tonic syllable plateauing to a level high pitch over the next few syllables (hereafter
Rise-Plateau). The Kimito and Pargas samples also contained one instance each of an accent
with a sharp fall in the tonic syllable, like in Esbo and Lappträsk, but no subsequent rise.
180
160
140
120
peng ar å allting annat mie:
0 1.39125
Time (s)
Figure 2. Kimito dialect: pengar å allting annat mie: ‘money and everything else too’.
Pengar has a Low-High pitch accent (the initial peak is due to consonant perturbation) and
annat has a falling pitch accent. The peak in allting is due to background noise.
3.3 Österbotten
The results for Lappfjärd (southern Österbotten) were similar to those for eastern Åboland in
that only one of the 34 pitch accents (in about 2 minutes of material) was classified as falling.
The remaining accents had Low-High and Rise-Plateau shapes, along with 3 instances of
sharp falls.
In Vörå (central Österbotten), 3 of the 34 pitch accents (in about 2 minutes of material)
were falling, while the rest were Low-High or Rise-Plateau (plus one instance of sharp
falling). Both Lappfjärd and Vörå had boundary L% tones, as in Pargas. These results are
consistent with Aho’s (ms.) findings on intonation in the central Österbotten dialect of Solv.
Lastly, Nedervetil (northern Österbotten) had 13 falling pitch accents, in a variety of
sentence positions, out of 34 total accents (in about 2 minutes). This was a higher proportion
than any of the other Österbotten or Åboland dialects investigated. Of the remaining 21
accents, 20 were labeled as Rise-Plateau and only one was Low-High.
4 Discussion
The preliminary result of this study is that falling pitch accents of the Finnish type are very
frequent in Swedish dialects of eastern and central Nyland, common in northern Österbotten,
and less frequent or marginal elsewhere in Österbotten and in eastern Åboland. A natural
explanation for this is that the eastern and northern outposts of Swedish Finland – eastern
Nyland and northern Österbotten, respectively – have, as border regions, had the heaviest
contact with Finnish. For example, central and eastern Nyland were the first regions to lose
the word accent contrast (Vendell, 1897), and anecdotal evidence suggests that northern
Österbotten dialects have some Finnish-like phonetic/phonological features that are not found
elsewhere in Österbotten.
The attestation of dialects where the falling pitch accent exists but has a limited role
suggests that intonational variation in Finland Swedish might be fruitfully studied to form a
sociolinguistic or diachronic picture of how various dialects have made, or are in the process
of making, a gradual transition from Swedish-like to Finnish-like intonational systems. A
number of topics would need to be addressed that have been outside the scope of the present
study, such as the phonetics, pragmatics, and the positional distributions of the various pitch
accents. For instance, the intonational phonologies of eastern Åboland and Österbotten are
80 YUNI KIM
probably quite different, despite the fact that their pitch accents have phonetic similarities
which in this paper have been subsumed under the same descriptive labels.
Acknowledgements
I wish to thank audience members at ICLaVE 3 for helpful discussion on an earlier version of
this paper, and Eija Aho for sending copies of her unpublished work. This research was
funded by a Fulbright Grant and a Jacob K. Javits Graduate Fellowship.
References
Aho, E., forthcoming. Om prosodin i liljendaldialekten. In H. Palmén, C. Sandström & J-O.
Östman (eds.), volume in series Dialektforskning. Helsinki: Nordica.
Aho, E., ms. Sulvan prosodiasta. Department of Nordic Languages, University of Helsinki.
Berg, A-C., 2002. Ordaccenten – finns den? En undersökning av Snappertunamålets
ordaccent: produktionen, perceptionen och den akustiska länken. Pro gradu thesis, Åbo
Akademi University.
Bruce, G., 2003. Late pitch peaks in West Swedish. Proceedings of the 15th ICPhS, Barcelona,
vol. 1, 245-248.
Bruce, G., 2005. Word intonation and utterance intonation in varieties of Swedish. Paper
presented at the Between Stress and Tone conference, Leiden.
Bruce, G. & E. Gårding, 1978. A prosodic typology for Swedish dialects. In E. Gårding, G.
Bruce & R. Bannerts (eds.), Nordic Prosody: Papers from a symposium. Lund University,
Department of Linguistics, 219-228.
Harling-Kranck, G., 1998. Från Pyttis till Nedervetil: tjugonio dialektprov från Nyland,
Åboland, Åland och Österbotten. Helsinki: Svenska Litteratursällskapet i Finland.
Kuronen, M. & K. Leinonen, 2001. Fonetiska skillnader mellan finlandssvenska och
rikssvenska. In L. Jönsson, V. Adelswärd, A. Cederberg, P. Pettersson & C. Kelly (eds.),
Förhandlingar vid Tjugofjärde sammankomsten för svenskans beskrivning. Linköping.
Mixdorff, H., M. Väiniö, S. Werner & J. Järvikivi, 2002. The manifestation of linguistic
information in prosodic features of Finnish. Proceedings of Speech Prosody 2002.
Selenius, E., 1972. Västnyländsk ordaccent. Studier i nordisk filologi 59. Helsinki: Svenska
Litteratursällskapet i Finland.
Selenius, E., 1978. Studies in the development of the 2-accent system in Finland-Swedish. In
E. Gårding, G. Bruce & R. Bannerts (eds.), Nordic Prosody: Papers from a symposium.
Lund University, Department of Linguistics, 229-236.
Svärd, N., 2001. Word accents in the Närpes dialect: Is there really only one accent? Working
Papers 49. Lund University, Department of Linguistics, 160-163.
Vendell, H., 1897. Ordaksenten i Raseborgs härads svenska folkmål. Öfversigt av finska
vetenskapssocietetens förhandlingar, B. XXXIX. Helsinki.
Local Speaking Rate and Perceived Quantity:

An Experiment with Italian Listeners
Diana Krull1, Hartmut Traunmüller1, and Pier Marco Bertinetto2
1
{diana.krull|hartmut}@ling.su.se
2
Scuola Normale Superiore, Pisa
bertinetto@sns.it
Abstract
We have shown in earlier studies that the local speaking rate influences the perception of
quantity in Estonian, Finnish and Norwegian listeners. In the present study, Italian listeners
were presented the same stimuli. The results show that the languages differ not only in the
relative position – preceding or following – of the units that have the strongest influence on
the perception of the target segment, but seemingly also in the width of the reference frame.
1 Introduction
Earlier investigation using Estonian, Finnish and Norwegian listeners has shown that local
speaking rate affects listeners’ perception of quantity (Krull, Traunmüller & van Dommelen,
2003; Traunmüller & Krull, 2003). The results were compatible with a model of speech
perception where an “inner clock” handles variations in the speaking rate (Traunmüller,
1994). However, there were language dependent differences. The most substantial of these
was the narrower reference frame of the Norwegians when compared to the Estonians and
Finns.
The Estonian quantity system is the most complex one. In a disyllabic word of the form
C1V1C2V2 (such as the one used as stimulus) V1 and C2 are the carriers of the quantity
distinction: V1 as well as C2, both singly and as a VC unit can have three degrees of quantity:
short, long and overlong. Seven of the nine possible combinations are actually being used in
Estonian phonology. C1 and V2 act as preceding and following context and is a cue to the
local speaking rate. Finnish has two quantity degrees, similar to Estonian short and overlong.
In a C1V1C2V2 word all four possible V1C2 combinations are used. In Finnish, as in Estonian,
the duration of V2 is inversely dependent on the quantity degree of the preceding units. In
Norwegian, on the other hand, only V1 carries the quantity degree, while C2 is inversely
dependent on the quantity of V1. There are only two phonologically different possibilities:
short or long V1.
In all three languages, it is a following unit of context that exerts the strongest secondary
influence on the perception of the quantity degree. The question arises: is this generally valid
also for other languages? Are there any other contextual factors that make a segment
important for quantity perception, apart from relative position? The answer to these questions
can perhaps be found by investigating Italian listeners’ reaction to the same stimuli. In Italian,
it is the duration of C2 that is considered as the most decisive for the distinction between
C1V1:C2V2 and C1V1C2:V2 – e.g. papa and pappa – while the duration of V1 is considered to
be inversely related to the duration of C2 when the vowel is stressed (Bertinetto & Vivalda,
1978).
82 DIANA KRULL ET AL.
This paper addresses the question of whether and how the reaction of Italian listeners to the
same stimuli differs from that of Estonians, Finns and Norwegians. Where will Italian
listeners place the boundary between [t] and [t:]? Which part(s) of the context will influence
the perception of C2?
2 Method
The stimuli were obtained by manipulating the duration of selected segments of the Estonian
word saate [sa:te] (‘you get’), read by a female Estonian speaker. The word was read both in
isolation and preceded by ja (‘and’) or followed by ka (‘also’). The [a:] and the [t] were
shortened or prolonged in proportionally equal steps as shown in Figure 1 (for the segment
durations of the original utterance and other details, see Traunmüller & Krull, 2003). The
durations of the [s] and the [e] were also manipulated up or down together with ja or ka when
present. The arrangement of stimuli in series is shown in Figure 1. The selection of stimulus
series was made according to which combinations could be possible in Italian. 20 students at
the University of Pisa listened to the stimuli.
300
Duration of t (ms)
250
200
150
100
100 150 200 250 300
Duration of a (ms)
Figure 1. Duration of the segments [a] and [t] in the stimuli without [ja] and [ka]. There were
three series of stimuli that differed in the duration of the [a], while the stimuli within each
series differed in the duration of the [t].

Figure 2 shows the effect of changes in segment duration on the perception of quantity for
Italian listeners. For comparison, results from earlier investigations with Estonian, Finnish and
Norwegian listeners have been added.
As could be expected, increasing the duration of the [t] had a strong positive effect on the
perception of the sate-satte distinction, while increasing the durations of neighboring units
practically always had an opposite effect. The strongest negative effect on the perception of
the [t] duration for Italian listeners resulted from lengthening the immediately preceding [a].
The role of ja and ka was less obvious. Changing the duration of [jas] had a certain effect on
the perception of [t] in ja saate, while the duration of [s] alone in saate and saate ka had no
importance. Similarly, change in the duration of [eka] had an effect, but not that of [e] alone
in ja saate and saate. This can be explained by the durational variability of an utterance-final
vowel in Italian.
LOCAL SPEAKING RATE AND PERCEIVED QUANTITY 83
Norwegian [a] Italian [t]
6 6
Regression coefficient
4 4
2 2
0 0
-2 -2
-4 -4
s a t e s a t e
Finnish [a] Finnish [t]
6 6
4 4
2 2
0 0
-2 -2
-4 -4
s a t e s a t e
Estonian [a] Estonian [t]
6 6
4 4
2 2
0 0
-2 -2
-4 -4
s a t e s a t e
Estonian [a:] Estonian [t:]
6 6
4 4
2 2
0 0
-2 -2
-4 -4
s a t e s a t e
Figure 2. Weights of the durations of segments as contributors to the perceived quantity of [a]
or [t], expressed in probit regression coefficients. Context: ja saate (black columns), saate
(white columns), and saate ka (grey columns).
The relatively strong effect of [jas] as compared with [s] alone may be difficult to explain, but
the same tendency appeared among Finnish and Estonian subjects. The [ja] could not be
interpreted as a separate word by Italian listeners. Therefore [jasa:te] was more likely to be
interpreted as one word, stressed on the second syllable. In this case the first [a] stands out as
longer than expected in an Italian word. In spite of that, changes in the duration of [jas]
influenced the perception of [t]. That the [s] of saate ka had no negative effect at all may be
due to its distance to the end of the utterance in relation to the length of the reference frame.
The word-final [e] in ja saate and saate had practically no effect on the perception of [t],
probably due to the fact that the duration of an utterance final vowel is highly variable in
Italian. However, changes in the duration of [e] had a substantial effect when followed by
[ka], in saate ka, which supports this assumption since in this case, there would not be so
much free variation in the duration of the [e].
84 DIANA KRULL ET AL.
A comparison with the results of Estonian, Finnish and Norwegian speakers revealed several
differences. For Finnish listeners, the negative effect of changes in the duration of [a] on the
perceived quantity of [t] was not statistically significant. In Finnish, the [a] is itself a possible
carrier of quantity distinction and is therefore not treated as ‘neighboring context’. (A similar
effect of [t] can be seen when [a] is the target). This is true also for Estonian. In the case of the
distinction between short and long [t], Estonian listeners behaved very much like the Italians.
However, when distinguishing between long and overlong, the lengthening of the preceding
vowel had a positive effect on the perceived quantity of [t:]. The reason for this is the
unacceptability of the combination of long vowel and overlong consonant in Estonian: [t:] can
be perceived as overlong only when the preceding vowel is either short or overlong. The same
effect is seen in the case where [a:] was the target.
Comparing the results of Italian listeners’ perception of [at] with that of the Norwegians’
revealed symmetry in the response patterns. In Italian, the consonant is the target which
carries the quantity distinction while the duration of the preceding vowel is inversely related
to it. This durational compensation can only be observed under sentence stress (Bertinetto &
Loporcaro, 2005). In Norwegian, it is the other way round: the vowel is the target and the
duration of the following consonant inversely related to it. In the present case, the negative
effect of [a] for the Italians and that of [t] for the Norwegians were of similar size.
A comparable effect of an inverse duration relation can be noted in the responses of
Estonian and – in a slightly weaker degree – Finnish listeners. Here it is the duration of the
vowel in the following syllable that is inversely related to the duration of V1, C2 or V1C2. As a
result, changes in the duration of [e] had a strong negative effect on the perceived quantity of
[a] and/or [t]. The data clearly show that segments whose duration can vary due to linguistic
or paralinguistic factors carry a lower weight (cf. the influence of [a] on [t] or vice versa in the
two Fennic languages and utterance final [e] in Italian).
To conclude, Italian listeners reacted generally in the same way as did Estonians, Finns and
Norwegians: changing the duration of the target segment itself had a strong positive effect
while changes in the durations of some neighboring segments had a weaker, negative effect. If
segment durations are to be measured by an “inner clock” whose pace depends on the speech
listened to, it is necessary to assume language specific reference windows. That of Norwegian
listeners must, clearly, be assumed to be shorter than that of the Fennic listeners (Traunmüller
& Krull, 2003). The length of the reference frame of Italian listeners is also shorter than that
of the Fennic speakers, but the data seem to indicate that it is longer than that of the Norwe-
gians. While the Italians’ location of their reference frame is clearly different from that of the
Norwegians if considered with respect to the target segment, the center of the reference frame
appears to be located close to the [a]/[t]-boundary in representatives of all four languages.
References
Bertinetto, P.M. & M. Loporcaro, 2005. The sound pattern of standard Italian, as compared
with the varieties spoken in Florence, Milan and Rome. Journal of the IPA 35, 131-151.
Bertinetto, P.M. & E. Vivalda, 1978. Recherches sur la perception des oppositions de quantité
en italien. Journal of Italian Linguistics 3, 97-116.
Krull, D., H. Traunmüller & W.A. van Dommelen, 2003. The effect of local speaking rate on
perceived quantity: a comparison between three languages. Proceedings XVth ICPhS,
Barcelona, 833-836.
Traunmüller, H., 1994. Conventional, biological, and environmental factors in speech
communication: A modulation theory. Phonetica 51, 170-183.
Traunmüller, H. & D. Krull, 2003. The Effect of Local Speaking Rate on the Perception of
Quantity in Estonian. Phonetica 60, 187-207.
A Case Study of /r/ in the Västgöta Dialect

Jonas Lindh
Department of Linguistics, Göteborg University
jonas.lindh@ling.gu.se
Abstract
This paper concentrates on the study of five young male speakers of the Swedish Västgöta
dialect. First, the classic phonological /r/ distribution between back and front /r/ was tested to
see whether the old descriptions of the dialect were valid for this group. Second, the
individual variation between the phonetic realizations was studied to see if it was possible to
distinguish between the five speakers solely on the basis of their /r/ distribution. This was
done by aural and spectrographic comparisons of /r/ in stressed and unstressed positions for
each speaker. Three /r/ categories were identified. Two speakers seem to have a classical
distribution of uvular /r/, two others use only the front version. The last speaker used the front
variant except in one focused instance. These results lead to some speculations on changes
occurring in the dialect. The speakers’ individual variation was studied by describing their /r/
realizations with phonological rules. This was done successfully and the five speakers were
rather easily distinguishable solely on the basis of their /r/ productions.

1.1 Hypotheses
This pilot case study mainly has the goal to investigate two hypotheses:
1. The classic descriptions or rules are not valid for this group of five young male
speakers.
2. It is possible to separate five speakers phonologically based solely on their production
of /r/ in stressed and unstressed positions.
The first hypothesis is simply investigating a possible dialectal change by using diachronic
recordings and comparing the use of /r/. The second hypothesis is a pilot case study
investigating whether between-speaker variation for /r/, whether it is sociophonetic or
dialectal change, is enough to separate or individualize five speakers with the same sex, age
and similar dialectal background.
1.2 The phoneme /r/

The phoneme /r/ was chosen because of its reported intra- and interspeaker variance (Vieregge
& Broeders, 1993). The phoneme has been subject to several studies for English, both
concerning its phonology (Lindau, 1985) and acoustic properties (see Espy-Wilson & Boyce,
1993; 1999). The Swedish studies are mostly concentrated on the dialectal area descriptions,
such as Sjösted’s (1936) early dissertation on the /r/-sounds in south Scandinavia and Elert’s
(1981) description of the back uvular [ ] geographical frontier. In a recent study by
Muminovic & Engstrand (2001), they found that approximant variants outnumbered fricatives
and taps while trills were uncommon. Aurally, they identified four place categories and these
were also separated acoustically except for back and retroflex /r/.
86 JONAS LINDH
1.3 /r/ in the Västgöta dialect

What is the Västgöta (or Göta) dialect? There are several variants. A quite common, but still
rough description is that the dialect contains four major variants: the Vadsbo, Skaraborg,
Älvsborg (except for the Mark – Marbo and Kind – Kindbo) and Göta-Älv variants. In one
major study by Götlind (1918), he suggests around 450 different variants. However, there are
several different features that connect them all. One of these dialect features is the distribution
of the two /r/ allophones [r] and [ ] which both appear in different positions. The allophones
[r] and [ ] are combinatory variants of the phoneme /r/ in the dialect. The general classic rules
can be described using SPE notation (Chomsky & Halle, 1968), choosing [r] as the underlying
representation:
Rule 1. /r / → [R ] / # _
The phoneme /r/his pronounced

l uvular in morpheme initial position.
Ex. [ ør] and [ e ør]
Rule 2. / r / → [R ] / V _ V()
:
[ +stress]
The phoneme /r/ is pronounced uvular in medial position, i.e. after an unstressed syllable and
preceeding
εkta stressed vowel.
Ex. [d ]
# 
Rule 3. / r / → [R ] / V _ 
[ +stress] V
The phoneme /r/ is pronounced uvular in final position after a short stressed vowel, or medial
followed by an unstressed
vowel.
Ex. [dø ] and [ ba a]
Teleman (2005) hypothesizes about the development of the allophonic use being related to the
geographical border for the use of ‘thick’ (retroflex flap) versus ‘normal’ /l/. However,
Malmberg (1974) reports a similar allophonic use of /r/ in Puerto Rican Spanish and it is also
used in Brazilian Portuguese which might give other indications (Torp, 2001). Other common
features in the dialect, both grammatical and phonological, are not considered in this paper,
but there are several (for examples see Norén et al., 1998).
2 Method
First, older (between 1950-1970) recordings from the Swedish Institute for Dialectology,
Onomastics and Folklore research (the CD Västgötadialekter <http://www.sofi.se>) were used
as references to confirm the general/classic descriptions of /r/ distribution in the Västgöta
variants. Five young male (aged 20-30) speakers from the Swedia dialect database were then
analyzed (<http://www.swedia.nu>). The recordings for the Swedia database were done with a
portable DAT recorder and small portable microphones carried on subjects collar or similar.
The situation was adjusted as well as possible to an informal talk where the subjects told a
story or memory. The mean length of each recording was approximately one minute. All
instances of /r/ were extracted using the software Praat (Boersma & Weenink, 2005).
A CASE STUDY OF /R/ IN THE VÄSTGÖTA DIALECT 87
Table 1. Number of /r/ and variants per young male speaker and older reference recording
used for diachronic comparison.
Speaker N /r/ instances N [r] N[ ] Older Reference
Recording
Öxabäck 38 32 6 Torestorp
Östad 10 10 0 Humla & Lena
Torsö 17 15 2 Skara
Floby 29 29 0 Floby
Korsberga 35 34 1 Korsberga
Young males aged 20-30 were chosen as a group because they exist as such in the Swedia
database and because they stand for 62% of the convicted criminals in Sweden last year
(<http://www.bra.se>) and since the second investigation has forensic implications (Rose,
2002) this group was preferred as a pilot case group.

3.1 Diachronic dialectal comparison for /r/
As can be seen in Table 1 above the speakers from Östad and Floby consequently use the
alveolar allophone as no instances of uvular [ ] were found. The speakers from Öxabäck and
Torsö follow the classical rule using uvular [ ] word (possibly morpheme) initially. No
instances of [ ] were found in other positions though. For the speaker from Korsberga, only
one instance of [ ] was found.
t t The instance was observed word initially in the focused word
<riktigt> pronounced [ ].
First of all, [ ] does not exist at all after short stressed vowels in the material. Secondly,
only two speakers frequently use it word/morpheme initially. That the uvular is disappearing
is only a speculation because of the sparse data, and maybe this is an effect of the formal
recording situation leading to a sociophonetic variation. However, the distribution of /r/ is as
follows using broad phonological categories:
Category 1. /r/ [ ]/ #_
/r/ is pronounced with uvular variant [ ] morpheme (or word) initially by the Öxabäck and
Torsö speakers.
Category 2. /r/ [r]
/r/ is always pronounced with an alveolar variant [r] for the two speakers from Östad and
Floby.
Category 3. /r/ [r] or possibly [ ] / # _ (+focus)
The Korsberga speaker uses an alveolar variant, but has a uvular variant [ ] word initially
when /r/ occurs in a focused syllable.
3.2 The individual variation between the speakers

The [ ] instances for the two speakers in category 1 above contain the word <rätt>, which
makes the natural starting point for comparison. The two speakers can then be separated as the

speaker from Öxabäck uses a fricative phone articulated as a velar [ ] while the speaker from
Torsö uses a uvular trill [ ].
Comparing the two speakers in category 2, the alveolar version was naturally compared
since there was no use of a uvular variant. By closer aural examination of the two speakers it
was obvious that the speaker from Östad in 7 out of 10 cases used an alveolar trill [r]. In the
three other cases the severely reduced sounds, in unstressed positions, were pronounced as
88 JONAS LINDH
approximantic [ ]. The speaker from Floby never produced a trill, but shifted between a tap []
(in stressed position) and an approximant [ ].
As the Korsberga speaker was alone in his use of a uvular in focused position there is no
need to separate him further. His uvular variant is pronounced as a trill though, while the
alveolar variants are either tapped or approximantic.
4 Conclusions and future work

The uvular [ ] is less used in the Västgöta dialect, at least in the sparse data used for this
study. This might mean that it has transformed into an already existent alveolar after short
stressed vowels and is slowly disappearing as a word (or morpheme) initial as well.
By aural and spectrographic examination leading to a narrow transcription and phonological
rules, it was easy to separate the speakers. More research on how well a larger group can be
separated using this method is recommended. Several aspects of interspeaker variation were
left out using a small amount of data. Including more acoustic measurements, such as spectral
studies of /r/ for different speakers, should also be investigated in the future.
References
[Computer program] Retrieved October 7, 2005, from http://www.praat.org/.
Brottsförebyggande Rådet. [www] Retrieved November 26, 2005, from http://www.bra.se/.
Chomsky, N. & M. Halle, 1968. The sound pattern of English. New York: Harper & Row.
Elert, C-C., 1981. Gränsen för det sydsvenska bakre r. Ljud och ord i svenskan 2. Stockholm:
Amquist & Wiksell International.
Espy-Wilson, C.Y. & S. Boyce, 1993. Context independence of F3 trajectories in American
English /r/’s. JASA 93, 2296 (A).
Espy-Wilson, C.Y. & S. Boyce, 1999. A simple tube model for American English /r/. Proc.
XIVth Int. Conf. Phon. Sci., San Francisco, 2137–2140.
Götlind, J., 1918. Studier i västsvensk ordbildning. De produktiva avledningssuffixen och
deras funktion hos substantiven i Göteve-målet. Stockholm.
Lindau, M., 1985. The story of /r/. In V. Fromkin (ed.), Phonetic linguistics. Orlando:
Academic Press.
Malmberg, B., 1974. Spansk fonetik. Lund: Liber Läromedel.
Muminovic, D. & O. Engstrand, 2001. /r/ in some Swedish dialects: preliminary observations.
Working Papers 49. Dept. of Linguistics, Lund University.
Norén, K., R. Gustafsson, B. Nilsson & L. Holgersson, 1998. Ord om orden i Västergötland.
Axvall: Aron-förlaget.
Rose, P., 2002. Forensic Speaker Identification. New York: Taylor & Francis.
Sjöstedt, G., 1936. Studier över r-ljuden i sydskandinaviska mål. Dissertation, Lund
University.
Swedia Dialect Database. [www] Retrieved during September, 2005, from
http://www.swedia.nu/.
Swedish Institute for dialectology, Onomastics and Folklore research Västgötadialekter [CD]
http://www.sofi.se.
Teleman, U., 2005. Om r-fonemets historia i svenskan. Nordlund 25. Småskrifter från
Institutionen för Nordiska språk, Lund.
Torp, A., 2001. Retroflex consonants and dorsal /r/: mutually excluding innovations? On the
diffusion of dorsal /r/ in Scandinavian. In van de Velde & van Hout, 75-90.
Vieregge, W.H. & A.P.A. Broeders, 1993. Intra-and interspeaker variation of /r/ in Dutch.
Proc. Eurospeech ’93, vol. 1, 267–270.
Preliminary Descriptive F0-statistics for

Young Male Speakers
Jonas Lindh
Department of Linguistics, Göteborg University
jonas.lindh@ling.gu.se
Abstract
This paper presents preliminary descriptive statistics for 109 young male speakers’
fundamental frequency. The recordings were taken from the Swedia dialect database with
speakers from different geographical areas of Sweden. The material consisted of spontaneous
speech ranging between seventeen seconds and approximately two minutes. F0 mean, median,
baseline and standard deviation distributions in Herz are described using histograms. It is
suggested to use median instead of mean when measuring F0 in for example forensic cases
since it is more robust and not as affected by octave jumps.

1.1 Why young male speakers?
Young males aged 20-30 were chosen as a group because they exist as such in the Swedia
database (<http://www.swedia.nu>) and because they stand for 62% of the convicted
criminals in Sweden last year (<http://www.bra.se>), which was important due to the forensic
implications of the descriptive statistics.
1.2 F0 and forensic phonetics

The within-speaker variation in F0 is affected by an enormous amount of factors. In Braun
(1995), she categorizes them as technical, physiological and psychological factors. Tape
speed, which surprisingly still is an issue for forensic samples, and sample size are examples
of technical factors. Smoking and age are examples of physiological, while emotional state
and background noise are examples of psychological factors. However, fundamental
frequency has been shown to be a successful forensic phonetic parameter (Nolan, 1983). To
be able to study differences it is suggested to use long-term distribution measures such as
arithmetical mean and standard deviation (Rose, 2002). The duration of the samples should be
more than 60 seconds according to Nolan (1983), but Rose (1991) reports that F0
measurements for seven Chinese speakers stabilised much earlier, implying that the values
may be language specific (Rose, 2002). Positive skewing of the F0 distribution is typical
(Jassem et al., 1973) and an argument for considering a base value (Fb) for F0 (Traunmüller,
1994). This base value is also described here together with mean, median and standard
deviation for the whole group. There are no Swedish statistics on F0 found after Kitzing
(1979), where he reports a mean of 110.3 Hz and a standard deviation of 3 semitones (in
Traunmüller & Eriksson, 1995a) for 51 male speakers ranging between 21-70 years of age.
2 Method
The software Praat (Boersma & Weenink, 2005) was used to collect F0 data from 109 young
male speakers (20-30 years old). The recordings were taken from the Swedia database
90 JONAS LINDH
(<http://www.swedia.nu>) and the durations of the recordings range from 17.4 to 116.8
seconds with a mean duration of 52.3 and standard deviation of 15.2. The parameters
extracted from the recordings were F0 mean, median, average baseline value (Fb), standard
deviation, maximum and minimum in Hz. The range for the F0 tracker was set to 75 - 350 Hz
to be able to cover all possible frequency excursions but at the same time avoid octave jumps.

3.1 F0 means, medians and average baselines
This section contains five histograms showing F0 distributions using mean, median, and
baseline in Hz.
Mean distribution of F0 for YM
30 28
25
22
21
N Speakers
20
15 14
10
10 8
5 4
1 1
0 0
0
70 80 90 100 110 120 130 140 150 160 170
Hz
Figure 1. Histogram showing the distribution of F0 means for 109 young male speakers.
Approximately 65% of the speakers have a mean fundamental frequency between 100-130
Hz. The mean of the means is 120.8 Hz. There is a positive skewing (0.6) with five extreme
outliers between 150-170 Hz. Since the automatic analysis had a tendency for making positive
octave jumps it is suggested to use median as it is more robust (see Figure 2 below).
Median distribution of F0 for YM
35
31
30
25 22
N Speakers
21
20
15
10 10
10
5 6
5 2 2
0 0
0
70 80 90 100 110 120 130 140 150 160 170
Hz
Figure 2. Histogram showing the distribution of F0 medians for 109 young male speakers.
The median distribution still has a positive skewing (still 0.6), but the mean (of the medians)
has moved down to 115.8 Hz. There is now approximately 68% that has a median between
100-130 Hz.
PRELIMINARY DESCRIPTIVE F0-STATISTICS FOR YOUNG MALE SPEAKERS 91
For comparison, the average baselines according to Traunmüller (1994) were calculated (see
Figure 3 below).
Average Baseline frequencies for YM
35
31
30 27
25
N Speakers
20
15 16
15 13
10
5 3
0 1 1 1 1
0
30 40 50 60 70 80 90 100 110 120 130
Hz
Figure 3. Histogram showing the average F0 baseline distribution for 109 young male
speakers.
The baseline (Fb) is seen as a carrier frequency in the modulation theory (Traunmüller, 1994).
As there are no major changes in vocal effort, voice register, or emotions involved in this
material, Fb can be expected to be approximately 1.43 standard deviations below the average
(Traunmüller & Eriksson, 1995b). The mean average baseline is 86.3 Hz, which corresponds
quite well to Traunmüller & Eriksson’s (1995a) average per balanced speaker of European
languages (93.4 Hz for male speakers). The values show a slight negative skewing (-0.35) and
approximately 68% of the values range between 70-100 Hz.
3.2 F0 standard deviation

Finally, the standard deviation distributions can be studied in Figure 4 and 5 below.
Standard deviations of F0 for YM
30
27
25
19
N Speakers
20
15 15
15 14
11
10
5 4
2
1 1
0
0
5 10 15 20 25 30 35 40 45 50 55
Hz
Figure 4. Histogram showing the F0 standard deviation distribution for 109 young male
speakers.
A perceptually motivated measure for liveliness is to use semitones (Traunmüller & Eriksson,
1995b).
92 JONAS LINDH
Standard deviations in semitones for YM
35 32
30
N Speakers 25 22
20 17
15
15
10
10
5 4
5 3
0 1 0
0
1 1,5 2 2,5 3 3,5 4 4,5 5 5,5 6
Semitones
Figure 5. Histogram showing the F0 standard deviation distribution in semitones for 109
young male speakers.
4 Conclusions and future work

The preliminary statistics in this paper gives an overview on the distribution for Swedish
young males’ fundamental frequency mean and standard deviation. The results suggest the use
of a more robust median instead of mean, since octave jumps influence the arithmetical mean.
To be able to study between-speaker differences better, distributions for individual speakers
should be compared and studied using different measures.
References
[Computer program]. Retrieved October 7, 2005, from http://www.praat.org/.
Braun, A., 1995. Fundamental frequency – how speaker-specific is it? In Braun & Köster
(eds.), 9-23.
Brottsförebyggande Rådet. [www] Retrieved November 26, 2005, from <http://www.bra.se/>
Jassem, W., S. Steffen-Batog & M. Czajka, 1973. Statistical characteristics short-term average
F0 distributions as personal voice features. In W. Jassem (ed.), Speech Analysis and
Synthesis vol. 3. Warshaw: Polish Academy of Science, 209-25.
Kitzing, P., 1979. Glottografisk frekvensindikering: En undersökningsmetod för mätning av
röstläge och röstomfång samt framställning av röstfrekvensdistributionen. Malmö: Lund
University.
Nolan, F., 1983. The Phonetic Bases of Speaker Recognition. Cambridge: Cambridge
University Press.
Rose, P., 1991. How effective are long term mean and standard deviation as normalisation
parameters for tonal fundamental frequency? Speech Communication 10, 229-247.
Rose, P., 2002. Forensic Speaker Identification. New York: Taylor & Francis.
Swedia Dialect Database. [www] Retrieved during September, 2005, from
http://www.swedia.nu/.
Traunmüller, H., 1994. Conventional, biological, and environmental factors in speech
communication: A modulation theory. Phonetica 51, 170-183.
Traunmüller, H. & A. Eriksson, 1995a. The frequency range of the voice fundamental in the
speech of male and female adults. Unpublished Manuscript (can be retrieved from
http://www.ling.su.se/staff/hartmut/aktupub.htm).
Traunmüller, H. & A. Eriksson, 1995b. The perceptual evaluation of F0-excursions in speech
as evidenced in liveliness estimations. J. Acoust. Soc. Am. 97, 1905-1915.
L1 Residue in L2 Use: A Preliminary Study of

Quantity and Tense-lax
Robert McAllister, Miyoko Inoue, and Sofie Dahl
bob@ling.su.se, tigermimmi@gmail.com, sofie.dahl@ling.su.se
Abstract
The main question addressed in this preliminary study is what traces of L1 have been
transferred to L2 use. The focus is on the durational aspects of the tense-lax and quantity
contrasts in English and Japanese. The results could be interpreted as support for the
hypothesis that an L1 durational pattern rather than a specific phonetic feature is the object
of transfer.
1 Introduction
As a rule, adults who learn a second language are not completely successful in learning to
produce and perceive L2 speech. Much of the recent research that has been done on the
acquisition of second language phonology and phonetics has been concerned with the question
of the source of foreign accent. A primary issue in both past and current studies of second
language (L2) speech acquisition is how and to what extent the first language (L1) influences
the learning of L2. The existence of common terms such as “French accent” have supported
the importance of that which has become known as “L1 transfer” as a major contribution to
foreign accent and numerous studies have been done to support the importance of this
phenomenon.
The aim of the present study is to contribute to the understanding of the role of native
language (L1) phonetic and phonological features in L2 speech acquisition. While
considerable research has been done with this aim which has contributed significantly to the
understanding of the nature of the phenomenon, there are still some important unanswered
questions to be addressed. Central among these concerns what aspects of the perception and
production of the L1 are actually transferred. One suggestion has been made by McAllister,
Flege & Piske (2003). In the discussion of their results the question was raised as to whether a
specific phonetic feature such as duration or an L1 durational pattern typical for the
phonology of a particular L1 could be what is actually transferred. If this were the case, a
durational pattern similar to that in L1 may be recognized in the use of the L2 contrast.
1.1 The pattern of durational relationships that can be found in Swedish and Japanese
quantity and the abstract feature of tense-lax in English
Traditionally, the primary phonetic difference underlying phonological quantity distinctions
has been attributed to durational differences in the vowels and or consonants, hence the “long-
short” or “quantity” terminology. In Swedish there is a relatively complex interplay between
temporal dimensions (i.e., the duration of a vowel and that of the following consonant) and
spectral dimensions (i.e., formant values in the vowel). English is considered to have no
quantity distinction. The tense-lax feature is considered to be a property of English phonology
and is phonetically similar to some aspects of Swedish quantity. The phonetic characteristics
94 ROBERT MCALLISTER ET AL.
of the Japanese quantity distinction appear to be in some respects similar to the Swedish
distinction. The contrast is based on duration and there are stable relationships between the
long and short vowels and consonants in Japanese syllables.
We are not able, in this short paper, to give even a partial view of the scholarly discussion
of tense-lax and its relation to quantity. For an excellent review and discussion, please see
Schaeffler (2005).
In this preliminary study we have taken the liberty to focus on the obvious if somewhat
unclear, relation between quantity and tense-lax. Our intent is to discover if a residue of the
Swedish quantity contrast might be found in the use of an L2 by native Swedes. Our
hypothesis is that evidence of patterns characteristic of Swedish quantity can be seen in native
Swedes’ L2 use of the tense-lax feature in English and the quantity contrast in Japanese.
2 Method
2.1 Experimental subjects
For the English part of the study, 20 native speakers of standard Swedish were recruited.
These were speech pathology students at Stockholm University who were asked to read a list
of English sentences containing a sentence final word with a tense or a lax vowel.
As a control group, 8 native speakers of standard American English read sentences with the
same tense and lax vowels as the native speakers of Swedish
The subjects for the Japanese part of the study consisted of 11 Swedish speakers (3 females
and 8 males) ranging from beginner to advanced levels of Japanese language which included 2
speakers one of whose parents was a native Japanese.
2.2 Speech material

For the English part of the study the vowels in the tense-lax pairs /i:/ – /i/, /u:/ – /u/, and /e:/ –
/e/ each occurred in three different monosyllabic words read by both the native speakers of
American English and the native Swedes. All three occurrences of each vowel were placed in
an identical or very similar phonetic environment (a voiceless stop).
For the Japanese part of the study the speech materials were two-syllable non-words which
followed Japanese phonotactics. The stimulus words written in Hiragana were read 5 times.
“Kinou _____ o kaimasita” (I bought ______ yesterday). The words in carrier sentences were
read three times each by the same informants. In this study we present only the results for the
Japanese vowels /i:/, /i/, /u:/, /u/, /e:/ and /e/ to compare with the English part of the study.

It should be pointed out at the outset of the discussion that the results presented here are a
preliminary version of this study. There are a number of additional measures that could be
relevant to the question of what aspects of the L1 are transferred in L2 use. Previous research
has shown that this V/C ratio is a robust and typical aspect of the Swedish quantity contrast so
we have decided to start with a presentation of this data and to present more data at Fonetik
2006 in Lund.
L1 RESIDUE IN L2 USE: A PRELIMINARY STUDY OF QUANTITY & TENSE-LAX 95
Swedes speaking English
1,6
1,4
1,2
1
V/C RATIO
swedL1
0,8
swedL2
0,6 engL1
0,4
0,2
0
i: i u: u ei e
VOWELS
Figure 1 shows the calculated V/C duration ratios for all the tense lax vowel pairs in English.
The three bars above the vowel symbols in the graph represent the V/C ratios for native
Swedish (unfilled bar), Swedes speaking English (black bar) and the native English speakers.
The native Swedish data in Figure 1 is taken from Elert (1964). Although the group data may
mask some of the potentially interesting individual behavior in the L2 users, it could reveal
some broad tendencies that are relevant to our question as to whether or not a durational
pattern is preserved in the English of the Swedish natives.
Figure1 indicates that while the native speakers of Swedish were not able to produce the
durational aspects of English authentically, they were not using the patterns familiar from
their L1 according to the Swedish norm. In terms of the V/C, the L2 users as a group appear to
approach the English pattern but their values are somewhere in between the Swedish and the
English norms for all the vowels. This result is reminiscent of a VOT study by Flege and
Eefting (1986) where the VOT values of the native Spanish speakers speaking English were
between those of native English and native Spanish. Those authors interpreted this result as
equivalence classification although this may imply a more strict adherence to the L1 pattern
than can be seen in the results.
The L2 ratios in Figure 2 are compared to the Swedish native standard and the L2
(Japanese) native standard as in Figure 1. The Swedish L2 users’ realization of the Japanese
V:C syllables appears to be similar to the results for the realization of the English tense
vowels in a VC sequences seen in Figure 1 although the realization of the /i:/ is better, i.e.
closer to the native Japanese values, than the other long vowels /u:/ and /e:/. In these cases the
L2 users have not been able to produce authentic Japanese syllables. The long /u:/ shows a
result similar to those for English in figure1. The native Swedes produce a syllable with a V/C
ratio in between the standard Swedish and the standard Japanese values. The V:C sequence
with /e:/, however, was produced in a way similar to the Swedish standard. An interesting
aspect of Figure 2 is the realization of the short vowels in VC sequences the native Japanese
syllables and the native Swedish syllables are quite similar. The native Swedes’ version of a
VC syllable with a short vowel is, with respect to the durational relationships, similar to the
96 ROBERT MCALLISTER ET AL.
authentic Japanese syllables. In this case it would seem that the application of the duration
rules for Swedish quantity could have yielded a rather good rendering of the Japanese
contrast. These results indicate that in the case of the realization of Japanese quantity, the
transfer of at least some of the aspects the Swedish quantity contrast pattern is part of the
Swedes’ strategy in learning Japanese quantity. The durational aspects of the English tense-
lax contrast present a somewhat less clear picture of the transfer phenomenon. It looks like the
Swedish natives are attempting to render the contrast but could be unsuccessful because of
their tendency to continue to apply the L1 pattern in their L2 use.
Swedes speaking Japanese
1,6
1,4
1,2
1,0
S
IO
TA
swedL1
R 0,8 swedL2
/C
V
jpnL1
0,6
0,4
0,2
0,0
i: i u: u e: e
VOWELS
Figure 2 shows the calculated V/C duration ratios for the short-long-vowels in Japanese
averaged over both isolated words and words which occurred in a sentence.
Further work on this material can give us a clearer idea of what residue from the L1 there
might be in the phonetic realization of an L2 contrast.
References
Elert, C-C., 1964. Phonologic Studies of Quantity in Swedish. Uppsala: Monografier utgivna
av Stockholms kommunalförvalting 27.
Flege, J. & W. Eefting, 1986. The production and perception of English stops by Spanish
speakers of English. Journal of Phonetics 15, 67-83.
McAllister, R., J.L. Flege & T. Piske, 2003. The influence of L1 on the acquisition of Swedish
quantity by native speakers of Spanish, English and Estonian. Journal of Phonetics 30, 229-
258.
Schaeffler, F., 2005. Phonological Quantity in Swedish Dialects. PHONUM 10.
Cross-speaker Variations in Producing

Attitudinally Varied Utterances in Japanese
Yasuko Nagano-Madsen1 and Takako Ayusawa2
1
Department of Oriental and African Languages, Göteborg University
yasuko.madsen@japan.gu.se
2
Department of Japan Studies, Akita International University
ayusawa@aiu.ac.jp
Abstract
Several acoustic phonetic parameters were analysed for six professional speakers of Japanese
who produced attitudinally-varied utterances. The results showed both agreement and
discrepancies among the speakers, implying that pragmatic information can be expressed in
at least a few alternative ways in Japanese and that this line of research needs more
attention.
1 Introduction
It is well known that pragmatic information can be combined in a set of tunes (or pitch-
accents in more recent terminology) in a language like English which has been traditionally
called an intonational language. How such pragmatic information is conveyed in a tone or
pitch-accent language in which pitch shape is lexically determined is much less clear. For
Japanese, Maekawa & Kitagawa (2002) conducted pioneering research on the production and
perception of paralinguistic phenomena. We have earlier reported the F0 shape characteristics
to show how speakers choose pitch shapes and phrasing to convey pragmatic meanings in
Japanese (Nagano-Madsen & Ayusawa, 2005). In this paper, we will report other phonetic
cues used by the same speakers. The attitudes tested are NEU(tral), DIS(appointment),
SUS(picious), JOY, and Q(uestion). Three phonologically balanced short utterances were
produced as a reply by six speakers – three male and three female speakers. For details on
data, speakers, and procedure, see Nagano-Madsen & Ayusawa (2005).
2 F0 characteristics
2.1 Pitch range
In order to make the cross-speaker comparison more meaningful, F0 features are calculated on
a semitone scale rather than in absolute Hz values. The average pitch ranges for the female
and male speakers were 13.9 and 14.3 semitones respectively. Table 1 shows the average
pitch range in semitones for the six speakers for the five attitude types, which shows that the
overall average pitch range increases in ascending order, DIS<NEU<SUS<Q<JOY. Speakers
were uniform in using their narrowest pitch range, on average 10.9 semitones, in expressing
attitude DIS. The fact that attitude DIS had the narrowest pitch range agrees with the findings
reported in Maekawa & Kitagawa (2002), though the exact magnitude of range cannot be
compared with their data. The widest pitch range was used for JOY, with an average of 16.3
semitones. Considerable cross-speaker variation is found in the use of overall pitch range
indicating that the overall pitch range alone cannot be regarded as a reliable acoustic phonetic
cue for attitude types. The male speakers manifest pitch range for NEU and JOY more closely
98 YASUKO NAGANO-MADSEN & TAKAKO AYUSAWA
than female speakers. It would be interesting to know if the three male speakers instead use
other phonetic cues more actively than female speakers.
Table 1. Cross-speaker variation in F0 range for attitude (F0 maxima minus (final) F0 minima
in semitones).
Female speakers Male speakers All the speakers
Attitude/ U V W X Y Z
speaker
Q 11.4 14.7 14.5 16.0 15.8 17.2 15.0 (2.68)
SUS 12.0 19.2 17.7 12.0 13.4 14.7 14.9 (3.27)
JOY 15.5 16.3 20.1 13.2 17.1 15.5 16.3 (3.73)
DIS 10.8 10.8 10.4 11.3 9.5 12.3 10.9 (1.66)
NEU 12.5 13.1 12.4 14.5 15.6 16.4 14.1 (1.94)
2.2 Pitch range for initial rise and final fall

The pitch range was calculated for the initial F0 rise and final F0 fall (cf. Figure 1 below). A
typical manifestation of the pitch range of initial F0 rise is in ascending order
DIS<NEU<Q<SUS<JOY. The cross-speaker pitch range variation for the initial F0 rise is far
more consistent than that of the final fall in relation to attitude type. Note that there is a
considerable cross-speaker variation in the manifestation of pitch range for the F0 fall for
attitudes SUS and JOY, but not for the initial F0 rise.
Figure 1. Pitch range in semitones for the initial F0 rise (left) and for the F0 fall (right).
2.3 Pitch range for final rise (Q and SUS)

Question utterances in Japanese are typically accompanied by a terminal rising contour. In the
present data, even SUS utterances had a regular terminal rise. However, final F0 rise for Q
and SUS were consistently differentiated in the magnitude (cf. Figure 2). The average F0 rise
for Q was 9.1 semitones (SD=3.04) while that for SUS was 12.9 semitones (SD=4.72). The
magnitude clusters around 2-4 semitones for most speakers, but speakers U and W had more
extreme values.
CROSS-SPEAKER VARIATIONS IN ATTITUDINALLY VARIED UTTERANCES IN JAPANESE 99
Figure 2 (left). Magnitude of utterance final F0 rise in semitone for Q and SUS.
Figure 3 (right). F0 peak values for the six speakers. U, V, W are female speakers.
2.4 F0 peak value

Figure 3 shows the F0 peak values for different attitude types. Four out of six speakers had the
same order in the F0 peak value, which in ascending order is DIS<NEU<SUS<Q<JOY. Note
that this order is similar to that of the pitch range of initial F0 rise except that of SUS and Q.
Two speakers, both male, had their highest F0 peak for Q rather than JOY.
2.5 F0 peak delay

The relevance of the F0 peak delay, i.e. the F0 peak is not on the expected syllable to which
the phonological accent is affiliated to, has been discussed for some time in relation to
pragmatics. In the present data, the F0 peak delays were common even for NEU (cf. Figure 4).
All except of one case (speaker X for NEU) had a peak delay of varying from one mora-delay
to six morae-delay. All the speakers had theleast peak delay for NEU while the peak delay in
relation to other attitude types varied considerably across speakers with SUS showing most
agreement in delay. Since the diversity among the speakers is great, it seems that the F0 peak
delay per se is not a reliable correlate for attitude types.
Figure 4 (left). The timing of F0 peak with the mora. When it is 0, the F0 peak is in the
syllable (mora) to which the accent is phonologically affiliated and there is no delay.
Figure 5 (right). Intensity peak measurements in dB.
3 Intensity peak
There was a good cross-speaker agreement among the speakers in the intensity peak value
with attitude type. The highest intensity peak value (average 75dB) was found for JOY while
the highest (average 68dB) for DIS (cf. Figure 5 above). Intensity peaks in relation to the
attitude types varied less across speakers. In contrast to the difference between JOY and DIS,
variations in the intensity peak value for other types of attitudes is small (71-2 dB on average).
However, speakers differ considerably in the magnitude of intensity peaks. Some speakers
vary the intensity greatly for attitude types (speaker W and Z) while speaker U hardly varied
100 YASUKO NAGANO-MADSEN & TAKAKO AYUSAWA
it. Intensity peaks and F0 peaks correlate to some extent, yet it is clear that the two parameters
should be treated separately. Note that speakers W and U have very similar F0 peak values but
different intensity peak values.
4 Duration (speaking rate) and pause

Average total utterance duration for the three utterances for each speaker is presented in
Figure 6. Of the three utterances, the utterance /a-soo-desuka/ permits the insertion of a pause
after the initial interjection /a/. When pause duration is included, it shows the same durational
pattern as the other two utterances without a pause in reflecting the attitude types. Therefore,
we interpreted pause as part of durational manifestation and included it in total utterance
duration. The smallest cross-speaker variation was found for NEU for which all except one
speaker used the shortest duration, clustering around 600-800ms. In the absolute duration
value, speakers were also uniform for SUS which falls in the range between 1000 to 1200ms.
The greatest cross-speaker variation was found for DIS for which the duration of the utterance
varied from 800ms to 1250ms.
Figure 6 (left). Average utterance duration for each attitude type.

Figure 7 (right). Plotting of F1 and F2 for the vowel /a/ (speaker Z).
5 Vowel quality
Auditory impressions suggested considerable intra- and cross-speaker variation in the use of
voice quality in general as well as in the specifically tested attitude types. Since the acoustic
cues for voice quality are less straightforward than other acoustic cues, we only present the
differences in vowel quality in this paper. Figure 7 above shows the manifestation of vowel
quality by speaker Z. This speaker differentiated the vowel quality of /a/ in such a way that
SUS and JOY had a more front quality than NEU, Q, and DIS. The figure also shows the
formants values of /a/ in nonsense words /mamamama/ spoken neutrally by the same speaker.
6 Summary and discussion

Together with our earlier report on F0 shape and phrasing (Nagano-Madsen & Ayusawa
2005), both agreement and discrepancies were observable among the six speakers in their
manifestation of attitudes. It seems that pragmatic information can be expressed in at least a
few alternative ways in Japanese and that this line of research needs more attention.
References
Maekawa, K. & N. Kitagawa, 2002. How does speech transmit paralinguistic information? (in
Japanese). Cognitive Studies 9(1), 46-66.
Nagano-Madsen, Y. & T. Ayusawa, 2005. Prosodic correlates of attitudinally-varied back
channels in Japanese. Proceedings FONETIK 2005, Department of Linguistics, Göteborg
University, 103-106.
Working Papers 52 (2006), 101–104
Emotion Recognition in Spontaneous Speech

Daniel Neiberg1, Kjell Elenius1, Inger Karlsson1, and Kornel Laskowski2
1
{neiberg|kjell|inger}@speech.kth.se
2
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
kornel@cs.cmu.edu
Abstract
Automatic detection of emotions has been evaluated using standard Mel-frequency Cepstral
Coefficients, MFCCs, and a variant, MFCC-low, that is calculated between 20 and 300 Hz in
order to model pitch. Plain pitch features have been used as well. These acoustic features
have all been modeled by Gaussian mixture models, GMMs, on the frame level. The method
has been tested on two different corpora and languages; Swedish voice controlled telephone
services and English meetings. The results indicate that using GMMs on the frame level is a
feasible technique for emotion classification. The two MFCC methods have similar perform-
ance, and MFCC-low outperforms the pitch features. Combining the three classifiers signifi-
cantly improves performance.
1 Introduction
Recognition of emotions in speech is a complex task that is furthermore complicated by the
fact that there is no unambiguous answer to what the “correct” emotion is for a given speech
sample (Scherer, 2003; Batliner et al., 2003). Emotion research can roughly be viewed as
going from the analysis of acted speech (Dellaert et al., 1996) to more “real”, e.g. from auto-
mated telephone services (Blouin & Maffiolo, 2005). The motivation of this latter is often to
try to enhance the performance of such systems by identifying frustrated users.
A difficulty with spontaneous emotions is in their labeling, since the actual emotion of the
speaker is almost impossible to know with certainty. Also, emotions occurring in spontaneous
speech seem to be more difficult to recognize compared to acted speech (Batliner et al., 2003).
In Oudeyer (2002), a set of 6 features selected from 200 is claimed to achieve good accuracy
in a 2-person corpus of acted speech. This approach is adopted by several authors. They ex-
periment with large numbers of features, usually at the utterance level, and then rank each
feature in order to find a small golden set, optimal for the task at hand (Batliner et al., 1999).
Classification results reported on spontaneous data are sparse in the literature. In Blouin &
Maffiolo (2005), the corpus consists of recordings of interactions between users and an auto-
matic voice service. The performance is reported to flatten out when 10 out of 60 features are
used in a linear discriminant analysis (LDA) cross-validation test. In Chul & Narayanan
(2005), data from a commercial call centre was used. As is frequently the case, the results for
various acoustic features were only slightly better than a system classifying all exemplars as
neutral. Often authors use hundreds of features per utterance, meaning that most spectral
properties are covered. Thus, to use spectral features, such as MFCCs, possibly with addi-
tional pitch measures, may be seen as an alternative. Delta MFCC measures on the utterance
level have been used earlier, e.g. in Oudeyer (2002). However, we have chosen to model the
distribution of the MFCC parameters on the frame level in order to obtain a more detailed de-
scription of the speech signal.
102 DANIEL NEIBERG ET AL.
In spontaneous speech the occurrence of canonical emotions such as happiness and anger is
typically low. The distribution of classes is highly unbalanced, making it difficult to measure
and compare performance reported by different authors. The difference between knowing and
not knowing the class distribution will significantly affect the results. Therefore we will in-
clude results from both types of classifiers.
2 Material
The first material used was recorded at 8 kHz at the Swedish Table 1. Materials used.
company Voice Provider (VP), which runs more then 50 differ-
VP development set
ent voice-controlled telephone services. Most utterances are
Neutral 3865 94 %
neutral (non-expressive), but some percent are frustrated, most
Emphatic 94 2 %
often due to misrecognitions by the speech recognizer, Table 1.
Negative 171 4 %
The utterances are labeled by an experienced, senior voice re-
searcher into neutral, emphasized or negative (frustrated) Total 4130
speech. A subset of the material was labeled by 5 different per- VP evaluation set
sons and the pair-wise inter-labeler kappa was 0.75 – 0.80. Neutral 3259 93 %
In addition to the VP data, we apply our approach to meeting Emphatic 66 2 %
recordings. The ISL Meeting Corpus consists of 18 meetings, Negative 164 5 %
with an average number of 5.1 participants per meeting and an Total 3489
average duration of 35 minutes. The audio is of 16 bit, 16 kHz ISL development set
quality, recorded with lapel microphones. It is accompanied by Neutral 6312 80 %
orthographic transcription and annotation of emotional valence Negative 273 3 %
(negative, neutral, positive) at the speaker contribution level Positive 1229 16 %
(Laskowski & Burger, 2006). The emotion labels were con- Total 7813
structed by majority voting (2 of 3) for each segment. Split deci- ISL evaluation set
sions (one vote for each class) were removed. Finally, the de- Neutral 3259 70 %
velopment set was split into two subsets that were used for Negative 151 3 %
cross-wise training and testing. Positive 844 19 %
Both corpora were split into a development and an evaluation Total 4666
set, as shown in Table 1.
3 Features
Thirteen Standard MFCC parameters were extracted from 24 Mel-scaled logarithmic filters
from 300 to 3400 Hz. Then we applied RASTA-processing (Hermansky & Morgan, 1994).
Delta and delta-delta features were added, resulting in a 39 dimensional vector. For the ISL
material we used 26 filters from 300 to 8000 Hz; otherwise the processing was identical.
MFCC-low features were computed similarly to the standard MFCCs but the filters ranged
from 20 to 300 Hz. We expected these MFCCs to model F0 variations.
Pitch was extracted using the Average Magnitude Difference Function, Ross et al. (1974) as
reported by Langlais (1995). We used a logarithmic scale subtracting the utterance mean. Also
delta features were added.
4 Classifiers
All acoustic features are modeled using Gaussian mixture models (GMMs) with diagonal co-
variance matrices measured over all frames of an utterance. First, using all the training data, a
root GMM is trained with the Expectation Maximization (EM) algorithm with a maximum
likelihood criterion, and then one GMM per class is adapted from the root model using the
maximum a posteriori criterion (Gauvin & Lee, 1994). We use 512 Gaussians for MFCCs and
64 Gaussians for pitch features. These numbers were empirically optimized. This way of us-
EMOTION RECOGNITION IN SPONTANEOUS SPEECH 103
ing GMMs has proved successful for speaker verification (Reynolds et al., 2000). The outputs
from the three classifiers were combined using multiple linear regression, with the final class
selected as the argmax over the per-class least square estimators. The transform matrix was
estimated from the training data.
5 Experiments
We ran our experiments with the features and classifiers described above. An acoustic com-
bination was composed by GMMs for MFCC, MFCC-low, and pitch. The combination matrix
was estimated by first testing the respective GMM with its training data.
6 Results
Performance is measured as abso- Table 2. Results. Accuracy, Average Recall, f1.
lute accuracy, average recall (for
all classes) and f1, computed VP Neutral vs. Emphasis vs. Negative
from the average precision and Classifier Acc. A.Rec. f1
recall for each classifier. The Random with equal priors 0.33 0.33 0.33
results are compared to two naïve MFCC 0.80 0.43 0.40
classifiers: a random classifier MFCC-low 0.78 0.39 0.37
that classifies everything with Pitch 0.56 0.40 0.38
equal class priors, random with Acoustic combination 0.90 0.37 0.39
equal priors, and a random Random using priors 0.88 0.33 0.33
classifier knowing the true prior Acoustic comb. using priors 0.93 0.34 0.38
distribution over classes in the ISL Negative vs. Neutral vs. Positive
training data, random using Classifier Acc. A.Rec. f1
priors. The combination matrix Random with equal priors 0.33 0.33 0.33
accounts for the prior distribution MFCC 0.66 0.49 0.47
in the training data, heavily MFCC-low 0.66 0.46 0.44
favoring the neutral class. There- Pitch 0.41 0.38 0.37
fore a weight vector which forces Acoustic combination 0.79 0.50 0.47
the matrix to normalize to equal Random using priors 0.67 0.33 0.33
prior distribution was also used. Acoustic comb. using priors 0.82 0.42 0.48
Thus we report two more results:
acoustic combination with equal priors, that is optimized for the accuracy measure and
acoustic combination using priors, which optimizes the average recall rate. Thus, classifiers
under the random equal priors heading do not know the a priori class distribution and should
only be compared to each other. The same holds for the classifiers under random using priors.
Note that the performance difference in percentages is higher for a classifier not knowing the
prior distribution compared to its random classifier, than for the same classifier knowing the
prior distribution compared to its random classifier. This is due to the skewed prior
distributions.
From Table 2 we note that all classifiers with equal priors perform substantially better than
the random classifier. The MFCC-low classifier is almost as good as the standard MFCC and
considerably better than the pitch classifier.
Regarding the ISL results in Table 2 we again notice that the pitch feature does not perform
on the same level as the MFCC features. When the distribution of errors for the individual
classes was examined, it revealed that most classifiers were good at recognizing the neutral
and positive class, but not the negative one, most probably due to its low frequency resulting
in poor training statistics.
104 DANIEL NEIBERG ET AL.
7 Conclusion
Automatic detection of emotions has been evaluated using spectral and pitch features, all
modeled by GMMs on the frame level. Two corpora were used: telephone services and meet-
ings. Results show that frame level GMMs are useful for emotion classification.
The two MFCC methods show similar performance, and MFCC-low outperforms pitch
features. A reason may be that MFCC-low gives a more stable pitch measure. Also, it may be
due to its ability to capture voice source characteristics, see Syrdal (1996), where the level dif-
ference between the first and the second harmonic is shown to distinguish between phona-
tions, which in turn may vary across emotions.
The diverse results of the two corpora are not surprising considering their discrepancies.
A possible way to improve performance for the VP corpus would be to perform emotion
detection on the dialogue level rather than the utterance level, and also take the lexical content
into account. This would mimic the behavior of the human labeler.
Above we have indicated the difficulty to compare emotion recognition results. However, it
seems that our results are at least on par with those in Blouin & Maffiolo (2005).
Acknowledgements
This work was performed within CHIL, Computers in the Human Interaction Loop, an EU 6th
Framework IP (506909). We thank Voice Provider for providing speech material.
References
Batliner, A., J. Buckow, R. Huber, V. Warnke, E. Nöth & H. Niemann, 1999. Prosodic
Feature Evaluation: Brute Force or Well Designed? Proc. 14th ICPhS, 2315-2318.
Batliner, A., K. Fischer, R. Huber, J. Spilkera & E. Nöth, 2003. How to find trouble in
communication. Speech Communication 40, 117-143.
Blouin, C. & V. Maffiolo, 2005. A study on the automatic detection and characterization of
emotion in a voice service context. Proc. Interspeech, Lisbon, 469-472.
Chul, M.L. & S. Narayanan, 2005. Toward Detecting Emotions in Spoken Dialogs. IEEE,
Transactions on Speech and Audio Processing 13(2), 293-303.
Dellaert, F., T.S. Polzin & A. Waibel, 1996. Recognizing emotion in speech. Proc. ICSLP,
Philadelphia, 3:1970-1973.
Gauvin, J-L. & C.H. Lee, 1994. Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains. IEEE Trans. SAP 2, 291-298.
Hermansky, H. & N. Morgan, 1994. RASTA processing of speech. IEEE Trans. SAP 4, 578-
589.
Langlais, P., 1995. Traitement de la prosodie en reconnaissance automatique de la parole.
PhD-thesis, University of Avignon.
Laskowski, K. & S. Burger, 2006. Annotation and Analysis of Emotionally Relevant Behavior
in the ISL Meeting Corpus. LREC, Genoa.
Oudeyer, P., 2002. Novel Useful Features and Algorithms for the Recognition of Emotions in.
Human Speech. Proc. of the 1st Int. Conf. on Speech Prosody.
Reynolds, D., T. Quatieri & R. Dunn, 2000. Speaker verification using adapted Gaussian
mixture models. Digital Signal Processing 10, 19-41.
Ross, M., H. Shafer, A. Cohen, R. Freudberg & H. Manley, 1974. Average magnitude
difference function pitch extraction. IEEE Trans. ASSP-22, 353-362.
Scherer, K.R., 2003. Vocal communication of emotion: A review of research paradigms.
Speech Communication 40, 227-256.
Syrdal, A.K., 1996. Acoustic variability in spontaneous conversational speech of American
English talkers. Proc. ICSLP, Philadelphia.
Working Papers 52 (2006), 105–108
Data-driven Formant Synthesis of Speaker Age

Susanne Schötz
susanne.schotz@ling.lu.se
Abstract
This paper briefly describes the development of a research tool for analysis of speaker age
using data-driven formant synthesis. A prototype system was developed to automatically ex-
tract 23 acoustic parameters from the Swedish word ‘själen’ [ˈɧɛːlən] (the soul) spoken by four
differently aged female speakers of the same dialect and family, and to generate synthetic
copies. Functions for parameter adjustment as well as audio-visual comparison of the natural
and synthesised words using waveforms and spectrograms were added to improve the synthe-
sised words. Age-weighted linear parameter interpolation was then used to synthesise a tar-
get age anywhere between the ages of 2 source speakers. After an initial evaluation, the sys-
tem was further improved and extended. A second evaluation indicated that speaker age may
be successfully synthesised using data-driven formant synthesis and weighted linear interpo-
lation.
1 Introduction
In speech synthesis applications like spoken dialogue systems and voice prostheses, the need
for voice variation in terms of age, emotion and other speaker-specific qualities is growing.
To contribute to the research in this area, as part of a larger study aiming at identifying pho-
netic age cues, a system for analysis by synthesis of speaker age was developed using data-
driven formant synthesis. This paper briefly describes the developing process and results.
Research has shown that acoustic cues to speaker age can be found in almost every phonetic
dimension, i.e. in F0, duration, intensity, resonance, and voice quality (Hollien, 1987; Jacques
& Rastatter, 1990; Linville, 2001; Xue & Deliyski., 2001). However, the relative importance
of the different cues has still not been fully explored. One reason for this may be the lack of an
adequate analysis tool in which a large number of potential age parameters can be varied sys-
tematically and studied in detail.
Formant synthesis generates speech from a set of rules and acoustic parameters, and is con-
sidered both robust and flexible. Still, the more natural-sounding concatenation synthesis is
generally preferred over formant synthesis (Narayanan & Alwan, 2004). Lately, formant syn-
thesis has made a comeback in speech research, e.g. in data-driven and hybrid synthesis with
improved naturalness (Carlson et al., 2002; Öhlin & Carlson, 2004).
2 Material
Four female non-smoking native Swedish speakers of the same family and dialect were se-
lected to represent different ages, and recorded twice over a period of 3 years: Speaker:1: girl
(aged 6 and 9), Speaker 2: mother (aged 36 and 39), Speaker 3: grandmother (aged 66 and]
69), and Speaker 4: great grandmother (aged 91 and 94). The isolated word ‘själen’ [ ln
(the soul), was selected as a first test word, and the recordings were segmented into pho-
nemes, resampled to 16 kHz, and normalized for intensity.
106 SUSANNE SCHÖTZ
3 Method and procedure

The prototype system was developed in several steps (see Figure 1). First, a Praat (Boersma &
Weenink, 2005) script extracted 23 acoustic parameters every 10 ms. These were then used as
input to the formant synthesiser GLOVE, which is an extension of OVE III (Liljencrants,
1968) with an expanded LF voice source model (Fant et al., 1985). GLOVE was used by kind
permission of CTT, KTH. For a more detailed description, see Carlson et al. (1991).
Parameter Audio-visual comparison
adjustment with natural speech and
previously synthesised version
Input Automatic Output
Formant
natural parameter + synthetic
synthesiser
speech extraction speech
Figure 1. Schematic overview of the prototype system.
Next, the parameters were adjusted to generate more natural-sounding synthesis. To be able to
compare the natural speech to the synthetic versions, another Praat script was developed,
which first called the parameter extraction script, and then displayed waveforms and spectro-
grams of the original word, the resulting synthetic word, as well as the previous synthetic ver-
sion. By auditive and visual comparison of the three files, the user could easily determine
whether a newly added parameter or adjustment had improved the synthesis. If an adjustment
improved the synthesis, it was added to the adjustment rules. Formants, amplitudes and voice
source parameters (except F0) caused the most serious problems, which were first solved us-
ing fixed values, then by parameter smoothing.
Output
Input synthesis
Source speaker
target of target
parameter files
age age
Source Source
parameter file 1 parameter file 2
Parameter file
for target age
Calculation of Duration Parameter Formant
age weights for each segment interpolation for each parameter interpolation synthesiser
Figure 2. Schematic overview of the age interpolation method.
An attempt to synthesise speaker age was carried out using the system. The basic idea was to
use the synthetic versions of the words to generate new words of other ages by age-weighted
linear interpolation between two source parameter files. A Java program was developed to
calculate the weights and to perform the interpolations. For each target age provided as input
by the user, the program selects the parameter files of two source speakers (the older and
younger speakers closest in age to the target age), and generates a new parameter file from the
interpolations between the two source parameter files. For instance, for the target age of 51,
i.e. exactly half-way between the ages of Speaker 2 (aged 36) and Speaker 3 (aged 66), the
program selects these two speakers as source speakers, and then calculates the age weights to
0.5 for both source speakers. Next, the program calculates the target duration for each pho-
neme segment using the age weights and the source speaker durations. If the duration of a
particular segment is 100 ms for source Speaker 1, and 200 ms for source Speaker 2, the target
duration for the interpolation is 200 x 0.5 + 100 x 0.5 = 150 ms. All parameter values are then
interpolated in the same way. Finally, the target parameter file is synthesised using GLOVE,
and displayed (waveform and spectrogram) in Praat along with the two input synthetic words
for comparison. A schematic overview of the procedure is shown in Figure 2.
DATA-DRIVEN FORMANT SYNTHESIS OF SPEAKER AGE 107
4 Results
To evaluate the system’s performance, two perception tests were carried out to estimate direct
age and naturalness (on a 7-point scale, where 1 is very unnatural and 7 is very natural). Stim-
uli in the first evaluation consisted of natural and synthetic versions of the 6, 36, 66 and 91
year old speakers. The second evaluation was carried out at a later stage when the 9, 39, 69
and 94 year olds had been included, and when parameter smoothing and pre-emphasis filter-
ing (to avoid muffled quality) had improved the synthesis. 31 students participated in the first
evaluation test, also including interpolations for 8 decades (10 to 80 years), while 21 students
took part in the second, which also comprised interpolations for 7 decades (10 to 70 years).
4.1 First evaluation

In the first evaluation, the correlation curves between chronological age (CA, or simulated
“CA” for the synthetic words) and perceived age (PA) displayed some similarity for the natu-
ral and synthetic words, though the synthetic ones were judged older in most cases, as seen in
Figure 3. The interpolations were mostly judged as much older than both the natural and syn-
thetic words. As for naturalness, the natural words were always judged more natural than the
synthetic ones. Both the natural and synthetic 6 year old versions were judged least natural.
80
70 7
60
6
50
Nat 5
40 Syn Nat
4
30
Int Syn
3
20
2
10
0 1
0 10 20 30 40 50 60 70 80 90 100 6 36 66 91
"CA" (years) Stimulus age
Figure 3. Correlation between PA and CA for natural, synthetic and interpolated words (left),
and median perceived naturalness for natural and synthetic words in the first evaluation.
4.2 Second evaluation

Figure 4 shows that not only the correlation curves for the natural and synthetic words, but
also for the interpolations did improve in similarity in the second evaluation compared to the
first one. However, the natural and synthetic versions of the 39, 66 and 69 year olds were
quite underestimated. All natural words were judged as more natural than the synthetic ones,
and all synthetic words except the 6 and 94 year old achieved a median naturalness value of 6.
80
70 7
60
6
50
Nat 5
40 Syn Nat
4
30
Int Syn
3
20
10 2
0 1
0 10 20 30 40 50 60 70 80 90 100 6 9 36 39 66 69 91 94
"CA" (years) Stimulus age
Figure 4. Correlation between PA and CA for natural, synthetic and interpolated words (left),
and median perceived naturalness for natural and synthetic words in the second evaluation.
108 SUSANNE SCHÖTZ
5 Discussion and future work

The synthetic words obtained a reasonable resemblance with the natural words in most cases,
and the similarity in age was improved in the second evaluation. The interpolated versions
were often judged as older than the intended age in the first evaluation, but in the second
evaluation they had become more similar in age to the natural and synthetic versions, indicat-
ing that speaker age may be synthesised using data-driven formant synthesis. Still, some of the
age estimations were quite unexpected. For instance, the 39, 66 and 69 year olds were judged
as much younger than their CA. This may be explained by that these voices were atypical for
their age.
One very important point in this study is that synthesis of age by linear interpolation is in-
deed a crude simplification of the human aging process, which is far from linear. Moreover,
while some parameters may change considerably during a certain period of aging (i.e. F0 and
formant frequencies during puberty), others remain constant. Better interpolation techniques
will have to be tested. One should also bear in mind that the system is likely to interpolate not
only between two ages, but also between a number of individual characteristics, even when
the speakers are closely related.
Future work involves (1) improved parameter extraction for formants, (2) better interpola-
tion algorithms, and (3) expansion of the system to handle more speakers (of both sexes), as
well as a larger and more varied speech material. Further research with a larger material is
needed to identify and rank the most important age-related parameters. If further developed,
the prototype system may well be used in future studies for analysis, modelling and synthesis
of speaker age and other speaker-specific qualities, including dialect and attitude. The pho-
netic knowledge gained from such experiments may then be used in future speech synthesis
applications to generate more natural-sounding synthetic speech.
References
Boersma, P. & D. Weenink, 2005. Praat: doing phonetics by computer (version 4.3.04) [com-
puter program]. Retrieved March 8, 2005, from http://www.praat.org/.
Carlson, R., B. Granström & I. Karlsson, 1991. Experiments with voice modelling in speech
synthesis. Speech Communication 10, 481–489.
Carlson, R., T. Sigvardson & A. Sjölander, 2002. Data-driven formant synthesis. Proceedings
of Fonetik 2002, TMH-QPSR, 121–124.
Fant, G., J. Liljencrants & Q. Lin, 1985. A four-parameter model of glottal flow. STL-QPSR
4, 1–13.
Hollien, H., 1987. Old voices: What do we really know about them? Journal of Voice 1, 2–13.
Jacques, R. & M. Rastatter, 1990. Recognition of speaker age from selected acoustic features
as perceived by normal young and older listeners. Folia Phoniatrica (Basel) 42, 118–124.
Liljencrants, J., 1968. The OVE III speech synthesizer. IEEE Trans AU-16(1), 137–140.
Linville, S.E., 2001. Vocal Aging. San Diego: Singular Thomson Learning.
Narayanan, S. & A. Alwan (eds.), 2004. Text to Speech Synthesis: New Paradigms and Ad-
vances. Prentice Hall PTR, IMSC Press Multimedia Series.
Öhlin, D. & R. Carlson, 2004. Data-driven formant synthesis. Proceedings of Fonetik 2004,
Dept. of Linguistics, Stockholm University, 160–163.
Xue, S.A. & D. Deliyski, 2001. Effects of aging on selected acoustic voice parameters: Pre-
liminary normative data and educational implications. Educational Gerontology 21, 159-
168.
Working Papers 52 (2006), 109–112
How do we Speak to Foreigners? – Phonetic

Analyses of Speech Communication between
L1 and L2 Speakers of Norwegian
Rein Ove Sikveland
Department of Language and Communication Studies,
The Norwegian University of Science and Technology (NTNU), Trondheim
rein.ove.sikveland@hf.ntnu.no
Abstract
The major goal of this study was to investigate which phonetic strategies we may actually use
when speaking to L2 speakers of our mother tongue (L1). The results showed that speech rate
in general was slower and that the vowel formants were closer to target values, in L2 directed
speech compared to L1 directed speech in Norwegian. These properties of L2 directed speech
correspond to previous findings for clear speech (e.g. Picheny et al., 1986; Krause & Braida,
2004). The results also suggest that level of experience may influence L2 directed speech;
teachers of Norwegian as a second language slowed down the speech rate more than the non-
teachers did, in L2 directed speech compared to L1 directed speech.
1 Introduction
When speaking to foreigners in our mother tongue (L1) it might be natural to speak clearer
than normal to make ourselves understood, which implies the use of certain phonetic
strategies when speaking to these second language learners (L2 speakers).
Previous findings by Picheny et al. (1986) and Krause & Braida (2004) have shown that
clear speech can be characterized by a decrease in speech rate, more pauses, relatively more
energy in the frequency region of 1-3 kHz, less phonological reductions (e.g. less burst
eliminations), vowel formants closer to target values, longer VOT and a greater F0 span,
compared to conversational speech. What characterizes L2 directed speech has not been
subject to any previous investigations, but one might assume that strategies in L2 directed
speech correspond to the findings for clear speech. This has been investigated in the present
studies, and the results for speech rate and vowel formants will be presented.
2 Method
To be able to compare speech in L1 and L2 contexts directly, the experiment was carried out
by recording native speakers of Norwegian 1) in dialogue with L2 speakers, and 2) in
dialogue with other L1 speakers. The dialogue setting was based on a keyword manuscript, to
facilitate natural speech, and at the same time be able to compare phonetic parameters in
identical words and phonological contexts.
2.1 Subjects
Six native speakers of Norwegian (with eastern Norwegian dialect background) participated
as informants. Three of them were teachers in Norwegian as a second language, called P
110 REIN OVE SIKVELAND
informants (P for “professional”), and three of them were non-teachers, called NP informants
(NP for “non-professional”). Six other L1 speakers and six L2 speakers of Norwegian
participated as opponents to match each informant in the L1 and L2 contexts. Thus there were
18 subjects participating in the experiment, distributed across twelve recordings.
2.2 Procedure
Recordings were made by placing each informant in a studio while the dialogue opponents
were placed in the control room. They communicated through microphones and headphones.
The dialogue setting, but not the sound quality, was to represent a phone conversation
between two former roommates/partners, and the role of the informants was to suggest to the
opponent how to distribute their former possessions, written on a list in the manuscript. There
were no lines written in the manuscript, only suggestions of how questions might be asked.
The participants were told to carry out the dialogue naturally, but they were not told to speak
in any specific manner (e.g. “clearly” or “conversationally”). The speech analyses of the
recordings were made using outputs of spectrograms, spectra and waveforms in software
“Praat”. Only words from the list of possessions were used for analyses, and the
corresponding words/syllables/phonemes were measured for each informant in L1 and L2
contexts.
3 Results
3.1 Speech rate
Speech rate was investigated by measuring syllable duration and number of phonemes per
second, in ten words for each informant in L1 and L2 contexts (altogether 120 words). The
measured words contained four syllables or more. The results showed that syllable duration
was longer, and that the number of phonemes per second was lower, in L2 context compared
to L1 context. Pooled across informants, the average duration of syllables is 221 ms in L1
context and 239 ms in L2 context. This difference is highly significant (t (298) = - 4.790; p <
0.0001), and gives a strong general impression that the speech rate is slower in L2 context
compared to L1 context.
Number of phonemes per second
14
12
10
0
L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2
P1 P2 P3 NP1 NP2 NP3
Figure 1. Average number of phonemes per second, for all six informants in L1 and L2
contexts. Error bars are standard deviations.
HOW DO WE SPEAK TO FOREIGNERS? 111
With the purpose of investigating and describing speech rate more directly, the number of
phonemes per second was found to be significantly lower in L2 context compared to L1
context, when pooled across informants (t (59) = 3.303; p < 0.002). There was an average
difference of 0.7 phonemes per second between contexts. In Figure 1 above the number of
phonemes per second is shown for all informants in L1 and L2 context. Figure 1 may also
show that the speech rate effect is larger for “professional” (P) informants than for “non-
professional” (NP) informants. The interaction between level of experience and L1/L2 context
on speech rate is significant (F (1, 58) = 7.337; p < 0.009). Considering these results one has
reasons to suggest that speech rate is slower in L2 context compared to L1 context, and that
the effect of context on speech rate is dependent on level of experience of the speaker.
3.2 Vowel formants

Formants F1, F2 and F3, in addition to F0, were measured for long and short vowels /a/, /i/
and /u/, representing the three most peripheral vowels in articulation. Since male and female
speakers have vocal tracts of different sizes and shapes, the results in Table 1 below are
presented for both genders separately. Bold type represents significant differences between
contexts, and the results suggest that F1 in /a:/ is generally higher in L2 directed speech than
in L1 directed speech, for female (t (25) = - 3.686; p < 0.001) and male (t (24) = - 3.806; p <
0.001) speakers. F1 is also significantly higher in L2 context than in L1 context in /a/, for
male speakers (t (20) = - 4.668; p < 0.0001), and in /i/ (t (19) = - 2.113; p < 0.048) and /u:/ (t
(35) = - 2.831; p < 0.008) for female speakers. A difference in F2 between contexts seems to
be evident only for the /i/ vowels, significantly so for male speakers, in /i:/ (t (23) = - 3.079; p
< 0.005) and /i/ (t (23) = - 5.520; p < 0.0001). F3 values are quite variable within vowels and
informants, but significantly higher values in L2 context than in L1 context were found in /i/
(t (23) = - 2.152; p < 0.042) and /u:/ (t (35) = - 3.313; p < 0.004) for male speakers.
Table 1. Average values for F1, F2 and F3 in Hz for female (F) and male (M) informants in
short and long /a/, /i/ and /u/ vowels. Standard deviations are in parentheses. Bold typing
represents statistical significance of differences between L1 and L2 context.
F1 F2 F3
L1 L2 L1 L2 L1 L2
/a:/ F (n=26) 663 (97) 719 (60) 1165 (89) 1192 (107) 2751 (224) 2724 (165)
M (n=25) 578 (66) 632 (46) 1014 (106) 1050 (93) 2562 (209) 2652 (252)
/a/ F (n=20) 729 (121) 738 (67) 1257 (168) 1273 (139) 2728 (226) 2685 (174)
M (n=21) 552 (77) 626 (59) 1076 (89) 1088 (123) 2408 (282) 2475 (288)
/i:/ F (n=24) 403 (77) 391 (76) 2362 (269) 2422 (194) 3044 (311) 3035 (299)
M (n=24) 317 (41) 326 (45) 2029 (124) 2093 (120) 2947 (261) 3026 (269)
/i/ F (n=20) 391 (56) 418 (51) 2287 (229) 2291 (197) 2933 (191) 2951 (154)
M (n=24) 361 (36) 362 (39) 1933 (109) 2036 (121) 2722 (155) 2795 (229)
/u:/ F (n=36) 379 (44) 404 (54) 861 (191) 854 (136) 2728 (209) 2790 (262)
M (n=36) 351 (32) 354 (38) 738 (135) 735 (138) 2476 (212) 2572 (199)
/u/ F (n=18) 394 (56) 418 (58) 1028 (164) 1012 (163) 2690 (216) 2638 (225)
M (n=19) 370 (40) 371 (35) 882 (143) 861 (136) 2387 (188) 2388 (146)
If F1 values correlate positively with degree of opening in vowel articulation, the general rise
in F1, especially for the /a/ vowels, might be interpreted as a result of a more open mouth/jaw
position in L2 context than in L1 context. As suggested by Ferguson & Kewley-Port (2002), a
rise in F1 might also be a result of increased vocal effort, which might give an additional
explanation to the higher F1 values for /i/ and /u:/. Letting F2 represent the front-back
112 REIN OVE SIKVELAND
dimension of the vocal tract (high F2 values for front vowels), one might suggest that /i/
vowels (mostly for male speakers) are produced further front in the mouth in L2 context than
in L1 context. The tendencies toward higher F2 and F3 frequencies in L2 context compared to
L1 context, might indicate that the informants do not use more lip rounding when producing
/u/ vowels in L2 context. Rather this point might support our suggestion above that
informants in general use a more open mouth position in L2 context than in L1 context.
According to Syrdal & Gopal (1986) one might expect that the relative differences F3-F2
and F1-F0 to describe the front-back and open-closed dimensions (respectively) more
precisely than the absolute formant values. In the present investigations, F1-F0 relations led to
the same interpretations as for F1 alone, regarding degrees of mouth opening. The F3-F2
relation gave additional information about the vowel /u:/, in that the F3-F2 difference was
significantly larger in L2 context than in L1 context (t (40) = - 2.302; p < 0.024). This might
be interpreted as /u:/ being produced more back in mouth in L2 context than in L1 context.
Effects of level of experience on formant values or formant relations were not found, which
indicates that the differences in vowel formants between L1 and L2 contexts are general
among speakers.
4 Conclusions
The results show that L1 speakers modify their pronunciation when speaking to L2 speakers
compared to when speaking to other L1 speakers. We have seen that this was so for speech
rate, in that the informants had longer syllable durations and fewer phonemes per second in
L2 context than in L1 context. The formant values and formant relations indicated that
articulation of the peripheral vowels /a/, /i/ and /u/ was closer to target in L2 context
compared to L1 context, in both degree of opening and front-back dimensions.
The results for L2 directed speech correspond to those found for clear speech (e.g. Picheny
et al., 1986; Krause & Braida, 2004; Bond & Moore, 1994).
Level of experience seemed to play a role in speech rate, in that “professional” L1-L2
speakers differentiated more between L1 and L2 context than “non-professional” L1-L2
speakers did.
References
Bond, Z.S. & T.J. Moore, 1994. A note on the acoustic-phonetic characteristics of
inadvertently clear speech. Speech Communication 14, 325-337.
Ferguson, S.H. & D. Kewley-Port, 2002. Vowel intelligibility in clear and conversational
speech for normal-hearing and hearing-impaired listeners. J. Acoust. Soc. Am. 112, 259-
271.
Krause, J.C. & L.D. Braida, 2004. Acoustic properties of naturally produced clear speech at
normal speaking rates. J. Acoust. Soc. Am. 115, 362-378.
Picheny, M.A., N.I. Durlach & L.D. Braida, 1986. Speaking clearly for the hard of hearing 2:
Acoustic characteristics of clear and conversational speech. Journal of Speech and Hearing
Research 29, 434-446.
Syrdal, A.K. & H.S. Gopal, 1986. A perceptual model of vowel recognition based on the
auditory representation of American English vowels. J. Acoust. Soc. Am. 79, 1066-1100.
Working Papers 52 (2006), 113–116
A Switch of Dialect as Disguise

Maria Sjöström1, Erik J. Eriksson1, Elisabeth Zetterholm2, and
Kirk P. H. Sullivan1
1
Department of Philosophy and Linguistics, Umeå University
kv00msm@cs.umu.se, erik.eriksson@ling.umu.se, kirk.sullivan@ling.umu.se
2
elisabeth.zetterholm@ling.lu.se
Abstract
Criminals may purposely try to hide their identity by using a voice disguise such as imitating
another dialect. This paper empirically investigates the power of dialect as an attribute that
listeners use when identifying voices and how a switch of dialect affects voice identification.
In order to delimit the magnitude of the perceptual significance of dialect and the possible
impact of dialect imitation, a native bidialectal speaker was the target speaker in a set of four
voice line-up experiments, two of which involved a dialect switch. Regardless of which dialect
the bidialectal speaker spoke he was readily recognized. When the familiarization and target
voices were of different dialects, it was found that the bidialectal speaker was significantly
less well recognized. Dialect is thus a key feature for speaker identification that overrides
many other features of the voice. Whether imitated dialect can be used for voice disguise to
the same degree as native dialect switching demands further research.
1 Introduction
In the process of recognizing a voice, humans attend to particular features of the individual’s
speech being heard. Some of the identifiable features that we listen to when recognizing a
voice have been listed by, among others, Gibbons (2003) and Hollien (2002). The listed
features include fundamental frequency (f0), articulation, voice quality, prosody, vocal
intensity, dialect/sociolect, speech impediments and idiosynctratic pronunciation. The listener
may use all, more, or only a few, of these features when trying to identify a person, depending
on what information is available. Which of these features serve as the most important ones
when recognizing a voice is unclear. Of note, however, is that, according to Hollien (2002),
one of the first things forensic practitioners look at when trying to establish the speaker’s
identity is dialect.
During a crime, however, criminals may purposely try to hide their identity by disguising
their voices. Künzel (2000) reported that the statistics from the German Federal Police Office
show that annually 15-25% of the cases involving speaker identification include at least one
type of voice disguise: some of the perpetrators’ ‘favourites’ include: falsetto, pertinent creaky
voice, whispering, faking a foreign accent and pinching one’s noise. Markham (1999)
investigated another possible method of voice disguise, dialect imitation. He had native
Swedish speakers attempt to produce readings in various Swedish dialects that were not their
native dialects. Both the speaker’s ability to consistently keep a natural impression and to
mask his or her native dialect were investigated. Markham found that some speakers are able
to successfully mimic a dialect and hide their own identity. Markham also pointed out that to
114 MARIA SJÖSTRÖM ET AL.
avoid suspicion it is as important to create an impression of naturalness, as it is to hide one’s

identity when using voice disguise.
In order to baseline and delimit the potential impact on speaker identification by voice
alone due to dialect imitation a suite of experiments were constructed that used a native
bidialectal speaker as the speaker to be identified. The use of a native bidialectal speaker
facilitates natural and dialect consistent stimuli. The four perception tests presented here are
excerpted from Sjöström (2005). The baselining of the potential problem is of central
importance for forensic phonetics since, if listeners can be easily fooled, it undermines
earwitness identification of dialect and suggests that forensic practitioners who currently use
dialect as a primary feature during analysis would need to reduce their reliance on this feature.
2 Method
Four perception tests were constructed. The first two tests investigated whether the bidialectal
speaker was equally recognizable in both his dialects. The second two tests addressed whether
listeners were distracted by a dialect shift between familiarization and the recognition task.
2.1 Speech material

The target bidialectal speaker is a male Swede who reports that he speaks Scanian and a
variety of Stockholm dialect on a daily basis. He was born near Stockholm but moved to
Scania as a five-year old. An acoustic analysis of the speaker’s dialect voices was performed,
which confirmed that his two varieties of Swedish carry the typical characteristics of the two
dialects and that he is consistent in his use of them.
Two recordings of The Princess and the Pea were made by the bidialectal speaker. In one
of them he read the story using the Stockholm dialect, and in the other he read it using his
Scanian dialect.
Four more recordings of The Princess and the Pea were made; two by two male mono-
dialectal speakers of the Stockholm dialect (ST) and two by two male mono-dialectal speakers
of the Scanian dialect (SC). These speakers (hereafter referred to as foils) were chosen with
regard to their similarities with the target voice in dialect, age, and other voice features such
as creakiness. For further details, see Sjöström (2005).
2.2 The identification tests

Four different earwitness identification tests were constructed for participants to listen to.
Each test began with the entire recording of The Princess and the Pea as the familiarization
voice, and was followed by a voice line-up of 45 stimuli. The 45 stimuli consisted of three
phrases selected from each recording presented three times for each speaker (3 x 3 x 5 = 45).
Each voice line-up contained the four foil voices and one of the target’s two dialect voices
(see Table 1). For example, the test ‘SC-ST’ uses the target’s Scanian voice as the
familiarization voice and the target’s Stockholm dialect voice in the line-up. Test SC-SC and
Test ST-ST were created as control tests. They afford investigation of whether the target’s
Stockholm and Scanian dialects can be recognized among the voices of the line-up, and to test
if the two different dialects are recognized to the same degree. Tests ST-SC and SC-ST
investigate if the target can be recognized even when a dialect shift between familiarization
and recognition occurs.
80 participants, ten in each listener test, took part in this study. All were native speakers of
Swedish and reported no known hearing impairment. Most of the listeners were students at
either Lund University or Umeå University, and all spoke a dialect from the southern or
northern part of Sweden.
A SWITCH OF DIALECT AS DISGUISE 115
Table 1. The composition of the voice identification tests showing which of the target’s
voices was used as familiarization voice and which voices were included in the voice line-up
for each of the four tests.
Test Familiarization voice Line-up voices
SC-SC TargetSC Foil 1-4 + TargetSC
ST-ST TargetST Foil 1-4 + TargetST
ST-SC TargetST Foil 1-4 + TargetSC
SC-ST TargetSC Foil 1-4 + TargetST
2.3 Data analysis

In this yes-no experimental design responses can be grouped into four different categories: hit
(when the listener correctly responds ‘yes’ to the target stimulus), miss (when the listener
responds ‘no’ to a target stimulus), false alarm (when the listener responds ‘yes’ to a non-
target stimulus) and correct rejection (when the listener correctly responds ‘no’ to a non-
target stimulus). By calculating the hit and false alarms rates as proportions of the maximum
possible number of hits and false alarms, the listeners’ discrimination sensitivity can be
determined, measured as d’. This measure is the difference between the hit rate (H) and the
false alarm rate (F), after first being transformed into z-values. The d’-equation is: d’ = z(H)-
z(F) (see Green & Swets, 1966).

Participants of the control tests, SC-SC and ST-ST, show positive mean d’-values (1.87 and
1.93). It was shown through a two-tailed Student’s t-test that there was no significant
difference in identification of the two dialects and they can therefore be considered equally
recognizable (t(38)=-0.28, p>0.05). By conducting a one-sample t-test it was shown that the
d’-values for both tests are highly distinct from 0 (t(39)=18.45, p<0.001) and therefore high
degree of identification of both dialects can be concluded.
The responses for the dialect shifting tests, ST-SC (mean d’ = 0.44); SC-ST (mean d’ =
-0.07), did not significantly differ (t(38)=1.93, p>0.05). The target voice in these two tests can
be considered equally difficult to identify. A one-sample t-test was conducted and showed that
the mean d’-value of the two tests was not significantly separated from 0 (t(39)=1.36, p>0.05),
indicating random response. Combining the responses for the ‘control tests’ (ST-ST; SC-SC)
and the ‘dialect shifting tests’ (ST-SC; SC-ST) and comparing the results to each other
revealed a significant difference between the two test groups (t(78)=5.97, p>0.001) (see Fig.
1). Thus, dialect shift has a detrimental effect on speaker identification.
4 Conclusions
The results indicate that the attribute dialect is of high importance in the identification
process. It is clear that listeners find it much more difficult to identify the target voice when a
shift of dialect in the voice takes place. One possible reason for the results is that when
making judgments about a person’s identity, dialect as an attribute is strong and has a higher
priority than other features.
The baselining of the potential problem we have conducted here shows that a switch of
dialect can easily fool listeners. This undermines earwitness identification of dialect and
suggests that forensic practitioners who currently use dialect as a primary feature during
analysis need to reduce their reliance on this feature and be aware that they can easily be
misled.
116 MARIA SJÖSTRÖM ET AL.
2,5
2,0
discrimination sensitivity (d') 1,5
1,0
0,5
0,0
Control tests Dialect shifting tests
Figure 1. Mean discrimination sensitivity (d’) and standard error for Control tests (SC-SC and
ST-ST combined) and Dialect shifting tests (ST-SC and SC-ST combined).
If used as a method of voice disguise, a perpetrator could use one native dialect at the time of
an offence and use the other in the event of being forced to participate in a voice line-up as a
suspect. Needless to say this method of voice disguise could have devastating effects on
witness accuracy as they would not able to recognize the perpetrators voice when using
different dialect, or yet worse, that the witness would make an incorrect identification and
choose another person whose dialect is more similar to the voice heard in the crime setting.
In order to assess whether voice disguise using imitated dialect can have as drastic an
impact upon speaker identification as voice disguise by switching between native dialects,
research using imitated dialect as a means of disguise is required.
Acknowledgements
Funded by a grant from the Bank of Swedish Tercentenary Foundation Dnr K2002-1121:1-4
to Umeå University for the project ‘Imitated voices: A research project with applications for
security and the law’.
References
Gibbons, J., 2003. Forensic Linguistics. Oxford: Blackwell Publishing.
Green, D.M. & J.A. Swets, 1966. Signal detection theory and psychophysics. New York: John
Wiley and sons, Inc.
Hollien, H., 2002. Forensic voice identification. San Diego: Academig Press.
Künzel, H.J., 2000. Effects of voice disguise on speaking fundamental frequency. Forensic
Linguistics 7, 1350-1771.
Markham, D., 1999. Listeners and disguises voices: the imitation and perception of dialectal
accent. Forensic Linguistics 6, 289-299.
Sjöström, M., 2005. Earwitness identification – Can a switch of dialect fool us? Masters
paper in Cognitive Science. Unpublished. Department of Philosophy and Linguistics, Umeå
University.
Working Papers 52 (2006), 117–120
Prosody and Grounding in Dialog

Gabriel Skantze, David House, and Jens Edlund
{gabriel|davidh|edlund}@speech.kth.se
Abstract
In a previous study we demonstrated that subjects could use prosodic features (primarily peak
height and alignment) to make different interpretations of synthesized fragmentary grounding
utterances. In the present study we test the hypothesis that subjects also change their behavior
accordingly in a human-computer dialog setting. We report on an experiment in which
subjects participate in a color-naming task in a Wizard-of-Oz controlled human-computer
dialog in Swedish. The results show that two annotators were able to categorize the subjects’
responses based on pragmatic meaning. Moreover, the subjects’ response times differed
significantly, depending on the prosodic features of the grounding fragment spoken by the
system.
1 Introduction
Detecting and recovering from errors is an important issue for spoken dialog systems, and a
common technique for this is verification. However, verifications are often perceived as
tedious and unnatural when they are constructed as full propositions verifying the complete
user utterance. In contrast, humans often use fragmentary, elliptical constructions such as in
the following example: “Further ahead on the right I see a red building.” “Red?” (see e.g.
Clark, 1996).
In a previous experiment, the effects of prosodic features on the interpretation of such
fragmentary grounding utterances were investigated (Edlund et al., 2005). Using a listener test
paradigm, subjects were asked to listen to short dialog fragments in Swedish where the
computer replies after a user turn with a one-word verification, and to judge what was actually
intended by the computer by choosing between the paraphrases shown in Table 1.
Table 1. Prototype stimuli found in the previous experiment.

Position Height Paraphrase Class
Early Low Ok, red ACCEPT
Mid High Do you really mean red? CLARIFYUNDERSTANDING
Late High Did you say red? CLARIFYPERCEIVE
The results showed that an early, low F0 peak signals acceptance (display of understanding),
that a late, high peak is perceived as a request for clarification of what was said, and that a
mid, high peak is perceived as a request for clarification of the meaning of what was said. The
results are summarized in Table 1 and demonstrate the relationship between prosodic
realization and the three different readings. In the present study, we want to test the hypothesis
that users of spoken dialog systems not only perceive the differences in prosody of
synthesized fragmentary grounding utterances, and their associated pragmatic meaning, but
that they also change their behavior accordingly in a human-computer dialog setting.
118 GABRIEL SKANTZE ET AL.
2 Method
To test our hypothesis, an experiment was designed in which 10 subjects were given the task
of classifying colors in a dialog with a computer. They were told that the computer needed the
subject’s assistance to build a coherent model of the subject’s perception of colors, and that
this was done by having the subject choose among pairs of the colors green, red, blue and
yellow when shown various nuances of colors in-between (e.g. purple, turquoise, orange and
chartreuse). They were also told that the computer may sometimes be confused by the chosen
color or disagree. The experiment used a Wizard-of-Oz set-up: a person sitting in another
room – the Wizard – listened to the audio from a close talking microphone. The Wizard fed
the system the colors spoken by the subjects, as well as giving a go-ahead signal to the system
whenever a system response was appropriate. The subjects were informed about the Wizard
setup immediately after the experiment, but not before. A typical dialog is shown in Table 2.
Table 2. A typical dialog fragment from the experiment (translated from Swedish).
S1-1a [presents purple flanked by red and blue]
S1-1b what color is this
U1-1 red
S1-2 red (ACCEPT/CLARIFYUND/CLARIFYPERC) or
mm (ACKNOWLEDGE)
U1-2 mm
S1-3 okay
S2-1a [presents orange flanked by red and yellow]
S2-1b and this
U2-1 yellow perhaps
[…]
The Wizard had no control over what utterance the system would present next. Instead, this
was chosen by the system depending on the context, just as it would be in a system without a
Wizard. The grounding fragments (S1-2 in Table 2) came in four flavors: a repetition of the
color with one of the three intonations described in Table 1 (ACCEPT, CLARIFYUND or
CLARIFYPERC) or a simple acknowledgement consisting of a synthesized /m/ or /a/
(ACKNOWLEDGE) (Wallers et al., 2006). The system picked these at random so that for every
eight colors, each grounding fragment appeared twice.
All system utterances were synthesized using the same voice as the experiment stimuli
(Filipsson & Bruce, 1997). Their prosody was hand-tuned before synthesis in order to raise
the subjects’ expectations of the computer’s conversational capabilities as much as possible.
Each of the non-stimuli responses was available in a number of varieties, and the system
picked from these at random. In general, the system was very responsive, with virtually no
delays caused by processing.
3 Results
The recorded conversations were automatically segmented into utterances based on the logged
timings of the system utterances. User utterances were then defined as the gaps in-between
these. Out of ten subjects, two did not respond at all to any of the grounding utterances. For
the other eight, responses were given in 243 out of 294 possible places. Since the object of our
analysis was the subjects’ responses, two subjects in their entirety and 51 silent responses
distributed over the remaining eight subjects were automatically excluded from analysis.
PROSODY AND GROUNDING IN DIALOG 119
User responses to fragmentary grounding utterances from the system were annotated with one
of the labels ACKNOWLEDGE, ACCEPT, CLARIFYUND or CLARIFYPERC, reflecting the
preceding utterance type.
In almost all cases subjects simply acknowledged the system utterance with a brief “yes” or
“mm” as the example U1-2 in Table 2. However, we felt that there were some differences in
the way these responses were realized. To find out whether these differences were dependent
on the preceding system utterance type, the user responses were cut out and labeled by two
annotators. To aid the annotation, three full paraphrases of the preceding system utterance,
according to Table 1, were recorded. The annotators could listen to each of the user responses
concatenated with the paraphrases, and select the resulting dialog fragment that sounded most
plausible, or decide that it was impossible to choose one of them. The result is a
categorization showing what system utterance the annotators found to be the most plausible to
precede the annotated subject response. The task is inherently difficult – sometimes the
necessary information simply is not present in the subjects’ responses – and the annotators
only agreed on a most plausible response in about 50% of the cases. The percentage of
preceding system utterance types for the classifications on which the annotators agreed is
shown in Figure 1.
Percentage of stimuli Table 3. Average of subjects’ mean

100%
ClarifyPerc response times after grounding fragments.
90%
ClarifyUnd
80%
70%
Accept Grounding Response
60% fragment time
50%
ACCEPT 591 ms
40%
30%
CLARIFYUND 976 ms
20% CLARIFYPERC 634 ms
10%
0%
Accept ClarifyUnd ClarifyPerc
Annotators' selected paraphrase
Figure 1. The percentage of preceding

system utterance types for the classifica-
tions on which the annotators agreed.
Figure 1 shows that responses to ACCEPT fragments are significantly more common in the
group of stimuli for which the annotators had agreed on the ACCEPT paraphrase. In the same
way, CLARIFYUND, and CLARIFYPERC responses are significantly overrepresented in their
respective classification groups ( 2=19.51; dF=4; p<0.001). This shows that the users’
responses are somehow affected by the prosody of the preceding fragmentary grounding
utterance, in line with our hypothesis.
The annotators felt that the most important cue for their classifications was the user
response time after the paraphrase. For example, a long pause after the question “did you say
red?” sounds implausible, but not after “do you really mean red?”. To test whether the
response times were in fact affected by the type of preceding fragment, the time between the
end of each system grounding fragment and the user response (in the cases where there was a
user response) was automatically determined using /nailon/ (Edlund & Heldner, 2005), a
software package for extraction of prosodic and other features from speech. Silence/speech
detection in /nailon/ is based on a fairly simplistic threshold algorithm, and for our purposes, a
preset threshold based on the average background noise in the room where the experiment
took place was deemed sufficient. The results are shown in Table 3. The table shows that, just
in line with the annotators’ intuitions, ACCEPT fragments are followed by the shortest re-
120 GABRIEL SKANTZE ET AL.
sponse times, CLARIFYUND the longest, and CLARIFYPERC between these. The differences
are statistically significant (one-way within-subjects ANOVA; F=7.558; dF=2; p<0.05).
4 Conclusions and discussion

In the present study, we have shown that users of spoken dialog systems not only perceive the
differences in prosody of synthesized fragmentary grounding utterances, and their associated
pragmatic meaning, but that they also change their behavior accordingly in a human-computer
dialog setting. The results show that two annotators were able to categorize the subjects’
responses based on pragmatic meaning. Moreover, the subjects’ response times differed
significantly, depending on the prosodic features of the grounding fragment spoken by the
system.
The response time differences found in the data are consistent with a cognitive load
perspective that could be applied to the fragment meanings ACCEPT, CLARIFYPERC and
CLARIFYUND. To simply acknowledge an acceptance should be the easiest, and it should be
nearly as easy, but not quite, for users to confirm what they have actually said. It should take
more time to reevaluate a decision and insist on the truth value of the utterance after
CLARIFIYUND. This relationship is nicely reflected in the data.
Although we have not quantified other prosodic differences in the users’ responses, the
annotators felt that there were subtle differences in e.g. pitch range and intensity which may
function as signals of certainty following CLARIFYPERC and signals of insistence or
uncertainty following CLARIFYUND. More neutral, unmarked prosody seemed to follow
ACCEPT. When listening to the resulting dialogs as a whole, the impression is that of a natural
dialog flow with appropriate timing of responses, feedback and turntaking. To be able to
create spoken dialog systems capable of this kind of dialog flow, we must be able to both
produce and recognize fragmentary grounding utterances and their responses. Further work
using more complex fragments and more work on analyzing the prosody of user responses is
needed.
Acknowledgements
This research was supported by VINNOVA and the EU project CHIL (IP506909).
References
Clark, H.H., 1996. Using language. Cambridge: Cambridge University Press.
Edlund, J. & M. Heldner, 2005. Exploring Prosody in Interaction Control. Phonetica 62(2-4),
215-226.
Edlund, J., D. House & G. Skantze, 2005. The effects of prosodic features on the
interpretation of clarification ellipses. Proceedings of Interspeech 2005, Lisbon, 2389-2392.
Filipsson, M. & G. Bruce, 1997. LUKAS – a preliminary report on a new Swedish speech
synthesis. Working Papers 46, Department of Linguistics and Phonetics, Lund University.
Wallers, Å., J. Edlund & G. Skantze, 2006. The effect of prosodic features on the
interpretation of synthesised backchannels. Proceedings of Perception and Interactive
Technologies, Kloster Irsee, Germany.
Working Papers 52 (2006), 121–124
The Prosody of Public Speech – A Description

of a Project
Eva Strangert1 and Thierry Deschamps2
1
Department of Comparative Literature and Scandinavian Languages, Umeå University
eva.strangert@nord.umu.se
2
Department of Philosophy and Linguistics, Umeå University
thierry.deschamps@ling.umu.se
Abstract
The project concerns prosodic aspects of public speech. A specific goal is to characterize
skilled speakers. To that end, acoustic analyses will be combined with subjective ratings of
speaker characteristics. The project has a bearing on how speech, and prosody in particular,
can be adjusted to the communicative situation, especially by speakers in possession of a rich
expressive repertoire.
1 Introduction
This paper presents a new project, the purpose of which is to identify prosodic features which
characterize public speech, both read and spontaneous. The purpose is moreover to reveal how
skilled public speakers use prosody to catch and keep the attention of their listeners, whether
it be to inform or argue with them. Combined with acoustic analyses of prosody, subjective
ratings of speakers will contribute to our knowledge of what characterizes a “good” or
“skilled” speaker. Thus, the project, though basically in the area of phonetics, has an
interdisciplinary character as it also addresses rhetoric issues.
The idea of approaching public speech has grown out of previous work in the field of
prosody including the recently completed project “Boundaries and groupings – the structuring
of speech in different communicative situations”, see Carlson et al. (2002) as well as studies
dealing specifically with the prosody of public speech, see below. Additional motivation for
the new project is the growing interest today in public speech, and rhetoric in particular.
The project should also be seen in the perspective of the significance given to the areas of
speaking style variation and expressive speech during the last decades. This research is
theoretically important, as it increases our knowledge of how human speech can be optimally
adjusted to the specific situation, and it contributes to learning about the limits of human
communicative capacity. Public speech offers a possibility to study speech that can be seen as
extreme in this respect. In politics and elsewhere when burning issues are at stake and where
often seriously committed individuals are involved, a rich expressive repertoire is made use
of. In this domain, prosody has a major role.
2 Background
Common to textbooks in rhetoric is their focus on those aspects which do not concern the
manner of speaking, although it is included in the concept of “rhetoric”. The emphasis is
rather on argumentation and planning of the speech act, the rhetoric process, as well as the
linguistic form; correctness, refinement, and clarity are demanded. The descriptions of how to
speak are considerably less detailed and very often even vague. The recommendations of
122 EVA STRANGERT & THIERRY DESCHAMPS
today are mostly similar to those given two thousand years ago; the voice of a skilled speaker
should be “smooth”, “flexible”, “firm”, “soft”, “clear” and “clean” (Johannesson, 1990/1998,
citing Quintilianus’ (ca AD 35-96) “Institutes of Oratory”).
As far as phonetically based investigations are concerned, Touati (1991) analyzed tonal and
temporal characteristics in the speech of French politicians. The analyses were undertaken
with a background in earlier studies of political rhetoric and, in addition, other types of speech
in public media in Sweden, see Bruce & Touati (1992). Other studies of public speech based
on Swedish include Strangert (1991; 1993) both dealing with professional news reading. A
study by Horne et al. (1995) concerned pausing and final lengthening in broadcasts on
stockmarket reports on Swedish Radio.
Analyses of interview speech made within the “Boundaries and groupings” project also
have relevance here. The purpose in this case was not to study public speech per se. However,
the results, in particular as concerns fluency and pausing (see e.g. Heldner & Megyesi, 2003;
Strangert 2004; Strangert & Carlson, 2006), may be assumed to reflect the fact that the speech
was produced by a very experienced speaker. A recent study with focus on “the skilled
professional speaker” (Strangert 2005) approaches problems sketched for the current project.
Braga & Marques (2004) focused on how prosodic features contribute to the listeners’
attention and interpretation of the message in political debate. The conception of a speaker as
“convincing”, “powerful” and “dedicated”, is assumed to be reflected in (combinations of)
prosodic features, or “maximes”. The study builds on the idea put forward by Gussenhoven
(2002) and developed further by Hirschberg (2002) of universal codes for how prosodic
information is produced by the speaker and perceived by the listener. Wichmann (2002) and
Mozziconacci (2002) belong to those dealing with the relations between prosody (f0 features
in particular) and what can be described as “affective functions”; a comprehensive survey of
expressive speech research can be found in Mozziconacci (2002).
Wichmann (2002) makes a distinction between “ways of saying” (properties or states
relating to the speaker) and “ways of behaving” (the speaker’s attitude to the listener). “Ways
of saying” includes first, how the speaker uses prosody in itself – stress and emphasis, tonal
features, speech rate, pausing etc. – and second, the emotional coloring of speech (e.g.
“happy”, “sad”, “angry”) as well as states such as “excited”, “powerful” etc. Examples of
“ways of behaving” are attitudes such as “arrogant” and “pleading”. In addition, the speaker
may use other argumentative and rhetorical means. All these functions of prosody make it a
complex, nuanced and powerful communicative tool.
To study the affective functions of prosody, auditive analyses must be combined with
acoustic measurements (see e.g. Mozziconacci, 2002). Also, listeners’ impressions have to be
categorized appropriately. A standard procedure is to have listeners judge samples of speech.
However, human speech very often conveys several states, attitudes and emotions at the same
time and this without doubt is true for the often quite elaborated speech produced in the public
domain. This complexity is examined in a study by Liscombe et al. (2003) through the use of
multiple and continuous scales for rating emotions. In their study, the subjective ratings are
also combined with acoustic analyses of prosodic features.
3 Work in progress
As a first step, we made a survey asking 22 students of logopedics at Umeå University what
kind of qualities they looked upon as important for a person regarded as a “good speaker”.
The students wrote down as many characteristics (in Swedish) as they could, guided only by
the definition of a good speaker as “A person who easily attracts listeners’ attention through
her/his way of speaking.”
7 characteristics were given on average, with a range between 4 and 11. In addition to
personality/emotional and attitudinal/interaction features, the labels given also reflected
THE PROSODY OF PUBLIC SPEECH – A DESCRIPTION OF A PROJECT 123
opinions about speech per se (articulation, voice characteristics and prosody), cf. Wichman
(2002). Thus, even if both the personality and the attitudinal features are transferred to the
listeners through speech, the subjects did not refrain from having opinions about the speech
itself. Table 1 shows the distribution of labels after grouping into the three categories.
With this as a background we will proceed by having subjects judge short passages of
speech (spontaneous and prepared) for multiple speaker characteristics. These will include not
only positively valued qualities like those listed here; also other qualities, including more
negatively colored ones, need to be covered in an effort to characterize speaker behavior. We
are currently in the process of developing a test environment for this experiment. In this work
we lean on previous efforts (see Liscombe et al., 2003). Combined with acoustic analyses we
expect the multiple ratings to give insight into how different acoustic/prosodic features
contribute to the impression of skilled – and less skilled – speaking behavior.
Table 1. Characteristics of “a good speaker” grouped into three categories based on 22

subjects’ written responses (see text). Labels in Swedish with English translations.
Number of
Speaker characteristics
responses
Speech features
tydlig artikulation clear articulation 7
god röststyrka, röstläge sufficient volume, voice level 6
icke-monoton röst non-monotonous voice 4
variation i röststyrka, röstläge variation of volume, voice level 3
rätt betoning, fokusering adequate prominence and focus 2
väl avvägd pausering well-adjusted pausing 2
bra taltempo, ej för snabbt well-adjusted tempo, not too fast 2
varierat taltempo varied speech tempo 1
talflyt fluency 1
varierad prosodi, uttrycksfullhet varied prosody, expressiveness 3
Personality features
inlevelse, entusiasm, engagemang involvement, enthusiasm, commitment 16
humor, lättsamhet sense of humour 12
karisma, utstrålning charisma, appeal 6
lugn, avslappnad stil calm, relaxed style of speaking 5
personlighet personality, individuality 4
positivt inställning positive attitude 3
ödmjukhet, självinsikt sense of humility 2
tydlighet distinctness, authority 2
självförtroende self-confidence 1
övertygelse conviction 1
Interaction features
förmåga att knyta an till lyssnarna ability to relate to audience 8
nivåanpassning relativt lyssnarna choosing the right communicative level 6
lyhördhet sensitivity 3
vilja till interaktion ability to interact with audience 3
utan överlägsenhet respectful, non-arrogant style 2
124 EVA STRANGERT & THIERRY DESCHAMPS
As the project in addition aims to characterize also other aspects of public speaking, a variety
of representative speech samples will be collected. In analyses of this material, fluency,
pausing, prominence, emphasis and voice characteristics will be central. Among the questions
we seek answers to are: What types of strategies are used for holding the floor? How does
speech perceived as fluent and disfluent respectively differ acoustically? How are prominence
and emphasis used in speech in media? What are the prosodic characteristics of agitation?
Answers to these questions, we believe, will add to our understanding of human
communicative capability and will also be useful in modeling speaking style variation.
Knowledge gained within the project may further be expected to be practically applicable.
Acknowledgements
This work was supported by The Swedish Research Council (VR).
References
Braga, D. & M.A. Marques, 2004. The pragmatics of prosodic features in the political debate.
Proc. Speech Prosody 2004, Nara, 321-324.
Bruce, G. & P. Touati, 1992. On the analysis of prosody in spontaneous speech with
exemplification from Swedish and French. Speech Communication 11, 453-458.
Carlson, R., B. Granström, M. Heldner, D. House, B. Megyesi, E. Strangert & M. Swerts,
2002. Boundaries and groupings – the structuring of speech in different communicative
situations: a description of the GROG project. TMH-QPSR 44, 65-68.
Gussenhoven, C., 2002. Intonation and interpretation: Phonetics and phonology. Proc. Speech
Prosody 2002, Aix-en-Provence, 11-13.
Heldner, M. & B. Megyesi, 2003. Exploring the prosody-syntax interface in conversations.
Proc. ICPhS 2003, Barcelona, 2501-2504.
Hirschberg, J., 2002. The pragmatics of intonational meaning. Proc. Speech Prosody 2002,
Aix-en-Provence, 65-68.
Horne, M., E. Strangert & M. Heldner, 1995. Prosodic boundary strength in Swedish: final
lengthening and silent interval duration. Proc. ICPhS 1995, Stockholm, 170-173.
Johannesson, K., 1998/1990. Retorik eller konsten att övertyga. Stockholm: Norstedts Förlag.
Liscombe, J., J. Venditti & J. Hirschberg, 2003. Classifying subject ratings of emotional
speech using acoustic features. Proc. Eurospeech 2003, Geneva, 725-728.
Mozziconacci, S., 2002. Prosody and emotions. Proc. Speech Prosody 2002, Aix-en-
Provence, 1-9.
Strangert, E., 1991. Phonetic characteristics of professional news reading. PERILUS XII.
Institute of Linguistics, University of Stockholm, 39-42.
Strangert, E., 1993. Speaking style and pausing. PHONUM 2. Reports from the Department of
Phonetics, University of Umeå, 121-137.
Strangert, E., 2004. Speech chunks in conversation: Syntactic and prosodic aspects. Proc.
Speech Prosody 2004, Nara, 305-308.
Strangert, E., 2005. Prosody in public speech: analyses of a news announcement and a
political interview. Proc. Interspeech 2005, Lisboa, 3401-3404.
Strangert, E., & R. Carlson. 2006. On modeling and synthesis of conversational speech. Proc.
Nordic Prosody IX, 2004, Lund, 255-264.
Touati, P., 1991. Temporal profiles and tonal configurations in French political speech.
Working Papers 38. Department of Linguistics, Lund University, 205-219.
Wichmann, A., 2002. Attitudinal intonation and the inferential process. Proc. Speech Prosody
2002, Aix-en-Provence, 11-22.
Working Papers 52 (2006), 125–128
Effects of Age on VOT: Categorical Perception

of Swedish Stops by Near-native L2 Speakers
Katrin Stölten
Centre for Research on Bilingualism, Stockholm University
Katrin.Stoelten@biling.su.se
Abstract
This study is concerned with effects of age of onset of L2 acquisition on categorical percep-
tion of the voicing contrast in Swedish word initial stops. 41 L1 Spanish early and late
learners of L2 Swedish, who had carefully been screened for their ‘nativelike’ L2-proficiency,
as well as 15 native speakers of Swedish participated in the study. Three voicing continua
were created on the basis of naturally generated word pairs with /p t k b d g/ in initial
position. Identification tests revealed an overall age effect on category boundary placement in
the nativelike L2 speakers’ perception of the three voicing continua. Only a small minority of
the late L2 learners perceived the voicing contrast in a way comparable to native-speaker
categorization. Findings concerning the early learners suggest that most, but far from all,
early L2 speakers show nativelike behavior when their perception of the L2 is analyzed in
detail.
1 Introduction
From extensive research on infant perception it has become a well-known fact that children
during their first year of life tune in on the first language (L1) phonetic categories, leaving
them insensitive to contrasts not existing in their native language (e.g. Werker & Tees, 1984).
In a study by Ruben (1997) it was found that children who had suffered from otitis media
during their first year of life showed significantly less capacity for phonetic discrimination
compared to children with normal hearing during infancy when they were tested at the age of
nine years. Such findings do not only demonstrate the importance of early linguistic exposure,
they have also been interpreted as an indication for the existence of a critical period for
phonetic/phonological acquisition which may be over at the age of one year (Ruben, 1997).
In research of age effects on language acquisition one classical issue is concerned whether
theories of a critical period can be applied to second language (L2) acquisition. The question
is whether the capacity to acquire phonetic detail in L2 learning is weakened or lost due to
lack of verbal input during a limited time frame for phonetic sensitivity, or whether a
nativelike perception and an accent-free pronunciation is possible for any adult L2 learner.
The present study is part of an extensive project on early and late L2 learners of Swedish
with Spanish as their L1. The subjects have been selected on the criterion that they are
perceived by native listeners as mother-tongue speakers of Swedish in everyday oral
communication. Thereafter, the candidates’ nativelike L2 proficiency has been tested for
various linguistic skills. The present study focuses on the analysis of the nativelike subjects’
categorical perception of the voicing contrast in Swedish word initial stops.
Both Swedish and Spanish recognize a phonological distinction between voiced and
voiceless stops in terms of voice onset time (VOT) but they differ as to where on the VOT
continuum the stop categories separate. In contrary to languages like Swedish and English,
126 KATRIN STÖLTEN
which treat short-lag stops as voiced and long-lag stops as voiceless, in Spanish short-lag
stops are recognized as voiceless, while stops with voicing lead are categorized as voiced (e.g.
Zampini & Green, 2001). Consequently, Spanish phoneme boundaries are perceptually
located at lower VOT values than in, for example, English (Abramson & Lisker, 1973).
Since language-specific category boundaries are established at a very early stage in
language development a great amount of perceptual sensibility is needed by a second language
learner in order to detect the categories present in the target language. The fact that L2
learners generally show difficulties in correctly perceiving and producing these language-
specific categories (e.g. Flege & Eefting, 1987) suggests that categorical perception may be
considered a good device for the analysis of nativelike subjects’ L2 proficiency.
For the present study the following research questions have been formulated:
(1) Is there a general age effect on categorical perception among apparently nativelike L2
speakers of Swedish?
(2) Are there late L2 learners who show category boundaries within the range of native-
speaker categorization?
(3) Do all (or most) early L2 learners show category boundaries within the range of native-
speaker categorization?
2 Method
2.1 Subjects
A total of 41 native speakers of Spanish (age 21-52 years), who had previously been screened
for their ‘nativelike’ proficiency of Swedish in three screening experiments (see Abrahamsson
& Hyltenstam, 2006), were chosen as subjects for the study. The participants’ age of onset
(AO) of L2 acquisition varied between 1 and 19 years and their mean length of residence in
Sweden was 24 years. The subjects had an educational level of no less than senior high school
and they all had acquired the variety of Swedish spoken in the great Stockholm area.
The control subjects consisting of 15 native speakers of Swedish were carefully matched
with the experimental group regarding present age, sex, educational level and variety of
Swedish. All participants went through a hearing test (OSCILLA SM910 screening
audiometer) in order to ensure that none of the subjects suffered from any hearing impairment.
2.2 Stimuli
The speech stimuli were prepared on the basis of naturally generated productions of three
Swedish minimal word pairs: par /pɑːr/ (pair, couple) vs. bar /bɑːr/ ‘bar, bare, carried’, tal
/tɑːl/ ‘number, speech’ vs. dal /dɑːl/ ‘valley’, kal /kɑːl/ ‘naked, bald’ vs. gal /gɑːl/
‘crow(s), call(s)’. A female speaker of Stockholm Swedish with knowledge of phonetics was
recorded in an anechoic chamber while reading aloud the words in isolation. The speaker was
instructed to articulate the voiceless stops with an extended aspiration interval and the voiced
counterparts with a clear period of voicing prior to stop release. All readings were digitized at
22 kHz with a 16-bit resolution.
For all stop productions VOT was determined by measuring the time interval between the
release burst and the onset of voicing in the following vowel. Thereafter, the release bursts of
the voiceless stops were equalized to 5ms. The aspiration phase was then extended to +100ms
VOT by generating multiple copies of centre proportions of the voicing lag interval. The
stimuli for the perception test were created by shortening the aspiration interval in 5ms-steps.
Voicing lead was simulated by copying the prevoicing interval from the original production of
the corresponding voiced stop and placing it prior to the burst. The prevoicing maximum was
first put at -100ms and then varied in 5ms-steps. Finally, a set of 30 speech stimuli ranging
from +90ms to -60ms VOT for each stop continuum was considered appropriate for the study.
EFFECTS OF AGE ON VOT: CATEGORICAL PERCEPTION OF SWEDISH STOPS 127
2.3 Testing procedure

A forced-choice identification task designed and run in E-Prime v1.0 (Schneider, Eschman &
Zuccolotto, 2002) was performed by each subject individually in a sound treated room. The
three voicing continua were tested separately. The speech stimuli were preceded by the carrier
phrase Nu hör du ‘Now you will here’ and randomly presented through headphones (KOSS
KTX/PRO) one at a time. For each stimulus the listeners were told to decide whether they
heard the word containing a voiced or voiceless stop and confirm their answer by pressing a
corresponding button on the keyboard. The experimenter was a male native speaker of
Stockholm Swedish.
3 Results
Stop category boundaries (in ms VOT) were calculated for each subject and plotted against
their age of onset. Since category boundary locations vary with place of articulation (see, e.g.
Abramson & Lisker, 1973) the stop pairs were analyzed separately. Due to extreme VOT
values the results from one subject (AO 5) had to be discarded from further analysis.
/p/-/b/ /t/-/d/ /k/-/g/
80 80 80
Category boundary
60 60 60
(in ms VOT)
40 40 40
20 20 20
0 0 0
-20 -20 -20
-40 -40 -40
-60 -60 -60

0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Age of onset (AO) of L2 acquisition
Figure 1. Age of onset (AO) in relation to mean category boundaries (in ms VOT) for the
bilabial, dental and velar stop continuum; the values at AO 0 represent category boundaries of
the 15 native speakers.
As can be seen in Figure 1 correlations between AO on perceived category boundary are

existent for both the /p/-/b/ (r = -.468, p < .01) and the /t/-/d/ (r = -.340, p < .05) contrast,
whereas the correlation for the /k/-/g/ contrast did not reach significance (r = -.291, p < .069).
In order to compare the nativelike candidates in a more systematic way, the subjects were
divided into a group of early (AO 1-11) and late learners (AO 13-19). Group comparisons
revealed that the control subjects change phoneme categories at the longest mean VOTs
(+7.23ms for /p/-/b/; +15.34ms for /t/-/d/; +24.62ms for /k/-/g/), while the late L2 listeners
show the shortest category crossover points (-17.57ms for /p/-/b/; +1.28ms for /t/-/d/;
+14.86ms for /k/-/g/). The group of early learners changes category boundaries at VOTs
somewhere in between the late learners and the controls (-2.93ms for /p/-/b/; +10.74 for /t/-
/d/; +20.17ms for /k/-/g/). An ANOVA confirmed that these group differences were highly
significant for both the bilabial (F(2,52) = 11.807, p < .001), the dental (F(2,52 = 7.847, p <
.001) and the velar stop contrast (F(2,52) = 8.815, p < .001). Post-hoc comparisons (Fisher’s
LSD) showed that except for comparisons between the native speakers and the early learners
in case of the dental stop contrast all remaining group differences were significant.
However, as can be seen in Figure 1 most of the nativelike candidates change categories at
estimated VOTs within the range of native speaker categorization. This holds for both the
bilabial (30 subjects), the dental (32 subjects) and the velar (29 subjects) voicing contrast.
128 KATRIN STÖLTEN
Whereas most of the early learners (21 out of 30) perceive category boundaries within the
range of native-speaker categorization for all three places of articulation this applies to only
two of the ten late learners (AO 13 and 16). In the group of early learners nine subjects show
category boundaries within the range of native-speaker categorization for either one or two of
the Swedish minimal pairs. At the same time no early learner was found who exhibits non-
nativelike category crossover points for all three places of stop articulation. Finally, the
analysis of the group of late L2 learners shows that seven individuals change phoneme
category within the range of native-speaker categorization for either one or two of the three
places of articulation. In contrast, only one subject (AO 14) does not exhibit category
boundaries within the range of native-speaker categorization for any of the stops.
4 Summary and conclusions

The present study has shown that age of onset has an effect on apparently nativelike L2
speakers’ categorical perception of the voicing contrast in Swedish word initial stops. In
addition to negative correlations between AO and perceived category boundaries, significant
group differences were found. The late L2 learners change phoneme category at the shortest
crossover points, thereby deviating the most from the Swedish controls. In short, the data
confirm that there is a general age effect on categorical perception even among L2 speakers
who seem to have attained a nativelike L2-profiency (Research Question 1).
Among the late L2 learners only two subjects (AO 13 and 16) change stop category within
the range of native-speaker categorization regarding all three places of articulation. Thus, only
a small minority of late, apparently nativelike L2 speakers show actual nativelike behavior
concerning the categorical perception of the voicing contrast (Research Question 2).
Most of the early L2 learners change category for the three stop continua at VOTs within
the range of native-speaker categorization. On the contrary, no subject with an early AO was
identified who showed non-nativelike category boundaries for all three stop continua. Thus,
most, but far from all, early learners show nativelike behavior when their perception of the L2
is analyzed in detail (Research Question 3).
References
Abrahamsson, N. & K. Hyltenstam, 2006. Inlärningsålder och uppfattad inföddhet i
andraspråket – lyssnarexperiment med avancerade L2-talare av svenska. Nordisk tidskrift
for andrespråksforskning 1:1, 9-36.
Abramson, A. & L. Lisker, 1973. Voice-timing perception in Spanish word-initial stops.
Journal of Phonetics 1, 1-8.
Flege, J.E. & W. Eefting, 1987. Production and perception of English stops by native Spanish
speakers. Journal of Phonetics 15, 67-83.
Ruben, R.J., 1997. A time frame of critical/sensitive periods of language development. Acta
Otolaryngologica 117, 202-205.
Schneider, W., A. Eschman & A. Zuccolotto, 2002. E-Prime Reference Guide. Pittsburgh:
Psychology Software Tools, Inc.
Werker, J.F. & R.C. Tees, 1984. Cross-language speech perception: Evidence for perceptual
reorganization during the first year of life. Infant Behaviour and Development 7, 49-63.
Zampini, M.L. & K.P. Green, 2001. The voicing contrast in English and Spanish: the
relationship between perception and production. In J.L. Nicol (ed.), One Mind, Two
Languages. Oxford: Blackwell.
Working Papers 52 (2006), 129–132
Stress, Accent and Vowel Durations in Finnish

Kari Suomi
Department of Finnish, Information Studies and Logopedics, Oulu University
kari.suomi@oulu.fi
Abstract
The paper summarises recent research on the interaction of prominence and vowel durations
in Finnish, a language with fixed initial stress and a quantity opposition in both vowels and
consonants; to be more accurate, the research has been conducted on Northern Finnish. It is
shown that, in one-foot words, there are four statistically distinct, non-contrastive duration
degrees for phonologically single vowels, and three such degrees for phonologically double
vowels. It is shown that the distributions of these duration degrees are crucially determined
by moraic structure. Also sentence accent has a moraic alignment, with a tonal rise occurring
on the word’s first mora and a fall on the second mora. It is argued that the durational
alternations are motivated by the particular way in which accent is realised.
1 Introduction
In Finnish word stress is invariably associated with the initial syllable, and there is a binary
quantity opposition in both vowels and consonants, independent of stress, effectively signalled
by only durational differences. There are very good grounds for interpreting the quantity
oppositions syntagmatically, as distinctions between a single phoneme and a double phoneme,
i.e. a sequence of two identical phonemes (Karlsson, 1969). This interpretation is also
reflected in the orthography, and thus there are written words like taka, taaka, takka, taakka,
takaa, taakaa, takkaa, taakkaa. However, the orthography only indicates the contrastive,
phonemic quantity distinctions and, beyond this, it does not in any way reflect the actual
durations of phonetic segments. Thus, for example, the orthography or a phonemic
transcription do not in any way express the fact that, in e.g. the dialect discussed in this paper,
the second-syllable single vowel in taka has a duration that is almost twice as long as that in
taaka, takka and taakka. This paper summarises recent research on such non-contrastive
vowel duration alternations, and suggests their motivations. The paper only looks at vowel
durations, and only in words that consist of just one, primary-stressed foot, and thus the effect
of secondary stress on vowel durations, which has not been systematically examined, is
excluded.
As will be seen below, the mora is an important unit in Finnish prosody. The morae of a
syllable are counted as follows: the first vowel phoneme – the syllable nucleus – is the first
mora, and every phoneme segment following in the same syllable counts as an additional
mora. Below, reference will be made to a word’s morae, and e.g. the words taka, taaka and
taakka have the moraic structures CM1.CM2, CM1M2.CM3 and CM1M2M3.CM4, respectively
(where Mn refers to the word’s nth mora, and C is a non-moraic consonant).
2 Vowel duration patterns

Suomi & Ylitalo (2004) investigated segment durations in unaccented, trisyllabic nonsense
words that consist of one foot each. The segmental composition of the nonsense words was
130 KARI SUOMI
fully counterbalanced. The word structures investigated were CV.CV.CV, CV.CVC.CV,

CV.CVV.CV, CV.CVV.CVV, CVC.CV.CV, CVC.CVC.CV, CVV.CV.CVV and
CVV.CVV.CVV, each represented by 18 different words. The words were spoken in the
frame sentence xyz, MINUN mielestäni xyz kirjoitetaan NÄIN (xyz, in MY opinion xyz is
written like THIS), where xyz represents the target word, the second occurrence of which was
measured. The five speakers were instructed to emphasise the capitalised words. Suomi &
Ylitalo only compared segment durations within the domain of the word’s first two morae
with those outside the domain, but the data have now been reanalysed in more detail. It turned
out that there are four statistically distinct, non-contrastive and complementary duration
degrees for single vowels, denoted as V(1) – V(4) in Table 1. The Table also shows the results
for three classes of double vowels (VV) with different moraic affiliations. The duration labels
given to the duration degrees are ad hoc.
Table 1. The mean durations (in ms) of the four duration degrees (DD) of phonologically
single vowels (V) and of three types of double vowels (VV) as observed in Suomi & Ylitalo
(2004) and in Suomi (in preparation); columns S & Y and S, respectively. In the column
Moraic status “M3+” means that the V is the word’s third or later mora, “M1.” that the V is M1
that is followed by a syllable boundary, “M1C” that the V is M1 that is followed by a
consonant in the same syllable, “CM2” that the V is M2 preceded by a consonant in the same
syllable, “M1M2” that the VV constitutes the sequence M1M2, “M2M3” that the VV constitutes
the sequence M2M3, and “M3+M” that the first segment in the VV sequence is M3+ or a later
mora. For further explanations see the text.
DD Duration label S&Y S Moraic status Example structures
V(1) “extra short” 48 75 M3+ CV.CV.CV, CVC.CV
V(2) “short” 58 104 M1 . CV.CV(X)
V(3) “longish” 73 126 M1 C CVC.CV(X)
V(4) “long” 84 158 CM2 CV.CV(C)
VV(1) “longish” + “longish” 149 - M1 M2 CVV(X)
VV(2) “long” + “extra short” 142 - M2 M3 CV.CVV
VV(3) “very long” 135 - M3+M CVC.CVV, CVV.CVV
Suomi (in preparation) measured durations in segmentally fully controlled, accented CV.CV
and CVC.CV nonsense words embedded in the frame sentence Sanonko ___ uudelleen?
(Shall I say ___ again?) and spoken by seven speakers. Suomi found the same four
statistically distinct duration degrees for phonologically single vowels, as reported in Table 1.
Three of the four single vowel duration degrees have been well documented earlier, e.g. by
Lehtonen (1970), but the existence of degree V(3) (“longish”) has not been previously
reported.
Below are the distributional rules of the observed duration degrees. The rules are to be
applied in the following manner: if a word contains a VV sequence, then an attempt to apply
the rule for VV duration should be made first. If this rule is not applicable, then the rule for V
should be applied to both members of the VV sequence (and of course to singleton V’s).
VV [very long] if the first V in the sequence constitutes M3+

V [extra short] if it constitutes M3+
[short] if it constitutes M1 that is not next to M2
[longish] if it occurs in the sequence M1M2
[long] if it constitutes M2 that is not next to M1
STRESS, ACCENT AND VOWEL DURATIONS IN FINNISH 131
As the rule for VV duration is formulated, it is only applicable to VV(3) but not to VV(1) nor to
VV(2). In these latter two cases, then, the rule for V duration has to be separately applied to
both segments in the sequence, and the correct durations are assigned. Thus VV is “very long”
in e.g. CVV.CVV.CVV and CVC.CVV, V is “extra short” in e.g. CVV.CV, CVC.CV and
CV.CV.CV, “short” in CV.CV(X), “longish” in CVC.CV and CVV.CV (both segments in VV
are “longish”), and “long” in CV.CV. In the structure CV.CVV, the first segment in the
second-syllable VV sequence (M2) is analysable as “long” and the second one (M3) as “extra
short”; the sum of these duration degrees is (84 ms + 48 ms =) 132 ms which is 10 ms less
than the observed duration for VV(2) (142 ms), but the difference was not significant. The
durational alternations under discussion of course entail complications to the realisation of the
phonemic quantity opposition, and in particular the durational difference in the second-
syllable vocalic segments in CV.CV and CV.CVV word structures is less than optimal.
Notice that the above rules explicitly refer to moraic structure only, and not e.g. to the
syllable. Notice further that M3+ is only referred to when the vowel is either “very long” or
“extra short”. These degrees represent the durations of double and single vowels in those
unstressed syllables in which nothing interferes with the realisation of the quantity opposition;
in these positions, the mean duration of double vowels is (135/48 =) 2.8 times that of single
vowels. But when a vowel constitutes M1, it can be either “short” or “longish”, and when it
constitutes M2, it can be either “longish” or “long”. This is because the durations of these
segments also signal prominence.
3 On the phonetic realisation of prominence

The distinction drawn by Ladd (1996, and elsewhere) between the association and the
alignment of prominence is very useful in Finnish. Primary word stress is unquestionably
phonologically associated with the word’s initial syllable, but its phonetic alignment with the
segmental material is more variable. Stress is signalled by greater segment duration, but not
necessarily on the stressed syllable only. Broadly speaking, stress is manifested as greater
duration of the segments that constitute M1 and M2, but exactly how the greater duration is
distributed depends on the structure of the initial syllable. If the initial syllable is light, i.e. in
(C)V.CV(C) words, the first-syllable vowel is “short” and the second-syllable vowel (M2) is
“long” (but both are longer than the third-syllable “extra short” vowel in (C)V.CV.CV(C)
words). But if the initial syllable is heavy, i.e. contains both M1 and M2, then both of these
segments are “longish” as in CVV.CV(C) words (and the second-syllable V is “extra short”).
As concerns sentence accent, it is normally realised as a tonal rise-fall that is also
moraically aligned: the rise is realised during the first mora, and (most of) the fall during the
second mora. Thus in (C)V.CV(C) words, the rise is realised during the first syllable and the
fall during the second one, whereas in words with a heavy initial syllable both the rise and the
fall are realised during the initial syllable. Strong (e.g. contrastive) accent involves a wider f0
movement than moderate accent, and it is also realised durationally, as an increase in the
durations of especially M1 and M2. But moderate accent is not realised durationally, i.e. the
unaccented and moderately accented versions of a word have equal durations.
In many languages, details of the tonal realisation of accent depend on the structure of the
accented syllable. Thus e.g. Arvaniti, Ladd & Mennen (1998) report that, in Greek, the slope
and duration of the (prenuclear) accentual tonal movement vary as a function of the structure
of the accented syllable. This is not so in Finnish. Instead, what has been observed repeatedly
is that, given a constant speech tempo and a given degree of accentuation, the rise-fall tune is
temporally and tonally uniform across different word and syllable structures (Suomi,
Toivanen & Ylitalo, 2003; Suomi, 2005; in press).
132 KARI SUOMI
4 Motivating the durational alternations

Why are there so many non-contrastive vowel duration degrees in Finnish, alternations that
partly interfere with the optimal realisation of the quantity opposition? The answer seems to
be provided by the particular combination of prosodic properties in the language. Given the
uniformity of the accentual tune across different word structures, and given the moraic
alignment of the accentual tune, the durational alternations discussed above are necessary. If
the durational alternations did not exist but accent nevertheless had the moraic alignment that
it has, the uniformity of the accentual tune would not be possible. Why the tonal uniformity
exists is not clear, but there it is. It is somewhat paradoxical that, in a full-fledged quantity
language in which segment durations signal phonemic distinctions, segment durations
nevertheless also vary extensively to serve tonal purposes, while in non-quantity languages
like Greek the segmental composition of the accented syllable determines the tonal realisation.
The durational alternations are also observable in unaccented words. But this does not
undermine the motivation just suggested, because unaccented and moderately accented words
do not differ from each other durationally, and the alternations are directly motivated in
moderately accented words. Thus unaccented words are as if prepared for being accented. A
conceivable alternative would be that unaccented words would lack the alternations present in
accented words, but this state of affairs would further complicate the durational system.
To summarise, beyond the loci in which stress and accent are realised, i.e. when vowels do
not constitute M1 or M2, single vowels are “extra short” and double vowels “very long”,
which results in their clear separation. In (C)V.CV(X) words, the tonal rise is realised during
the initial syllable and it is sufficient that the vowel is “short”. The long fall is realised during
the second syllable, and therefore the vowel must be “long”. In (C)VV.CV(X) words, both the
rise and most of the fall is realised during the initial syllable, and therefore both segments in
the VV sequence must be “longish”. This paper is not about consonant durations but in
(C)VC.CV(X) words, in which M2 is a consonant, it too has to be “longish”; if the consonant
has relatively short intrinsic duration elsewhere, it is lengthened in this position. As a
consequence of these alternations, the accentual rise-fall can be uniform across different word
structures, and at the same time, the quantity oppositions are not jeopardised.
References
Arvaniti, A., D.R. Ladd & I. Mennen, 1998. Stability of tonal alignment: the case of Greek
prenuclear accents. Journal of Phonetics 26, 3-25.
Karlsson, F., 1969. Suomen yleiskielen segmentaalifoneemien paradigma. Virittäjä 73, 351-
362.
Ladd, D.R., 1996. Intonational phonology. Cambridge: Cambridge University Press.
Lehtonen, J. 1970. Aspects of quantity in standard Finnish. Jyväskylä: Jyväskylä University
Press.
Suomi, K., 2005. Temporal conspiracies for a tonal end: segmental durations and accentual f0
movement in a quantity language. Journal of Phonetics 33, 291-309.
Suomi, K., in press. On the tonal and temporal domains of accent in Finnish. Journal of
Phonetics.
Suomi, K., in preparation. Durational elasticity for accentual purposes (working title).
Suomi, K., J. Toivanen & R. Ylitalo, 2003. Durational and tonal correlates of accent in
Finnish. Journal of Phonetics 31, 113-138.
Suomi, K. & R. Ylitalo, 2004. On durational correlates of word stress in Finnish. Journal of
Phonetics 32, 35-63.
Working Papers 52 (2006), 133–136
Phonological Demands vs. System Constraints

in an L2 Setting
Bosse Thorén
Dept. of Linguistics, Stockholm University
bosse.thoren@skola.sundsvall.se
Abstract
How can system constraints and phonological output demands influence articulation in a L2-
speaker? When measuring durations and articulator movements for some Swedish /V C/ and
/VC / words, pronounced by a Swedish and a Polish speaker, it appeared that phonological
vowel length was realized very similarly by both speakers, while complementary consonant
length was applied only by the native Swedish speaker. Furthermore, the tendency for
increased openness in short (lax) vowel allophones was manifested in analogous jaw and lip
movements in the Swedish speaker, but followed a different pattern in the Polish speaker.
1 Introduction
How is articulation influenced by system-based constraints and output-based constraints,
when a person uses a second language? According to the Hyper & Hypo speech theory
(Linblom 1990) the degree of articulatory effort in human speech is determined by mainly
two factors: 1) The limitations that inertia in the articulators poses upon speech, including the
tendency for economy in effort. 2) The demands of the listener, e.g. sufficient phonological
contrast. The former is assumed to result in unclear speech, or “under shoot”, and the latter to
“over shoot” or “perfect shoot” (clear speech). According to the H&H-theory, the output
demands vary depending on e.g. contextual predictability and the acoustic channel being
used, the presence of noise etc.
From a cross-linguistic point of view, the demands of a listener are to a high degree
determined by the phonologic system of the language in question. These demands are
supposed to be intuitively inherent in the native speaker of the language, i.e. the speaker has a
clear but probably unconscious picture of the articulatory goal. What happens to a L2-speaker
in this perspective? We can assume that the L2-speaker is influenced both by L1 and L2
demands on the output, as well as by system-based constraints.
Swedish has a quantity distinction in stressed syllables, manifested in most varieties as
either /V(C)/, or /VC/. Elert (1964) has shown that the Swedish long-short phonological
distinction is accompanied by analogous differences in duration for the segments involved.
His study also shows that the differences in duration between long and short Swedish vowel
allophones are significantly greater (mean 35%) than durational differences between closed
and open vowels (5-15%). This predicts that output constraints for Swedish segment durations
would override the system constraints, i.e. the inherent differences in duration between open
and closed vowels.
Polish on the other hand, is a language without phonological quantity, and is not expected
to involve any output constraints on the duration of segments. Duration differences in Polish
are assumed to result mainly from vowel openness, in accordance with the “Extent of
Movement Hypothesis” (Fischer-Jörgensen, 1964). A native polish speaker, who speaks
134 BOSSE THORÉN
Swedish as a second language, is therefore expected to show more influence from the system
constraints in his/her Swedish production than a native Swedish speaker. In addition to the
longer inherent duration in open vowels, there is a clear connection between long/tense and
short/lax vowels, resulting in Swedish short vowel allophones being pronounced more open
than their long counterparts (cf. Fant 1959).
The present study examines what happens when a native speaker of Swedish and a native
speaker of Polish pronounce test words containing the following combinations in trisyllabic
nonsense words: long open vowel / pp /, short open vowel / pap /, long closed vowel
/ pip / and short closed vowel / p p /.
In this study, the movements of mandible and lips are measured in addition to segment
durations, in order to compare the two speakers with regard to patterns of articulatory gestures
as results from output demands and system constraints respectively.
The question is: Will the duration of segments, produced in Swedish /VC/ and /VC/-
contexts differ significantly when pronounced by a native Swedish speaker, and a native
Polish speaker? And will the timing and magnitude of lip- and jaw-movements differ in a
significant way between the two speakers, indicating more influence from output demands or
system constraints?
2 Method
Two adult male subjects, one native Swede and one native Pole, who had lived in Sweden for
22 years, were recorded when pronouncing the nonsense words [ pp ], [ pap ], [ pip ]and
[ p p ], all of which are possible Swedish words according to Swedish phonotactics and
prosody. The Swedish speaker read the sequence of test words five times, and the Polish
speaker read it three times.
Measurements of lip and mandible movement as well as speech signal were carried out by
means of Move track: a magnetometer system with sender coils attached to the speaker’s lips
and lower incisors, and receiver coils placed in a light helmet on the speaker’s head. The
device measures variation in magnetic field that can be directly related to distance between
coils. The system produces data files with articulator movements synchronized with the
speech signal.
3 Results
3.1 Segment durations
The two speakers realized phonological vowel length in a similar way, making clear temporal
differences between long and short vowel allophones, as shown in Figure 1. The
complementary long consonant after short vowel in stressed syllables in Swedish is very clear
in the native Swedish speaker, but non-existent and even shorter in the native Polish speaker.
Differences in vowel duration ratios, are illustrated in Figure 1b, where the differences in
vowel length are seen as functions of phonological demands and system constraints
respectively. Both speakers realize phonologically long vowels with more than the double
duration of short vowels, the Polish speaker having even more difference than the Swedish
speaker. The Polish speaker made a greater duration difference between /a/ and /i/ than the
Swedish speaker did. This latter difference in duration ratios between speakers is significant
(p< 0.05 ANOVA) whereas the inter-speaker difference in V/V ratios is not.
PHONOLOGICAL DEMANDS VS. SYSTEM CONSTRAINTS IN AN L2 SETTING 135
Figure 1a and 1b. a) Durations of long and short allophones produced by the Swedish (black
columns) and the Polish speaker (gray columns). Mean values from 10 realizations by the
Swedish speaker and 6 realizations by the Polish speaker. b) Inter-speaker differences for
long/short vowel ratios and open/closed vowel ratios (mean values).
3.2 Vowel durations and articulator movements

Two principal measures of articulator movements were taken to show possible differences
between the speakers; 1) vertical mandible displacement in relation to vowel openness and
phonological length, 2) vertical lower lip depression in relation to vowel openness and
phonological length. The pattern of jaw opening in the two speakers is shown in Figure 2a.
The Swedish speaker follows an expected pattern, where the jaw movement seems to reflect
vowel openness, with greater openness for /a/ than for /i/, but also more open articulation for
short allophones than for long allophones. The Polish speaker also shows greater jaw
lowering for /a/ than for /i/, but the smaller opening for short allophones compared to long
allophones, does not reflect the spectral vowel quality,
]
i.e. the fact that at least for /a/, the
Polish speaker produces higher F1 for [a] than for [ . Inspection via listening and spectral
analysis, shows that both speakers produce very similar F1 and F2 values.
The pattern of lip aperture, as shown in Figure 2b, follows roughly the pattern of jaw
lowering gestures, except for the Swedish speaker’s smaller lip aperture for short /i/ compared
to long /i/.
Figure 2a and 2b. Mandible (2a) and lower lip (2b) depression for long-short and open-
closed vowels, produced by the Swedish and the Polish speaker.
The timing pattern in terms of lip aperture duration related to vowel duration, and time laps
from vowel end to maximal lip closure, did not show any systematic differences between
speakers or vowel types.
136 BOSSE THORÉN
4 Discussion
The segment duration patterns produced by the two speakers are not surprising. Starting with
vowel duration, the phonological vowel length is a well established and well known property
of Swedish, both as an important feature of Swedish pronunciation, and a way of accounting
for the double consonant spelling. As seen in Figure 1, both speakers realize long and short
vowel allophones quite similarly. The Swedish speaker, as shown in Figure 1, demonstrates in
addition a substantial prolonging of the /p/ segment after short vowel, which the Polish
speaker does not. The Polish speaker reports having encountered rules for vowel length as
well as consonant length while studying Swedish, implying that mere ignorance does not
account for his lack of complementary long consonant. Literature in phonetics, e.g. Ladefoged
& Maddieson (1996), gives the impression that phonological vowel length is utilized by a
greater number of the world’s languages than is consonant length. This suggests that
phonological consonant length is a universally more marked feature than is vowel length, and
hence more difficult to acquire.
The somewhat greater difference between long and short vowel allophone, demonstrated by
the Polish speaker, can be interpreted as a compensation for the lack of complementary
consonant length, which is demonstrated to serve as a complementary cue for the listener,
when segment durations are in the borderland between /VC/ and /VC/ (Thorén 2005).
The between-speaker difference is not surprising, since the phonological quantity in
Swedish is a predominant phonetic feature, and can be expected to influence the temporal
organization of the native Swede’s speech from early age. The Polish speaker came to
Sweden as an adult and has acquired one important temporal feature, but his overall temporal
organization may still bear strong traces of the system constraints, concerning the duration of
segments.
The differences in lip and mandible movements between the speakers could be interpreted
as follows: Both speakers produce a higher F1 for short [a] than for long [] (e.g. Fant 1959),
which typically correlates with lower tongue and mandible. The Polish speaker however,
shows a clearly greater jaw and lip opening for long [] than for short [a], which suggests
that the Polish speaker has a compensatory tongue height in [], to maintain correct spectral
quality. The greater mandible excursion in [] can not be the result of an articulatory goal for
this vowel, but could possibly be interpreted as an inverse “Extent of Movement Hypothesis”
(Fischer-Jörgensen 1964), letting the mandible make a greater excursion owing to the
opportunity offered by the long duration of the [].
References
Elert, C-C., 1964. Phonologic studies of Swedish Quantity. Uppsala: Almqvist & Wiksell.
Fant, G., 1959. Acoustic analysis and synthesis of speech with application to Swedish.
Ericsson Technics. No. 15, 3-108.
Fischer-Jörgensen, E., 1964. Sound Duration and Place of articulation. Zeitschrift für
Sprachwissenschaft und Kommunikationsforschung 17, 175-207.
Ladefoged, P. & I. Maddieson, 1996. The sounds of the World’s Languages. Oxford:
Blackwell publishers.
Lindblom, B., 1990. Explaining phonetic variation: a sketch of the H&H theory. In Hardcastle
& Marchal (eds.), Speech production and speech modeling. Dordrecht: Kluwer, 403-439.
Thorén, B., 2005. The postvocalic consonant as a complementary cue to the quantity
distinction in Swedish – a revisit. Proceedings from FONETIK 2005, Göteborg University,
115-118.
Working Papers 52 (2006), 137–140
Cross-modal Interactions in Visual as Opposed

to Auditory Perception of Vowels
Hartmut Traunmüller
hartmut@ling.su.se
Abstract
This paper describes two perception experiments with vowels in monosyllabic utterances presented
auditorily, visually and bimodally with incongruent cues to openness and/or roundedness. In the first,
the subjects had to tell what they heard; in the second what they saw. The results show that the same
stimuli evoke a visual percept that may be influenced by audition and may be different from the
auditory percept that may be influenced by vision. In both cases, the strength of the influence of the
unattended modality showed between-feature variation reflecting the reliability of the information.
1 Introduction
Nearly all research on cross-modal interactions in speech perception has been focused on the
influence an optic signal may have on auditory perception. In modeling audiovisual
integration, it is common to assume three functional components: (1) auditory analysis, (2)
visual analysis and (3) audiovisual integration that is assumed to produce an ‘amodal’
phonetic output. Although details differ (Massaro, 1996; Robert-Ribes et al., 1996; Massaro &
Stork, 1998), the output was commonly identified with what the subjects heard, not having
been asked about what they saw. This experimenter behavior suggests the amodal
representations of phonetic units (concepts), which can be assumed to exist in the minds of
people, to be closely associated with auditory perception. The seen remains outside the scope
of these models unless it agrees with the heard.
The present experiments were done in order to answer the question of whether a visual
percept that may be influenced by audition can be distinguished from the auditory percept that
may be influenced by vision and whether the strength of such an influence is feature-specific.
Previous investigations (Robert-Ribes et al., 1998; Traunmüller & Öhrström, in press)
demonstrated such feature-specificity in the influence of optic information on the auditory
perception of vowels: the influence was strongest for roundedness, for which the non-attended
visual modality offered more reliable cues than the attended auditory modality. In analogy, we
could expect a much stronger influence of non-attended acoustic information on the visual
perception of vowel height or “openness” as compared with roundedness.
2 Method
2.1 Speakers and speech material
For the two experiments performed, a subset of the video recordings made for a previous
experiment (Traunmüller & Öhrström, in press) was used. It consisted of the 6 incongruent
auditory-visual combinations of the nonsense syllables /ɡiːɡ/, /ɡyːɡ/ and /ɡeːɡ/ produced by
each one of 2 male and 2 female speakers of Swedish. Synchronization had been based on the
release burst of the first consonant. In Exp. 1, each auditory stimulus was also presented alone
and in Exp. 2 each visual stimulus instead.
138 HARTMUT TRAUNMÜLLER
2.2 Perceivers
14 subjects believed to have normal hearing and vision (6 male, aged 20 to 60 years, and 8
female, aged 20 to 59 years) served as perceivers. All were native speakers of Swedish
pursuing studies or research at the Department of Linguistics.
2.3 Procedure
The subjects wore headphones AKG K25 and were seated with their faces at an arm’s length
from a computer screen. Each one of the 9 stimuli per speaker was presented twice in random
order, using Windows Media Player. The height of the faces, shown in the left half of the
screen, was roughly 12 cm. In Exp. 1, the subjects were instructed to look at the speaker when
shown, but to tell which vowel they heard. In Exp. 2, they were instructed to keep the
headphones on, but to tell which vowel they saw. Stimulus presentation was controlled
individually by the subjects, who were allowed to repeat each stimulus as often as they
wished. They gave their responses by clicking on orthographic symbols of the 9 Swedish
vowels arranged in the right half of the screen in manner of an IPA-chart. They were told to
expect a rather skewed vowel distribution. Prior to each experiment proper, three stimuli were
presented for familiarization. The two experiments lasted for grossly 30 minutes together.
3 Results
The pooled results of each one of the two experiments are shown in Tables 1 and 2. It can be
noticed that in Exp. 1, where subjects had to report the vowels they heard, openness was
almost always perceived in agreement with the speaker’s intention (99.2%) even when
conflicting optic information was presented. Roundedness was perceived much less reliably:
14.7% errors when no face was presented. Many of these errors were evoked by one particular
speaker. In cases of incongruent information, roundedness was predominantly perceived in
agreement with the optic rather than the acoustic stimulus.
The picture that emerged from Exp. 2, where subjects had to report the vowels they saw by
lipreading, was the reverse: Presence vs. absence of roundedness was perceived correctly to
98.4% and the rounded vowels were only in 5.4% of the cases mistaken for inrounded
(labialized), while openness was perceived quite unreliably, with 28.3% errors when no
acoustic signal was presented. (One of the speakers elicited substantially fewer errors.) In
cases of incongruent information, openness was often perceived in agreement with the
acoustic rather than the optic stimulus, but the cross-modal influence was not as strong in
lipreading (Exp. 2) as in listening (Exp. 1). This can be seen more immediately in Table 3,
which shows the result of linear regression analyses in which the numerical mean of the
perceived openness and roundedness of each stimulus were taken as the dependent variables.
While the overall close to perfect performance in auditory perception of openness and in
visual perception of roundedness by all speakers does not allow any possible between-
perceiver differences to show up, such differences emerged, not unexpectedly, in auditory
roundedness perception and in visual openness perception (see Table 4). In auditory
roundedness perception, the case-wise reliance on vision varied between 31% and 100%. In
visual openness perception, the case-wise reliance on audition varied between 28% and 97%.
Despite the similarity in range, there was no significant correlation between these two
variables (r2=0.04, p=0.5), nor was there any significant gender difference in visual perception
(p>0.4). This means that the between-subject variation cannot be explained as due to a subject
specific (or gender specific) general disposition towards cross-modal integration. In auditory
perception, women showed a greater influence of vision, but the gender difference failed to
attain significance (two-tailed t-test, equal variance not assumed: p=0.15), and age was never
a significant factor.
CROSS-MODAL INTERACTIONS IN PERCEPTION OF VOWELS 139
Table 1. Confusion matrix for auditory perception. Stimuli: intended vowels presented
acoustically and optically. Responses: perceived vowels (letters). ; incorrect
openness; roundedness incorrect but agrees with optic stimulus. Boldface: majority response.
Stimuli Responses
Sound Face i y u o e ö å ä a *
i 9 1
y 17 1
e 23 1
i e 3
i y 85
y i 85 1
y e 84 6
e i 8
e y 1 73
Table 2. Confusion matrix for visual perception. As in Table 1, but incorrect roundedness;
openness incorrect but agrees with acoustic stimulus.
Stimuli Responses
Sound Face i y u o e ö å ä a *
i 54 2
y 1 7 3 28
e 9 2
e i 91 4 1
y i 2 31 1 1
i y 1 5 11
e y 3 1 46
i e 65 1
y e 52 2 1
Table 3. Weights of auditory and visual cues in the perception of openness and roundedness.
Auditory cues Visual cues
Heard openness 1.00 0.03
Heard roundedness 0.28 0.68
Seen openness 0.45 0.52
Seen roundedness 0.00 0.97
Table 4. Auditory roundedness and visual openness perception by subject (age in years, sex).
Percentage of responses in agreement with the acoustic (Aud) or optic (Vis) stimulus in cases
of incongruent information (n=32). Incorrect but agrees with the unattended modality.
Subject Age 34 27 30 20 41 34 21 23 60 27 20 23 25 59
Sex m m f m m f f f m f m f f f
Roundedness Aud
of vowels heard Vis 31 41 50 59 63 66 75 78 91 94 94 97 97 100
Openness Aud 59 47 28 34 72 69 97 69 63 72 50 75 28 59
of vowels seen Vis
140 HARTMUT TRAUNMÜLLER
The patterns of confusions can be modelled by weighted summation of the response

probabilities for each vowel in the attended modality [listening (A), lipreading (V)] and a
Bayesian auditory-visual integration (AV). For the pooled data, linear regression on this basis
gives response probabilities P and determination coefficients r2 as follows:
Pheard = 0.01 +0.26 A +0.71 AV (r2 = 0.98) and Pseen = −0.00 +0.57 V +0.45 AV (r2 = 0.94).
4 Discussion
As for auditory perception with and without conflicting visual cues and for visual perception
alone (lipreading), the patterns of confusion observed here agree closely with those obtained
previously (Traunmüller & Öhrström). Now, the novel results obtained in visual perception
with conflicting auditory cues demonstrate that a visual percept that may be influenced by
audition has to be distinguished from the auditory percept that may be influenced by vision
and that the strength of the cross-modal influence is feature-specific in each case.
Based on confusion patterns in consonant perception, it has been claimed that humans
behave in accordance with Bayes’ theorem (Massaro & Stork, 1998), which allows predicting
bimodal response probabilities by multiplicative integration of the unimodal probabilities.
Although some of our subjects behaved in agreement with this hypothesis in reporting what
they heard, the behaviour of most subjects refutes the general validity of this claim, since it
shows a substantial additive influence of the auditory sensation. When reporting what they
saw, all subjects except one showed a substantial additive influence of the visual sensation.
Given the unimodal data included in Tables 1 and 2, Bayesian integration lends prominence
to audition in the perception of openness and to vision in roundedness. The data make it clear
that an ideal perceiver should rely on audition in the perception of openness, as all subjects
did in their auditory judgments, and combine this with the roundedness sensed by vision, since
this is more reliable when the speaker’s face is clearly visible. Four female and two male
subjects behaved in this way to more than 90% in reporting what they heard but only one
other female subject in reporting what she saw.
The results can be understood as reflecting a weighted summation of sensory cues for
features such as openness and roundedness, whereby the weight attached reflects the feature-
specific reliability of the information received by each sensory modality (cf. Table 3). The
between-perceiver variation then reflects differences in the estimation of this reliability.
Acknowledgements
This investigation has been supported by grant 2004-2345 from the Swedish Research
Council. I am grateful to Niklas Öhrström for the recordings and for discussion of the text.
References
Massaro, D., 1996. Bimodal speech perception: a progress report. In D.G. Stork &
M.E.Hennecke (eds.), Speechreading by Humans and Machines. Berlin: Springer, 80-101.
Massaro, D.W. & D.G. Stork, 1998. Speech recognition and sensory integration. American
Scientist 86, 236-244.
Robert-Ribes, J., M. Piquemal, J-L. Schwartz & P. Escudier, 1996. Exploiting sensor fusion
architectures and stimuli complementarity in AV speech recognition. In D.G. Stork &
M.E.Hennecke (eds.), Speechreading by Humans and Machines. Berlin: Springer, 193-210.
Robert-Ribes, J., J-L. Schwartz, T. Lallouache & P. Escudier, 1998. Complementarity and
synergy in bimodal speech: Auditory, visual and audio-visual identification of French oral
vowels in noise. Journal of the Acoustical Society of America 103, 3677-3689.
Traunmüller, H. & N. Öhrström, in press. Audiovisual perception of openness and lip
rounding in front vowels. Journal of Phonetics.
Working Papers 52 (2006), 141–144
Knowledge-light Letter-to-Sound Conversion

for Swedish with FST and TBL
Marcus Uneson
marcus.uneson@ling.lu.se
Abstract
This paper describes some exploratory attempts to apply a combination of finite state
transducers (FST) and transformation-based learning (TBL, Brill 1992) to the problem of
letter-to-sound (LTS) conversion for Swedish. Following Bouma (2000) for Dutch, we employ
FST for segmentation of the textual input into groups of letters and a first transcription stage;
we feed the output of this step into a TBL system. With this setup, we reach 96.2% correctly
transcribed segments with rather restricted means (a small set of hand-crafted rules for the
FST stage; a set of 12 templates and a training set of 30kw for the TBL stage).
Observing that quantity is the major error source and that compound morpheme
boundaries can be useful for inferring quantity, we exploratively add good precision-low
recall compound splitting based on graphotactic constraints. With this simple-minded
method, targeting only a subset of the compounds, performance improves to 96.9%.
1 Introduction
A text-to-speech (TTS) system which takes unrestricted text as input will need some strategy
for assigning pronunciations to unknown words, typically achieved by a set of letter-to-sound
(LTS) rules. Such rules may also help in reducing lexicon size, permitting the deletion of
entries whose pronunciation can be correctly predicted from rules alone. Outside the TTS
domain, LTS rules may be employed for instance in spelling correction, and automatically
induced rules may be interesting for reading research.
Building LTS rules by hand from scratch is easy for some languages (e.g., Finnish,
Turkish), but turns out prohibitively laborious in most cases. Data-driven methods include
artificial neural networks, decision trees, finite-state methods, hidden Markov models,
transformation-based learning and analogy-based reasoning (sometimes in combination).
Attempts at fully automatic, data-driven LTS for Swedish include Frid (2003), who reaches
96.9 % correct transcriptions on segment level with a 42000-node decision tree.
2 The present study

The present study tries a knowledge-light approach to LTS conversion, first applied by Bouma
(2000) on Dutch, which combines a manually specified segmentation step (by finite-state
transducers, FST) and an error-driven machine learning technique (transformation-based
learning, TBL). One might think of the first step as redefining the alphabet size, by
introducing new, combined letters, and the second as automatic induction of reading rules on
that (redefined) alphabet, ordered in sequence of relevance.
For training and evaluation, we used disjoint subsets of a fully morphologically expanded
form of Hedelin et al. (1987). The expanded lexicon holds about 770k words (including
142 MARCUS UNESON
proper nouns; these and other words containing characters outside the Swedish alphabet in
lowercase were discarded).
2.1 Finite-state transduction (FST)

Many NLP tasks can be cast as string transformation problems, often conveniently attacked
with context-sensitive rewrite rules (which can be compiled directly into FST). Here, we first
use an FST to segment input into segments or letter groups, rather than individual letters. A
segment typically corresponds to a single sound (and may have one member only). Treating a
sequence of letters as a group is in principle meaningful whenever doing so leads to more
predictable behaviour. Clearly, however, there is an upper limit on the number of groups, if
the method should justifiably be called ‘knowledge-light’. For Swedish, some segments close
at hand are {[s,c,h], [s,s], [s,j], [s,h], [c,k], [k], [k,j]…}; the set used in the experiments
described here has about 75 members.
Segmentation is performed on a leftmost, longest basis, i.e., that rule is chosen which
results in as early a match as possible, the longest possible one if there are several candidates.
All following processing now deals with segments rather than individual letters.
After segmentation, markers for begin- and end-of-word are added, and the (currently
around 30) hand-written replace rules are applied, again expressed as transducers or
compositions of transducers. These context-sensitive replace rules may encode well-known
reading rules (in the case of Swedish, for instance ‘<k> is pronounced /ɕ/ in front of
<e,i,y,ä,ö> morpheme-initially’), or try to capture other partial regularities (Olsson 1998).
Most rules deal with vowel quantity and/or the <o> grapheme, reflecting typical difficulties in
Swedish orthography. The replacement transducer is implemented such that each segment can
be transduced at most once. A set (currently around 60) of context-less, catch-all rules
provide default mappings. To illustrate the FST steps, consider the word skärning ‘cut’ after
each transduction:
input: skärning
segment: sk-ä-r-n-i-ng
marker: #-sk-ä-r-n-i-ng-#
transduce: #-S+<:+r-n-I-N+#
remove marker: S<:rnIN
2.2 Transformation-based learning (TBL)

TBL was first proposed for part-of-speech tagging by Eric Brill (1992). TBL is, generally
speaking, a technique for automatic learning of human-readable classification rules. It is
especially suited for tasks where the classification of one element depends on properties or
features of a small number of other elements in the data, typically the few closest neighbours
in a sequence. In contrast to the opaque problem representation in stochastic approaches, such
as HMMs, the result of TBL training is a human-readable, ordered list of rules. Application of
the rules to new material can again be implemented as FSTs and thus be very fast.
For the present task, we employed the -TBL system (Lager 1999). It provides an interface
for scripting as well as an interactive environment, and Brill’s original algorithm is
supplemented by much faster Monte Carlo rule sampling. The templates were taken from Brill
(1992), omitting disjunctive contexts (e.g., “A goes to B when C is either 1 or 2 before”),
which are less relevant to LTS conversion than to POS tagging.
2.3 Compound segmentation (CS)

The most important error source by far is incorrectly inferred quantity. In contrast to Dutch,
for which Bouma reports 99% with the two steps above (and a generally larger setup, with
LETTER-TO-SOUND CONVERSION FOR SWEDISH WITH FST & TBL 143
500 TBL templates), quantity is not explicitly marked in Swedish orthography. One might
suspect that this kind of errors might be remedied if compounds and their morpheme
boundaries could be identified in a preprocessing step. Many rules are applicable in the
beginning or end of morphemes rather than words; we could provide context for more rules if
only we knew where the morpheme boundaries are. Compound segmentation (CS) could also
help in many difficult cases where the suffix of one component happens to form a letter group
when combined with the prefix of the following, as in <matjord>, <polishund>, <bokjägare>.
Ideally, segments should not span morpheme boundaries: <sch> should be treated as a
segment in <kvälls|schottis> but not in <kvälls|choklad>.
In order to explore this idea while still minimizing dependencies on lexical properties, we
implemented a simple compound splitter based on graphotactic constraints. An elaborate
variant of such a non-lexicalized method for Swedish was suggested by Brodda (1979). He
describes a six-level hierarchy for consonant clusters according to how much information they
provide about a possible segmentation point, from certainty (as -rkkl- in <kyrkklocka>
‘church bell’) to none at all (as -gr- in <vägren> ‘verge (road)’). For the purposes of this
study, we targeted the safe cases only (on the order of 30-40% of all compounds). Thus, recall
is poor but precision good, which at least should be enough to test the hypothesis.
3 Results
3.1 Evaluation measure
The most common LTS evaluation measure is Levenshtein distance between output string and
target. For the practical reason of convenient error analysis and comparability with Frid
(2003) we follow this, but we note that the measure has severe deficiencies. Thus, all errors
are equally important – exchanging [e] for [ ] is considered just as bad as exchanging [t] for
[a]. Furthermore, different lexica have different levels of granularity in their transcriptions,
leading to rather arbitrary ideas about what ‘right’ is supposed to mean. For future work, some
phonetically motivated distance measure, such as the one suggested by Kondrak (2000),
seems a necessary supplement.
Table 1. Results and number of rules for combinations of CS, FST, and TBL. 5-fold cross-
validation. Monte Carlo rule sampling. Score threshold (stopping criterion) = 2. The baselines
(omitting TBL) are 80.1% (default mappings); 86.6% (FST step only); 88.3% (CS + FST).
Training data TBL FST + TBL CS + FST + TBL
segments words results % #rules results % #rules results % #rules
49k 5k 93.8 820 94.9 503 95.5 513
98k 10k 94.1 1131 95.0 761 95.7 809
198k 20k 95.2 1690 95.7 1275 96.5 1250
300k 30k 95.7 2225 96.2 1862 96.9 1756
3.2 Discussion
Some results are given in Table 1. In short, both with and without the TBL steps, adding
handwritten rules to the baseline improves system performance (and TBL training time)
significantly, as does adding the crude CS algorithm. The number of learnt rules is sometimes
high. However, although space constraints do not allow the inclusion of a graph here, rule
efficiency declines quickly (as is typical for TBL), and the first few hundred rules are by far
the most important. We note that the major error source still is incorrectly inferred quantity.
We have stayed at the segmental level of lexical transcription, with no aim of modelling
contextual processes. Although this approach would need (at the very least) postprocessing for
many applications, it might be enough for others, such as spelling correction. Result-wise, it
144 MARCUS UNESON
seems that the current approach can challenge Frid’s (2003) results (96.9% on a much larger
(70kw) training corpus), while still retaining the advantage of the more interpretable rule
representation. Frid goes on to predict lexical prosody; we hope to get back to this topic.
4 Future directions
Outside incorporating more sophisticated compound splitting, there are several interesting
directions. The template set is currently small. Likewise, the feature set for each corpus
position may be extended in other ways, for instance by providing classes of graphemes – C
and V is a good place to start, but place or manner of articulation for C and frontness for
vowels might also be considered. Such classes might help finding generalizing rules over, say,
front vowels or nasals, and might help where data is sparse; the extracted rules are also likely
to be more linguistically relevant. If so, segments should preferably be chosen such that they
fall clear into classes.
Another, orthogonal approach is “multidimensional” TBL (Florian & Ngai 2001), i.e., TBL
with more than one variable. For instance, the establishment of stress pattern may determine
phoneme transcription, or the other way round. For most TBL systems, rules can change one,
prespecified attribute only (although many attributes may provide context). This is true for -
TBL as well; however, we are currently considering an extension.
Interesting is also the idea to try to predict quantity and stress reductively, with Constraint
Grammar-style reduction rules (i.e., “if Y, remove tag X from the set of possible tags”). Each
syllable is assigned an initial set of all possible stress levels, a set which is reduced by positive
rules (‘ending -<ör># has main stress; thus its predecessor does not’) as well as negative
(‘ending -<lig># never takes stress’). -TBL conveniently supports reduction rules.
References
Bouma, G., 2000. A finite state and data oriented method for grapheme to phoneme
conversion. Proceedings of the first conference on North American chapter of the
Association for Computational Linguistic, Seattle, WA.
Brill, E., 1992. A simple rule-based part of speech tagger. Third Conference on Applied
Natural Language Processing, ACL.
Brodda, B., 1979. Något om de svenska ordens fonotax och morfotax: Iakttagelse med
utgångspunkt från automatisk morfologisk analys. PILUS 38. Institutionen för lingvistik,
Stockholms universitet.
Florian, R. & G. Ngai, 2001. Multidimensional Transformation-Based Learning. Proceedings
of the Fifth Workshop on Computational Language Learning (CoNLL-2001), Toulouse.
Frid, J., 2003. Lexical and Acoustic Modelling of Swedish Prosody. PhD Thesis. Travaux de
l'institut de linguistique de Lund 45. Dept. of Linguistics, Lund University.
Hedelin, P., A. Jonsson & P. Lindblad, 1987. Svenskt uttalslexikon (3rd ed.). Technical report,
Chalmers University of Technology.
Kondrak, G., 2000. A new algorithm for the alignment of phonetic sequences. Proceedings of
the first conference on North American chapter of the ACL, Morgan Kaufmann Publishers
Inc, 288-295.
Lager, T., 1999. The µ-TBL System: Logic Programming Tools for Transformation-Based
Learning. Third International Workshop on Computational Natural Language Learning
(CoNLL-1999), Bergen.
Olsson, L-J., 1998. Specification of phonemic repesentation, Swedish. DEL 4.1.3 of EC
project “SCARRIE Scandinavian proof-reading tools” (LE3-4239).
Working Papers 52 (2006), 145–148
The Articulation of Uvular Consonants: Swedish

Sidney Wood
sidney.wood@ling.lu.se
Abstract
The articulation of uvular consonants is studied with particular reference to quantal aspects
of speech production. Data from X-ray motion films are presented. Two speakers of Southern
Swedish give examples of [ ]. The traditional view, that uvular consonants are produced by
articulating the tongue dorsum towards the uvula, is questioned, and theoretical
considerations point instead to the same upper pharyngeal place of articulation as for [o]-
like vowels. The X-ray films disclose that these subjects did indeed constrict the upper
pharynx for [ ].
1 Introduction
1.1 The theory of uvular articulations
This study begins by questioning the classical account of uvular consonant production (e.g.
Jones, 1964), that the tongue dorsum is raised towards the uvula, and that the uvula vibrates
for a rolled []. Firstly, it is not clear how a vibrating uvula would produce the acoustic energy
of a typical rolled []. A likely process exploits a Bernoulli force in the constricted passage to
chop the voiced sound into pulses when air pressure and tissue elasticity are suitably balanced,
which requires that intermittent occlusion is possible between pulses. Unfortunately, there are
free air passages either side of the uvula that should prevent this from happening. Secondly,
these same passages should likewise prevent complete occlusion for a uvular stop, and they
should also prevent a Reynolds number becoming sufficiently small for the turbulence of
uvular fricatives.
If the uvula is not a good place for producing consonants known as “uvular”, how else
might they be produced? Wood (1974) observed that the spectra of vowel-to-consonant tran-
sitions immediately adjacent to uvular consonants were very similar to the spectra of [o ]-like
vowels, or to their respective counterparts [ ], and concluded that they shared the same
place of location, i.e. the upper pharynx, confirmed for [o ]-like vowels by Wood (1979).
Mrayati et al. (1988) studied the spectral consequences of systematic deformations along an
acoustic tube, and also concluded that the upper pharynx was a suitable location for these
same consonants and vowels. Observations like this are obviously relevant for discussions of
the quantal nature of speech (Stevens 1972, 1989). Clarifying the production of uvular
consonants is not just a matter of correcting a possible misconception about a place of
articulation. It concerns fundamental issues of phonetic theory.
1.2 This investigation

The uvular articulations were analysed from cinefluorographic films, a method that enables
simultaneous articulatory activity to be observed in the entire vocal tract, and is therefore
suitable for studying the tongue manoeuvres associated with uvular consonants. Two
undisputed sources of uvular consonants are [] in southern Swedish, and [ q χ] in West
Greenlandic Inuit. The subjects of the films are native speakers of these languages. Examples
146 SIDNEY WOOD
from the Greenlandic subject have been published in e.g. Wood (1996a-b, 1997). Examples
from one Swedish subject are reported in this paper. Examples from a second Swedish subject
will be presented at the conference.
2 Procedures
Wood (1979) gives details of how the films were made. One reel of 35mm film was exposed
per subject at an image rate of 75 frames/second (i.e. 1 frame every 13.3ms), allowing about
40 seconds per subject. Each frame received a 3ms exposure.
Figure 1. (a) Example of profile tracing and identification of prominent features. Note the
difference between the tongue midline and the edge contours. (b) Examples of tongue body
and tongue blade maneuvers (five successive film frames in this instance).
In the film by SweA, Swedish sibilants were commuted through different vowel environ-
ments. The uvular variant of Swedish /r/ occurs in the present indicative verb ending {/ar/}
followed by the proposition {/i:/}, yielding several tokens of the sequence [ai].
In the film by SweB, the long vowels of Swedish (diphthongized in this dialect) were
placed in a /bVd/ environment. The uvular variant of Swedish /r/ occurs where the subject
recited the date and location of the film session. The word “four”, fyra (/fy:ra/), is reported
here, yielding the sequence [ ya].
3 Examples from the subject SweB

The frame by frame tongue body movement by SweB in [ya] is summarised in Figure 2. The
sequence of profiles from [y] through [] to [a] is shown in Figures 3-5 (every other film
frame, i.e. about 27ms between each illustration).
Figure 2. Subject SweB. Frame by frame tongue body movement (13.3ms for each step) in
the transition from [y] to [] (left) and [] to [a] (right). The numbers refer to frames on the
film.
THE ARTICULATION OF UVULAR CONSONANTS: SWEDISH 147
Figure 3. Every other profile from the sequence [ya], starting from the most complete /y:/
profile (left, frame 2298): tongue body raised towards the hard palate and lips rounded. The
tongue body was then retracted for the transition to [], and the lip rounding withdrawn (2300
centre, 2302 right). Continued in Figure 4.
Figure 4. Profiles from the sequence [ya], continued from Figure 3. The transition to []
continued to frame 2304 (left), concluding with a narrow pharyngeal constriction (circled).
This retraction was accompanied by slight depression, so that the tongue dorsum passed
below the uvula and was directed into the upper pharynx. The lip rounding of /y:/ is still being
withdrawn. Activity for /a/ was then commenced, continuing through frames 2306, centre, and
2308, right. The tongue body gesture of /a/ is directed towards the lower pharynx,
accompanied by mandibular depression. The velar port opened slightly in frame 2308 (right)
(this sequence is phrase final and was followed by a breathing pause). Continued in Figure 5.
Figure 5. Profiles from the sequence [ya], continued from Figure 4. The transition from []
to [a] continued through frame 2310 (left) to frame 2312 (right), concluding with a narrow
low pharyngeal constriction (circled), as expected from Wood (1979).
148 SIDNEY WOOD
4 Discussion and conclusions

The retracting tongue body manoeuvre from [y] to [], seen in Figure 1 (left) and in profiles
2300 to 2304 in Figures 2 and 3, was depressed slightly. Consequently it passed below the
uvula and continued into the pharynx. For this instance of [], the subject did not elevate the
tongue dorsum towards the uvula. Similar behaviour was exhibited by the West Greenlandic
subject for [ q χ], and by the second Swedish subject whose results will be presented at the
conference.
The target of the tongue body gesture of [] was the upper pharynx, as hypothesized. This
was also the case in the other data to be reported at the conference. The upper pharynx is also
the region that is constricted for [o] and []-like vowels, which means that this one place of
articulation is shared by all these consonants and vowels.
The upper pharynx is a more suitable place than the uvula for producing “uvular” stops,
fricatives and trills. The soft smooth elastic surfaces of the posterior part of the tongue and the
opposing posterior pharyngeal wall allow perfect occlusion, or the creation of apertures
narrow enough for the generation of turbulence.
References
Jones, D., 1964. An Outline of English Phonetics. Cambridge: W. Heffer & Sons Ltd. (9th
amended edition).
Mrayati, M., R. Carré & B. Guérin, 1988. Distinctive regions and modes: a new theory of
speech production. Speech Communication 7, 257-286.
Stevens, K.N., 1972. The quantal nature of speech: Evidence from articulatory-acoustic data.
In P.B. Denes & E.E. David, Jr. (eds.), Human Communication: A Unified View. New
York: McGraw Hill, 243-255.
Stevens, K.N., 1989. On the quantal nature of speech. In J. Ohala (ed.), On the Quantal
Nature of Speech. Journal of Phonetics 17, 3-45.
Wood, S.A.J., 1974. A spectrographic study of allophonic variation and vowel reduction in
West Greenlandic Eskimo. Working Papers 4, Dept. of Linguistics and Phonetics,
University of Lund, 58-94.
Wood, S.A.J., 1979. A radiographic analysis of constriction locations for vowels. Journal of
Phonetics 7, 25-43.
Wood, S.A.J., 1996a. Temporal coordination of articulator gestures: an example from
Greenlandic. Journal of the Acoustical Society of America (A). Poster presented at 131st
meeting of the Acoustical Society of America, Indianapolis.
Wood, S.A.J., 1996b. The gestural organization of vowels: a cinefluorographic study of
articulator gestures in Greenlandic. Journal of the Acoustical Society of America 100, 2689
(A). Poster presented at the Third Joint Meeting of the Acoustical Societies of America and
Japan, Honolulu.
Wood, S.A.J., 1997. A cinefluorographic study of the temporal organization of articulator
gestures: Examples from Greenlandic. In P. Perrier, R. Laboissière, C. Abry & S. Maeda
(eds.), Speech Production: Models and Data (Papers from the First ESCA Workshop on
Speech Modeling and Fourth Speech Production Seminar, Grenoble 1996). Speech
Communication 22, 207-225.
Working Papers 52 (2006), 149–152
Acoustical Prerequisites for Visual Hearing

Niklas Öhrström and Hartmut Traunmüller
Department of Phonetics, Stockholm University
{niklas.ohrstrom|hartmut}@ling.su.se
Abstract
The McGurk effect shows in an obvious manner that visual information from a speaker’s
articulatory movements influences the auditory perception. The present study concerns the
robustness of such speech specific audiovisual integration. What are the acoustical
prerequisites for audiovisual integration to occur in speech perception? Auditory, visual and
audiovisual syllables (phonated and whispered) were presented to 23 perceivers. In some of
the stimuli, the auditory signal was exchanged for a schwa syllable, a dynamic source signal
and a constant source signal. The results show that dynamic spectral information from a
source signal suffice as auditory input for speech specific audiovisual integration to occur.
The results also confirm that type (and absence) of lip rounding are strong visual cues.
1 Introduction
Visual contribution to speech comprehension was for a long time ignored by theorists and
only accounted for when the auditory speech signal was degraded (Sumby & Pollack, 1954).
However, McGurk and MacDonald (1976) showed that auditory speech perception could be
altered by vision even when the auditory stimulus lacked ambiguity. They used [baba] and
[ɡaɡa] syllables dubbed on visual stimuli with different consonants. A visual [ɡaɡa] dubbed
on an auditory [baba] evoked the percept of /dada/. A visual [baba] dubbed on an auditory
[ɡaɡa] was often perceived as /ɡabɡa/ or /baɡba/. This demonstrated ordinary speech
perception to be a bimodal process in which optic information about a speaker’s articulatory
movements is integrated into auditory perception. Traunmüller and Öhrström (in press) have
demonstrated that this also holds for vowels. It has been shown experimentally that perception
of features such as labiality and lip rounding is dominated by the visual signal. In addition, it
is worth mentioning that crossmodal illusions are not necessarily restricted to speech
perception: Shams et al. (2000) demonstrated that the visual perception of the numerosity of
flashes can be altered by simultaneous auditory presentation of clicks.
Bimodal speech perception normally involves synchrony between the auditory and the
visual information from a speaker’s articulatory movements. Visual information can,
therefore, be expected to have a substantial influence on auditory speech perception but visual
hearing might require presence of a more or less authentic acoustic speech signal. This study
aims at exploring acoustical prerequisites for visually influenced auditory perception to occur.
How much information from the original acoustic signal can we remove and still evoke visual
hearing? In this study the four long Swedish vowels /i/, /u/, /ɛ/ and /ɒ/ will be tested
(appearing both phonated and whispered in a [b_d] frame). In the first condition the formant
frequencies of the vowel will be changed (in this case a [ə] will be used). In the second
condition the formant peaks will be flattened out, whereby an approximate source signal will
be obtained. In the third condition, the formant peaks will be flattened out and the acoustic
signal will be kept in a steady state. It can be expected that at least the visible type of lip
rounding will have an influence on auditory perception.
150 NIKLAS ÖHRSTRÖM & HARTMUT TRAUNMÜLLER
2 Method
2.1 Speakers and speech material
One male and one female lecturer from the Department of Linguistics served as speakers. The
recordings took place in an anechoic chamber using a video camera Panasonic NVDS11 and a
microphone Brüel&Kjær 4215. The speakers produced the more or less meaningful Swedish
syllables /bid/, /bud/, /bɛd/ and /bɒd/ in both phonated and whispered fashion. They were
also asked to produce [bəd]. The video recordings were captured in DV format and the
acoustic signal was recorded separately (sf = 44.1 kHz, 16 bit/sample, mono). The acoustic
recordings were subsequently manipulated in different ways using Praat (Boersma &
Weenink, 2006): Firstly all acoustic syllables were resynthesized (sf = 11 kHz). The
resynthesis was carried out using the Praat algorithm “LPC-burg”. The [bəd] syllable was also
resynthesized with formant bandwidths expanded to Bn = 2 Fn. The spectrally flattened schwa
in this syllable is most similar to a source signal. Finally, to obtain a constant spectrally
flattened signal, one glottal period of this schwa was selected and iterated. To obtain a
constant whispered spectrally flattened signal, a window of 25 ms of the spectrally flattened
whispered schwa was subjected to LPC analysis and resynthesized with increased duration.
The final audiovisual stimuli were obtained by synchronizing the video signals with the
manipulated audio signals. The synchronization was based on the release burst of the first
consonant and performed in Premiere 6.5. The constant spectrally flattened signals were made
equal in duration with the whole visual stimuli (approximately 2s). Each optic stimulus
(except [bəd]) was presented together with its corresponding auditory one, the acoustic schwa
vowel [ə], the spectrally flattened signal (SF) and the constant spectrally flattened signal
(CSF). Each visual syllable (except [bəd]) and each auditory stimulus was also presented
alone. In this way, 54 stimuli were obtained for each speaker. The total perception task
consisted of two blocks. Block one consisted of 92 audio (A) and audiovisual (AV) stimuli in
which each stimulus was presented once in random order. Block two consisted of 16 visual
(V) stimuli, each presented twice in random order.
2.2 Perceivers
23 subjects who reported normal hearing and normal or corrected-to-normal vision (11 male,
aged 17 to 52 years, and 12 female, aged 20 to 50 years) served as perceivers. All were
phonetically naïve native listeners of Swedish.
2.3 Perception task

The perceivers wore headphones AKG 135 and were seated with their faces at approximately
50 cm from a computer screen. All the stimuli were presented using Windows Media Player.
During block 1 (which contained AV and A stimuli), the perceivers were instructed to
report what they had heard. Nevertheless, they were instructed to always look at the speaker
when shown. The perceivers were allowed to repeat each stimulus as many times as they
wished. If they had heard a [bVd] syllable (which could appear in a very distinct or vague
manner) they were asked to report which one of the nine long Swedish vowels it resembled
the most. They gave their responses by clicking on orthographic symbols of the Swedish
vowels (a /ɒ/, e /e/, i /i/, o /u/, u /ʉ/, y /y/, å /o/, ä /ɛ/, ö /ø/) arranged in the right half of
the screen in manner of an IPA-chart. There was a response alternative “hör ingen vokal” right
under the chart. This was to be used when no syllable was heard or when the sound was not
heard as a human vowel.
During block 2 (which contained optic stimuli only) the perceivers were instructed to report
the vowel perceived through lipreading. As before, they were allowed to repeat each stimulus
as many times as they wished. The whole experiment lasted for approximately 30 minutes.
ACOUSTICAL PREREQUISITES FOR VISUAL HEARING 151
3 Results
The responses of all subjects to each stimulus combination, whispered and phonated versions
pooled, are shown in Table 1. It can be seen that the responses to congruent AV stimuli and
auditorily presented vowels are in accord with the speaker’s intention. With vowels presented
visually only, there were many confusions. The unrounded /i/ and /ɛ/ were mostly confused
with other unrounded vowels. The in-rounded /u/ was predominantly confused with other in-
rounded vowels (/ʉ/ and /o/)). The out-rounded /ɒ/ was mostly confused with other out-
rounded vowels (in this case with /ø/) and, to some extent, with in-rounded vowels. The
auditory [ə] was almost exclusively categorized as an out-rounded vowel (/ɒ/ or /ø/) and
incongruent visual cues, such as absence of lip rounding, contributed only marginally to the
auditory perception.
Table 1. Responses from all subjects (in %) to all stimuli, whispered and phonated versions
pooled. Boldface: most frequent response. A: acoustic cues, V: optic cues, SF: spectrally
flattened [bəd], CSF: constant spectrally flattened [ə]. “*”: no audible vowel.
Stimuli Responses
A V /i/ /y/ /ʉ/ /u/ /e/ /ø/ /o/ /ɛ/ /ɒ/ *
/i/ /i/ 99 1
/u/ /u/ 89 11
/ɛ/ /ɛ/ 11 1 86 2
/ɒ/ /ɒ/ 100
/i/ - 99 1
/u/ - 86 14
/ɛ/ - 10 2 88
/ɒ/ - 100
- /i/ 70 24 1 4 2
- /u/ 1 1 7 87 4
- /ɛ/ 6 18 1 66 10
- /ɒ/ 1 1 19 5 74
[ə] - 48 52
[ə] /i/ 5 38 3 53
[ə] /u/ 49 2 49
[ə] /ɛ/ 1 35 3 60 1
[ə] /ɒ/ 50 50
SF - 7 1 1 9 1 22 1 2 48 9
SF /i/ 29 4 2 18 3 39 3
SF /u/ 2 48 9 4 1 27 9
SF /ɛ/ 10 4 4 13 13 48 8
SF /ɒ/ 2 9 21 2 1 61 4
CSF - 2 9 1 1 7 80
CSF /i/ 5 2 5 1 3 83
CSF /u/ 1 7 4 2 4 82
CSF /ɛ/ 2 5 4 2 86
CSF /ɒ/ 1 1 9 10 79
152 NIKLAS ÖHRSTRÖM & HARTMUT TRAUNMÜLLER
When presented in auditory mode alone, the spectrally flattened vowel (SF) was mostly
categorized as out-rounded (in 70% of the cases as /ɒ/ or /ø/) but also, to some extent, as an
unrounded or in-rounded vowel. When the auditory source signal was presented with different
visual cues, type of rounding was very often perceived in accord with the visual stimulus. The
constant source signal (CSF) was not very often identified as a human vowel or syllable, but
there were traces of influence from the visual signal.
4 Discussion
These experiments have shown that auditory perception of an approximate acoustic source
signal (SF) is sensitive to visual input. In this case, the type of rounding was often perceived
in accord with the visual signal. Interestingly, there was a perceptual bias towards /ø/ and /ɒ/
concerning the stimuli containing acoustical [bəd] syllables, (SF) signals and (CSF) signals,
while [ə] is undefined with respect to its roundedness. It is obvious that the (SF) and (CSF)
still contain some acoustic traces from the [bəd]. In this study the [ə]s produced by the two
speakers were categorized as rounded vowels. A possible explanation is that the Swedish
phonological system does not offer any unrounded vowels, except possibly /ɛ/, in this region
of the vowel space. Therefore, it cannot be excluded that subjects actually heard an unrounded
vowel for which they lacked a response alternative, but coarticulation effects from the initial
labial consonant might also have caused a bias in favor of rounded vowels.
The auditory perception of the acoustic schwa vowel was not much influenced by the visual
signal. This could be due to the fact that a naturally articulated schwa has a definite place in
auditory vowel space since the formant peaks are distinct. This makes the acoustic cues quite
salient and leaves just a small space for visual influence. On the other hand, the approximate
acoustic source signal with preserved dynamics (SF) evoked the perception of a human
vocalization, although the place of the vowel in auditory vowel space was only vaguely
suggested by its acoustic cues, which were much less salient. This gives opportunity for the
visual information about a speaker’s articulatory movements to be integrated into the auditory
percept. The constant source signal (CSF) lacked both dynamic properties and a distinct place
in vowel space. It also lacked the temporal alignment with the visual signals that was present
in the other stimuli. It was, therefore, perceived as a sound that was separate from the human
utterance that the visible articulatory movements suggested. Thus, it appears that visual
hearing of speech requires the presence of an acoustic signal that can easily be interpreted as
belonging together with the visual signal.
Acknowledgements
This investigation has been supported by grant 2004-2345 from the Swedish Research
Council.
References
Boersma, P. & D. Weenink, 2006. Praat – a system for doing phonetics by computer.
http://www.fon.hum.uva.nl/praat/.
McGurk, H. & J. MacDonald, 1976. Hearing lips and seeing voices. Nature 264, 746-748.
Shams, L., Y. Kamitani & S. Shimojo, 2000. What you see is what you hear. Nature 408, 788.
Sumby, W.H. & I. Pollack, 1954. Visual contribution to speech intelligibility in noise. Journal
of the Acoustical Society of America 26, 212-215.
Traunmüller, H. & N. Öhrström, in press. Audiovisual perception of openness and lip
rounding in front vowels. Journal of Phonetics.

Microphones and Measurements PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Microphones and Measurements PDF

Uploaded by

Copyright:

Available Formats

ISSN 0280-526X

Edited by Gilbert Ambrazaitis and Susanne Schötz

Edited by Gilbert Ambrazaitis and Susanne Schötz

This issue was edited by Gilbert Ambrazaitis and Susanne Schötz

© 2006 The Authors and the Department of Linguistics and

Printed in Sweden, Mediatryck, Lund, 2006

Lund, May 2006

The Organizing Committee

Previous Swedish Phonetics Conferences

Acoustic Analysis of Phonetically Transcribed

1 Background and introduction

2.2 Acoustic analysis

2.3 Statistical method

3 Results and discussion

Perception of South Swedish Word Accents

Figure 1. Prototypical F0 contours of the two word accents in a prominent position of an

2.3 Data analysis

final word = drömmen (accent I) final word = dimman (accent II)

Focal Accent and Facial Movements in

2 Data collection and corpus

4 Results and discussion

A Study of Simultaneous-masking and

Youth Language in Multilingual Göteborg

1.2 Purpose of the perception experiment

3 Results and discussion

30 Listeners with one Swedish-born parent

3.2 Listeners’ classification of the stimuli

3.3 Foreign accent or language variety?

Table 1. Speakers’ background and classification by the listeners in the experiment.

3.4 Differences in awareness of and attitude towards gårdstenska

3.5 Listeners’ awareness of sociolinguistic variation

Prosodic Cues for Hesitation

2.1 Retardation and pause

2.2 F0 slope variation

Hesitation perception increase (%)

Figure 2. a) Distribution of hesitation responses b) Distribution of hesitation perception

3 Results and discussion

F-pattern Analysis of Professional Imitations

2 Imitations of the Swedish word “hallå” – the speech material

3.2 Landmark selection along the time axis

4.2 Overview of F-pattern behaviours

Figure 2. Landmark-normalised F-patterns: Imitator & his imitations of 3 Swedish dialects.

4.3 Imitator versus imitations – a quantitative comparison

5 Summary and ways ahead

Describing Swedish-accented English

1.2 Previous studies

Quantification of Speech Rhythm in

2.2 Segmentation and definition of metrics

e re de et ne dy to tamm mat barn

Figure 1. Mean duration of syllables in a Norwegian utterance ranked according to increasing

3.2 Discriminant analysis

Table 2. Predicted L1 group membership (percent correct) of five utterances according to a

/nailon/ – Online Analysis of Prosody

2 Design criteria for practical applications

3.1 Audio acquisition

3.3 Voicing, pitch, and intensity extraction

3.5 Range normalisation of F0 and intensity

3.6 Silence detection

Feedback from Real & Virtual Language Teachers

2 Feedback in pronunciation training

4.2 How should errors be corrected?

4.3 Which errors should be corrected?

5 Feedback management in ARTUR